<a href="https://colab.research.google.com/github/ArmandFS/hiv-quality-of-care-vis-update/blob/main/Quality_Care_of_HIV_Patients_Plotting_and_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exploring Quality of Care Data for HIV Clients, Plotting, and Analysis

### Import Libraries for Analysis
##### Here in this cell, we are going to import a bunch of libraries to help us analyze this data. Numpy as is used for linear algebra and mathematical calculations, pandas is used for data processing and reading csv data, matplotlib will be used for plotting our data. The os library is used for joining different data from different files and managing the main directory

##### Additional Libraries will be added down the line for visualization and configuration

In [None]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import plotly.express as px
import plotly.graph_objects as go

### Loading the data
##### Let's load our data into a pandas dataframe using the read_excel function. The data describe function is used to get a bit of insight of the numeric data in the datatable.

In [None]:
excel_data = pd.read_excel('QualityOfCare.xlsx')

In [None]:
data = pd.DataFrame(excel_data)
stats_table = data.describe()

styled_table = stats_table.style\
    .set_caption('Quality Care of HIV Patients Initial Dataset')\
    .set_table_styles([
        {'selector': 'caption',
         'props': [
             ('font-size', '18px'),
             ('font-weight', 'bold'),
             ('text-align', 'center')
         ]},
        {'selector': 'table',
         'props': [
             ('border-collapse', 'collapse')
         ]},
        {'selector': 'th, td',
         'props': [
             ('border', '1px solid black'),
             ('padding', '8px')
         ]}
    ])
styled_table = styled_table.background_gradient(cmap='Blues', subset=stats_table.columns[0:])

styled_table

Unnamed: 0,Age,AgeInMonths,WeightAtStart,HeightAtStart,Cd4AtStart,WeightAtLastVisit,HeightAtLastVisit,MostRecentCd4Count
count,26569.0,642.0,25065.0,15357.0,21906.0,24444.0,15231.0,20106.0
mean,109.559047,14.890966,54.92087,142.788071,451.511158,60.014575,144.670872,654.213839
std,12159.486892,80.235363,17.123884,51.772227,6625.037637,30.309822,335.184395,7467.950481
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27.0,0.0,47.0,150.0,139.0,50.0,150.0,229.0
50%,34.0,2.0,55.0,160.0,267.0,60.0,160.0,396.0
75%,42.0,18.0,65.0,166.0,434.0,69.0,166.0,600.0
max,1982014.0,1987.0,100.0,980.0,662645.0,1248.0,40914.0,578566.0


##### Let's look at the different columns of the datatable. The data has about 46 different columns

In [None]:
print(f'The dataset has {len(data.columns)} columns')
print("---------------------------------------------")
data.columns

The dataset has 46 columns
---------------------------------------------


Index(['Health facility level', 'FacilityType', 'FundingSources',
       'DateOfBirth', 'Age', 'AgeInMonths', 'Sex', 'MaritalStatus',
       'EducationLevel', 'Occupation', 'DateOfConfirmedHIV',
       'DateOfEnrollment', 'CareEntryPoint', 'DateArtStarted',
       'RegimenAtStart', 'WeightAtStart', 'HeightUnit', 'HeightAtStart',
       'FunctionalStatusAtStart', 'ClinicalStageAtStart', 'Cd4Unit',
       'Cd4AtStart', 'AdherenceCouncelingCompleted', 'InitialTbScreeningDone',
       'ArtSubstitution', 'ArtSwitch', 'ArtInterruption', 'PatientDead',
       'ClinicalStageAtLastVisit', 'TbStatusAtLAstVisit', 'WeightAtLastVisit',
       'HeightAtLastVisit', 'OpportunisticInfectionPresentAtLastVisit',
       'AnySideEffects', 'WasPatientReceivingArv', 'ArvAdherenceLatestLevel',
       'MostRecentCd4Count', 'ViralLoadDone', 'PregnancyStatus',
       'ArtInterruptionType', 'ArtInterruptionDate', 'ArtInterruptionReason',
       'OpportunisticInfectionAtLastVisit',
       'OpportunisticInfectionAt

##### We can also list the columns with their corresponding data types


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27288 entries, 0 to 27287
Data columns (total 46 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   Health facility level                     27288 non-null  object 
 1   FacilityType                              27288 non-null  object 
 2   FundingSources                            27196 non-null  object 
 3   DateOfBirth                               24493 non-null  object 
 4   Age                                       26569 non-null  float64
 5   AgeInMonths                               642 non-null    float64
 6   Sex                                       27288 non-null  object 
 7   MaritalStatus                             25547 non-null  object 
 8   EducationLevel                            24638 non-null  object 
 9   Occupation                                24610 non-null  object 
 10  DateOfConfirmedHIV                

### Reviewing the Gender Demographic Data
##### We can see based off the data that we have a few demographics that we'd want to plot and compare, such as gender representations. We can plot it based off the sex column to find out how the different sexes are represented in the dataset. The pandas unique function will get the data from the sex column

##### We can then plot a bar chart, showing a comparison of male and females



In [None]:
data['Sex'].unique()
male_count = data['Sex'].value_counts()['Male']
female_count = data['Sex'].value_counts()['Female']
#we then print the total amount of males and females

print(f"Number of Males: {male_count}")
print(f"Number of Females: {female_count}")
print("-----------------------------------")

sexes = list(data['Sex'].unique())
values = list(data['Sex'].value_counts())
fig = go.Figure(data=[go.Bar(x=sexes, y=values, marker_color=['red', 'blue'])])
fig.update_layout(
    title="HIV Clients Receiving Treatment: Gender Chart",
    xaxis_title="Sex",
    yaxis_title="No. of clients",
    width=400,
    height=500
)
fig.show()

Number of Males: 8808
Number of Females: 18480
-----------------------------------


### Age Calculation and Comparison
##### Let's look at the age of the clients. The dataset already has age information but we can calculate the age from the date of birth to get better accuracy. The dataset has age and also date of birth so, we will recalculate the age from the date of birth. To do this, we will define a function that will help us calculate the age. To calculate this, we can use the datetime library.

In [None]:
from datetime import date
from datetime import datetime

def age(birthdate):
      today = datetime.today()
      age = today.year - birthdate.year - ((today.month, today.day) < (birthdate.month, birthdate.day))
      return age

for ind, row in data.iterrows():
    date = pd.to_datetime(row['DateOfBirth'], errors='coerce', infer_datetime_format=True)
    data.loc[ind, ['CalculatedAge']] = age(date)

data.head()

Unnamed: 0,Health facility level,FacilityType,FundingSources,DateOfBirth,Age,AgeInMonths,Sex,MaritalStatus,EducationLevel,Occupation,...,ViralLoadDone,PregnancyStatus,ArtInterruptionType,ArtInterruptionDate,ArtInterruptionReason,OpportunisticInfectionAtLastVisit,OpportunisticInfectionAtLastVisitOthers,CurrentRegimen,ViralLoad,CalculatedAge
0,Tertiary hospital,Public,",Non-Governmental Organisation",28/11/2000,18.0,,Female,Single,Tertiary,Student,...,No,Non-oregnancy,,,,,,First-line Regimen,,22.0
1,Secondary health facility,Public,",State Government",1986-05-06 00:00:00,28.0,,Female,Married,Primary,Unemployed,...,Yes,Non-oregnancy,,,,,,First-line Regimen,20,37.0
2,Secondary health facility,Public,",Federal Government,Non-Governmental Organisation",1955-12-04 00:00:00,50.0,,Male,Married,Primary,Civil servant,...,Yes,,,,,,,First-line Regimen,<20,67.0
3,Tertiary hospital,Public,",Non-Governmental Organisation",15/4/1963,51.0,,Female,Married,Missing,Self employed,...,Yes,Non-oregnancy,,,,,,First-line Regimen,9031co/ml,60.0
4,Secondary health facility,Faith Based,",Non-Governmental Organisation",1985-12-03 00:00:00,30.0,,Female,Single,Secondary,Self employed,...,No,Non-oregnancy,,,,,,First-line Regimen,,37.0


##### Now we can plot the calculated ages  of the dataset in a histogram. Plotly is used for better visualization

In [None]:

# Convert the Pandas plot to a Plotly figure
fig = px.histogram(data, x="CalculatedAge", color="Sex")
fig.update_layout(
    title="Age Distribution by Sex",
    xaxis_title="Calculated Age",
    yaxis_title="Count",
    width=1000,  
    height=500 
)
# Display the interactive plot
fig.show()

### Age Grouping and Distribution Plot
##### After sorting out the age, we can now group the ages to different groups to help the analysis. We can use the cut function in pandas to cut the distribution of the age groups. We can then plot the age distribution, and also visualize it better with the plotly library


In [None]:
#Importing plotly library for better visualization
bins = [0,5,10,15,20,25,30,35,40,45,50,55,60,65,200]
age_labels = ['0-5','5-10','10-15','15-20','20-25','25-30','30-35', '35-40', '40-45', '45-50', '50-55', '55-60', '60-65', '65+']
data['AgeGroup'] = pd.cut(data['CalculatedAge'], bins=bins, labels=age_labels, right=False)

#re-arrange age group in order of numerical age value.
age_group = data['AgeGroup'].value_counts()

age_group_counts = []
for label in age_labels:
    age_group_counts.append(age_group[label])
    
# Create an interactive bar chart using Plotly
fig = go.Figure(data=[go.Bar(x=age_labels, y=age_group_counts, marker_color='red')])
fig.update_layout(
    title="Age Group Distribution",
    xaxis_title="Age groups",
    yaxis_title="No of clients",
    width=800,  # Set the width of the figure
    height=500  # Set the height of the figure
)

# Display the interactive plot
fig.show()

### 