# Are we Safe?: HIV Risk Predictive Model

# 1.0 Project Overview
# 1.1 Introduction

According to HIV & AIDS world outlook reports highlights HIV as the biggest global health issue with approximately 38 million infected worldwide(WHO(2020), UNAIDS(2020)).The disease mostly affects adolescent girls and young women (AGYW), who are at a higher risk of HIV infection due to a combination of biological, socio-economic, and behavioral factors (WHO(2020), UNAIDS(2019)).

Regionally, sub-Saharan Africa is bearing the bulk of the epidemic and bears nearly 70% of the world's HIV cases (WHO(2020), UNAIDS(2019)).

Locally in Kenya, HIV remains a significant public health issue among AGYW relative to their counterparts who are male. Structural impediments such as gender disparities, levels of poverty, and limited access to education and health care leave AGYW in some of the counties vulnerable to increased vulnerability (National AIDS Control Council, 2020; Ministry of Health Kenya, 2021). Despite interventions such as the Aid-Funded programs that seek to stem HIV infections by surmounting these structural drivers, the challenge of reaching high-risk groups and utilizing existing resources to the fullest for intervention still exists.This research will, therefore adopt Cross Industry Standard Procedures- Data Mining(CRISP-DM) methodology for the health industry.


# 2.0 Business Understanding

Health risks like the Human Immunodeficiency Virus (HIV) is still a major public health problem among adolescent girls and young women (AGYW) in Kenya. AGYW still carry an unbalanced burden of HIV despite the universal national and global campaigns, including the DREAMS program (Determined, Resilient, Empowered, AIDS-Free, Mentored, and Safe). AGYW have a greater risk of HIV infection than their male counterparts because of a mix of biological, socio-economic, and behavioral determinants(WHO(2020); https://www.who.int/news-room/fact-sheets/detail/adolescents-health-risks-and-solutions, UNAIDS(2019); https://www.unaids.org/sites/default/files/media_asset/2019-global-AIDS-update_en.pdf).

In Kenya, counties vary in their socio-economic, cultural, and health environments. These are the determinants of HIV vulnerability, access to care, and success of interventions. Even with focused interventions, identifying those most vulnerable and effectively allocating resources to prevent infections is a challenge.

The overall goal of this project is to establish a predictive model to assess HIV risk among AGYW based collected data from select counties. This model, data first problem, will identify individuals who are most at risk of acquiring HIV. The final model will help stakeholders to scale up to cover more counties, channel interventions more effectively and allocate funds more appropriately to those most at risk so that new HIV infection cases can be minimized in these societies.

# 3.0 Data Understanding

## 3.1 Data Description
The dataset used in this project contains detailed demographic, behavioral, and intervention-related information on adolescent girls and young women (AGYW) participating in the DREAMS program. The data includes key indicators such as:

Demographic Information – Age, county, household structure, parental status.
Socioeconomic Status – Household size, food security, income sources.
Education & Behavior – School attendance, history of sexual activity, condom use.
HIV Testing & Status – HIV testing history, last test result.
DREAMS Program Participation – Interventions received (biomedical, behavioral, social protection).
Exit Status – Whether participants continued or exited the program and the reason for exiting.
This dataset provides a comprehensive view of factors affecting HIV risk among AGYW, allowing for predictive modeling and impact evaluation.

## 3.2 Data Source
The data is sourced from health related data records of aid-funded program (PEPFAR DREAMS program) in kenya on select counties.

## 3.3 Data loading and preview

In [1]:
#Import relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,confusion_matrix
import warnings
warnings.filterwarnings("ignore")

In [2]:
#Loading the dreams dataset
df = pd.read_csv('./data/dreams_raw_dataset.csv')

In [3]:
#Dataset preview
df.head()

Unnamed: 0,date_of_birth,date_of_enrollment,Agency,implementing_partner_name,county,head_of_household,head_of_household_other,age_of_household_head,father_alive,mother_alive,...,intervention_name,intervention_date,result,bio_medical,social_protection,behavioral,post_gbv_care,other_interventions,exit_age,exit_reason_other
0,9/16/2004,2/22/2020,USAID,USAID Tumikia Mtoto,Nairobi,Mother,,58.0,Yes,Yes,...,HTS - HTS (Client),8/8/2020,Negative,1,1,1,0,0,,
1,8/2/2004,12/18/2019,USAID,USAID Tumikia Mtoto,Nairobi,Father,,43.0,Yes,Yes,...,HTS - HTS (Client),4/27/2020,Negative,1,1,1,0,0,,
2,10/20/2005,3/7/2020,USAID,USAID Tumikia Mtoto,Nairobi,Mother,,41.0,No,Yes,...,HTS - HTS (Client),8/12/2020,Negative,1,1,0,0,0,,
3,1/18/2006,3/3/2020,USAID,USAID Tumikia Mtoto,Nairobi,Mother,,45.0,No,Yes,...,HTS - HTS (Client),8/12/2020,Negative,1,1,1,0,0,,
4,3/22/2004,12/18/2019,USAID,USAID Tumikia Mtoto,Nairobi,Father,,42.0,Yes,Yes,...,HTS - HTS (Client),4/27/2020,Negative,1,1,1,0,0,,


## 3.4 Problem Statement

## 3.5 Metrics of Success

# 4.0 Data Preparation

## 4.1 Data Cleaning

This involves checking on data validity, accuracy, completeness, accuracy, consistency and uniformity. These will be carried out on the select datasets that are within the scope of reasearch.

# 5.0 Exploratory Data Analysis(EDA)
This is the process of analyzing data to reveal trends and patterns, detect anomalies, test hypotheses and check assumptions using visuals and summary statistics.Turkey,J.W(1977)

Key goals of EDA include:

Understanding the data: Getting a sense of the data's distribution, range, and central tendencies. Identifying patterns: Discovering trends, correlations, or anomalies within the data. Checking assumptions: Verifying assumptions made about the data before further analysis or modeling. Generating hypotheses: Developing potential explanations or questions based on the findings.

## 5.1 Univariate Analysis


## 5.2 Bivariate Analysis


## 5.3 Multivariate Analayis


# 6.0 Modeling

## 6.1 Baseline Model