# HIV Status classification model

This dataset contains a Ghana Demographic and Health Survey (GDHS) collected by the Ghana Statistical Service (GSS) in 2003. The GDHS report showcases the demographics of 9555 person's who were surveyed and the number of those persons that had the HIV virus. 

Person's who were tested for the HIV virus were asked a few questions like, their gender, whether they have had sex before, and whether they have had sex within the past year. Their answers were correlated with thier individual HIV test results, and the data was summerized into a table.

The purpose of this project is to create a classification model that predicts whether or not a person has the HIV virus based on their individual demographics. This would shed light on the types of people that are most vunerable to contract the virus, and in turn help the Ghana Aids coorporation to direct their resouces towards such persons, bringing efficiency to the fight against HIV prevalence in Ghana.

The however the data is not comprehensive enough. Characteristics such as age, educational backgroud, Region, and employment status would have shed more light on the demographics of the most vunerable people.

### Data Description 

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
|Gender|	The gender of the person|
|HSB|       It is an acronym for Had Sex Before, and it represents whether or not the person being surveyed has engaged in an form of sexual activity or not. To this the respondent replies __Yes__ or __No__|
|HSWY|  It is an acronym for Had Sex Within a Year , and it is a question meant for persons who have had sex before. Such persons where asked a follow up question on whether or not they had sex within the past year and to this the respondent replies __Yes__ or __No__. All respondents who hadn't had sex before where assummed to have replied __No__ to this question|
|HIV-Stat|	Corresponds to the HIV test results of all the respondents during the survey|


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

### Data Preprocessing

In [2]:
df = pd.read_csv('HIV.csv')
df.head()

In [3]:
#Converting 
mapping_gender = {"Male": 2, "Female": 4}
mapping_stat = {"Negative": -1, "Positive": 1}
mapping_hsb = {"No": 0, "Yes": 1}
mapping_hswy = {"No": 0, "Yes": 1}

df["Gender"] = df["Gender"].astype("str")
df["HIV-Stat"] = df["HIV-Stat"].astype("str")
df["HSB"] = df["HSB"].astype("str")
df["HSWY"] = df["HSWY"]
df["Gender"] = df["Gender"].replace(mapping_gender)
df["HIV-Stat"] = df["HIV-Stat"].replace(mapping_stat)
df["HSB"] = df["HSB"].replace(mapping_hsb)
df["HSWY"] = df["HSWY"].replace(mapping_hswy)

df

From the dataframe above, all string values have been transformed into numerical values. Males correspond to the number 2, and females to the number 4. Yes corresponds to 1 and No to 0. Finally a positive status corresponds to 1, whereas a negative status corresponds to -1.


In [4]:
x = df[['Gender', 'HSB', 'HSWY']].values
y = df['HIV-Stat'].values
y

### Creating the model

In [6]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 4)
x_train_norm = preprocessing.StandardScaler().fit_transform(x_train)
x_test_norm = preprocessing.StandardScaler().fit_transform(x_test)

In [7]:
# Loop through various values of K and calculate the correspoding accuracy of each k value
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
ks = 5
mean_acc = np.zeros(ks)

for n in range(1,ks+1):
    neigh =  KNeighborsClassifier(n_neighbors=n).fit(x_train_norm,y_train)
    y_predict = neigh.predict(x_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, y_predict)
    
mean_acc

Now lets evaluate the key features in this dataset that are most important when determining whether an individual has HIV or not.


In [8]:
from sklearn.feature_selection import mutual_info_classif

# X is your feature matrix, y is your target variable
feature_scores = mutual_info_classif(x_train, y_train)
feature_scores


From the results above, it is evident that the most important feature in this dataset in determining whether or not one has HIV is whether he or she has had sexual realtions before.