Assignment 1: Individual Assignment

## Remmy Bisimbeko - B26099 - J24M19/011
My GitHub - https://github.com/RemmyBisimbeko/Data-Science

Using supervised and/or unsupervised learning models, make predictions for the following datasets.

The Datasets:

1. Mosquito_Dataset.xlsx  

This dataset contains survey data collected by scientists working on the mosquito pathogen carriers that cause infections amongst humans  in three villages in Uganda. Further information on the variables collected can be found under the sheet "Variables_Descriptors" within the excel book.

Instructions: 

A new infection case has been noted in Lwengo village. Determine the mosquito species (species.mol) that has led to this infection. 

The particulars of information collected from the infected case are as follows:

Village	Lwengo
collection	OBET
origin	MH
fed	yes
parity	NA
oocyst	3
spz	0
infection	spz
infection 1	uninfected
choice	H
RS-tech	ELISA
spz_tech	ELISA

In [2]:
# Let's bring in the Libs
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score

In [3]:
# Load and read the dataset
df = pd.read_excel("/Users/remmy/Documents/Projects/Data-Science/Data Sets/Mosquito_Dataset.xlsx", sheet_name="Data")
df.head()

Unnamed: 0,village,collection,origin,fed,parity,species.mol,oocyst,spz,infection,infection1,choice,RS_tech,spz_tech
0,kitamilo,spray,CA,yes,,g,3,0,oocyst,oocyst,other,pcr,qpcr
1,kitamilo,spray,MH,yes,,g,0,1,spz,spz,H,pcr,qpcr
2,kitamilo,spray,MH,yes,,g,0,1,spz,spz,H,pcr,qpcr
3,kitamilo,spray,MI,yes,,g,0,1,spz,spz,H,pcr,qpcr
4,kitamilo,spray,MI,yes,,c,0,1,spz,spz,H,pcr,qpcr


In [4]:
# Trim all spaces on column names
df.columns = df.columns.str.strip()

In [5]:
# Now I proceed to droping rows with missing values
df.dropna(inplace=True)

In [6]:
# Followed by separating features (X) and target (y)
X = df.drop(columns=['species.mol'])
y = df['species.mol']

In [7]:
# I split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
# The i separate numeric and categorical columns
numeric_cols = X_train.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X_train.select_dtypes(exclude=np.number).columns.tolist()

In [9]:
# here, I preprocess the numeric features
numeric_imputer = SimpleImputer(strategy='mean')
X_train_numeric_imputed = numeric_imputer.fit_transform(X_train[numeric_cols])
X_test_numeric_imputed = numeric_imputer.transform(X_test[numeric_cols])

In [10]:
# followed  by preprocessing the categorical features
categorical_imputer = SimpleImputer(strategy='most_frequent')
X_train_categorical_imputed = categorical_imputer.fit_transform(X_train[categorical_cols])
X_test_categorical_imputed = categorical_imputer.transform(X_test[categorical_cols])

In [11]:
# I then onne-hot encode categorical features
encoder = OneHotEncoder(handle_unknown='ignore')
X_train_categorical_encoded = encoder.fit_transform(X_train_categorical_imputed)
X_test_categorical_encoded = encoder.transform(X_test_categorical_imputed)

In [12]:
# Lets concatenate the numeric and encoded categorical features
X_train_processed = np.concatenate([X_train_numeric_imputed, X_train_categorical_encoded.toarray()], axis=1)
X_test_processed = np.concatenate([X_test_numeric_imputed, X_test_categorical_encoded.toarray()], axis=1)

In [13]:
# I begin to train a logistic regression model
model = LogisticRegression()
model.fit(X_train_processed, y_train)

In [14]:
# After, let's predict the mosquito species for the infected case in Lwengo village
# Define the features for the infected case
new_case = pd.DataFrame({
    'village': ['Lwengo'],
    'collection': ['OBET'],
    'origin': ['MH'],
    'fed': ['yes'],
    'parity': [np.nan],  
    'oocyst': [3],
    'spz': [0],
    'infection': ['oocyst'],  
    'infection1': ['uninfected'],  
    'choice': ['H'],
    'RS_tech': ['pcr'],  
    'spz_tech': ['qpcr']  
}, index=[0])  # Here, this bit crreates a DataFrame with a new index corresponding to the new case

In [15]:
# From here, i handle missing values and encode categorical features for the new case
new_case_numeric_imputed = numeric_imputer.transform(new_case[numeric_cols])
new_case_categorical_imputed = categorical_imputer.transform(new_case[categorical_cols])
new_case_categorical_encoded = encoder.transform(new_case_categorical_imputed)

In [16]:
# i then concatenate numeric and encoded categorical features for the new case
new_case_processed = np.concatenate([new_case_numeric_imputed, new_case_categorical_encoded.toarray()], axis=1)

In [17]:
# I then make the prediction for the new infection case
prediction = model.predict(new_case_processed)
print("The predicted mosquito species for the new infection case is:", prediction[0])

The predicted mosquito species for the new infection case is: g


In [18]:
# Finally, I do a calculation of the accuracy on the test set
y_pred_test = model.predict(X_test_processed)
accuracy = accuracy_score(y_test, y_pred_test)
print("Accuracy on the test set:", accuracy)

Accuracy on the test set: 0.7073170731707317
