<a href="https://colab.research.google.com/github/KDiBSilva/Project-2/blob/main/Project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Libraries

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
## Numpy
import numpy as np
## Pandas
import pandas as pd
## MatplotLib
import matplotlib.pyplot as plt
## Seaborn
import seaborn as sns

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

## Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier


## Classification Metrics
from sklearn.metrics import (roc_auc_score, ConfusionMatrixDisplay, 
                             PrecisionRecallDisplay, RocCurveDisplay, 
                             f1_score, accuracy_score, precision_score,
                             recall_score, classification_report)

## Set global scikit-learn configuration 
from sklearn import set_config
## Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

###1 & 2. 
About Adult Income Dataset:

Information and dataset found from Kaggle: [Here](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset)

Dataset information also resourced from UCI: [Link](https://archive.ics.uci.edu/ml/datasets/adult)

Data Set Information:

Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

Prediction task is to determine whether a person makes over 50K a year.

An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc.

This is a widely cited KNN dataset. However, we will still explore other models to compare performance. 

##Data Dictionary:

age: continuous.

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

fnlwgt: continuous.

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

education-num: years of educucation.

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

gender: Female, Male.

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

income: >50K, <=50K

### 3, 4 & 5.
- Target column for this dataset is 'income'.
- Income column represents whether a person will have earn greater $50K or less based on the other values.
- This will be a classification problem as there are only two outcomes to predict, this column will be change to a 0 = >50K and 1 = <=50K.

#Load Dataset

In [23]:
# CSV
filename = '/content/drive/My Drive/Coding Dojo/Data/adult.csv'
df_1 = pd.read_csv(filename)

In [24]:
df_1.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [25]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


- There are no missing values to address

For Machine Learning
- Numerical(int64) column will need to be scaled.
- Categorical(object) column will need to be OneHotEncoded.

In [26]:
# how many rows and columns
df_1.shape
print(f'There are {df_1.shape[0]} rows, and {df_1.shape[1]} features.')

There are 48842 rows, and 15 features.


In [27]:
#check for duplicates
df_1.duplicated().sum()

52

- There are 52 duplicate to be removed. 

In [28]:
# drop duplicates
df_1 = df_1.drop_duplicates()
#confirm drop of duplicates
df_1.duplicated().sum()


0

- Duplicates have been removed.

In [32]:
# how many rows and columns
df_1.shape
print(f'There are {df_1.shape[0]} rows, and {df_1.shape[1]} features.')

There are 48790 rows, and 15 features.


### 6 & 7.

There are 48790 rows, and 15 features.



In [31]:
# Inspect categorical columns for errors(code used from code along)
cat_cols = make_column_selector(dtype_include='object')(df_1)
for col in cat_cols:
  display(df_1[col].value_counts(normalize=True))

Private             0.693995
Self-emp-not-inc    0.079135
Local-gov           0.064275
?                   0.057286
State-gov           0.040603
Self-emp-inc        0.034720
Federal-gov         0.029350
Without-pay         0.000430
Never-worked        0.000205
Name: workclass, dtype: float64

HS-grad         0.323222
Some-college    0.222648
Bachelors       0.164234
Masters         0.054437
Assoc-voc       0.042222
11th            0.037139
Assoc-acdm      0.032814
10th            0.028469
7th-8th         0.019553
Prof-school     0.017094
9th             0.015495
12th            0.013425
Doctorate       0.012175
5th-6th         0.010391
1st-4th         0.005022
Preschool       0.001660
Name: education, dtype: float64

Married-civ-spouse       0.458414
Never-married            0.329617
Divorced                 0.135889
Separated                0.031359
Widowed                  0.031113
Married-spouse-absent    0.012851
Married-AF-spouse        0.000758
Name: marital-status, dtype: float64

Prof-specialty       0.126358
Craft-repair         0.125067
Exec-managerial      0.124657
Adm-clerical         0.114901
Sales                0.112749
Other-service        0.100820
Machine-op-inspct    0.061836
?                    0.057491
Transport-moving     0.048268
Handlers-cleaners    0.042447
Farming-fishing      0.030437
Tech-support         0.029617
Protective-serv      0.020127
Priv-house-serv      0.004919
Armed-Forces         0.000307
Name: occupation, dtype: float64

Husband           0.403833
Not-in-family     0.257368
Own-child         0.155134
Unmarried         0.105022
Wife              0.047776
Other-relative    0.030867
Name: relationship, dtype: float64

White                 0.854970
Black                 0.095983
Asian-Pac-Islander    0.031092
Amer-Indian-Eskimo    0.009633
Other                 0.008321
Name: race, dtype: float64

Male      0.668457
Female    0.331543
Name: gender, dtype: float64

United-States                 0.897561
Mexico                        0.019328
?                             0.017545
Philippines                   0.006026
Germany                       0.004222
Puerto-Rico                   0.003771
Canada                        0.003730
El-Salvador                   0.003177
India                         0.003095
Cuba                          0.002828
England                       0.002603
China                         0.002501
South                         0.002357
Jamaica                       0.002173
Italy                         0.002152
Dominican-Republic            0.002111
Japan                         0.001886
Poland                        0.001783
Guatemala                     0.001763
Vietnam                       0.001763
Columbia                      0.001742
Haiti                         0.001537
Portugal                      0.001373
Taiwan                        0.001332
Iran                          0.001209
Greece                   

<=50K    0.760586
>50K     0.239414
Name: income, dtype: float64

### 8.
Cleaning
- A few columns have a "?" value, this could be replaced with "unknown". However, as it only accound for 1- 5% of data I may look to just drop the rows as creating a new category may not benefit the models ability to predictthe income value. 

- The dataset has 48790 rows, and 15 features which I feel is a good amount of data for modeling

#OPTION TWO DATASET

###1 & 2. 

About Stroke Prediction Dataset


Dataset and resource information from Kaggle:
[Link](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset)



Context of this Dataset:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

##Attribute Information:

1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



###3, 4 & 5.


- Target column for this dataset is 'stroke'.
- Stroke column represents whether a person will have a stroke or not based on the other values. 
- This will be a classification problem as there are only two outcomes to predict.

#Load Dataset

In [6]:
# CSV
filename = '/content/drive/My Drive/Coding Dojo/Data/healthcare-dataset-stroke-data.csv'
df_2 = pd.read_csv(filename)

In [7]:
df_2.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [8]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


- There are no missing values

In [9]:
df_2.shape
print(f'There are {df_2.shape[0]} rows, and {df_2.shape[1]} features.')

There are 5110 rows, and 12 features.


###6 & 7. There are 5110 rows, and 12 features.

In [10]:
df_2.duplicated().sum()

0

- There are no duplicate values.

In [17]:
# Inspect categorical columns for errors(code used from code along)
cat_cols = make_column_selector(dtype_include='object')(df_2)
for col in cat_cols:
  display(df_2[col].value_counts(normalize=True))

Female    0.585910
Male      0.413894
Other     0.000196
Name: gender, dtype: float64

Yes    0.656164
No     0.343836
Name: ever_married, dtype: float64

Private          0.572407
Self-employed    0.160274
children         0.134442
Govt_job         0.128571
Never_worked     0.004305
Name: work_type, dtype: float64

Urban    0.508023
Rural    0.491977
Name: Residence_type, dtype: float64

never smoked       0.370254
Unknown            0.302153
formerly smoked    0.173190
smokes             0.154403
Name: smoking_status, dtype: float64

- I can .replace() values for 'Residence_type', 'gender' and 'ever_married' to '0, 1, 2' numeric values.
- Will need to One Hot Encode for the remaining categorical columns.

- For numerical columns I will need to scale this to all float values. 


###8. 
This dataset may not have enough data and I may need to address an imbalance inorder for the model to improve its predictions. 