<a href="https://colab.research.google.com/github/CRPeace/Coding_Dojo_Project_2/blob/main/C_Peace_Project_2_Part_1_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2 Core
Cameron Peace

### <font color='darkorange'>***Answers to assignment questions can be found in a text cell at the end of each Dataset Section***</font>

### Assignment Checklist


Your task for this week is to propose two possible datasets you would like to work with for Project 2.

You will choose your first choice data set, and a backup data set in case the first proposed data set is not approved.

* This data can be from any source and can be on any topic with these limitations:

* the data must be available for use (it is your responsibility to ensure that the license states that you are able to use it.)
* the data must be appropriate for a professional environment
* the data must not contain personal information
* the data must not be a dataset used for any assignment, lecture, or task from the course
* the data must not be a time series dataset. You will be able to identify these because each row will represent a moment in time. * These kinds of datasets follow special rules and are not appropriate for the kind of machine learning you have learned in this stack.
* Make sure you select a dataset that will be reasonable to work with in the amount of time we have left. Think about what questions you could reasonably answer with the dataset you select.

* You must propose two datasets that each have a supervised learning component. You may choose a regression or classification problem for each proposed data set.

* The dataset you choose must be approved by the instructor.

For this Task:
Create a Colab notebook where you have uploaded and shown the .head() of each of your data sets.

Create a NEW GitHub repository for this project, add your notebook to that new repository, and submit the link below.

In [4]:
## Imports

# Analysis, Plotting
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# ML
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
# Models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
# Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, \
                  accuracy_score, precision_score, recall_score, classification_report, \
                  ConfusionMatrixDisplay

## Dataset 1 - Metabolic Syndrome Prediction (First Choice)

### Data Background:
* ***Goal: To predict metabolic syndrome, yes or no based on common risk factors***
* The dataset can be found [here](https://data.world/informatics-edu/metabolic-syndrome-prediction) on data.world.
> The dataset for analysis came from the NHANES initiative where the following variables were combined from multiple tables with SQL: abnormal waist circumference, triglycerides above 150, HDL cholesterol below 50 in women or 40 in men, history of hypertension and mildly elevated fasting blood sugar (100-125). Numerous other variables were added, such as uric acid, race, income, etc.
* From CDC website: More information on [NHANES](https://www.cdc.gov/nchs/nhanes/about_nhanes.htm) 
> The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.
* It is not clear from the description on data.world when the data were collected and what regions or demographic groups the individuals in the study represent.
* It was uploaded to data.world by Robert Hoyt MD on July 22nd, 2019.


### Data Dictionary

COLUMN NAME TYPE DESCRIPTION

seqn -- integer  --  
age -- integer  --  
sex -- string  --  
marital -- string  --  
income -- integer  --  
race  --  string  --  
waistcirc  --  decimal  --  
bmi  --  decimal  --  
albuminuria  --  integer  --  
uralbcr  --  decimal  --  
uricacid  --  decimal  --  
bloodglucose  --  integer  --  
hdl  --  integer  --  
triglycerides  --  integer  --  
metabolicsyndrome  --  string  --  


In [5]:
df1 = pd.read_csv('/content/Metabolic  Syndrome.csv')
df1_original = df1.copy()

In [6]:
df1.sample(5)

Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
449,63936,35,Male,Single,3500.0,White,114.0,32.2,0,3.4,5.5,129,41,151,MetSyn
720,65086,46,Male,Married,9000.0,White,87.3,25.8,0,5.38,4.1,89,53,67,No MetSyn
1902,69870,68,Male,Married,9000.0,White,108.0,27.2,0,4.5,6.2,90,53,56,No MetSyn
702,65009,51,Female,Single,2500.0,Black,155.9,55.8,0,4.97,4.4,126,79,59,MetSyn
37,62308,78,Female,Married,3500.0,White,94.9,24.6,0,10.39,7.0,94,57,177,No MetSyn


In [12]:
df1.MetabolicSyndrome.value_counts()

No MetSyn    1579
MetSyn        822
Name: MetabolicSyndrome, dtype: int64

In [13]:
display(df1.info(), df1.describe(include='all').T, df1.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2401 entries, 0 to 2400
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqn               2401 non-null   int64  
 1   Age                2401 non-null   int64  
 2   Sex                2401 non-null   object 
 3   Marital            2193 non-null   object 
 4   Income             2284 non-null   float64
 5   Race               2401 non-null   object 
 6   WaistCirc          2316 non-null   float64
 7   BMI                2375 non-null   float64
 8   Albuminuria        2401 non-null   int64  
 9   UrAlbCr            2401 non-null   float64
 10  UricAcid           2401 non-null   float64
 11  BloodGlucose       2401 non-null   int64  
 12  HDL                2401 non-null   int64  
 13  Triglycerides      2401 non-null   int64  
 14  MetabolicSyndrome  2401 non-null   object 
dtypes: float64(5), int64(6), object(4)
memory usage: 281.5+ KB


None

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
seqn,2401.0,,,,67030.674302,2823.565114,62161.0,64591.0,67059.0,69495.0,71915.0
Age,2401.0,,,,48.691795,17.632852,20.0,34.0,48.0,63.0,80.0
Sex,2401.0,2.0,Female,1211.0,,,,,,,
Marital,2193.0,5.0,Married,1192.0,,,,,,,
Income,2284.0,,,,4005.25394,2954.032186,300.0,1600.0,2500.0,6200.0,9000.0
Race,2401.0,6.0,White,933.0,,,,,,,
WaistCirc,2316.0,,,,98.307254,16.252634,56.2,86.675,97.0,107.625,176.0
BMI,2375.0,,,,28.702189,6.662242,13.4,24.0,27.7,32.1,68.7
Albuminuria,2401.0,,,,0.154102,0.42278,0.0,0.0,0.0,0.0,2.0
UrAlbCr,2401.0,,,,43.626131,258.272829,1.4,4.45,7.07,13.69,5928.0


Index(['seqn', 'Age', 'Sex', 'Marital', 'Income', 'Race', 'WaistCirc', 'BMI',
       'Albuminuria', 'UrAlbCr', 'UricAcid', 'BloodGlucose', 'HDL',
       'Triglycerides', 'MetabolicSyndrome'],
      dtype='object')

### Dataset 1 Assignment Questions

  <font color='dodgerblue'>

* Source of data
  * ***data.world, more info can be found in the 'Data Background' section.***
* Brief description of data
  * ***The dataset represents individuals who were selected from a larger set of databases who either display risk factors for Metabolic Syndrome or who have been diagnosed with Metabolic Syndrome.***
* What is the target?
  * ***Whether or not the individual has Metabolic Syndrome, yes or no.***
* What does one row represent? (A person? A business? An event? A product?)
  * ***Each row represents a person.***
* Is this a classification or regression problem?
  *  ***This is a classification problem.***
* How many features does the data have?
  * ***13, 15 rows minus the target, also minus an id number('seqn')***
* How many rows are in the dataset?
  *  ***2401***
* What, if any, challenges do you foresee in cleaning, exploring, or modeling with this dataset?
  *  ***Understanding enough of the medical background behind each column's data to be able to interpret the data effectively and to also discriminate between actual outliers and erroneous values.***
  * ***The data dictionary from data.world did not give descriptions of each column, so getting the necessary information without making wrong assumptions may be a slight challenge, although at first glance everything seems fairly clear.***
  </font>

## Dataset 2 - Stroke Prediction Dataset

### Data Background

* This dataset can be found [here](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset) on kaggle:  From the description found there:
> According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
* There is no information on the dataset provenance, collection methods, time period, region, etc.
* The dataset was uploaded to kaggle in February of 2022.

### Data Dictionary

* id: unique identifier
* gender: "Male", "Female" or "Other"
* age: age of the patient
* hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* ever_married: "No" or "Yes"
* work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* Residence_type: "Rural" or "Urban"
* avg_glucose_level: average glucose level in blood
*  bmi: body mass index
*  smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
*  stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient*

In [9]:
df2 = pd.read_csv('/content/healthcare-dataset-stroke-data.csv')
df2_original = df2.copy()

In [10]:
df2.sample(5)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
4470,47802,Male,28.0,0,0,No,Private,Urban,256.74,23.4,formerly smoked,0
2832,52225,Male,24.0,0,0,No,Private,Urban,84.16,37.5,smokes,0
1254,70610,Female,45.0,0,0,Yes,Private,Rural,81.02,39.0,never smoked,0
2982,39852,Male,59.0,1,1,Yes,Govt_job,Rural,81.51,32.6,never smoked,0
1194,542,Female,3.0,0,0,No,children,Urban,79.63,,Unknown,0


In [14]:
df2.stroke.value_counts()

0    4861
1     249
Name: stroke, dtype: int64

In [11]:
display(df2.info(), df2.describe().T, df2.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


None

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,5110.0,36517.829354,21161.721625,67.0,17741.25,36932.0,54682.0,72940.0
age,5110.0,43.226614,22.612647,0.08,25.0,45.0,61.0,82.0
hypertension,5110.0,0.097456,0.296607,0.0,0.0,0.0,0.0,1.0
heart_disease,5110.0,0.054012,0.226063,0.0,0.0,0.0,0.0,1.0
avg_glucose_level,5110.0,106.147677,45.28356,55.12,77.245,91.885,114.09,271.74
bmi,4909.0,28.893237,7.854067,10.3,23.5,28.1,33.1,97.6
stroke,5110.0,0.048728,0.21532,0.0,0.0,0.0,0.0,1.0


Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

### Dataset 2 Assignment Questions

  <font color='dodgerblue'>

* Source of data
  * ***kaggle.com; &emsp; More info in the 'Data Background' section***
* Brief description of data
  * ***The dataset provides health information and other life information on individuals who have had a stroke or may be at risk for a stroke.***
* What is the target?
  * ***Whether or not someone had a stroke previously***
* What does one row represent? (A person? A business? An event? A product?)
  * ***Each row represents a person.***
* Is this a classification or regression problem?
  *  ***This is a classification problem.***
* How many features does the data have?
  * ***10, 12 rows minus the target, also minus an id column***
* How many rows are in the dataset?
  *  ***5110***
* What, if any, challenges do you foresee in cleaning, exploring, or modeling with this dataset?
  *  ***This data is very unbalanced, it may be harder to model***
  * ***It appears there is missing information in both the 'smoking' column and the bmi column***
  * ***There may be some medical understanding necessary to interpret the data and findings***
  * ***There are some features in this dataset that look a little like "noise".***
  * ***The data's provenance is suspect which may diminish any enthusiasm about modeling.***
  </font>