## Programming Lab #2
## Foundations of Machine Learning

The purpose of this project is to build predictive algorithms that predict the likelihood a person has a stroke. The data include:
  
  - `age`: Patient age, numeric
  - `avg_glucose_level`: Blood sugar levels, numeric
  - `bmi`: Body mass index, numeric
  - `ever_married`: Ever married, dummy/character (Yes, No)
  - `gender`: Male, Female, or Other, character
  - `heart_disease`: Has heart disease, dummy
  - `hypertension`: Has hypertension, dummy
  - `id`: Study identification number
  - `Residence_type`: Type of residence, dummy/character (Urban, Rural)
  - `smoking_status`: Former, never, or current smoker, categorical
  - `work_type`: Employment type (Never worked (Never_worked), homemaker ("children"), Public sector employment (Govt_job), Private sector employment (`Private`), Self-employed (`Self-employed`)
  - `stroke`: Suffered a stroke in the sample period
  
The data come in two files: `training_data.csv`, which you should use to build your models, and `testing_data.csv`, which you should use to test your models. The models must be trained on the training data and tested on the testing data, but providing both files allows you to experiment with your choices and iterate on model designs. If performance drops on the testing data, you know there's a problem.
  
You can use any of the tools presented in class: $k$ nearest neighbor, linear models, or decision trees. In principle, $k$ means clustering might also be helpful for looking for patterns in the data that the other methods might miss. Using canned versions of more advanced tools (boosting, bagging, random forests, neural networks, etc.) is deeply unsporting and thus not allowed. You can be creative about transforming variables, or combining decision trees with linear models or $k$NN. Try something interesting. Fail extravagantly. The goal is to work on an intellectually interesting question that is similar to the tasks that data scientists are called on to do every day.
  
We will compare the groups' models to see if there are common trends or significant differences, and also to declare **The Winners** on the basis of whichever team achieves the lowest $RMSE$ on the testing data. A simple linear model with some polynomials and dummy variables achieves an $R^2$ of .087 and a $RMSE$ of .206.

In [None]:
! git clone https://github.com/LexiVanMetre/group7

fatal: destination path 'group7' already exists and is not an empty directory.


In [None]:
import os

cwd = os.getcwd()  # Get the current working directory (cwd)
files = os.listdir(cwd)  # Get all the files in that directory
print("Files in %r: %s" % (cwd, files))

Files in '/content': ['.config', 'group7', 'sample_data']


In [None]:
ls group7

'Paper_ Project 1.pdf'                                                         [0m[01;34mproject_2[0m/
 [01;34mproject_1[0m/                                                                    README.md
'Project 1: Analysis of Socioeconomic Factors Effecting Family Income.ipynb'


ls: cannot access 'project_2/': No such file or directory


In [None]:
import pandas as pd
import numpy as np
df_train = pd.read_csv('/content/group7/project_2/data/training_data.csv')
df_test = pd.read_csv('/content/group7/project_2/data/testing_data.csv')

y_train = df_train['stroke']
X_train = df_train.drop('stroke',axis=1)
y_test = df_test['stroke']
X_test = df_test.drop('stroke',axis=1)

X_train['bmi'] = X_train['bmi'].fillna(X_train['bmi'].mean())
X_test['bmi'] = X_test['bmi'].fillna(X_test['bmi'].mean())

In [None]:
## Linear Model
from sklearn.linear_model import LinearRegression # Import linear regression model
from sklearn.preprocessing import PolynomialFeatures

X_train_numeric = X_train.loc[:,['age','hypertension','heart_disease','bmi','avg_glucose_level'] ]
#
expander = PolynomialFeatures(degree=2,include_bias=False) # Create the expander
Z = expander.fit_transform(X_train_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
continuous = pd.DataFrame(data=Z, columns = names) # Create a new, expanded dataframe
#
dummies = pd.concat([ pd.get_dummies(X_train['work_type'],dtype='int',drop_first=True),
                      pd.get_dummies(X_train['Residence_type'],dtype='int',drop_first=True),
                      pd.get_dummies(X_train['smoking_status'],dtype='int',drop_first=True)],axis=1)
#
Z_train = pd.concat([continuous,dummies],axis=1)

X_test_numeric = X_test.loc[:,['age','hypertension','heart_disease','bmi','avg_glucose_level'] ]
#
expander = PolynomialFeatures(degree=2,include_bias=False) # Create the expander
Z = expander.fit_transform(X_test_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
continuous = pd.DataFrame(data=Z, columns = names) # Create a new, expanded dataframe

dummies = pd.concat([ pd.get_dummies(X_test['work_type'],dtype='int',drop_first=True),
                      pd.get_dummies(X_test['Residence_type'],dtype='int',drop_first=True),
                      pd.get_dummies(X_test['smoking_status'],dtype='int',drop_first=True)],axis=1)
#
Z_test = pd.concat([continuous,dummies],axis=1)

# Fit the model and get the R2 measure:
reg = LinearRegression().fit(Z_train, y_train) # Fit the linear model
print('R2: ', reg.score(Z_test, y_test)) # R squared measure
y_hat = reg.predict(Z_test)
N = len(y_test)
print('RMSE: ', (np.sum( (y_test - y_hat)**2)/N )**.5 )   # R squared measure


R2:  0.08717964343852191
RMSE:  0.20599583849612824


In [None]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,2465,68685,Male,36.0,0,0,Yes,Govt_job,Urban,65.87,32.2,formerly smoked,0
1,4311,59058,Female,45.0,0,0,Yes,Govt_job,Rural,68.66,25.3,never smoked,0
2,2375,46068,Male,58.0,0,0,No,Self-employed,Rural,170.93,30.7,Unknown,0
3,5017,36837,Female,61.0,0,0,Yes,Self-employed,Urban,69.88,27.1,never smoked,0
4,753,30550,Female,78.0,0,0,No,Private,Urban,103.86,30.6,Unknown,0


In [None]:
print(df_train.dtypes)

Unnamed: 0             int64
id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object


In [None]:
df_train['hypertension'] = df_train['hypertension'].map({1: 'yes', 0: 'no'})
df_train['heart_disease'] = df_train['heart_disease'].map({1: 'yes', 0: 'no'})
df_train['stroke'] = df_train['stroke'].map({1: 'yes', 0: 'no'})


In [None]:
print(df_train.dtypes)

Unnamed: 0             int64
id                     int64
gender                object
age                  float64
hypertension          object
heart_disease         object
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                object
dtype: object


In [None]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,2465,68685,Male,36.0,no,no,Yes,Govt_job,Urban,65.87,32.2,formerly smoked,no
1,4311,59058,Female,45.0,no,no,Yes,Govt_job,Rural,68.66,25.3,never smoked,no
2,2375,46068,Male,58.0,no,no,No,Self-employed,Rural,170.93,30.7,Unknown,no
3,5017,36837,Female,61.0,no,no,Yes,Self-employed,Urban,69.88,27.1,never smoked,no
4,753,30550,Female,78.0,no,no,No,Private,Urban,103.86,30.6,Unknown,no


In [None]:
# checking the categorical variables to make sure that there are no misspellings in the group names that we need to fix:
print(df_train['gender'].value_counts())
print('----------------------------')

print(df_train['ever_married'].value_counts())
print('----------------------------')

print(df_train['work_type'].value_counts())
print('----------------------------')

print(df_train['Residence_type'].value_counts())
print('----------------------------')

print(df_train['smoking_status'].value_counts())
print('----------------------------')

print(df_train['stroke'].value_counts())
print('----------------------------')

print(df_train['heart_disease'].value_counts())
print('----------------------------')

print(df_train['hypertension'].value_counts())
print('----------------------------')


Female    2398
Male      1688
Other        1
Name: gender, dtype: int64
----------------------------
Yes    2686
No     1401
Name: ever_married, dtype: int64
----------------------------
Private          2329
Self-employed     667
children          542
Govt_job          534
Never_worked       15
Name: work_type, dtype: int64
----------------------------
Urban    2052
Rural    2035
Name: Residence_type, dtype: int64
----------------------------
never smoked       1505
Unknown            1241
formerly smoked     699
smokes              642
Name: smoking_status, dtype: int64
----------------------------
no     3888
yes     199
Name: stroke, dtype: int64
----------------------------
no     3858
yes     229
Name: heart_disease, dtype: int64
----------------------------
no     3687
yes     400
Name: hypertension, dtype: int64
----------------------------


In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder

In [None]:
# New Model for Categorical Variables:
# Categorical Variables with One-Hot Encoding Model
categorical_vars = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'heart_disease', 'hypertension']
X_train_categorical = X_train[categorical_vars]
X_test_categorical = X_test[categorical_vars]

encoder = OneHotEncoder(sparse=False, drop='first')
Xtrain_encoded = encoder.fit_transform(X_train_categorical)
Xtest_encoded = encoder.transform(X_test_categorical)




In [None]:
model = LinearRegression()
model.fit(Xtrain_encoded, y_train)
r2_categorical = model.score(Xtest_encoded, y_test)
y_pred_cat = model.predict(Xtest_encoded)
rmse_cat = np.sqrt(np.mean((y_pred_cat - y_test) ** 2))
print("R2:", r2_categorical)
print("RMSE:", rmse_cat)

R2: 0.03748719579733306
RMSE: 0.21152857630594185


In [None]:
X_combined_train = np.concatenate((X_train_numeric, Xtrain_encoded), axis=1)
X_combined_test = np.concatenate((X_test_numeric, Xtest_encoded), axis=1)

In [None]:
model_combined = LinearRegression()
model_combined.fit(X_combined_train, y_train)
r2_combined = model_combined.score(X_combined_test, y_test)
y_pred_combined = model_combined.predict(X_combined_test)
rmse_combined = np.sqrt(np.mean((y_pred_combined - y_test) ** 2))
print("R2:", r2_combined)
print("RMSE:", rmse_combined)


R2: 0.0817718426714168
RMSE: 0.20660512565020178


In [None]:
# Expanding the numerical variables with degree 3

# Expand features
expander = PolynomialFeatures(degree=3,include_bias=False) # Create the expander
Z_train = expander.fit_transform(X_train_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
X_train_lm = pd.DataFrame(data=Z_train, columns = names) # Create a new, expanded dataframe

Z_test = expander.fit_transform(X_test_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
X_test_lm = pd.DataFrame(data=Z_test, columns = names) # Create a new, expanded dataframe

In [None]:
categorical_vars = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'heart_disease', 'hypertension']
X_train_categorical = X_train[categorical_vars]
X_train_categorical = X_train[categorical_vars]
X_test_categorical = X_test[categorical_vars]

encoder = OneHotEncoder(sparse=False, drop='first')
Xtrain_encoded = encoder.fit_transform(X_train_categorical)
Xtest_encoded = encoder.transform(X_test_categorical)



In [None]:
X_combined_train = np.concatenate((X_train_lm, Xtrain_encoded), axis=1)
X_combined_test = np.concatenate((X_test_lm, Xtest_encoded), axis=1)


In [None]:
# Degree 3

model_combined = LinearRegression()
model_combined.fit(X_combined_train, y_train)
r2_combined = model_combined.score(X_combined_test, y_test)
y_pred_combined = model_combined.predict(X_combined_test)
rmse_combined = np.sqrt(np.mean((y_pred_combined - y_test) ** 2))
print("R2:", r2_combined)
print("RMSE:", rmse_combined)

R2: 0.05453465674571989
RMSE: 0.20964697272565777


In [None]:
# Expand features with degree 5
expander = PolynomialFeatures(degree=5,include_bias=False) # Create the expander
Z_train = expander.fit_transform(X_train_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
X_train_lm = pd.DataFrame(data=Z_train, columns = names) # Create a new, expanded dataframe

Z_test = expander.fit_transform(X_test_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
X_test_lm = pd.DataFrame(data=Z_test, columns = names) # Create a new, expanded dataframe

# combine datasets
X_combined_train = np.concatenate((X_train_lm, Xtrain_encoded), axis=1)
X_combined_test = np.concatenate((X_test_lm, Xtest_encoded), axis=1)

# run new model
model_combined = LinearRegression()
model_combined.fit(X_combined_train, y_train)
r2_combined = model_combined.score(X_combined_test, y_test)
y_pred_combined = model_combined.predict(X_combined_test)
rmse_combined = np.sqrt(np.mean((y_pred_combined - y_test) ** 2))
print("R2:", r2_combined)
print("RMSE:", rmse_combined)

R2: -0.22471713185879372
RMSE: 0.23860727728111095


In [None]:
# Expand features with degree 1
expander = PolynomialFeatures(degree=1,include_bias=False) # Create the expander
Z_train = expander.fit_transform(X_train_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
X_train_lm = pd.DataFrame(data=Z_train, columns = names) # Create a new, expanded dataframe

Z_test = expander.fit_transform(X_test_numeric) # Pass the df into the expander to get powers/interactions of x and y
names = expander.get_feature_names_out() # Get the names of these variables
X_test_lm = pd.DataFrame(data=Z_test, columns = names) # Create a new, expanded dataframe

# combine datasets
X_combined_train = np.concatenate((X_train_lm, Xtrain_encoded), axis=1)
X_combined_test = np.concatenate((X_test_lm, Xtest_encoded), axis=1)

# run new model
model_combined = LinearRegression()
model_combined.fit(X_combined_train, y_train)
r2_combined = model_combined.score(X_combined_test, y_test)
y_pred_combined = model_combined.predict(X_combined_test)
rmse_combined = np.sqrt(np.mean((y_pred_combined - y_test) ** 2))
print("R2:", r2_combined)
print("RMSE:", rmse_combined)

R2: 0.0817718426714168
RMSE: 0.20660512565020178


In [None]:
# looping through each degree to find the R2 and the RMSE values

for i in range(1,10):
  # Expand features with degree i
  expander = PolynomialFeatures(degree=i,include_bias=False) # Create the expander
  Z_train = expander.fit_transform(X_train_numeric) # Pass the df into the expander to get powers/interactions of x and y
  names = expander.get_feature_names_out() # Get the names of these variables
  X_train_lm = pd.DataFrame(data=Z_train, columns = names) # Create a new, expanded dataframe

  Z_test = expander.fit_transform(X_test_numeric) # Pass the df into the expander to get powers/interactions of x and y
  names = expander.get_feature_names_out() # Get the names of these variables
  X_test_lm = pd.DataFrame(data=Z_test, columns = names) # Create a new, expanded dataframe

  # combine datasets
  X_combined_train = np.concatenate((X_train_lm, Xtrain_encoded), axis=1)
  X_combined_test = np.concatenate((X_test_lm, Xtest_encoded), axis=1)

  # run new model
  model_combined = LinearRegression()
  model_combined.fit(X_combined_train, y_train)
  r2_combined = model_combined.score(X_combined_test, y_test)
  y_pred_combined = model_combined.predict(X_combined_test)
  rmse_combined = np.sqrt(np.mean((y_pred_combined - y_test) ** 2))
  print("R2:", r2_combined)
  print("RMSE:", rmse_combined)
  print('-----------------------')

R2: 0.0817718426714168
RMSE: 0.20660512565020178
-----------------------
R2: 0.08628551101336068
RMSE: 0.20609670307217384
-----------------------
R2: 0.05453465674571989
RMSE: 0.20964697272565777
-----------------------
R2: 0.027145742838248554
RMSE: 0.21266189546024228
-----------------------
R2: -0.22471713185879372
RMSE: 0.23860727728111095
-----------------------
R2: -1.661810817371904
RMSE: 0.3517664220570942
-----------------------
R2: -111.46564814143541
RMSE: 2.286523839905831
-----------------------
R2: -2407.717570156376
RMSE: 10.581780018566048
-----------------------
R2: -373.0609581999013
RMSE: 4.170008878068549
-----------------------


Degree 2 has the lowest RMSE; however, the very first model still has the lowest RMSE.

We now will try to create a Decision Tree.

In [None]:
X_traincode= pd.get_dummies(X_train, drop_first=True)
X_testcode = pd.get_dummies(X_test, drop_first=True)

In [None]:
# Check for missing columns in the test set
missing = set(X_traincode.columns) - set(X_testcode.columns)

# Create missing dummy columns in the test set and set values to 0
for col in missing:
    X_testcode[col] = 0

X_testcode = X_testcode[X_traincode.columns]


In [None]:
TSS = (np.sum( (y_test - y_hat)**2)/N )**.5

## Decision Trees with varing depths
from sklearn import tree

for i in range(1,10):
  D = i
  model = tree.DecisionTreeRegressor(max_depth=10) # Fit the classifier
  car_tree = model.fit(X_traincode, y_train)

  ## Make Predictions on the Test Set
  y_hat_car_tree = car_tree.predict(X_testcode)
  residuals_cart = y_test - y_hat_car_tree

  ## Metrics:
  SSE_car_tree = np.sum( (y_test-y_hat_car_tree)**2 )
  MSE_car_tree = (1/N)*np.sum( (y_test-y_hat_car_tree)**2 )
  RMSE_car_tree = (SSE_car_tree/N)**(1/2)
  R2_car_tree = 1 - SSE_car_tree/TSS

  print(D)
  print('R2:', R2_car_tree)
  print('SSE:', SSE_car_tree)
  print('----------------')


1
R2: 1.0
SSE: 0.0
----------------
2
R2: 1.0
SSE: 0.0
----------------
3
R2: 1.0
SSE: 0.0
----------------
4
R2: 1.0
SSE: 0.0
----------------
5
R2: 1.0
SSE: 0.0
----------------
6
R2: 1.0
SSE: 0.0
----------------
7
R2: 1.0
SSE: 0.0
----------------
8
R2: 1.0
SSE: 0.0
----------------
9
R2: 1.0
SSE: 0.0
----------------


This is supposed to be fairly "fun," so please do not turn it into a combinatorial nightmare of comparing thousands of model specifications. Settle on a strategy you think is promising, crank it out, and write up the results. Your time and energy are valuable, so learn to recognize when the marginal cost of another twenty minutes on a project exceeds the benefit in terms of improving the results and your grade.
  
## Paper format

The format of the paper should be:

  - Summary: A one paragraph description of the question, methods, and results (about 350 words).
  - Data: One to two pages discussing the data and key variables, and any challenges in reading, cleaning, and preparing them for analysis.
  - Results: Two to five pages providing visualizations, statistics, a discussion of your methodology, and a presentation of your main findings.
  - Conclusion: One to two pages summarizing the project, defending it from criticism, and suggesting additional work that was outside the scope of the project.
  - Appendix: If you have a significant number of additional plots or table that you feel are essential to the project, you can put any amount of extra content at the end and reference it from the body of the paper.

## Submission

Half of each student's grade is based on their commits to the repo. Each student is expected to do something specific that contributes to the overall project outcome. Since commits are recorded explicitly by Git/GitHub, this is observable. A student can contribute by cleaning data, creating visualizations,performing analytic analyses,  or writing about results, but everyone has to do something substantial. A student's work doesn't need to make it into the final report to be valuable and substantial, and fulfill the requirement to make a contribution to the project.

The other half of each student's grade is based on the written report. Groups will work together on combining results and writing up findings in a Jupyter noteb,ok, using code chunks to execute Python commands and markdown chunks to structure the paper and provide exposition. The notebook should run on Colab or Rivana from beginning to end without any errors.

mbers submit.

## Criteria

The project is graded based on four criteria:

  - Project Concept: What is the strategy for building and testing the group's models? How did the group decide how to use the tools presented so far in class? How did the group compare the performance of the options considered, and settle on a final choice for submission?
  - Wrangling, EDA, and Visualization: How are are missing values handled? For variables with large numbers of missing values, to what extent do the data and documentation provide an explanation for the missing data? If multiple data sources are used, how are the data merged? For the main variables in the analysis, are the relevant data summarized and visualized through a histogram or kernel density plot where appropriate? Are basic quantitative features of the data addressed and explained? How are outliers characterized and addressed?
  - Analysis: What are the groups' main findings? Do the tables, plots, and statistics support the conclusions? Is the research strategy carried out correctly? If the research strategy succeeds, are the results interpreted correctly and appropriately? If the research strategy fails, is a useful discussion of the flaws of the data collection process or the research strategy discussed?
  - Replication/Documentation: Is the code appropriately commented? Can the main results be replicated from the code and original data files? Are significant choices noted and explained?

Each of the four criteria are equally weighted (25 points out of 100).