## Datathon-2: Notebook Submission



**Content**

As per WHO,

Cancer is the second leading cause of death globally, and is responsible for an estimated 9.6 million deaths in 2018. Globally, about 1 in 6 deaths is due to cancer.
Approximately 70% of deaths from cancer occur in low- and middle-income countries.
Around one third of deaths from cancer are due to the 5 leading behavioral and dietary risks: high body mass index, low fruit and vegetable intake, lack of physical activity, tobacco use, and alcohol use.

**Problem Statement**

Many aspects of the behaviour of cancer disease are highly unpredictable. Even with the huge number of studies that have been done on the DNA mutation responsible for the disease, we are still unable to use these information at clinical level. However, it is important that we understand the effects and impacts of this disease from the past information as much as we possibly can.

**Objective**

You are required to build a machine learning  model that would predict the cancer death rate for the given year.

**Evaluation Criteria**

Submissions are evaluated using Mean Squared Error (MSE).

https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/mse.png

**About the data**

The data is collected from cancer.gov and the US Census American Community Survey. There are 34 columns including the target column. Some of the columns are listed below:

TARGET_deathRate: Dependent variable. Mean per capita (100,000) cancer mortalities(a)
avgAnnCount: Mean number of reported cases of cancer diagnosed annually(a)
avgDeathsPerYear: Mean number of reported mortalities due to cancer(a)
incidenceRate: Mean per capita (100,000) cancer diagoses(a)
medianIncome: Median income per county (b)
popEst2015: Population of county (b)
povertyPercent: Percent of populace in poverty (b)
studyPerCap: Per capita number of cancer-related clinical trials per county (a)
binnedInc: Median income per capita binned by decile (b)
MedianAge: Median age of county residents (b)
MedianAgeMale: Median age of male county residents (b)
MedianAgeFemale: Median age of female county residents (b)
Geography: County name (b)
AvgHouseholdSize: Mean household size of county (b)
PercentMarried: Percent of county residents who are married (b)
PctNoHS18_24: Percent of county residents ages 18-24 highest education attained: less than high school (b)
PctHS18_24: Percent of county residents ages 18-24 highest education attained: high school diploma (b)
PctSomeCol18_24: Percent of county residents ages 18-24 highest education attained: some college (b)
PctBachDeg18_24: Percent of county residents ages 18-24 highest education attained: bachelor's degree (b)
PctHS25_Over: Percent of county residents ages 25 and over highest education attained: high school diploma (b)
PctBachDeg25_Over: Percent of county residents ages 25 and over highest education attained: bachelor's degree (b)
PctEmployed16_Over: Percent of county residents ages 16 and over employed (b)
PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed (b)
PctPrivateCoverage: Percent of county residents with private health coverage (b)
PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance) (b)
PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage (b)
PctPublicCoverage: Percent of county residents with government-provided health coverage (b)
PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone (b)
PctWhite: Percent of county residents who identify as White (b)
PctBlack: Percent of county residents who identify as Black (b)
PctAsian: Percent of county residents who identify as Asian (b)
PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian (b)
PctMarriedHouseholds: Percent of married households (b)
BirthRate: Number of live births relative to number of women in county (b)
(a): years 2010-2016

(b): 2013 Census Estimates

## Task 1

### Import Libraries

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)

### Load the data and display first 5 rows.

In [None]:
# load training data
cancer_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Training_set_label.csv" )

In [None]:
cancer_data.head()

In [None]:
cancer_data.shape

In [None]:
#load test data
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/cancer_death_rate/Testing_set_label.csv')

In [None]:
test_data.shape

### Perform Exploratory Data Analysis

We will perform all EDA and Preprocessing steps on Training Data to keep the Test data unseen and subsequently carry out the same on test data prior predictions

In [None]:
# check info on columns
cancer_data.info()

Except two columns , binnedInc and Geography others are numeric columns

In [None]:
# descriptive status 
cancer_data.describe()

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns

sns.set_style('whitegrid')

In [None]:
#lets take the nemeric columns and examine the data distribution
col_numeric = list(cancer_data.select_dtypes(exclude='object').columns)
len(col_numeric)


In [None]:
#plotting histogram of numeric columns
i = 1
plt.figure(figsize=(20,16))
for col in col_numeric:
    plt.subplot(8,4,i)
    plt.hist(cancer_data[col])
    plt.xlabel(col)
    i+=1
    plt.tight_layout()

There are some distributions which are skewed and may require transformation to remove the skew. 
There are some columns like median age male and females which are normally distributed.

In [None]:
#lets check of box plot of above columns to see outliers clearly
i = 1
plt.figure(figsize=(20,16))
for col in col_numeric:
    plt.subplot(8,4,i)
    sns.boxplot(y=cancer_data[col])
    i+=1
    plt.tight_layout()

In [None]:
#lets look at object data type
cat_col = cancer_data.select_dtypes(include='object').columns
cat_col

In [None]:
for col in cat_col:
    print(cancer_data[col].value_counts())
    

There  are large no of cat for geography ,the count values range for 1 to 2 . So not much info may come out of this column. We can delete the Geography columns

In [None]:
# delete Geography column for trg set
cancer_data.drop('Geography',axis=1,inplace=True)

In [None]:
# lets convert binnedInc to cat object
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
cancer_data['binnedInc'] = le.fit_transform(cancer_data['binnedInc'])

In [None]:
# lets handle null values in data set
null_data = cancer_data.isnull().sum()
null_data[null_data.values != 0]/len(cancer_data)

we see that 75 % of data in PctSomeCol18_24 is missing . We can delete this column.

In [None]:
cancer_data.drop('PctSomeCol18_24',axis=1,inplace=True)

In [None]:
cancer_data['PctEmployed16_Over'].nunique()

In [None]:
# examine closely the null values (Percent of county residents ages 16 and over employed (b))
cancer_data[cancer_data['PctEmployed16_Over'].isnull()].head()

In [None]:
cancer_data['PctEmployed16_Over'].describe()

The data seems to have no outlier and can be imputed by mean value.

In [None]:
#imputing with mean value
cancer_data['PctEmployed16_Over'].fillna(cancer_data['PctEmployed16_Over'].mean(),inplace=True)

In [None]:
# lets look at PctPrivateCoverageAlone column for missing values
cancer_data['PctPrivateCoverageAlone'].nunique()

In [None]:
cancer_data['PctPrivateCoverageAlone'].describe()

Here as well no outliers . Hence we shall impute with mean

In [None]:
#imputing with mean value
cancer_data['PctPrivateCoverageAlone'].fillna(cancer_data['PctPrivateCoverageAlone'].mean(),inplace=True)

In [None]:
cancer_data.isnull().sum()

No null values in the dataframe

###  BUILDING A BASE LINE MODEL 

In [None]:
# building a base line model for refernce with all attributes as X
X1 = cancer_data.drop('TARGET_deathRate',axis=1)
y1 = cancer_data['TARGET_deathRate']

In [None]:
# splitting the data into trg and validation set
from sklearn.model_selection import train_test_split
X1_train,X1_valid,y1_train,y1_valid = train_test_split(X1,y1,test_size=0.3,shuffle=True,random_state=42)

In [None]:
# using RandomForestRegressor as base model
from sklearn.ensemble import RandomForestRegressor
rf1 = RandomForestRegressor()
rf1.fit(X1_train,y1_train)

In [None]:
#training MSE
y1_pred_trg = rf1.predict(X1_train)
metrics.mean_squared_error(y1_train,y1_pred_trg)

In [None]:
#test MSE
from sklearn import metrics
y_pred1_valid = rf1.predict(X1_valid)
metrics.mean_squared_error(y1_valid,y_pred1_valid)

## Task 2

### Perform Data Preparation Steps

In [None]:
cancer_data.head()

### FEATURE ENGINEERING

In [None]:
# the avg death has no meaning without referring to population
#so we create a new col pop_to_avgdeath
cancer_data['pop_to_avgdeath'] = cancer_data['popEst2015']/cancer_data['avgDeathsPerYear']

In [None]:
#lets look at new col distribution
plt.hist(cancer_data['pop_to_avgdeath'])
plt.show()

In [None]:
sns.boxplot(y=cancer_data['pop_to_avgdeath'])
plt.show()

In [None]:
#transform the columns using log transformation
cancer_data['pop_to_avgdeath'] = np.log(cancer_data['pop_to_avgdeath'])

In [None]:
plt.hist(cancer_data['pop_to_avgdeath'])
plt.show()
from scipy.stats import skew 
print(skew(cancer_data['pop_to_avgdeath']))


Now the distribution looks less skewd. The skew value is 0.92. 

In [None]:
# lets drop orginal columns
cancer_data.drop(columns=['popEst2015','avgDeathsPerYear'],axis=1,inplace=True)

In [None]:
# handle skewness of data
cancer_data['medIncome'] = np.log(cancer_data['medIncome'])
plt.hist(cancer_data['medIncome'])
plt.show()

In [None]:
from scipy.stats import skew 
skew(cancer_data['medIncome'])

skewness = 0 : normally distributed.
skewness > 0 : more weight in the left tail of the distribution.
skewness < 0 : more weight in the right tail of the distribution. 

In [None]:
plt.hist(cancer_data['PctWhite'])
plt.show()

In [None]:
cancer_data['PctWhite'] = np.log(cancer_data['PctWhite'])

### Using Hyperparameter tuning on baseline model

In [None]:
# scaling data using minmax scaler after splitting 
from sklearn.preprocessing import MinMaxScaler
X2 = cancer_data.drop('TARGET_deathRate',axis=1)
y2 = cancer_data[['TARGET_deathRate']]
X2_train,X2_valid,y2_train,y2_valid = train_test_split(X2,y2,test_size=0.3,shuffle=True,random_state=42)

In [None]:
col_train = X2_train.columns
col_target = y2_train.columns

In [None]:
X_scaled = MinMaxScaler()
X_scaled.fit(X2_train)


In [None]:
X2_train = X_scaled.transform(X2_train)
X2_valid = X_scaled.transform(X2_valid)
X2_train = pd.DataFrame(X2_train,columns=col_train)
X2_valid = pd.DataFrame(X2_valid,columns=col_train)

In [None]:
# scale target variable
target_scaler = MinMaxScaler()
target_scaler.fit(y2_train)

In [None]:
y2_train = target_scaler.transform(y2_train)
y2_valid = target_scaler.transform(y2_valid)
y2_train = pd.DataFrame(y2_train,columns=col_target)
y2_valid = pd.DataFrame(y2_valid,columns=col_target) 

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search.fit(X2_train,y2_train)

In [None]:
grid_search.best_params_

In [None]:
y2_pred_trg = grid_search.predict(X2_train)
metrics.mean_squared_error(y2_train,y2_pred_trg)


In [None]:
y2_pred_valid = grid_search.predict(X2_valid)
metrics.mean_squared_error(y2_valid,y2_pred_valid)

### Predicting target on test data 

 we shall preprocess test data similar to training data prior using model on test data

In [None]:
# drop geography 
test_data.drop('Geography',axis=1,inplace=True)

In [None]:
# lets convert binnedInc to cat object
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
test_data['binnedInc'] = le.fit_transform(test_data['binnedInc'])

In [None]:
# lets handle null values in test data set
null_data = test_data.isnull().sum()
null_data[null_data.values != 0]/len(test_data)

In [None]:
#dropping PctSomeCol18_24
test_data.drop('PctSomeCol18_24',axis=1,inplace=True)

In [None]:
# imputing with mean value 
test_data['PctEmployed16_Over'].fillna(test_data['PctEmployed16_Over'].mean(),inplace=True)

In [None]:
# PctPrivateCoverageAlone imputing with mean
test_data['PctPrivateCoverageAlone'].fillna(test_data['PctPrivateCoverageAlone'].mean(),inplace=True)

In [None]:
#so we create a new col pop_to_avgdeath
test_data['pop_to_avgdeath'] = test_data['popEst2015']/test_data['avgDeathsPerYear']

In [None]:
#transform the columns using log transformation
test_data['pop_to_avgdeath'] = np.log(test_data['pop_to_avgdeath'])

In [None]:
# lets drop orginal columns
test_data.drop(columns=['popEst2015','avgDeathsPerYear'],axis=1,inplace=True)

In [None]:
# medincome col
test_data['medIncome'] = np.log(test_data['medIncome'])

In [None]:
#pct white col
cancer_data['PctWhite'] = np.log(cancer_data['PctWhite'])

In [None]:
test_data.head()

In [None]:
col_test = test_data.columns

In [None]:
test_data_scaled = X_scaled.transform(test_data)
test_data_scaled = pd.DataFrame(test_data_scaled,columns=col_test)

In [None]:
test_data_scaled.head()

In [None]:
y_hat2 = grid_search.predict(test_data_scaled)

In [None]:
y_hat2 = y_hat2.reshape(-1,1)

In [None]:
#take inverse transform to get predictions
y_hat2 = target_scaler.inverse_transform(y_hat2)

In [None]:
prediction_df = pd.DataFrame(y_hat2.flatten())
prediction_df.columns = ['prediction']
prediction_df.head()

In [None]:
prediction_df.shape

In [None]:
prediction_df.head()

In [None]:
prediction_df.to_csv('hari_assignment3_rf.csv',index=False)

228.78564304461932 is the score obtained for this model 

## USING FEATURE SELECTION 

In [None]:
# we will also check and drop duplicates from train and test data
cancer_data.drop_duplicates(keep='first',inplace=True)

In [None]:
duplicate = cancer_data[cancer_data.duplicated()]
duplicate.shape

In [None]:
duplicate_test = test_data[test_data.duplicated()]
duplicate_test.shape

No duplicate in both train and test data now.

In [None]:
from boruta import BorutaPy as bp
from sklearn.ensemble import RandomForestRegressor
X = cancer_data.drop('TARGET_deathRate',axis=1)
y = cancer_data['TARGET_deathRate']

In [None]:
X_train,X_valid,y_train,y_valid = train_test_split(X,y,test_size=0.2,random_state=42)
X_train = X_train.reset_index(drop=True)
X_valid = X_valid.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_valid = y_valid.reset_index(drop=True)

In [None]:
#using boruta for feature selection
rf_model =  RandomForestRegressor(n_jobs=-1, max_depth=5)
feat_selector = bp(rf_model, n_estimators='auto', verbose=0, random_state=42, max_iter = 100, perc = 70)
feat_selector.fit(np.array(X_train),np.array(y_train))

In [None]:
# Let's visualise it better in the form of a table
selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
                                      'Ranking':feat_selector.ranking_})
selected_rfe_features.sort_values(by='Ranking')

In [None]:
selected_rfe_features.drop([5,4],axis=0,inplace=True)

In [None]:
#list of columns selected by boruta
final_col = list(selected_rfe_features['Feature'].values)

In [None]:
#transform selected train and valid set
X_important_train = feat_selector.transform(np.array(X_train))
X_important_valid = feat_selector.transform(np.array(X_valid))

In [None]:
X_important_train = pd.DataFrame(X_important_train,columns=final_col)
X_important_valid = pd.DataFrame(X_important_valid,columns=final_col)

In [None]:
#transform test data
test_data_imp = feat_selector.transform(np.array(test_data))

In [None]:
test_data_imp = pd.DataFrame(test_data_imp,columns=final_col)


In [None]:
X_important_train.shape,X_important_valid.shape,test_data_imp.shape

In [None]:
# do gridserch cv for hyperparameter tuning 
gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,20),
            'n_estimators': (10, 50, 100, 1000),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0,n_jobs=-1)
    
grid_result = gsc.fit(X_important_train, y_train)
best_params = grid_result.best_params_

In [None]:
gsc.fit(X_important_train,y_train)


In [None]:
y_pred_trg = gsc.predict(X_important_train)
metrics.mean_squared_error(y_train,y_pred_trg)

In [None]:
y_pred_valid = grid_result.predict(X_important_valid)
metrics.mean_squared_error(y_valid,y_pred_valid)

In [None]:
#predicting on test data
y_hat = grid_result.predict(test_data_imp)

In [None]:
prediction_df = pd.DataFrame(y_hat.flatten())
prediction_df.columns = ['prediction']
prediction_df.head()

In [None]:
prediction_df.shape

In [None]:
prediction_df.to_csv('hari_assignment3_rf_fs.csv',index=False)

225.02555083369964 is the score(MSE) obtained for this model . 

### USING ABOVE RANDOM FOREST WITH HYPERPARAMETER TUNING AND SCALING

In [None]:
#calling MinMaxScaler
X_scaled = MinMaxScaler()
#fit X_important_train with scaler
X_scaled.fit(X_important_train)

In [None]:
X_train = X_scaled.transform(X_important_train)
X_valid = X_scaled.transform(X_important_valid)
X_train = pd.DataFrame(X_train,columns=final_col)
X_valid = pd.DataFrame(X_valid,columns=final_col)

In [None]:
X_train.head()

In [None]:
#y_train = pd.DataFrame(y_train, columns=col_target)

In [None]:
# scale target variable
target_scaler = MinMaxScaler()
target_scaler.fit(y_train)

In [None]:
y_valid = pd.DataFrame(y_valid, columns=col_target)

In [None]:
y_train = target_scaler.transform(y_train)
y_valid = target_scaler.transform(y_valid)
y_train = pd.DataFrame(y_train,columns=col_target)
y_valid = pd.DataFrame(y_valid,columns=col_target)

In [None]:
#scaling test data
test_data_scaled_imp = X_scaled.transform(test_data_imp)
test_data_scaled_imp = pd.DataFrame(test_data_scaled_imp,columns=final_col)

In [None]:
# do gridserch cv for hyperparameter tuning 
gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,20),
            'n_estimators': (10, 50, 100, 1000),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0,n_jobs=-1)
    
grid_result = gsc.fit(X_train, y_train)
best_params = grid_result.best_params_

In [None]:
best_params

In [None]:
grid_result.fit(X_train,y_train)


In [None]:
#traing MSE
y_pred_trg2 = grid_result.predict(X_train)
metrics.mean_squared_error(y_train,y_pred_trg2)

In [None]:
#validation MSE
y_pred_valid2 = grid_result.predict(X_valid)
metrics.mean_squared_error(y_valid,y_pred_valid2)

In [None]:
#predicting on test data
y_hat_2 = grid_result.predict(test_data_scaled_imp)

In [None]:
y_hat_2 = target_scaler.inverse_transform(y_hat_2.reshape(-1,1))


In [None]:
prediction_df = pd.DataFrame(y_hat_2.flatten())
prediction_df.columns = ['prediction']
prediction_df.head()

In [None]:
prediction_df.to_csv('hari_assignment3_rf_fs_sc.csv',index=False)

221.57504664347528 is the score obtained for this model 

### Try out other Machine Learning Models and Evaluate them

We shall try out xgboost algorithm on the above data set for one more iteration. we will use important features obtained from Boruta and unscaled features and target variables. I am getting error using xgboost with scaled data. 

In [None]:
import xgboost as xg 


In [None]:
# use unscaled y_train
y_train = target_scaler.inverse_transform(y_train)


In [None]:
#fitting and predicting with xgboost regressor
xgdmat=xg.DMatrix(X_important_train,y_train)
our_params={'eta':0.1,'seed':0,'subsample':0.8,'colsample_bytree':0.8,'objective':'reg:linear','max_depth':3,'min_child_weight':1}
final_gb=xg.train(our_params,xgdmat)
tesdmat=xg.DMatrix(test_data_imp)
pred =final_gb.predict(tesdmat)




In [None]:
prediction_df_xgb = pd.DataFrame(pred.flatten())
prediction_df_xgb.columns = ['prediction']
prediction_df_xgb.head()

In [None]:
prediction_df_xgb.shape

In [None]:
#write to csv file
prediction_df_xgb.to_csv('hari_assignment3_xgb.csv',index=False)

4409.702415442606 is the mean squared error score obtained. So we disregard this model 

## CONCLUSION

The best model post multiple iterations was found to be RandomForestRegressor with hyperparameter tuning using GridsearchCV with 28 features obtained from feature selection algorithm Boruta.

It is opined that further reduction in MSE can be attempted post transformation of some more features which have some skew. 