# <span style="color:Blue">Hyperparametre Project - Dataset 1

## Abstract
The goal of this project is to predict Cancer Mortality Rates for US Counties using multiple regression algorithms. To predict the Cancer Mortality Rates first we we will find the predictors (i.e _indipendant variables_) with strong relation to our target variable '_TARGET_deathRate_'. To achieve better results we will be cleaning data by removing extream outliers and normalizing it. We will be using varios regression algorithms like linear Regression, Logistic regression, StepWise Regression and regularization using either Ridge regression (i.e _L2 regularization_) or Lasso regression (i.e _L1 regularization_). To make prediction more accurate we will create multiple models and compare the accuracy/outputs of models to get better predictions. Further we will be cross validating output of each model using varios methods like K-Fold, cross_val_score from sklearn.

### Data Dictionary
***TARGET_deathRate*** : Dependent variable. Mean per capita (100,000) cancer mortalities(a) <br />
***avgAnnCount*** : Mean number of reported cases of cancer diagnosed annually(a) <br />
***avgDeathsPerYear*** : Mean number of reported mortalities due to cancer(a) <br />
***incidenceRate*** : Mean per capita (100,000) cancer diagoses(a) <br />
***medianIncome*** : Median income per county (b) <br />
***popEst2015*** : Population of county (b) <br />
***povertyPercent*** : Percent of populace in poverty (b) <br />
***studyPerCap*** : Per capita number of cancer-related clinical trials per county (a) <br />
***binnedInc*** : Median income per capita binned by decile (b) <br />
***MedianAge*** : Median age of county residents (b) <br />
***MedianAgeMale*** : Median age of male county residents (b) <br />
***MedianAgeFemale*** : Median age of female county residents (b) <br />
***Geography*** : County name (b) <br />
***AvgHouseholdSize*** : Mean household size of county (b) <br />
***PercentMarried*** : Percent of county residents who are married (b) <br />
***PctNoHS18_24*** : Percent of county residents ages 18-24 highest education attained: less than high school (b) <br />
***PctHS18_24*** : Percent of county residents ages 18-24 highest education attained: high school diploma (b) <br />
***PctSomeCol18_24*** : Percent of county residents ages 18-24 highest education attained: some college (b) <br />
***PctBachDeg18_24*** : Percent of county residents ages 18-24 highest education attained: bachelor's degree (b) <br />
***PctHS25_Over*** : Percent of county residents ages 25 and over highest education attained: high school diploma (b) <br />
***PctBachDeg25_Over*** : Percent of county residents ages 25 and over highest education attained: bachelor's degree (b) <br />
***PctEmployed16_Over*** : Percent of county residents ages 16 and over employed (b) <br />
***PctUnemployed16_Over*** : Percent of county residents ages 16 and over unemployed (b) <br />
***PctPrivateCoverage*** : Percent of county residents with private health coverage (b) <br />
***PctPrivateCoverageAlone*** : Percent of county residents with private health coverage alone (no public assistance) (b) <br />
***PctEmpPrivCoverage*** : Percent of county residents with employee-provided private health coverage (b) <br />
***PctPublicCoverage*** : Percent of county residents with government-provided health coverage (b) <br />
***PctPubliceCoverageAlone*** : Percent of county residents with government-provided health coverage alone (b) <br />
***PctWhite*** : Percent of county residents who identify as White (b) <br />
***PctBlack*** : Percent of county residents who identify as Black (b) <br />
***PctAsian*** : Percent of county residents who identify as Asian (b) <br />
***PctOtherRace*** : Percent of county residents who identify in a category which is not White, Black, or Asian (b) <br />
***PctMarriedHouseholds*** : Percent of married households (b) <br />
***BirthRate*** : Number of live births relative to number of women in county (b) <br />

(a): years 2010-2016 <br />
(b): 2013 Census Estimates

## Acknowledgements

The website hosting the data is located at https://data.world/nrippner/ols-regression-challenge. These data were aggregated from a number of sources including the American Community Survey (https://www.census.gov), https://www.clinicaltrials.gov, and https://www.cancer.gov.

#### Lets Start with importing libraries

In [None]:
# importing libraries
%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
from sklearn import linear_model, metrics
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LinearRegression,LogisticRegression, Ridge, Lasso, ElasticNet, SGDRegressor
from sklearn.preprocessing import PolynomialFeatures, normalize, StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold

import pylab as pl

# Importing H2O
import time, warnings, h2o, logging, os, sys, psutil, random
from h2o.automl import H2OAutoML


from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
# from __future__ import print_function

import json
import datetime
import warnings
warnings.filterwarnings('ignore')


In [None]:
#Reading Data
df_cancer_data=pd.read_csv(r'cancer_reg.csv', encoding='latin-1')
# df=pd.read_csv(r'C://Users//kaila//OneDrive//Desktop//DSMT//Assignment_2_Linear Models//cancer_reg.csv', encoding='latin-1')

In [None]:
df_cancer_data.head(10)

In [None]:
df_cancer_data.describe()

In [None]:
df_cancer_data.shape

In [None]:
df_cancer_data.isnull().sum()

Columns PctSomeCol18_24, PctEmployed16_Over and PctPrivateCoverageAlone have some null value.<br />

### Feature Engineering:

In [None]:
df_cancer_data.loc[(df_cancer_data['MedianAge'] > 0) & (df_cancer_data['MedianAge'] <= 40), 'MedianAge'] = int(1)
df_cancer_data.loc[(df_cancer_data['MedianAge'] > 40) & (df_cancer_data['MedianAge'] <= 50), 'MedianAge'] = int(2)
df_cancer_data.loc[(df_cancer_data['MedianAge'] > 50), 'MedianAge'] = int(3)
df_cancer_data['MedianAge']=df_cancer_data['MedianAge'].round(0).astype(int)

df_cancer_data['MedianAge'].value_counts()

Changed MedianAge column to categorical column.

    | MedianAge | Categorical Value   | Occurence  |
    |-----------|---------------------|------------|
    | 0 to 40   | 1                   | 1270       |
    | 40 to 50  | 2                   | 1621       |
    | Above 50  | 3                   | 156        |

Converted all values to integer type to make computation easier.

In [None]:
df_cancer_data['povertyPercent'].describe()

In [None]:
df_cancer_data['isPoor'] = np.where(df_cancer_data['povertyPercent'] <= 15.90, 0 ,1)
df_cancer_data['isPoor'].value_counts()
# remove column povertyPercent;

Poverty percent is impotant factor when estimating death rate in cancer patients. <br />
Creating binary variable that can be used in linear regression is valuable.<br /> 
- Created new binary column isPoor based on povertyPercent. 
- If povertyPercent below mean then value of isPoor is 0 else 1.

In [None]:
df_cancer_data[['PctEmployed16_Over','PctPrivateCoverageAlone','PctSomeCol18_24']].describe()

In [None]:
df_cancer_data['PctEmployed16_Over'] = pd.to_numeric(df_cancer_data['PctEmployed16_Over'], errors='coerce').fillna(54.50)
df_cancer_data['PctPrivateCoverageAlone'] = pd.to_numeric(df_cancer_data['PctPrivateCoverageAlone'], errors='coerce').fillna(48.70)
df_cancer_data['PctSomeCol18_24'] = pd.to_numeric(df_cancer_data['PctSomeCol18_24'], errors='coerce').fillna(40.40)

In [None]:
df_cancer_data[['PctEmployed16_Over','PctPrivateCoverageAlone','PctSomeCol18_24']].describe()

We are going to fill null value with either mean or median to avoid skewness <br />
We replace the null value column with median values(50%) <br />
After Replacing values there is very little changes in overall description and thats good sign that our dataset is still close to the original.

### Creating Dummy columns for MedianAge

In [None]:
# create dummy columns
dummies = pd.get_dummies(df_cancer_data.MedianAge, prefix='MedianAge').iloc[:, 0:]
# concate dummy columns with main dataframe df
df_cancer_data = pd.concat([df_cancer_data, dummies], axis=1)
# df=df.drop('MedianAge', axis=1)

### MinMaxScaler normalisation

In [None]:
std_val = df_cancer_data[['TARGET_deathRate','incidenceRate','medIncome','PctHS25_Over','PctEmployed16_Over','povertyPercent','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone','PctPublicCoverage','PctPublicCoverageAlone','MedianAge_1','MedianAge_2','MedianAge_3','isPoor']]

std_scale = StandardScaler().fit(std_val)
df_std = std_scale.transform(std_val)
print(std_scale)
print(df_std)

In [None]:
from sklearn.preprocessing import MinMaxScaler
minmax_val = df_cancer_data[['TARGET_deathRate','incidenceRate','medIncome','PctHS25_Over','PctEmployed16_Over','povertyPercent','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone','PctPublicCoverage','PctPublicCoverageAlone','MedianAge_1','MedianAge_2','MedianAge_3','isPoor']]
minmax_scale = MinMaxScaler().fit(minmax_val)
print(minmax_scale)
df_minmax = minmax_scale.transform(minmax_val)
print(df_minmax)

In [None]:
from sklearn import preprocessing

for f in df_cancer_data.columns:
    if df_cancer_data[f].dtype=='object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(np.unique(list(df_cancer_data[f].values) ))
        df_cancer_data[f] = lbl.transform(list(df_cancer_data[f].values))
        

In [None]:
df_cancer_data

### Remove outliers

In [None]:
# Calculating Z-Score 
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_cancer_data[['TARGET_deathRate','incidenceRate','medIncome','PctHS25_Over','PctEmployed16_Over','povertyPercent','PctBachDeg25_Over','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone','PctPublicCoverage','PctPublicCoverageAlone','MedianAge_1','MedianAge_2','MedianAge_3']]))
print(z)

In [None]:
def remove_outlier(df_cancer_data, col_name):
    if ((col_name!='binnedInc')&(col_name!='Geography')&(col_name!='PctSomeCol18_24')&(col_name!='MedianAge_1')&(col_name!='MedianAge_2')&(col_name!='MedianAge_3')):
        q1 = df_cancer_data[col_name].quantile(0.10)
        q3 = df_cancer_data[col_name].quantile(0.90)

#         print (q1,q3)
        iqr = q3 - q1
        lower_bound  = q1 - (1.5  * iqr)
        upper_bound = q3 + (1.5 * iqr)

        out_df=df_cancer_data.loc[(df_cancer_data[col_name] > lower_bound) & (df_cancer_data[col_name] < upper_bound)]
        df_cancer_data[col_name] = out_df[col_name]
        df_cancer_data[col_name] = pd.to_numeric(df_cancer_data[col_name], errors='coerce').fillna(df_cancer_data[col_name].mean())
        return out_df

In [None]:
for column in df_cancer_data:
     remove_outlier(df_cancer_data,df_cancer_data[column].name)
#     print(df[column].name) 

In [None]:
df_cancer_data.info()

### Is the relationship significant?

In [None]:
df_cancer_data.corr()

In [None]:
plt.figure(figsize=(20,12))
sns.heatmap(data=df_cancer_data.iloc[:,2:].corr(),annot=True,fmt='.2f',cmap='coolwarm')
plt.show()

In [None]:
x_val = df_cancer_data[['TARGET_deathRate','incidenceRate','medIncome','povertyPercent','PctHS25_Over','PctEmployed16_Over','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone','PctPublicCoverage','PctPublicCoverageAlone']]

In [None]:
plt.figure(figsize=(10,20))
sns.heatmap(df_cancer_data.corr(), annot=True)
plt.show()

In [None]:
x1_val = df_cancer_data[['TARGET_deathRate','incidenceRate','medIncome','PctHS25_Over','PctEmployed16_Over','povertyPercent','PctBachDeg25_Over']]
x2_val = df_cancer_data[['TARGET_deathRate','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone','PctPublicCoverage','PctPublicCoverageAlone']]
sns.set(style="ticks")
sns.pairplot(x1_val,palette="husl")
plt.show()

In [None]:
sns.set(style="ticks")
sns.pairplot(x2_val)
plt.show()

Dependant Variable: Target_deathRate <br />

In our dataset there are 35 original charecteristic out which 11 charecteristics have significant relationship to ur dependant variable. We will consider them as our predictors for our analysis<br />

Independant Variables: ['TARGET_deathRate','incidenceRate','medIncome','povertyPercent','PctHS25_Over',
'PctEmployed16_Over','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone',
'PctPublicCoverage','PctPublicCoverageAlone']

| Predictor Name                | Correlation         | Nature of relation       |
|-------------------------------|---------------------|--------------------------|
| incidenceRate                 |  0.45               | Positively Correlated    |
| medIncome                     | -0.43               | Negatively Correlated    |
| povertyPercent                |  0.43               | Positively Correlated    |
| PctHS25_Over                  |  0.4                | Positively Correlated    |
| PctEmployed16_Over            | -0.4                | Negatively Correlated    |
| PctUnemployed16_Over          |  0.38               | Positively Correlated    |
| PctPrivateCoverage            | -0.39               | Negatively Correlated    |
| PctPrivateCoverageAlone       | -0.33               | Negatively Correlated    |
| PctPublicCoverage             |  0.4                | Positively Correlated    |
| PctPublicCoverageAlone        |  0.45               | Positively Correlated    |


## Linear Regression

In [None]:
y= df_cancer_data['TARGET_deathRate']
x1 = df_cancer_data[['incidenceRate','PctHS25_Over','PctUnemployed16_Over','PctPrivateCoverage','MedianAge_3','isPoor']]

In [None]:
# Add constant in metric form 
x=sm.add_constant(x1,prepend=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.5)

In [None]:
# results = sm.OLS(y_train.apply(pd.to_numeric),X_train.apply(pd.to_numeric)).fit()
results = sm.OLS(y_train,X_train).fit()

In [None]:
results.summary()

In [None]:
predictions=results.predict(X_test)
predictions

In [None]:
y_test

Plotting Y-Test vs Predictions

In [None]:
fig,ax = plt.subplots()
ax.scatter(y_test, predictions)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
fig.show()

- For the lower values on the X-axis and  higher values on the X-axis the points are not near the regression line.
- Plots shows there is varience in between y_test and Predictions. 

#### Calculating Evaluation Matric -RMSE

In [None]:
# Lower the RSME better the model
from sklearn.metrics import accuracy_score,mean_squared_error
crossVal_Model1=np.sqrt(mean_squared_error(y_test,predictions))
crossVal_Model1

### Is the model good enough?

In [None]:
import seaborn as sns
colors = np.array('b g r c m y'.split()) #Different colors for plotting

fig,axes = plt.subplots(nrows =2,ncols=2, sharey=True,figsize = (15,10))
plt.tight_layout()
row = 0
iteration = 0
for j in range(0,len(x1.columns[:-3])):
    iteration+=1
    if(j%2==0):
        k = 0
    else:
        k = 1
    sns.distplot(x1[x1.columns[j]],kde=False,hist_kws=dict(edgecolor="w", linewidth=2),
                 color = np.random.choice(colors) ,ax=axes[row][k])
    if(iteration%2==0):
        row+=1
        plt.ylim(0,200)

All predictors are normally distributed, incidenceRate and PctUnemployed16_Over are skewed to the left.

### Are residuals normally distributed?
#### In regression analysis, the difference between the observed value of the dependent variable (i.e TARGET_deathRate) and the predicted value (predictions) is called the residual.

In [None]:
# histogram superimposed by normal curve
plt.figure(figsize=(10,6))
import scipy.stats as stats
mu = np.mean(results.resid)
sigma = np.std(results.resid)
pdf = stats.norm.pdf(sorted(results.resid), mu, sigma)
plt.hist(results.resid, bins=100, normed=True)
plt.plot(sorted(results.resid), pdf, color='r', linewidth=2)
plt.show()

Figure above is histogram superimposed by normal curve.
- The distribution of the residuals does not adhere perfectly to a normal distribution (skew=0, excess kurtosis=0).
- There is a small number of outliers to the left, the tails appear slightly fatter than, and the distribution has moderate kurtosis (i.e 5.81).

In [None]:
# QQplot
fig, [ax1, ax2] = plt.subplots(1,2, figsize=(15,5))
sm.qqplot(results.resid, stats.t, fit=True, line='45', ax = ax1)
ax1.set_title("t distribution")
sm.qqplot(results.resid, stats.norm, fit=True, line='45', ax=ax2)
ax2.set_title("normal distribution")
plt.show()

- The qqplots confirm that the residuals adhere more closely to the t- than normal distribution (fatter tails). <br />
- A few prominent outliers are visible at the lower and upper extreme. <br />
- All-in-all, despite these imperfections, I consider the distribution of residuals to be adequate. <br />
- However, we should investigate the nature of the more extreme outliers. <br />
- We also may want to try to add additional information or to change the predictors (We will be doing it in Model 3).

### Are any model assumptions violated?

#### Homoscedasticity & Heteroscedasticity
Homoscedasticity means that the variance around the regression line is the same for all values of the predictor variable (i.e TARGET_deathRate).

In [None]:
# plot predicted vs actual
plt.figure(figsize=(14,7))
sns.regplot(y_train, results.fittedvalues, line_kws={'color':'r', 'alpha':0.3, 
                                              'linestyle':'--', 'linewidth':2}, 
            scatter_kws={'alpha':0.5})
plt.ylim(0,300)
plt.xlabel('Actual Values')
plt.ylabel('Fitted Values')
plt.show()
print("Pearson R: ", stats.pearsonr(results.fittedvalues, y_train))

- From the plot we can infer that fitted values and actual values are collinear.
- The fiitet values sticks to the reggression line so there is no assumption violated.
- For the lower values on the X-axis and  higher values on the X-axis the points are all very near the regression line.
- Consistent with our reported R^2 value, we now visualize the strong correlation between actual and predicted values.
- no assumption are violated

#### Plotting actual values versus residuals

In [None]:
from statsmodels.nonparametric.smoothers_lowess import lowess
ys = lowess(results.resid.values, y_train, frac=0.2)
ys = pd.DataFrame(ys, index=range(len(ys)), columns=['a', 'b'])
ys = ys.sort_values(by='a')

fig, ax = plt.subplots(figsize=(14,9))
plt.scatter(y_train, results.resid, alpha=0.5, s=25)
plt.axhline(y=0, color='r', linestyle="--", alpha=0.5)
plt.xlabel("Actual Values")
plt.ylabel("Residuals")

plt.plot(ys.a, ys.b, c='green', linewidth=2, label="Lowess")
plt.legend()
plt.show()
print("Pearson R:", stats.pearsonr(y_train, results.resid))

- A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis.
- This chart suggests values are plottet randomly and linear Model is appropriate for data.

### Cross-Validation

In [None]:
modelL1 = LinearRegression()
model = modelL1.fit(X_train,y_train)

In [None]:
predictions_linear=model.predict(X_test)
predictions_linear

#### Comparing Both the output for first 5 values

In [None]:
print ('Actual Predictions :' )
print(predictions[0:5])
print ('Cross-Validation Predictions :')
print(predictions_linear[0:5])

- from both the results we can infer that the our predictions are currect for the model

In [None]:
# R-Square value calculted 5 times
scores_train = cross_val_score(model, X_train,y_train, cv=5)
print(scores_train)
print("Accuracy:  %0.2f (+/- %0.2f): " % (scores_train.mean(),scores_train.std()*2))

In [None]:
scores_test = cross_val_score(model, X_test,y_test, cv=5)
print(scores_train )
print("Accuracy:  %0.2f (+/- %0.2f): " % (scores_train.mean(),scores_train.std()*2))

### Multicollinearity

In [None]:
plt.figure(figsize=(10,6))
print(sns.heatmap(x1.corr(), annot=True,cmap='coolwarm'))
plt.show()

- Only PctPrivateCoverage has high correlation with PctUnemployed16_over and isPoor.

Let's apply variance inflation factors to assess for multicollinearity. VIFs, by performing an independent variable on the design matrix comprising all the other independent variables, allows us to assess the degree to which that independent variable is orthogonal the others. 
Larger VIFs indicate multicollinearity. 



In [None]:
pd.DataFrame([[var, variance_inflation_factor(x.values, x.columns.get_loc(var))] for var in x.columns],
                   index=range(x.shape[1]), columns=['Variable', 'VIF'])

Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100 there is certainly multicollinearity among the variables. 
In the above model, Multicollinearity is present.

All the predictors have acceptable VIF.

- In Backword Selection method; we are going to check which predictors performs well and we will be removing predictots which have high P-stats and comparing the output. 
- Here our Traget variable is 'isPoor'

## Regularization 

- Regularization is a method for "constraining" or "regularizing" the size of the coefficients, thus "shrinking" them towards zero.
- It reduces model variance which minimizes overfitting.

For a regularized linear regression model, we minimize the sum of RSS and a "penalty term" that penalizes coefficient size.

### Ridge Regression

In [None]:
df_norm=df_cancer_data

In [None]:
df_norm=df_norm.drop('binnedInc', axis=1)
df_norm=df_norm.drop('Geography', axis=1)

In [None]:
# Ridge regression with an alpha of 0.5
y_norm=df_norm['TARGET_deathRate']
# x_norm = df_norm[['TARGET_deathRate','incidenceRate','medIncome','povertyPercent','PctHS25_Over','PctEmployed16_Over','PctUnemployed16_Over','PctPrivateCoverage','PctPrivateCoverageAlone','PctPublicCoverage','PctPublicCoverageAlone']]
x_norm = df_norm[['incidenceRate','PctHS25_Over','PctUnemployed16_Over','PctPrivateCoverage','MedianAge_3','isPoor']]

X_train, X_test, y_train, y_test = train_test_split(x_norm,y_norm,test_size=0.2)
ridge = Ridge(fit_intercept=True, alpha=0.5)
ridge.fit(X_train,y_train)

In [None]:
y_pred = ridge.predict(X_test)
plt.figure(figsize=(14,7))
plt.scatter(y_test, y_pred)
plt.xlabel("TARGET_deathRate: $Y_i$")
plt.ylabel("Predicted TARGET_deathRate: $\hat{y}_i$")
plt.title("Ridge Regression - TARGET_deathRate vs Predicted TARGET_deathRate: $Y_i$ vs $\hat{y}_i$")

#### Calculating Evaluation Matrics -RMSE

In [None]:
rmse_norm = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rmse_norm

In [None]:
linreg = LinearRegression()
# Train the model using the training sets
linreg.fit(x_norm,y_norm)

In [None]:
X_kf=np.array(x_norm)
y_kf=np.array(y_norm)
# Define the split - into 2 folds 
kf=KFold(n_splits=5,shuffle=False, random_state=None) 
#Returns the number of splitting iterations in the cross-validator
kf.get_n_splits(X_kf)

In [None]:
for train_index, test_index in kf.split(X_kf):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X_kf[train_index], X_kf[test_index]

In [None]:
scores = []
for train, test in kf.split(X_kf):
    linreg.fit(X_kf[train],y_kf[train])
    scores.append(np.sqrt(metrics.mean_squared_error(y_kf[test], linreg.predict(X_kf[test]))))
# scores
np.around(scores, decimals=2)

In [None]:
print('Mean           : ',round(np.mean(scores),2))
print('Median         : ',round(np.median(scores),2))
print('Std. Deviation : ',round(np.std(scores),2))

In [None]:
def show_stats(m, ncv, cv):
    print('Method: %s' %m)
    print('RMSE on no CV training: %.3f' %ncv)
    print('RMSE on 5-fold CV: %.3f' %cv)

In [None]:
show_stats('Simple Linear Regression',rmse_norm ,np.mean(scores))

In [None]:
print('Ridge Regression')
print('alpha\t RMSE_train\t RMSE_cv\n')
alpha = np.linspace(.01,20,50)
t_rmse = np.array([])
cv_rmse = np.array([])

for a in alpha:
    ridge = Ridge(fit_intercept=True, alpha=a)  
    # computing the RMSE on training data
    ridge.fit(X_kf,y_kf)
    y_pred = ridge.predict(X_kf)
    err = y_pred-y_kf    
    # Dot product of error vector with itself gives us the sum of squared errors
    total_error = np.dot(err,err)
    rmse_train = np.sqrt(total_error/len(y_pred))

    # computing RMSE using 5-fold cross validation
    kf = KFold(len(X_kf))
    xval_err = 0
    for train, test in kf.split(X_kf):
        ridge.fit(X_kf[train], y_kf[train])
        y_pred = ridge.predict(X_kf[test])
        err = y_pred - y_kf[test]
        xval_err += np.dot(err,err)
    rmse_cv = np.sqrt(xval_err/len(X_kf))
    
    t_rmse = np.append(t_rmse, [rmse_train])
    cv_rmse = np.append(cv_rmse, [rmse_cv])
    print('{:.3f}\t {:.4f}\t\t {:.4f}'.format(a,rmse_train,rmse_cv))

In [None]:
plt.figure(figsize=(14,7))
pl.plot(alpha, t_rmse, label='RMSE-Train')
pl.plot(alpha, cv_rmse, label='RMSE_Cross_Val')
pl.legend( ('Ridge RMSE-Train', 'Ridge RMSE_Cross_Val') )
pl.ylabel('RMSE')
pl.xlabel('Alpha')
pl.show()

In Ridge Regression:
The cross validation between RMSE_train and RMSE_cv shows the root mean square error for train and cross validation values  very close 

## Implementing H2O

#### Initialising the H20

In [None]:
pct_memory=0.95
virtual_memory=psutil.virtual_memory()
print("Virtual Memory Size: ",virtual_memory)
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print("Minimum Memory Size: ",min_mem_size)

#### Importing Dataset into H2O frame

In [None]:
data_path=None
all_variables=None
test_path=None
# target='search_term'
target=None
nthreads=1 
min_mem_size= min_mem_size 
# run_time=4000
classification=False
scale=False
max_models=None    
model_path=None
balance_y=False 
balance_threshold=0.2
name=None 
server_path=None  
analysis=2

In [None]:
# Functions

def alphabet(n):
  alpha='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'    
  str=''
  r=len(alpha)-1   
  while len(str)<n:
    i=random.randint(0,r)
    str+=alpha[i]   
  return str
  
  
def set_meta_data(analysis,run_id,server,data,test,model_path,target,run_time,classification,scale,model,balance,balance_threshold,name,path,nthreads,min_mem_size):
  m_data={}
  m_data['start_time'] = time.time()
  m_data['target']=target
#   m_data['predictors']=predictors
  m_data['server_path']=server
  m_data['data_path']=data 
  m_data['test_path']=test
  m_data['max_models']=model
  m_data['run_time']=run_time
  m_data['run_id'] =run_id
  m_data['scale']=scale
  m_data['classification']=classification
  m_data['scale']=False
  m_data['model_path']=model_path
  m_data['balance']=balance
  m_data['balance_threshold']=balance_threshold
  m_data['project'] =name
  m_data['end_time'] = time.time()
  m_data['execution_time'] = 0.0
  m_data['run_path'] =path
  m_data['nthreads'] = nthreads
  m_data['min_mem_size'] = min_mem_size
  m_data['analysis'] = analysis
  m_data['Main_Eval_metrix'] = "RMSE"
  return m_data


def dict_to_json(dct,n):
  j = json.dumps(dct, indent=4)
  f = open(n, 'w')
  print(j, file=f)
  f.close()
  
  
def stackedensemble(mod):
    coef_norm=None
    try:
      metalearner = h2o.get_model(mod.metalearner()['name'])
      coef_norm=metalearner.coef_norm()
    except:
      pass        
    return coef_norm

def stackedensemble_df(df):
    bm_algo={ 'GBM': None,'GLM': None,'DRF': None,'XRT': None,'Dee': None}
    for index, row in df.iterrows():
      if len(row['model_id'])>3:
        key=row['model_id'][0:3]
        if key in bm_algo:
          if bm_algo[key] is None:
                bm_algo[key]=row['model_id']
    bm=list(bm_algo.values()) 
    bm=list(filter(None.__ne__, bm))             
    return bm

def se_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['auc']=modl.auc()   
    d['roc']=modl.roc()
    d['mse']=modl.mse()   
    d['null_degrees_of_freedom']=modl.null_degrees_of_freedom()
    d['null_deviance']=modl.null_deviance()
    d['residual_degrees_of_freedom']=modl.residual_degrees_of_freedom()   
    d['residual_deviance']=modl.residual_deviance()
    d['rmse']=modl.rmse()
    return d

def get_model_by_algo(algo,models_dict):
    mod=None
    mod_id=None    
    for m in list(models_dict.keys()):
        if m[0:3]==algo:
            mod_id=m
            mod=h2o.get_model(m)      
    return mod,mod_id     
    
    
def gbm_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d
    
    
def dl_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d
    
    
def drf_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    d['roc']=modl.roc()      
    return d
    
def xrt_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    d['roc']=modl.roc()      
    return d
    
    
def glm_stats(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['coef']=modl.coef()  
    d['coef_norm']=modl.coef_norm()      
    return d
    
def model_performance_stats(perf):
    d={}
    try:    
      d['mse']=perf.mse()
    except:
      pass      
    try:    
      d['rmse']=perf.rmse() 
    except:
      pass      
    try:    
      d['null_degrees_of_freedom']=perf.null_degrees_of_freedom()
    except:
      pass      
    try:    
      d['residual_degrees_of_freedom']=perf.residual_degrees_of_freedom()
    except:
      pass      
    try:    
      d['residual_deviance']=perf.residual_deviance() 
    except:
      pass      
    try:    
      d['null_deviance']=perf.null_deviance() 
    except:
      pass      
    try:    
      d['aic']=perf.aic() 
    except:
      pass      
    try:
      d['logloss']=perf.logloss() 
    except:
      pass    
    try:
      d['auc']=perf.auc()
    except:
      pass  
    try:
      d['gini']=perf.gini()
    except:
      pass    
    return d
    
def impute_missing_values(df, x, scal=False):
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in x:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    _ = df[reals].impute(method='mean')
    _ = df[ints].impute(method='median')
    if scal:
        df[reals] = df[reals].scale()
        df[ints] = df[ints].scale()    
    return


def get_independent_variables(df, targ):
    C = [name for name in df.columns if name != targ]
    # determine column types
    ints, reals, enums = [], [], []
    for key, val in df.types.items():
        if key in C:
            if val == 'enum':
                enums.append(key)
            elif val == 'int':
                ints.append(key)            
            else: 
                reals.append(key)    
    x=ints+enums+reals
    return x
    
def get_all_variables_csv(i):
    ivd={}
    try:
      iv = pd.read_csv(i,header=None)
    except:
      sys.exit(1)    
    col=iv.values.tolist()[0]
    dt=iv.values.tolist()[1]
    i=0
    for c in col:
      ivd[c.strip()]=dt[i].strip()
      i+=1        
    return ivd
    
    

def check_all_variables(df,dct,y=None):     
    targ=list(dct.keys())     
    for key, val in df.types.items():
        if key in targ:
          if dct[key] not in ['real','int','enum']:                      
            targ.remove(key)  
    for key, val in df.types.items():
        if key in targ:            
          if dct[key] != val:
            print('convert ',key,' ',dct[key],' ',val)
            if dct[key]=='enum':
                try:
                  df[key] = df[key].asfactor() 
                except:
                  targ.remove(key)                 
            if dct[key]=='int': 
                try:                
                  df[key] = df[key].asnumeric() 
                except:
                  targ.remove(key)                  
            if dct[key]=='real':
                try:                
                  df[key] = df[key].asnumeric()  
                except:
                  targ.remove(key)                  
    if y is None:
      y=df.columns[-1] 
    if y in targ:
      targ.remove(y)
    else:
      y=targ.pop()            
    return targ    
    
def predictions(mod,data,run_id):
    test = h2o.import_file(data)
    mod_perf=mod_best.model_performance(test)
              
    stats_test={}
    stats_test=model_performance_stats(mod_perf)

    n=run_id+'_test_stats.json'
    dict_to_json(stats_test,n) 

    try:    
      cf=mod_perf.confusion_matrix(metrics=["f1","f2","f0point5","accuracy","precision","recall","specificity","absolute_mcc","min_per_class_accuracy","mean_per_class_accuracy"])
      cf_df=cf[0].table.as_data_frame()
      cf_df.to_csv(run_id+'_test_confusion_matrix.csv')
    except:
      pass

    predictions = mod_best.predict(test)
    predictions_df=test.cbind(predictions).as_data_frame() 
    predictions_df.to_csv(run_id+'_predictions.csv')
    return

def predictions_test(mod,test,run_id):
    mod_perf=mod_best.model_performance(test)          
    stats_test={}
    stats_test=model_performance_stats(mod_perf)
    n=run_id+'_test_stats.json'
    dict_to_json(stats_test,n) 
    try:
      cf=mod_perf.confusion_matrix()
#      cf=mod_perf.confusion_matrix(metrics=["f1","f2","f0point5","accuracy","precision","recall","specificity","absolute_mcc","min_per_class_accuracy","mean_per_class_accuracy"])
      cf_df=cf.table.as_data_frame()
      cf_df.to_csv(run_id+'_test_confusion_matrix.csv')
    except:
      pass
    predictions = mod_best.predict(test)    
    predictions_df=test.cbind(predictions).as_data_frame() 
    predictions_df.to_csv(run_id+'_predictions.csv')
    return predictions

def check_X(x,df):
    for name in x:
        if name not in df.columns:
          x.remove(name)  
    return x    
    
    
def get_stacked_ensemble(lst):
    se=None
    for model in model_set:
      if 'BestOfFamily' in model:
        se=model
    if se is None:     
      for model in model_set:
        if 'AllModels'in model:
          se=model           
    return se       
    
def get_variables_types(df):
    d={}
    for key, val in df.types.items():
        d[key]=val           
    return d    

def Variable_imp_list(modl):
    d={}
    d['algo']=modl.algo
    d['model_id']=modl.model_id   
    d['varimp']=modl.varimp()  
    return d

#  End Functions

In [None]:
all_variables=None

In [None]:
# run_id="Runtime_1_333_Hyperparameter"
run_id=alphabet(9)
if server_path==None:
  server_path=os.path.abspath(os.curdir)
os.chdir(server_path) 
run_dir = os.path.join(server_path,run_id)
os.mkdir(run_id)
os.chdir(run_id)

# run_id to std_opt
print(run_id)

In [None]:
logfile=run_id+'_autoh2o_log.zip'
logs_path=os.path.join(run_dir,'logs')
print(logs_path,logfile)

In [None]:
# Connect to a cluster
port_no=random.randint(5555,55555)

#  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
try:
  h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
except:
  logging.critical('h2o.init')
  h2o.download_all_logs(dirname=logs_path, filename=logfile)      
  h2o.cluster().shutdown()
  sys.exit(2)

In [None]:
# H2O_df = h2o.import_file("cancer_reg.csv")
H2O_df=h2o.H2OFrame(df_cancer_data)

In [None]:
predictors=[x for x in H2O_df.columns if x not in ['TARGET_deathRate']]
Target='TARGET_deathRate'

### Model - 1 

In [None]:
run_time=333

In [None]:
meta_data = set_meta_data(analysis, run_id,server_path,data_path,test_path,model_path,target,run_time,classification,scale,max_models,balance_y,balance_threshold,name,run_dir,nthreads,min_mem_size)
print(meta_data)

In [None]:
# dependent variable
# assign target and inputs for classification or regression
if target==None:
 target=H2O_df.columns[2]   #df['target_class']
y = target

In [None]:
if all_variables is not None:
 ivd=get_all_variables_csv(all_variables)
 print(ivd)
 X=check_all_variables(H2O_df,ivd,y)
 print(X)

In [None]:
# independent variables

X = []
if all_variables is None:
 X=get_independent_variables(H2O_df, y)
 print(X)
else:
 ivd=get_all_variables_csv(all_variables)
 X=check_all_variables(H2O_df, ivd)


X=check_X(X,H2O_df)

# Add independent variables
meta_data['X']=X


# impute missing values
_=impute_missing_values(H2O_df,X, scale)

In [None]:
if analysis == 2:
 classification=False
elif analysis == 1:
 classification=True

In [None]:
# Force target to be factors
# Only 'int' or 'string' are allowed for asfactor(), got Target (Total orders):real

if classification:
   H2O_df[y] = H2O_df[y].asfactor()

In [None]:
def check_y(y,df):
 ok=False
 C = [name for name in df.columns if name == y]
 for key, val in df.types.items():
   if key in C:
     if val in ['real','int','enum']:
       ok=True
 return ok

In [None]:
ok=check_y(y,H2O_df)
if not ok:
   print(ok)

In [None]:
classification=True
if classification:
   print(H2O_df[y].levels())

In [None]:
allV=get_variables_types(H2O_df)
# allV

In [None]:
meta_data['variables']=allV

In [None]:
aml = H2OAutoML(max_runtime_secs = run_time, seed=27)

In [None]:
model_start_time = time.time()
aml.train(x=predictors, y=Target, training_frame=H2O_df)
model_end_time = time.time()

In [None]:
meta_data['model_execution_time'] = time.time() - model_start_time

In [None]:
# comma saperated predictors can put in metadata file
predictors_list=' ,'.join(predictors)

In [None]:
allV=get_variables_types(H2O_df)
meta_data['variables']=allV

#### Lets print the Leaderboard 

In [None]:
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

In [None]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best=h2o.get_model(model_set[0])
meta_data['mod_best']= h2o.get_model(model_set[0])._id 
meta_data['mod_best_algo']=h2o.get_model(model_set[0]).algo

In [None]:
meta_data['mod_best_algo']

In [None]:
df_pandaFrame = lb.as_data_frame()
leaderboard_stats=run_id+'_leaderboard.json'
df_pandaFrame.to_json(leaderboard_stats)

In [None]:
count=0

for rows in df_pandaFrame.iterrows():
    
    mod_best1=h2o.get_model(model_set[count])
    count=count+1
    hy_parameter = mod_best1.params
    n= run_id + "_"+mod_best1._id + '_hy_parameter.json' 
    dict_to_json(hy_parameter,n)
#     hy_parameter.to_json(file_name)
#     print(hy_parameter)
#     output = pd.DataFrame()
#     output = output.append(hy_parameter, ignore_index=True)
    

In [None]:
best_models={}
best_models=stackedensemble(mod_best)
bm=[]
if best_models is not None: 
  if 'Intercept' in best_models.keys():
    del best_models['Intercept']
  bm=list(best_models.keys())
else:
  best_models={}
  bm=stackedensemble_df(aml_leaderboard_df)   
  for b in bm:   
    best_models[b]=None

if mod_best1.model_id not in bm:
    bm.append(mod_best.model_id)
    

In [None]:
meta_data['models']=bm

In [None]:
GBM_list=['Model_name','Leaderboard_rank','learn_rate','learn_rate_annealing','max_abs_leafnode_pred','pred_noise_bandwidth','distribution','tweedie_power','quantile_alpha','huber_alpha','categorical_encoding','max_depth','sample_rate','col_sample_rate','ntrees','nfolds']
df_gbm = pd.DataFrame(columns=GBM_list)


GLM_list=['Model_name','Leaderboard_rank','nfolds','seed','tweedie_variance_power','tweedie_link_power','alpha','lambda','missing_values_handling','standardize']
df_glm = pd.DataFrame(columns=GLM_list)

# DRF and XRT are the same
DRF_list=['Model_name','Leaderboard_rank','nfolds','seed','mtries','categorical_encoding']
df_drf = pd.DataFrame(columns=DRF_list)

Deeplearn_list=['Model_name','Leaderboard_rank','balance_classes','categorical_encoding','class_sampling_factors','distribution','huber_alpha','max_after_balance_size','max_runtime_secs','missing_values_handling','model_id','quantile_alpha','seed','standardize','stopping_metric','stopping_rounds','stopping_tolerance','tweedie_power']
df_Deeplearn = pd.DataFrame(columns=Deeplearn_list)

count=0
cnt=0
# aml_leaderboard_df=aml.leaderboard.as_data_frame()
# model_set=aml_leaderboard_df['model_id']
rows_list = []
for rows in df_pandaFrame.iterrows():
    
    mod_best1=h2o.get_model(model_set[count])
    count=count+1
    hy_parameter = mod_best1.params
    
    if (mod_best1.algo == 'gbm' ):
#         print(mod_best1.varimp())
        df_gbm=df_gbm.append({'Model_name':mod_best1.model_id,
                                'Leaderboard_rank':count,
                                'RMSE_VAL':mod_best1.rmse(xval=True),
                                'learn_rate':hy_parameter['learn_rate']['actual'],
                                'max_depth':hy_parameter['max_depth']['actual'],
                                'sample_rate': hy_parameter['sample_rate']['actual'],
                                'col_sample_rate':hy_parameter['col_sample_rate']['actual'],
                                'ntrees':hy_parameter['ntrees']['actual'],
                                'nfolds':hy_parameter['nfolds']['actual'],
                                'learn_rate_annealing':hy_parameter['learn_rate_annealing']['actual'],
                                'max_abs_leafnode_pred':hy_parameter['max_abs_leafnode_pred']['actual'],
                                'pred_noise_bandwidth':hy_parameter['pred_noise_bandwidth']['actual'],
                                'distribution':hy_parameter['distribution']['actual'],
                                'tweedie_power':hy_parameter['tweedie_power']['actual'],
                                'quantile_alpha':hy_parameter['quantile_alpha']['actual'],
                                'huber_alpha':hy_parameter['huber_alpha']['actual'],
                                'categorical_encoding':hy_parameter['categorical_encoding']['actual']},
                             ignore_index=True)
     
    elif (mod_best1.algo == 'glm'):
        df_glm=df_glm.append({'Model_name':mod_best1.model_id,
                                'Leaderboard_rank':count,
                                'nfolds':hy_parameter['nfolds']['actual'],
                                'seed':hy_parameter['seed']['actual'],
                                'tweedie_variance_power':hy_parameter['tweedie_variance_power']['actual'],
                                'tweedie_link_power':hy_parameter['tweedie_link_power']['actual'],
                                'alpha':hy_parameter['alpha']['actual'],
                                'lambda':hy_parameter['lambda']['actual'],
                                'missing_values_handling':hy_parameter['missing_values_handling']['actual'],
                                'standardize':hy_parameter['standardize']['actual']},
                             ignore_index=True)
    
    elif (mod_best1.algo == 'drf'):
        df_drf=df_drf.append({'Model_name':mod_best1.model_id,
                                'Leaderboard_rank':count,
                                'nfolds':hy_parameter['nfolds']['actual'],
                                'seed':hy_parameter['seed']['actual'],
                                'mtries':hy_parameter['mtries']['actual'],
                                'categorical_encoding':hy_parameter['categorical_encoding']['actual']},
                             ignore_index=True)
        
    elif (mod_best1.algo == 'deeplearning'):
        df_Deeplearn=df_Deeplearn.append({'Model_name':mod_best1.model_id,
                                'Leaderboard_rank':count,
                                'balance_classes': hy_parameter['balance_classes']['actual'],
                                'max_after_balance_size': hy_parameter['max_after_balance_size']['actual'],
                                'class_sampling_factors':hy_parameter['class_sampling_factors']['actual'],
                                'activation':hy_parameter['activation']['actual'],
                                'hidden':hy_parameter['hidden']['actual'],
                                'epochs':hy_parameter['epochs']['actual'],
                                'train_samples_per_iteration':hy_parameter['train_samples_per_iteration']['actual'],
                                'target_ratio_comm_to_comp':hy_parameter['target_ratio_comm_to_comp']['actual'],
                                'seed':hy_parameter['seed']['actual'],
                                'adaptive_rate':hy_parameter['adaptive_rate']['actual'],
                                'rho':hy_parameter['rho']['actual'],
                                'epsilon':hy_parameter['epsilon']['actual'],
                                'rate':hy_parameter['rate']['actual'],
                                'rate_annealing':hy_parameter['rate_annealing']['actual'],
                                'rate_decay':hy_parameter['rate_decay']['actual'],
                                'momentum_start':hy_parameter['momentum_start']['actual'],
                                'momentum_ramp':hy_parameter['momentum_ramp']['actual'],
                                'momentum_stable':hy_parameter['momentum_stable']['actual'],                                          
                                'nesterov_accelerated_gradient':hy_parameter['nesterov_accelerated_gradient']['actual'],
                                'input_dropout_ratio':hy_parameter['input_dropout_ratio']['actual'],
                                'hidden_dropout_ratios':hy_parameter['hidden_dropout_ratios']['actual'],
                                'l1':hy_parameter['l1']['actual'],
                                'initial_weight_distribution':hy_parameter['initial_weight_distribution']['actual'],
                                'initial_weight_scale':hy_parameter['initial_weight_scale']['actual'],                                          
                                'l2':hy_parameter['l2']['actual'],
                                'max_w2':hy_parameter['max_w2']['actual'],
                                'loss':hy_parameter['loss']['actual'],
                                'initial_weights':hy_parameter['initial_weights']['actual'],
                                'initial_biases':hy_parameter['initial_biases']['actual'],
                                'distribution':hy_parameter['distribution']['actual'],                                          
                                'tweedie_power':hy_parameter['tweedie_power']['actual'],
                                'quantile_alpha':hy_parameter['quantile_alpha']['actual'],
                                'score_interval':hy_parameter['score_interval']['actual'],
                                'score_training_samples':hy_parameter['score_training_samples']['actual'],
                                'score_validation_samples':hy_parameter['score_validation_samples']['actual'],
                                'score_duty_cycle':hy_parameter['score_duty_cycle']['actual'],
                                'classification_stop':hy_parameter['classification_stop']['actual'],
                                'regression_stop':hy_parameter['regression_stop']['actual'],
                                'score_validation_sampling':hy_parameter['score_validation_sampling']['actual'],                                          
                                'overwrite_with_best_model':hy_parameter['overwrite_with_best_model']['actual'],
                                'use_all_factor_levels':hy_parameter['use_all_factor_levels']['actual'],
                                'standardize':hy_parameter['standardize']['actual'],
                                'fast_mode':hy_parameter['fast_mode']['actual'],
                                'variable_importances':hy_parameter['variable_importances']['actual'],                                          
                                'fast_mode':hy_parameter['fast_mode']['actual'],
                                'force_load_balance':hy_parameter['force_load_balance']['actual'],
                                'replicate_training_data':hy_parameter['replicate_training_data']['actual'],
                                'shuffle_training_data':hy_parameter['shuffle_training_data']['actual'],
                                'missing_values_handling':hy_parameter['missing_values_handling']['actual'],
                                'sparse':hy_parameter['sparse']['actual'],
                                'col_major':hy_parameter['col_major']['actual'],
                                'average_activation':hy_parameter['average_activation']['actual'],
                                'sparsity_beta':hy_parameter['sparsity_beta']['actual'],                                          
                                'max_categorical_features':hy_parameter['max_categorical_features']['actual'],
                                'reproducible':hy_parameter['reproducible']['actual'],
                                'elastic_averaging':hy_parameter['elastic_averaging']['actual'],
                                'elastic_averaging_moving_rate':hy_parameter['elastic_averaging_moving_rate']['actual'],
                                'elastic_averaging_regularization':hy_parameter['elastic_averaging_regularization']['actual'],
                                'categorical_encoding':hy_parameter['categorical_encoding']['actual']
                                         
                                         },
                             ignore_index=True)
#     stackedensemble
#     elif (mod_best1.algo == 'stackedensemble'):
#         for models in stack_models(mod_best1):
# #             print(mod_best1)
# #             print(models)
        

In [None]:
df_gbm
# df_gbm[['Model_name','Leaderboard_rank','RMSE_VAL','learn_rate','max_depth','sample_rate','col_sample_rate','ntrees','nfolds']]

In [None]:
df_Deeplearn

In [None]:
df_glm

In [None]:
df_drf

In [None]:
GBM_Para= run_id + "_GBM" + '_hyperparameters.json' 
df_gbm.to_json(GBM_Para)

GLM_Para= run_id + "_GLM" + '_hyperparameters.json' 
df_Deeplearn.to_json(GLM_Para)

DRF_para= run_id + "_DRF" + '_hyperparameters.json' 
df_glm.to_json(DRF_para)

Deeplearn_para= run_id + "_Deeplearn" + '_hyperparameters.json' 
df_drf.to_json(Deeplearn_para)

In [None]:
fig, (ax1, ax2,ax3,ax4,ax5) = plt.subplots(5, 1, figsize=(20,25))

ax1.plot(df_gbm['ntrees'],df_gbm['RMSE_VAL'],color='red', marker='o', linestyle='dashed',linewidth=2, markersize=12)
ax1.set_title('ntrees vs RMSE')
ax1.set_ylabel('RMSE')
ax1.set_xlabel('ntrees')

ax2.plot(df_gbm['max_depth'],df_gbm['RMSE_VAL'],color='red', marker='o', linestyle='dashed',linewidth=2, markersize=12)
ax2.set_title('max_depth vs RMSE')
ax2.set_ylabel('RMSE')
ax2.set_xlabel('max_depth')

ax3.plot(df_gbm['learn_rate'],df_gbm['RMSE_VAL'],color='blue', marker='o', linestyle='dashed',linewidth=2, markersize=12)
ax3.set_title('learn_rate vs RMSE')
ax3.set_ylabel('RMSE')
ax3.set_xlabel('learn_rate')

ax4.plot(df_gbm['sample_rate'],df_gbm['RMSE_VAL'],color='blue', marker='o', linestyle='dashed',linewidth=2, markersize=12)
ax4.set_title('sample_rate vs RMSE')
ax4.set_ylabel('RMSE')
ax4.set_xlabel('sample_rate')

ax5.plot(df_gbm['col_sample_rate'],df_gbm['RMSE_VAL'],color='blue', marker='o', linestyle='dashed',linewidth=2, markersize=12)
ax5.set_title('col_sample_rate vs RMSE')
ax5.set_ylabel('RMSE')
ax5.set_xlabel('col_sample_rate')

In [None]:
# mod_best_test=h2o.get_model(model_set[27])
# print(mod_best_test.algo)
# mod_best_test.params

In [None]:
store_val=Variable_imp_list(h2o.get_model(model_set[1]))
print(store_val)

In [None]:
df_varimp = pd.DataFrame(store_val['varimp'])
df_varimp.rename(columns={0:'Variable',1:'relative_importance',2:'scaled_importance',3:'percentage'}, inplace=True)
df_varimp.head(20)

In [None]:
df_varimp=df_varimp.sort_values(by=['relative_importance'],ascending=False)
df_varimp['Variable'].head(27)

In [None]:
VarImp_Para= run_id + '__VariableImportance.json' 
df_varimp.to_json(VarImp_Para)


#### Calculating Evaluation Matric - RMSE

In [None]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
model_set=aml_leaderboard_df['model_id']
mod_best1=h2o.get_model(model_set[0])

In [None]:
print("Best Model Name: ",mod_best.rmse())
print("RMSE of best iteration: ",mod_best.rmse())
print("RMSE on CV: ",mod_best.rmse(xval=True))

In [None]:
print(mod_best1.algo)

In [None]:
n=run_id+'_metadata.json'
dict_to_json(meta_data,n)

### Model - 2 : Increasing AutoML running time

In [None]:
aml2 = H2OAutoML(max_runtime_secs=777, seed=27)

In [None]:
model_start_time = time.time()
aml2.train(x=predictors, y=Target, training_frame=H2O_df)
model_end_time = time.time()

In [None]:
currentDT=datetime.datetime.now()
currentDT=currentDT.strftime("%Y-%m-%d %H:%M:%S")

predictors_list=' |'.join(predictors)

meta_data={}
meta_data['Problem type'] ="classification"
meta_data['Target']='TARGET_deathRate'
meta_data['predictors']=predictors_list
meta_data['Execution_Date'] = currentDT
meta_data['model_execution_time'] = {(model_end_time - model_start_time)}
meta_data['max_runtime_secs']='777'
meta_data['Evaluation_Matric']='RMSE'
# meta_data=meta_data.as_data_frame()
pd_meta=pd.DataFrame.from_dict(meta_data)

# Save metadata
pd_meta.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//LeaderBoard//Iteration2_metadata.json')

In [None]:
# View the AutoML Leaderboard
lb2 = aml2.leaderboard
lb2.head(rows=lb2.nrows)  # Print all rows instead of default (10 rows)

In [None]:
df_pandaFrame2 = lb2.as_data_frame()
# print(df_pandaFrame)
df_pandaFrame2.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//Iteration2_777.json')

In [None]:
aml_leaderboard_df2=aml2.leaderboard.as_data_frame()
model_set2=aml_leaderboard_df2['model_id']
mod_best2=h2o.get_model(model_set2[0])

In [None]:
print(mod_best2.algo)

In [None]:
params_list2 = []
for key, value in mod_best2.params.items():
    params_list2.append(str(key)+" = "+str(value['actual']))
params_list2

### Model - 3 : Increasing AutoML running time

In [None]:
aml3 = H2OAutoML(max_runtime_secs=999, seed=27)

In [None]:
model_start_time = time.time()

try:
    aml3.train(x=predictors, y=Target, training_frame=H2O_df)
except Exception as e:
    
    logging.critical('aml3.train')
    h2o.download_all_logs(dirname=logs_path, filename=logfile)
    h2o.cluster().shutdown()
    sys.exit(4)

model_end_time = time.time()

In [None]:
currentDT=datetime.datetime.now()
currentDT=currentDT.strftime("%Y-%m-%d %H:%M:%S")

predictors_list=' |'.join(predictors)

meta_data={}
meta_data['Problem type'] ="classification"
meta_data['Target']='TARGET_deathRate'
meta_data['predictors']=predictors_list
meta_data['Execution_Date'] = currentDT
meta_data['model_execution_time'] = {(model_end_time - model_start_time)}
meta_data['max_runtime_secs']='999'
meta_data['Evaluation_Matric']='RMSE'
# meta_data=meta_data.as_data_frame()
pd_meta=pd.DataFrame.from_dict(meta_data)

# Save metadata
pd_meta.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//LeaderBoard//Iteration3_metadata.json')

In [None]:
# View the AutoML Leaderboard
lb3 = aml3.leaderboard
lb3.head(rows=lb3.nrows)  # Print all rows instead of default (10 rows)

In [None]:
df_pandaFrame3 = lb3.as_data_frame()
# print(df_pandaFrame)
df_pandaFrame3.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//Iteration3_999.json')

### Model - 4 : Increasing AutoML running time

In [None]:
aml4 = H2OAutoML(max_runtime_secs=1333, seed=27)

In [None]:
model_start_time = time.time()

try:
    aml4.train(x=predictors, y=Target, training_frame=H2O_df)
except Exception as e:
    
    logging.critical('aml4.train')
    h2o.download_all_logs(dirname=logs_path, filename=logfile)
    h2o.cluster().shutdown()
    sys.exit(4)

model_end_time = time.time()

In [None]:
currentDT=datetime.datetime.now()
currentDT=currentDT.strftime("%Y-%m-%d %H:%M:%S")

predictors_list=' |'.join(predictors)

meta_data={}
meta_data['Problem type'] ="classification"
meta_data['Target']='TARGET_deathRate'
meta_data['predictors']=predictors_list
meta_data['Execution_Date'] = currentDT
meta_data['model_execution_time'] = {(model_end_time - model_start_time)}
meta_data['max_runtime_secs']='1333'
meta_data['Evaluation_Matric']='RMSE'
# meta_data=meta_data.as_data_frame()
pd_meta=pd.DataFrame.from_dict(meta_data)

# Save metadata
pd_meta.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//LeaderBoard//Iteration4_metadata.json')

In [None]:
# View the AutoML Leaderboard
lb4 = aml4.leaderboard
lb4.head(rows=lb4.nrows)  # Print all rows instead of default (10 rows)

In [None]:
df_pandaFrame4 = lb4.as_data_frame()
# print(df_pandaFrame)
df_pandaFrame4.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//Iteration4_1333.json')

### Model - 5 : Increasing AutoML running time¶

In [None]:
aml5 = H2OAutoML(max_runtime_secs=1555, seed=27)

In [None]:
model_start_time = time.time()

try:
    aml5.train(x=predictors, y=Target, training_frame=H2O_df)
except Exception as e:
    
    logging.critical('aml5.train')
    h2o.download_all_logs(dirname=logs_path, filename=logfile)
    h2o.cluster().shutdown()
    sys.exit(4)

model_end_time = time.time()

In [None]:
currentDT=datetime.datetime.now()
currentDT=currentDT.strftime("%Y-%m-%d %H:%M:%S")

predictors_list=' |'.join(predictors)

meta_data={}
meta_data['Problem type'] ="classification"
meta_data['Target']='TARGET_deathRate'
meta_data['predictors']=predictors_list
meta_data['Execution_Date'] = currentDT
meta_data['model_execution_time'] = {(model_end_time - model_start_time)}
meta_data['max_runtime_secs']='1555'
meta_data['Evaluation_Matric']='RMSE'
# meta_data=meta_data.as_data_frame()
pd_meta=pd.DataFrame.from_dict(meta_data)

# Save metadata
pd_meta.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//LeaderBoard//Iteration5_metadata.json')

In [None]:
# View the AutoML Leaderboard
lb5 = aml5.leaderboard
lb5.head(rows=lb5.nrows)  # Print all rows instead of default (10 rows)

In [None]:
df_pandaFrame5 = lb5.as_data_frame()
# print(df_pandaFrame)
df_pandaFrame5.to_json('C://Users//kaila//OneDrive//Desktop//DSMT//HYPERPARAMETER-Project//Iteration5_1555.json')

In [None]:
_emp

In [None]:
lst = ['I', 'am', 'Pyhton', 'prog']

S1: s = ""
    for x in lst:
        s+=x
        
S2: s= "".join(lst)

We have successfully implementation various regression algorithms like linear Regression, Logistic regression, StepWise Regression and Ridge regression (i.e _L2 regularization_) to predict Cancer Mortality Rates for US Counties. <br />
Cancer Mortality Rates for US Counties is depend upon various factors like <br /> TARGET_deathRate,incidenceRate,medIncome,povertyPercent,PctHS25_Over,PctEmployed16_Over,PctUnemployed16_Over, <br /> PctPrivateCoverage,PctPrivateCoverageAlone,PctPublicCoverage,PctPublicCoverageAlone <br />
but the most prominent are : <br />
incidenceRate, PctPrivateCoverage, PctHS25_Over, povertyPercent, PctUnemployed16_Over, PctPrivateCoverageAlone, <br /> PctEmployed16_Over, PctPublicCoverage, PctPublicCoverageAlone, medIncome <br />

In linear regression The best model give the 'best fit'(R-Square) of: 0.51 while in Logistic Rgression's the best model gives accuracy of 0.97.
In our analysis the predictors we use in linear Regression and predictors suggested by forward stepwise regression are same wich suggest that the independent variable we use are very correlated.

## Contributions 

In above analysis:
- 70% of explanation, analysis and code is done by me.
- 20% of resource is from web and citations are given below.
- 10% of resource is from prof. Nik Brown notes.

## Citations

Dataset : https://data.world/nrippner/ols-regression-challenge <br />
Regression methods : https://github.com/nikbearbrown/INFO_6105/<br />
Learn ROC Curve : https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8 <br />
Dummy function : https://towardsdatascience.com/the-dummys-guide-to-creating-dummy-variables-f21faddb1d40 <br />
K-fold cross validation: https://towardsdatascience.com/cross-validation-70289113a072 <br />
Homoscedasticity : http://davidmlane.com/hyperstat/A121947.html <br />
Confusion Matrix : https://stackoverflow.com/questions/30746460/how-to-interpret-scikits-learn-confusion-matrix-and-classification-report <br />
Forward Stepwise : https://planspace.org/20150423-forward_selection_with_statsmodels/ <br />
Backword Elimination : https://www.kaggle.com/umeshsati54/backward-elimination <br />
Feature Scalling : http://sebastianraschka.com/Articles/2014_about_feature_scaling.html#about-standardization <br />
Outliers : https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba <br />
Standard Error : http://changingminds.org/explanations/research/statistics/standard_error.htm


##  License

#### Copyright 2019 Kailash Nadkar


Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.