<a href="https://colab.research.google.com/github/AbiemwenseMaureenOshobugie/DPhi/blob/main/Data_Sprint_100_Employee_Performance_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Sprint 100 - Employee Performance Prediction

<div align="center" style="width: 950px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://www.shutterstock.com/image-photo/readymade-garments-workers-work-factory-600w-1967092840.jpg"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=950px/>

## Problem Statement
The garment industry is one of the most dominating industries in this era of industrial globalization. It is a highly labor-intensive industry that requires a large number of human resources to produce its goods and fill up the global demand for garment products. Because of the dependency on human labor, the production of a garment company comprehensively relies on the productivity of the employees who are working in different departments of the company. A common problem in this industry is that the actual productivity of the garment employees sometimes does not meet the targeted productivity that was set for them by the authorities to meet the production goals in due time. When the productivity gap occurs, the company faces a huge loss in production.

## Objective

As an aspiring Data Scientist, your job is devise a Machine Learning model that helps us to track, analyse and predict the productivity performance of the employees in their factories thereby helping  the manufacturers to set an accurate target, minimize the production loss and maximize the profit. 



## Dataset Description

 

This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

For more details: click the following links, [Dataset description](https://aiplanet.com/challenges/data-sprint-100-employee-performance-prediction/317/data), , [Participants and Leaderboard](https://aiplanet.com/challenges/data-sprint-100-employee-performance-prediction/317/leaderboard/public/), [Train dataset](https://s3.us-west-1.wasabisys.com/dphi/datasets/317/train_dataset.csv?AWSAccessKeyId=ABSZWDH67WW3G8YX40WD&Signature=2ytaB5yZVqTo1ADhgTNkpOfiGGs%3D&Expires=1671466004) and [Test dataset](https://s3.us-west-1.wasabisys.com/dphi/datasets/317/test_dataset.csv?AWSAccessKeyId=ABSZWDH67WW3G8YX40WD&Signature=jVtN5ujoTRfDpeNSR4xyo6sS6YA%3D&Expires=1671466004)





## Install and import the necessary packages

In [None]:
!pip3 -q install catboost
!pip -q install pandas_bokeh

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams.update({'figure.figsize':(10,8), 'figure.dpi':100})
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# for modeling data
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn import neighbors

# for checking performance
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')
import pandas_bokeh
# Embedding plots in Colab Notebook
pandas_bokeh.output_notebook()
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = [10, 7]

In [None]:
df_train = pd.read_csv('/content/garment_train_dataset.csv')
df_test = pd.read_csv('/content/garment_test_dataset.csv')

In [None]:
df_train.head()

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,...,department_finishing,department_finishing.1,department_sweing,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday,actual_productivity
0,9,0.75,3.94,,960,0,0.0,0,0,8.0,...,1,0,0,0,0,0,0,0,1,0.755167
1,7,0.65,30.1,909.0,7080,0,0.0,0,1,59.0,...,0,0,1,0,0,0,1,0,0,0.535678
2,3,0.8,4.15,,1440,0,0.0,0,0,7.0,...,0,1,0,0,0,0,0,0,1,0.820833
3,1,0.65,22.53,762.0,5040,0,0.0,0,1,42.0,...,0,0,1,0,0,0,0,0,1,0.581131
4,4,0.7,30.1,767.0,3300,50,0.0,0,1,57.0,...,0,0,1,1,0,0,0,0,0,0.790003


In [None]:
df_test.head()

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,...,quarter_Quarter5,department_finishing,department_finishing.1,department_sweing,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday
0,12,0.75,4.08,,1080,0,0.0,0,0,9.0,...,1,0,1,0,0,1,0,0,0,0
1,4,0.75,4.15,,2400,0,0.0,0,0,20.0,...,0,1,0,0,0,0,1,0,0,0
2,3,0.7,30.1,1057.0,0,40,0.0,0,1,58.0,...,0,0,0,1,0,0,0,0,0,1
3,7,0.7,3.94,,2160,0,0.0,0,0,18.0,...,0,0,1,0,0,0,1,0,0,0
4,5,0.5,4.15,,1440,0,0.0,0,0,8.0,...,0,0,1,0,0,0,1,0,0,0


In [None]:
df_train.shape,df_test.shape

((1017, 26), (180, 25))

In [None]:
df_train.duplicated().sum()

0

In [None]:
df_train.describe()

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,...,department_finishing,department_finishing.1,department_sweing,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday,actual_productivity
count,1017.0,1017.0,1017.0,594.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,...,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0
mean,6.443461,0.730747,15.150492,1183.183502,4532.94002,40.689282,0.564405,0.39823,0.160275,34.846116,...,0.201573,0.214356,0.584071,0.161259,0.152409,0.164208,0.165192,0.171091,0.185841,0.736509
std,3.472473,0.097384,10.946096,1793.836719,3275.997333,173.240655,10.093731,3.351712,0.440199,22.185292,...,0.401373,0.410577,0.493124,0.36795,0.359594,0.370647,0.371536,0.376774,0.389169,0.174304
min,1.0,0.07,2.9,7.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.233705
25%,3.0,0.7,3.94,770.5,1440.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.651515
50%,7.0,0.75,15.26,1039.0,4080.0,0.0,0.0,0.0,0.0,34.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.773333
75%,9.0,0.8,24.26,1254.75,6900.0,50.0,0.0,0.0,0.0,57.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.85017
max,12.0,0.8,54.56,23122.0,15120.0,3600.0,270.0,45.0,2.0,89.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.108125


In [None]:
'''def remove_outlier(df):
    low = .05
    high = .95
    quant_df = df.quantile([low, high])
    for name in list(df.columns):
        if df[name].dtype == 'numeric':
            df = df[(df[name] > quant_df.loc[low, name]) & (df[name] < quant_df.loc[high, name])]
    return df

remove_outlier(df_train)'''

"def remove_outlier(df):\n    low = .05\n    high = .95\n    quant_df = df.quantile([low, high])\n    for name in list(df.columns):\n        if df[name].dtype == 'numeric':\n            df = df[(df[name] > quant_df.loc[low, name]) & (df[name] < quant_df.loc[high, name])]\n    return df\n\nremove_outlier(df_train)"

In [None]:
df_train.describe()

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,...,department_finishing,department_finishing.1,department_sweing,day_Monday,day_Saturday,day_Sunday,day_Thursday,day_Tuesday,day_Wednesday,actual_productivity
count,1017.0,1017.0,1017.0,594.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,...,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0,1017.0
mean,6.443461,0.730747,15.150492,1183.183502,4532.94002,40.689282,0.564405,0.39823,0.160275,34.846116,...,0.201573,0.214356,0.584071,0.161259,0.152409,0.164208,0.165192,0.171091,0.185841,0.736509
std,3.472473,0.097384,10.946096,1793.836719,3275.997333,173.240655,10.093731,3.351712,0.440199,22.185292,...,0.401373,0.410577,0.493124,0.36795,0.359594,0.370647,0.371536,0.376774,0.389169,0.174304
min,1.0,0.07,2.9,7.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.233705
25%,3.0,0.7,3.94,770.5,1440.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.651515
50%,7.0,0.75,15.26,1039.0,4080.0,0.0,0.0,0.0,0.0,34.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.773333
75%,9.0,0.8,24.26,1254.75,6900.0,50.0,0.0,0.0,0.0,57.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.85017
max,12.0,0.8,54.56,23122.0,15120.0,3600.0,270.0,45.0,2.0,89.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.108125


In [None]:
df_train.shape

(1017, 26)

In [None]:
df_train.columns

Index(['team', 'targeted_productivity', 'smv', 'wip', 'over_time', 'incentive',
       'idle_time', 'idle_men', 'no_of_style_change', 'no_of_workers', 'month',
       'quarter_Quarter1', 'quarter_Quarter2', 'quarter_Quarter3',
       'quarter_Quarter4', 'quarter_Quarter5', 'department_finishing',
       'department_finishing ', 'department_sweing', 'day_Monday',
       'day_Saturday', 'day_Sunday', 'day_Thursday', 'day_Tuesday',
       'day_Wednesday', 'actual_productivity'],
      dtype='object')

In [None]:
plot1 = df_train.plot_bokeh(kind="hist", y="team",show_figure=False,legend=False,)
plot2 = df_train.plot_bokeh(kind="hist", y="targeted_productivity",show_figure=False,legend=False,)
plot3 = df_train.plot_bokeh(kind="hist", y="smv",show_figure=False,legend=False,)
plot4 = df_train.plot_bokeh(kind="hist", y="wip",show_figure=False,legend=False,)
plot5 = df_train.plot_bokeh(kind="hist", y="over_time",show_figure=False,legend=False,)
plot6 = df_train.plot_bokeh(kind="hist", y="incentive",show_figure=False,legend=False,)
plot7 = df_train.plot_bokeh(kind="hist", y="idle_time",show_figure=False,legend=False,)
plot8 = df_train.plot_bokeh(kind="hist", y="idle_men",show_figure=False,legend=False,)
plot9 = df_train.plot_bokeh(kind="hist", y="no_of_style_change",show_figure=False,legend=False,)
plot10 = df_train.plot_bokeh(kind="hist", y="no_of_workers",show_figure=False,legend=False,)
plot11 = df_train.plot_bokeh(kind="hist", y="month",show_figure=False,legend=False,)
plot12 = df_train.plot_bokeh(kind="hist", y="quarter_Quarter1",show_figure=False,legend=False,)
plot13 = df_train.plot_bokeh(kind="hist", y="quarter_Quarter2",show_figure=False,legend=False,)
plot14 = df_train.plot_bokeh(kind="hist", y="quarter_Quarter3",show_figure=False,legend=False,)
plot15 = df_train.plot_bokeh(kind="hist", y="quarter_Quarter4",show_figure=False,legend=False,)
plot16 = df_train.plot_bokeh(kind="hist", y="quarter_Quarter5",show_figure=False,legend=False,)
plot17 = df_train.plot_bokeh(kind="hist", y="department_finishing",show_figure=False,legend=False,)
plot18 = df_train.plot_bokeh(kind="hist", y="department_finishing",show_figure=False,legend=False,)
plot19 = df_train.plot_bokeh(kind="hist", y="department_sweing",show_figure=False,legend=False,)
plot20 = df_train.plot_bokeh(kind="hist", y="day_Monday",show_figure=False,legend=False,)
plot21 = df_train.plot_bokeh(kind="hist", y="day_Saturday",show_figure=False,legend=False,)
plot22 = df_train.plot_bokeh(kind="hist", y="day_Sunday",show_figure=False,legend=False,)
plot23 = df_train.plot_bokeh(kind="hist", y="day_Thursday",show_figure=False,legend=False,)
plot24 = df_train.plot_bokeh(kind="hist", y="day_Tuesday",show_figure=False,legend=False,)
plot25 = df_train.plot_bokeh(kind="hist", y="day_Wednesday",show_figure=False,legend=False,)
plot26 = df_train.plot_bokeh(kind="hist", y="actual_productivity",show_figure=False,legend=False,)

pandas_bokeh.plot_grid([[plot1,plot2,plot3,plot4,plot5,plot6],[plot7,plot8,plot9,plot10,plot11,plot12],
                        [plot13,plot14,plot15,plot16,plot17,plot18],[plot19,plot20,plot21,plot22,plot23,plot24],[plot25,plot26]], 
                       plot_width=200, plot_height=150)


## Data Cleaning

In [None]:
'''df_train.wip.fillna(method='bfill', inplace=True)
df_train.wip.fillna(method='ffill', inplace=True)

df_test.wip.fillna(method='bfill', inplace=True)
df_test.wip.fillna(method='ffill', inplace=True)'''

"df_train.wip.fillna(method='bfill', inplace=True)\ndf_train.wip.fillna(method='ffill', inplace=True)\n\ndf_test.wip.fillna(method='bfill', inplace=True)\ndf_test.wip.fillna(method='ffill', inplace=True)"

In [None]:
df_train['wip'] = df_train['wip'].fillna(0)
df_test['wip'] = df_test['wip'].fillna(0)

In [None]:
# Select the columns with positive skewness to transform
# for +ve skewness, can use square-roots, cube-roots, logarithms and reciprocals
positive_columns_to_transform = ["idle_men", 'idle_time','wip','no_of_style_change','quarter_Quarter5']
# Apply the log transformation to the selected columns
df_train[positive_columns_to_transform] = np.sqrt(df_train[positive_columns_to_transform])
# Check the skewness of the transformed columns
print(df_train[positive_columns_to_transform].skew())

idle_men               8.143066
idle_time             16.821077
wip                    1.365880
no_of_style_change     2.333005
quarter_Quarter5       4.959514
dtype: float64


## Data Preprocessing

In [None]:
df_train = df_train.drop(['wip'],axis =1)
df_test = df_test.drop(['wip'],axis =1)

In [None]:
# define predictors data
X = df_train.drop('actual_productivity', axis = 1)

# define targetvariables
y = df_train['actual_productivity'] 


In [None]:
# Feature engineering of hour of month like cyclical feature & days in week feature

X['month_sin'] = np.sin((X['month']-1)*(2.*np.pi/12))
X['month_cos'] = np.cos((X['month']-1)*(2.*np.pi/12))

X['quarter_Quarter1'] = np.sqrt(X['quarter_Quarter1'])
X['quarter_Quarter2'] = np.sqrt(X['quarter_Quarter2'])
X['quarter_Quarter3'] = np.sqrt(X['quarter_Quarter3'])
X['quarter_Quarter4'] = np.sqrt(X['quarter_Quarter4'])
X['quarter_Quarter5'] = np.sqrt(X['quarter_Quarter5'])
X['day_Monday'] = np.sqrt(X['day_Monday'])
X['day_Saturday'] = np.sqrt(X['day_Saturday'])
X['day_Sunday'] = np.sqrt(X['day_Sunday'])
X['day_Tuesday'] = np.sqrt(X['day_Tuesday'])
X['day_Thursday'] = np.sqrt(X['day_Thursday'])
X['day_Wednesday'] = np.sqrt(X['day_Wednesday'])


X.drop(['month'], axis=1, inplace=True)

In [None]:
df_test['month_sin'] = np.sin((df_test['month']-1)*(2.*np.pi/12))
df_test['month_cos'] = np.cos((df_test['month']-1)*(2.*np.pi/12))

df_test['quarter_Quarter1'] = np.sqrt(df_test['quarter_Quarter1'])
df_test['quarter_Quarter2'] = np.sqrt(df_test['quarter_Quarter2'])
df_test['quarter_Quarter3'] = np.sqrt(df_test['quarter_Quarter3'])
df_test['quarter_Quarter4'] = np.sqrt(df_test['quarter_Quarter4'])
df_test['quarter_Quarter5'] = np.sqrt(df_test['quarter_Quarter5'])
df_test['day_Monday'] = np.sqrt(df_test['day_Monday'])
df_test['day_Saturday'] = np.sqrt(df_test['day_Saturday'])
df_test['day_Sunday'] = np.sqrt(df_test['day_Sunday'])
df_test['day_Tuesday'] = np.sqrt(df_test['day_Tuesday'])
df_test['day_Thursday'] = np.sqrt(df_test['day_Thursday'])
df_test['day_Wednesday'] = np.sqrt(df_test['day_Wednesday'])


df_test.drop(['month'], axis=1, inplace=True)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_info = pd.DataFrame()
vif_info['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif_info['Columns'] = X.columns
vif_info.sort_values('VIF', ascending=False)

Unnamed: 0,VIF,Columns
12,inf,quarter_Quarter4
16,inf,department_sweing
22,inf,day_Wednesday
21,inf,day_Tuesday
20,inf,day_Thursday
19,inf,day_Sunday
18,inf,day_Saturday
9,inf,quarter_Quarter1
10,inf,quarter_Quarter2
11,inf,quarter_Quarter3


In [None]:
X.drop(['quarter_Quarter4','day_Thursday'], axis=1, inplace=True)
df_test.drop(['quarter_Quarter4','day_Thursday'], axis=1, inplace=True)

In [None]:
vif_info = pd.DataFrame()
vif_info['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif_info['Columns'] = X.columns
vif_info.sort_values('VIF', ascending=False)

Unnamed: 0,VIF,Columns
15,259.970838,department_sweing
13,83.981679,department_finishing
14,79.014898,department_finishing
8,14.316933,no_of_workers
21,12.078212,month_sin
22,8.710233,month_cos
2,6.174965,smv
3,3.294868,over_time
9,2.169761,quarter_Quarter1
6,2.054404,idle_men


In [None]:
# Set test size to 20 % of training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

# Normalise X train and X test
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Normalise test set
test_scaler = scaler.transform(df_test)


In [None]:
import numpy as np
from sklearn import linear_model
from sklearn import svm

classifiers = [
    svm.SVR(),
    linear_model.SGDRegressor(),
    linear_model.BayesianRidge(),
    linear_model.LassoLars(),
    linear_model.ARDRegression(),
    linear_model.PassiveAggressiveRegressor(),
    linear_model.TheilSenRegressor(),
    linear_model.LinearRegression()]

for item in classifiers:
    print(item)
    clf = item
    clf.fit(X_train, y_train)
    print(r2_score(y_test,clf.predict(X_test)),'\n')

SVR()
0.31701415514587916 

SGDRegressor()
0.2810984717592597 

BayesianRidge()
0.2830583864997964 

LassoLars()
-0.02173428579673975 

ARDRegression()
0.29201581061222504 

PassiveAggressiveRegressor()
-0.3206870723783857 

TheilSenRegressor(max_subpopulation=10000)
-0.6253774466759874 

LinearRegression()
0.29540535797455625 



## Catboost

In [None]:
c_model = CatBoostRegressor(iterations=100,
                          learning_rate=0.1,
                          depth=8,
                          random_state=0)
c_model.fit(X_train, y_train)


0:	learn: 0.1652988	total: 3.16ms	remaining: 313ms
1:	learn: 0.1599119	total: 9.55ms	remaining: 468ms
2:	learn: 0.1550633	total: 14.3ms	remaining: 462ms
3:	learn: 0.1506996	total: 19ms	remaining: 456ms
4:	learn: 0.1465595	total: 23.8ms	remaining: 452ms
5:	learn: 0.1422730	total: 28.3ms	remaining: 443ms
6:	learn: 0.1409688	total: 31.4ms	remaining: 417ms
7:	learn: 0.1382671	total: 32.8ms	remaining: 377ms
8:	learn: 0.1356714	total: 35.8ms	remaining: 362ms
9:	learn: 0.1331457	total: 40.6ms	remaining: 365ms
10:	learn: 0.1303336	total: 46.3ms	remaining: 374ms
11:	learn: 0.1274423	total: 49.8ms	remaining: 365ms
12:	learn: 0.1257882	total: 54.3ms	remaining: 364ms
13:	learn: 0.1238211	total: 59.1ms	remaining: 363ms
14:	learn: 0.1220014	total: 63.8ms	remaining: 362ms
15:	learn: 0.1199923	total: 68.4ms	remaining: 359ms
16:	learn: 0.1184063	total: 73ms	remaining: 357ms
17:	learn: 0.1177155	total: 77.7ms	remaining: 354ms
18:	learn: 0.1164503	total: 82.2ms	remaining: 350ms
19:	learn: 0.1154964	total

<catboost.core.CatBoostRegressor at 0x7ff4242af940>

In [None]:
#  make prediction with X_test data
c_pred = c_model.predict(X_test)

# Evaluate result from the gradient boosting regression model
print("Accuracy of the Catboost Regression model is",r2_score(y_test,c_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test,c_pred)), '\nMAE: ',mean_absolute_error(y_test, c_pred))


Accuracy of the Catboost Regression model is 0.44637664176028147
RMSE:  0.14119568337925267 
MAE:  0.09107013484744214


## Randomforest

In [None]:
# Building a Random Forest Machine on train data
rfr = RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=10,
                      n_estimators=50,
                      random_state =0) 

# train the data by fitting in a randon forest regression model
rfr.fit(X_train, y_train)


RandomForestRegressor(max_depth=10, min_samples_leaf=2, min_samples_split=10,
                      n_estimators=50, random_state=0)

In [None]:
#  make prediction with X_test data
pred_rfr = rfr.predict(X_test)

# Evaluate result from the random forest regression model
print("Accuracy of the Random Forest Regression model is",r2_score(y_test,pred_rfr))
print("RMSE: ", np.sqrt(mean_squared_error(y_test,pred_rfr)), '\nMAE: ',mean_absolute_error(y_test, pred_rfr))


Accuracy of the Random Forest Regression model is 0.4093222161936162
RMSE:  0.14584432519747687 
MAE:  0.09298740969514259


##Gradient Boosting

In [None]:
gbr = GradientBoostingRegressor(learning_rate=0.05, min_samples_leaf=2,
                          min_samples_split=10, n_estimators=300,
                          random_state =0)


# train the data by fitting in a bagging ensemble model
gbr.fit(X_train,y_train)


GradientBoostingRegressor(learning_rate=0.05, min_samples_leaf=2,
                          min_samples_split=10, n_estimators=300,
                          random_state=0)

In [None]:
#  make prediction with X_test data
pred_gbr = gbr.predict(X_test)

# Evaluate result from the bagging regression model
print("Accuracy of the GradientBoosting regressor model is",r2_score(y_test,pred_gbr))
print("RMSE: ", np.sqrt(mean_squared_error(y_test,pred_gbr)), '\nMAE: ',mean_absolute_error(y_test, pred_gbr))


Accuracy of the GradientBoosting regressor model is 0.451089259322488
RMSE:  0.1405934478223653 
MAE:  0.09265560151523065


## GradientBoosting Bagging Ensemble

In [None]:
bag_gbr = GradientBoostingRegressor(learning_rate=0.05, min_samples_leaf=2,
                          min_samples_split=10, n_estimators=300,
                          random_state=0)

# Instantiate BaggingRegressor model with a gradient booster as the base model
bag_reg = BaggingRegressor(base_estimator = bag_gbr)

# train the data by fitting in a bagging ensemble model
bag_reg.fit(X_train,y_train)


BaggingRegressor(base_estimator=GradientBoostingRegressor(learning_rate=0.05,
                                                          min_samples_leaf=2,
                                                          min_samples_split=10,
                                                          n_estimators=300,
                                                          random_state=0))

In [None]:
#  make prediction with X_test data
bag_pred = bag_reg.predict(X_test)

# Evaluate result from the bagging regression model
print("Accuracy of the bagging regressor model is",r2_score(y_test,bag_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test,bag_pred)), '\nMAE: ',mean_absolute_error(y_test, bag_pred))


Accuracy of the bagging regressor model is 0.4333515364910182
RMSE:  0.14284698380898592 
MAE:  0.09620141980024811


## Prediction for submission

In [None]:
predict1 = c_model(test_scaler)

In [None]:
# Create Dataframe of predicted value with particular respective index
submit1 = pd.DataFrame(predicted1) 
submit1.columns = ["actual_productivity"]

# download the csv file to my local
from google.colab import files
submit1.to_csv('submission_s_reg.csv', index = False)
files.download('submission_s_reg.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## GridSearch

In [None]:
'''from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from catboost import CatBoostRegressor
model_CBR = CatBoostRegressor()

parameters = {'depth'         : [6,8,10],
                  'learning_rate' : [0.01, 0.05, 0.1],
                  'iterations'    : [30, 50, 100]
                 }
grid = GridSearchCV(estimator=model_CBR, param_grid = parameters, cv = 2, n_jobs=-1)
grid.fit(X_train, y_train)
print(" Results from Grid Search " )
print("\n The best estimator across ALL searched params:\n", grid.best_estimator_)
print("\n The best score across ALL searched params:\n", grid.best_score_)
print("\n The best parameters across ALL searched params:\n", grid.best_params_)'''

'from sklearn import datasets\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.model_selection import GridSearchCV\nfrom catboost import CatBoostRegressor\nmodel_CBR = CatBoostRegressor()\n\nparameters = {\'depth\'         : [6,8,10],\n                  \'learning_rate\' : [0.01, 0.05, 0.1],\n                  \'iterations\'    : [30, 50, 100]\n                 }\ngrid = GridSearchCV(estimator=model_CBR, param_grid = parameters, cv = 2, n_jobs=-1)\ngrid.fit(X_train, y_train)\nprint(" Results from Grid Search " )\nprint("\n The best estimator across ALL searched params:\n", grid.best_estimator_)\nprint("\n The best score across ALL searched params:\n", grid.best_score_)\nprint("\n The best parameters across ALL searched params:\n", grid.best_params_)'

In [None]:
'''# GradientBoosting
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

# Create a Gradient Boosting Regressor
model = GradientBoostingRegressor()

# Define the hyperparameters to tune and their possible values
param_grid = {'n_estimators': [100, 200, 300],
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [1, 3, 5],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}

# Create a Grid Search object with 5-fold cross validation
grid_search = GridSearchCV(model, param_grid, cv=5)

# Fit the grid search object to the training data
grid_search.fit(X_train, y_train)

# Get the best combination of hyperparameters
best_params = grid_search.best_params_

# Use the best combination of hyperparameters to train a new model
best_model = GradientBoostingRegressor(**best_params)
best_model.fit(X_train, y_train)'''

"# GradientBoosting\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.model_selection import GridSearchCV\n\n# Create a Gradient Boosting Regressor\nmodel = GradientBoostingRegressor()\n\n# Define the hyperparameters to tune and their possible values\nparam_grid = {'n_estimators': [100, 200, 300],\n              'learning_rate': [0.1, 0.05, 0.01],\n              'max_depth': [1, 3, 5],\n              'min_samples_split': [2, 5, 10],\n              'min_samples_leaf': [1, 2, 4]}\n\n# Create a Grid Search object with 5-fold cross validation\ngrid_search = GridSearchCV(model, param_grid, cv=5)\n\n# Fit the grid search object to the training data\ngrid_search.fit(X_train, y_train)\n\n# Get the best combination of hyperparameters\nbest_params = grid_search.best_params_\n\n# Use the best combination of hyperparameters to train a new model\nbest_model = GradientBoostingRegressor(**best_params)\nbest_model.fit(X_train, y_train)"

In [None]:
'''# Randomforest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Create a Random Forest Regressor
model = RandomForestRegressor()

# Define the hyperparameters to tune and their possible values
param_grid = {'n_estimators': [10, 50, 100, 200],
              'max_depth': [None, 5, 10, 15],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}

# Create a Grid Search object with 5-fold cross validation
grid_search = GridSearchCV(model, param_grid, cv=5)

# Fit the grid search object to the training data
grid_search.fit(X_train, y_train)

# Get the best combination of hyperparameters
best_params = grid_search.best_params_

# Use the best combination of hyperparameters to train a new model
best_model = RandomForestRegressor(**best_params)
best_model.fit(X_train, y_train)'''

"# Randomforest\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.model_selection import GridSearchCV\n\n# Create a Random Forest Regressor\nmodel = RandomForestRegressor()\n\n# Define the hyperparameters to tune and their possible values\nparam_grid = {'n_estimators': [10, 50, 100, 200],\n              'max_depth': [None, 5, 10, 15],\n              'min_samples_split': [2, 5, 10],\n              'min_samples_leaf': [1, 2, 4]}\n\n# Create a Grid Search object with 5-fold cross validation\ngrid_search = GridSearchCV(model, param_grid, cv=5)\n\n# Fit the grid search object to the training data\ngrid_search.fit(X_train, y_train)\n\n# Get the best combination of hyperparameters\nbest_params = grid_search.best_params_\n\n# Use the best combination of hyperparameters to train a new model\nbest_model = RandomForestRegressor(**best_params)\nbest_model.fit(X_train, y_train)"

In [None]:
'''# ARDRegressor
from sklearn.linear_model import ARDRegression
from sklearn.model_selection import GridSearchCV

# Create an ARD Regressor
model = ARDRegression()

# Define the hyperparameters to tune and their possible values
param_grid = {'n_iter': [100, 200, 300],
              'tol': [1e-3, 1e-4, 1e-5],
              'alpha_1': [1e-6, 1e-5, 1e-4],
              'alpha_2': [1e-6, 1e-5, 1e-4],
              'lambda_1': [1e-6, 1e-5, 1e-4],
              'lambda_2': [1e-6, 1e-5, 1e-4]}

# Create a Grid Search object with 5-fold cross validation
grid_search = GridSearchCV(model, param_grid, cv=5)

# Fit the grid search object to the training data
grid_search.fit(X_train, y_train)

# Get the best combination of hyperparameters
best_params = grid_search.best_params_

# Use the best combination of hyperparameters to train a new model
best_model = ARDRegression(**best_params)
best_model.fit(X_train, y_train)'''

"# ARDRegressor\nfrom sklearn.linear_model import ARDRegression\nfrom sklearn.model_selection import GridSearchCV\n\n# Create an ARD Regressor\nmodel = ARDRegression()\n\n# Define the hyperparameters to tune and their possible values\nparam_grid = {'n_iter': [100, 200, 300],\n              'tol': [1e-3, 1e-4, 1e-5],\n              'alpha_1': [1e-6, 1e-5, 1e-4],\n              'alpha_2': [1e-6, 1e-5, 1e-4],\n              'lambda_1': [1e-6, 1e-5, 1e-4],\n              'lambda_2': [1e-6, 1e-5, 1e-4]}\n\n# Create a Grid Search object with 5-fold cross validation\ngrid_search = GridSearchCV(model, param_grid, cv=5)\n\n# Fit the grid search object to the training data\ngrid_search.fit(X_train, y_train)\n\n# Get the best combination of hyperparameters\nbest_params = grid_search.best_params_\n\n# Use the best combination of hyperparameters to train a new model\nbest_model = ARDRegression(**best_params)\nbest_model.fit(X_train, y_train)"

In [None]:
'''from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

# Set up the hyperparameter grid
param_grid = {'max_depth': [2, 4, 6, 8, 10,20,50,100],
              'min_samples_leaf': [1, 2, 4, 6, 8,10,20,30,40,50]}

# Instantiate the model
tree = DecisionTreeRegressor()

# Set up the grid search
grid_search = GridSearchCV(estimator=tree, param_grid=param_grid, cv=5)

# Fit the grid search to the data
grid_search.fit(X, y)

# Print the best hyperparameters
print(grid_search.best_params_)
'''

"from sklearn.model_selection import GridSearchCV\nfrom sklearn.tree import DecisionTreeRegressor\n\n# Set up the hyperparameter grid\nparam_grid = {'max_depth': [2, 4, 6, 8, 10,20,50,100],\n              'min_samples_leaf': [1, 2, 4, 6, 8,10,20,30,40,50]}\n\n# Instantiate the model\ntree = DecisionTreeRegressor()\n\n# Set up the grid search\ngrid_search = GridSearchCV(estimator=tree, param_grid=param_grid, cv=5)\n\n# Fit the grid search to the data\ngrid_search.fit(X, y)\n\n# Print the best hyperparameters\nprint(grid_search.best_params_)\n"