# Topics covered in the notebook

In this notebook almost every important sklearn library was used for completing a full project. This notebook may help you for understanding the use of many sklearn libraries and their application on real-life data.
1. Data read with pandas
2. Data exploration with pandas
3. Label Encoding with sklearn
4. OneHot Encoding with sklearn
5. Create pandas datetime variable and handling time data
6. train_test_split using sklearn
7. Sklearn Model Formation: Linear Regression, Decision Tree, Random Forest, Gradient Boosting, Light Gradient Boosting 
8. Sklearn Model performence evaluation: RMSE and score
9. K-fold cross validation with sklearn


## Library import

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## File Reading

In [2]:
df= pd.read_csv("train.csv")

## Data Exploration

In [3]:
df.head()

Unnamed: 0,row_id,date,country,store,product,num_sold
0,0,2015-01-01,Finland,KaggleMart,Kaggle Mug,329
1,1,2015-01-01,Finland,KaggleMart,Kaggle Hat,520
2,2,2015-01-01,Finland,KaggleMart,Kaggle Sticker,146
3,3,2015-01-01,Finland,KaggleRama,Kaggle Mug,572
4,4,2015-01-01,Finland,KaggleRama,Kaggle Hat,911


In [4]:
df['store'].unique()

array(['KaggleMart', 'KaggleRama'], dtype=object)

In [5]:
df['product'].unique()

array(['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker'], dtype=object)

In [6]:
df['country'].unique()

array(['Finland', 'Norway', 'Sweden'], dtype=object)

In [7]:
df['date'].value_counts()

2018-12-10    18
2018-07-11    18
2017-01-17    18
2017-10-10    18
2015-01-06    18
              ..
2017-08-24    18
2015-02-07    18
2018-07-07    18
2015-05-01    18
2016-03-13    18
Name: date, Length: 1461, dtype: int64

In [8]:
df.describe()

Unnamed: 0,row_id,num_sold
count,26298.0,26298.0
mean,13148.5,387.533577
std,7591.723026,266.076193
min,0.0,70.0
25%,6574.25,190.0
50%,13148.5,315.0
75%,19722.75,510.0
max,26297.0,2884.0


In [9]:
df.isnull().sum()

row_id      0
date        0
country     0
store       0
product     0
num_sold    0
dtype: int64

### Separate X & Y using pandas sclicing method

In [10]:
X = df.iloc[:, :-1] # Independent Variable
y = df.iloc[:, -1] # Dependent Variable

In [11]:
X.head()

Unnamed: 0,row_id,date,country,store,product
0,0,2015-01-01,Finland,KaggleMart,Kaggle Mug
1,1,2015-01-01,Finland,KaggleMart,Kaggle Hat
2,2,2015-01-01,Finland,KaggleMart,Kaggle Sticker
3,3,2015-01-01,Finland,KaggleRama,Kaggle Mug
4,4,2015-01-01,Finland,KaggleRama,Kaggle Hat


In [12]:
y.head()

0    329
1    520
2    146
3    572
4    911
Name: num_sold, dtype: int64

### Use Sklearn for encoding categorical values

LabelEncoder chnages any two classes into 0 or 1

In [13]:
# Label Encoding for 2-class columns:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X.iloc[:, 3] = le.fit_transform(X.iloc[:, 3])

In [14]:
X.head()

Unnamed: 0,row_id,date,country,store,product
0,0,2015-01-01,Finland,0,Kaggle Mug
1,1,2015-01-01,Finland,0,Kaggle Hat
2,2,2015-01-01,Finland,0,Kaggle Sticker
3,3,2015-01-01,Finland,1,Kaggle Mug
4,4,2015-01-01,Finland,1,Kaggle Hat


### Convert the date column into pandas datetime format

In [15]:
X['date']=pd.to_datetime(X['date'], format = '%Y-%m-%d', errors = 'coerce')
#X['just_date'] = X['date'].dt

In [16]:
X['year']=X['date'].dt.year

In [17]:
X['month']=X['date'].dt.month
X['day']=X['date'].dt.day

In [18]:
df_x_train=X
X.drop(['date'],inplace=True, axis=1)

In [19]:
X.head()

Unnamed: 0,row_id,country,store,product,year,month,day
0,0,Finland,0,Kaggle Mug,2015,1,1
1,1,Finland,0,Kaggle Hat,2015,1,1
2,2,Finland,0,Kaggle Sticker,2015,1,1
3,3,Finland,1,Kaggle Mug,2015,1,1
4,4,Finland,1,Kaggle Hat,2015,1,1


### Now use OnehotEncoder for encoding classes which have more than 2-Class

In [20]:
# OneHot Encoding for more than 2-class columns:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1,3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

### Use sklearn for Normalization of the input values

In [21]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_scaled= scaler.fit_transform(X)

### Use sklearn train_test_split for splitting the dataset into train and test

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 0)

## Model Formation and Training

### Linear Regression

In [23]:
# Training the Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

Linear Regression: Evaluate the model

In [24]:
y_preds= regressor.predict(X_test)
rmse= mean_squared_error(y_test, y_preds, squared=False)
print('Error = '+ str(rmse))

Error = 132.00096815465412


### Decision Tree Regression

In [25]:
from sklearn.tree import DecisionTreeRegressor
DTregressor = DecisionTreeRegressor(random_state = 0)
DTregressor.fit(X_train, y_train)

DecisionTreeRegressor(random_state=0)

Decision Tree: Evaluate the model

In [26]:
y_preds= DTregressor.predict(X_test)
rmse= mean_squared_error(y_test, y_preds, squared=False)
print('Error = '+ str(rmse))

Error = 68.56792585658228


### Random Forest 

In [27]:
from sklearn.ensemble import RandomForestRegressor
RFregressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
RFregressor.fit(X_train, y_train)

RandomForestRegressor(n_estimators=10, random_state=0)

Random Forest: Evaluate the model

In [28]:
y_preds= RFregressor.predict(X_test)
rmse= mean_squared_error(y_test, y_preds, squared=False)
print('Error = '+ str(rmse))

Error = 60.19449177710684


### Light gradient boosting

In [29]:
import lightgbm as lgbm  # standard alias

lgbm_reg = lgbm.LGBMRegressor()
lgbm_reg.fit(X_train, y_train)

LGBMRegressor()

LGBM: Evaluate the model

In [30]:
y_preds= lgbm_reg.predict(X_test)
rmse= mean_squared_error(y_test, y_preds, squared=False)
print('Error = '+ str(rmse))

Error = 55.19093158831137


### GradientBoosting Regressor

In [31]:
from sklearn.ensemble import GradientBoostingRegressor 
xgb_regressor = GradientBoostingRegressor(max_depth=10,n_estimators=100,random_state=0,learning_rate=1)
xgb_regressor.fit(X_train, y_train)

GradientBoostingRegressor(learning_rate=1, max_depth=10, random_state=0)

GradientBoosting Regressor: Model Evaluation

In [32]:
y_preds= xgb_regressor.predict(X_test)
rmse= mean_squared_error(y_test, y_preds, squared=False)
print('Error = '+ str(rmse))

Error = 52.99687184766306


### Cross Validation 

In [33]:
from sklearn.model_selection import cross_val_score

# Use sklearn for 5 fold cross validation 
scores_xgb= cross_val_score(xgb_regressor,X_scaled,y,cv=5)

# print the scores from different folds
print(scores_xgb)

[ 0.7052318   0.85367084  0.81354898  0.70097841 -0.85402018]
