## Outline:

- 1. Dataset Observation

     
     
- 2. Exploratory Data Analysis and Cleaning
    
   - Missing Values
   - Univariate Analysis (Target)
   - Univariate Analysis (Independent Variables)
   - Multivariate Analysis
   - Outliers
   - Normalization
   - Correlation
        
    
- 3. Model Preparation

    - Split training and testing
    - Encoding
    
    
- 5. Models and Tuning / Evaluation Metrics
    
    - Regression Algorithms
    - RMSE / MSE / MAE

# 1. Data Observation

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_train = pd.read_csv('Train.csv')

In [None]:
print("Train Shape: ", df_train.shape)

In [None]:
df_train.head()

In [None]:
df_train.dtypes

In [None]:
categorical_df = df_train.select_dtypes(include = 'object')
numerical_df = df_train.select_dtypes(exclude = 'object')

In [None]:
print(f"There are {len(categorical_df.columns)} Categorical Attributes")
print(f"There are {len(numerical_df.columns)} Numerical Attributes")

In [None]:
df_train.describe()

# 2. Exploratory Data Analysis

Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

We will be improve our features as we go through the visulizations. 

But first, let's analyze the missing values.

In [None]:
xdf = df_train.copy()

## Missing Values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize = (8,8))
sns.heatmap(xdf.isnull(), cbar = False);

Now, we know <b> Item_Weight </b> and <b> Outlet_Size </b> contains huge number of "NaN" but how much?

In [None]:
## Let's list them out:

total = xdf.isnull().sum().sort_values(ascending = False)
percent = ((xdf.isnull().sum() / xdf.shape[0]) * 100).sort_values(ascending = False)
percent = np.round(percent, 3)
types = xdf[percent.index].dtypes

missing_data = pd.concat([total, percent, types], axis = 1, keys = ["Total","Percent","Type"])
missing_data.head(5)

These are the values in <b> % </b>. <b> 28.27% </b> and<b> 17.16 % </b> values are missing in <b> Outlet_Size </b> and <b> Item_Weight </b> respectively.

### Outlet_Size 

Since this is a categorical attribute we will impute by using mode

In [None]:
xdf['Outlet_Size'].fillna(xdf['Outlet_Size'].mode()[0], inplace = True)

### Item_Weight 

It is a numeric variable, so we will be replacing it by <b> median </b>

In [None]:
xdf['Item_Weight'].fillna(xdf['Item_Weight'].median(), inplace = True)

### Let's confirm the impute

In [None]:
xdf.isnull().sum()

## Univariate Analysis

Starting with the analyzation of <b> Target Attribute </b>

In [None]:
xdf['Item_Outlet_Sales'].describe()

<b> Let's check the distribution of the Target Attribute


In [None]:
plt.figure(figsize = (10,6))
sns.histplot(data = xdf, x = 'Item_Outlet_Sales', kde = True);

In [None]:
## let's confirm the outliers

plt.figure(figsize = (10,8))
sns.boxplot( x = 'Item_Outlet_Sales', data = xdf);

As we can see, it is positively skewed and also containes some outliers.First let's remove outliers.

In [None]:
## First we will remove the outliers from this attribute
## function to remove outlier

def remove_outliers(dataframe, column):
    
    Q3 = dataframe[column].quantile(0.75)
    Q1 = dataframe[column].quantile(0.25)
    
    IQR = Q3 - Q1
    
    upper = Q3 + (1.5 * IQR)
    lower = Q1 - (1.5 * IQR)
    
    df_no_outlier = dataframe[(dataframe[column] > lower ) & (dataframe[column] < upper)]

    return df_no_outlier

In [None]:
# Removing Outliers form Item_Outlet_Sales

xdf = remove_outliers(xdf, "Item_Outlet_Sales")

In [None]:
# Quickly checking the result in boxplot

plt.figure(figsize = (8,8))
sns.boxplot(x = 'Item_Outlet_Sales', data = xdf);

We will not be fixing skewness in our <b> target attribute </b> as it given incorrect <b> RMSE </b>

In [None]:
## Function for fixing positive skewness
def sqrt_transformation(dataframe):
    return np.sqrt(dataframe)

In [None]:
xdf['Item_Outlet_Sales'] = xdf['Item_Outlet_Sales'].map(sqrt_transformation)

In [None]:
# After fixing skewness

plt.figure(figsize = (10,6))
sns.histplot(data = xdf, x = 'Item_Outlet_Sales', kde = True);

## Univariate Analysis (Independent Variables)

In [None]:
xxdf = xdf.copy()

In [None]:
numerical_df.columns

In [None]:
for i in numerical_df:
    sns.displot(data = xxdf, x = i, kde = True, aspect = 2, height = 6);
    plt.xlabel(i, fontsize = 12)

Let's take a note, which <b> feature </b> has skewed dataset.

In [None]:
# Checking for outliers

for i in numerical_df:
    plt.figure(figsize =(8,6))
    sns.scatterplot(data = xxdf, y = xdf.index, x = i);
    plt.xlabel(i, fontsize = 12)

Also, let's take a note which <b> Attribute </b> contains outliers.

In [None]:
# Confirming the outliers

for i in numerical_df:
    plt.figure(figsize =(8,6))
    sns.boxplot(data = xxdf, y = i);
    plt.xlabel(i, fontsize = 12)

<b> Observations: </b>

Item_Visibility contains outliers, and as well as it is positively skewed on both the dastaset. Let's fix this.

In [None]:
sns.displot(data = xxdf, x = 'Item_Visibility', kde = True, aspect = 2, height = 6);

It contains a 0 value, let's fix that too.

In [None]:
## First Removing strange '0'

xxdf['Item_Visibility'].replace(0, xxdf['Item_Visibility'].median(), inplace = True)

In [None]:
# Treating Postive skewness

xxdf['Item_Visibility'] = xxdf["Item_Visibility"].map(sqrt_transformation)

In [None]:
# Removing Outliers

xxdf = remove_outliers(xxdf, "Item_Visibility")

In [None]:
# After remvoing skewness and fixing outliers on trainset

sns.displot(x = 'Item_Visibility', data = xxdf, aspect = 2, height = 6, kde = True);

### Bivariate Analysis

First let's see the scatter plot of all the <b> Numerical variables</b> in term of <b> Item_Outlet_Sales </b>

In [None]:
for i in numerical_df:
    plt.figure(figsize =(8,6))
    sns.scatterplot(data = xxdf, x = i, y = xxdf['Item_Outlet_Sales']);
    plt.xlabel(i, fontsize = 12)
    plt.ylabel("Sales")

We observe, <b> Item_MRP </b> has linear relationship

In [None]:
bi_df = xxdf.copy()

### Bivariate Analysis (Categorical)

In [None]:
categorical_df.columns

<b> Countplot

In [None]:
for i in categorical_df:
    plt.figure(figsize = (10,8))
    sns.countplot( y = i, data = bi_df);   

<b> Observations: </b>

- Item Identifier: There are lot of individual Item Identifiers.
- Item_Fat_Content: We have multiple same values, let's fix it.
- Fruits & Vegies, Frozen food, Dariy, Household and Snacks has highest number of counts.
- Supermarket has higher number of counts.

First let's fix, <b> Item_Fat_Content

In [None]:
bi_df['Item_Fat_Content'].unique()

In [None]:
bi_df['Item_Fat_Content'] = bi_df['Item_Fat_Content'].map({"low fat": "Low Fat",
                                                           "Low Fat": "Low Fat",
                                                         "LF":"Low Fat",
                                                         "Regular":"Regular",
                                                         "reg":"Regular"})

In [None]:
bi_df['Item_Fat_Content'].value_counts()

<b> In term of Sales? </b>


In [None]:
for i in categorical_df:
    plt.figure(figsize = (10,8))
    sns.boxplot( y = i, x = bi_df['Item_Outlet_Sales'],data = bi_df);

<b> Observations: </b>
- In terms of 'Outlet_Type', Supermarket has highest demand (Type1 and Type3)
- Starchy Food, Dairy, Fruits & Vegetables and Households has highest sales. But most of them all equal in terms of overall sales.


## Skewness on Numbers

In [None]:
for i in numerical_df:
    print("\n")
    print(i)
    print("-" * 20)
    print("Skewness: %f" % bi_df[i].skew())
    print("Kurtosis: %f" % bi_df[i].kurt())
    print("-" * 20)

In [None]:
tf_df = bi_df.copy()

## Dataset Transformation

In [None]:
categorical_df.columns

### Label Encoding 

Let's encode all the categorical values, and check the correlation of all the values with 'SalePrice'

In [None]:
from sklearn import preprocessing

In [None]:
label_encoder = preprocessing.LabelEncoder()

In [None]:
categorical_df = tf_df.select_dtypes(include = 'object')

In [None]:
label_df = tf_df.copy()
for i in categorical_df:
    label_df[i] = label_encoder.fit_transform(tf_df[i])

## Correlation

In [None]:
corrmat =label_df.corr()
f, ax = plt.subplots(figsize = (20,9))
sns.heatmap(corrmat, vmax = .8, annot = True)

### Dropping unrelated Columns

In [None]:
drop_columns = ['Item_Visibility','Outlet_Size','Outlet_Establishment_Year','Outlet_Type','Item_Weight','Item_Identifier']

tf_df.drop(drop_columns, axis =1 , inplace = True) 

In [None]:
tf_df

### One Hot Encoding

In [None]:
tf_df = pd.get_dummies(tf_df)

In [None]:
tf_df

### Preparing the Dataset

In [None]:
X = tf_df.drop(['Item_Outlet_Sales'], axis = 1)
y = tf_df['Item_Outlet_Sales']

### Scaling the Dataset

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X1 = scaler.fit_transform(X)
X_train = pd.DataFrame(data = X1, columns = X.columns)

In [None]:
X_train.head()

### Splitting Dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Modeling and Evaluation Metrics

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_squared_log_error, make_scorer, mean_absolute_error
import math


lr = LinearRegression(normalize = True)
lr.fit(X_train, y_train)

In [None]:
lr_predict = lr.predict(X_test)

In [None]:
yp = lr.predict(X_test)
print("R2 Score:", r2_score(y_test, lr_predict))
print("Mean Squarred Error:", mean_squared_error(y_test, lr_predict))
print("RMSE:", math.sqrt(mean_squared_error(y_test, lr_predict)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,lr_predict)))

### XGBOOST REGRESSOR

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from xgboost import XGBRegressor

xgb = XGBRegressor(n_estimators = 1000, learning_rate = 0.05)
xgb.fit(X_train, y_train)

predict = xgb.predict(X_test)


In [None]:
print("R2 Score:", r2_score(y_test, predict))
print("Mean Squarred Error:", mean_squared_error(y_test, predict))
print("RMSE:", math.sqrt(mean_squared_error(y_test, predict)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,predict)))

### LASSO REGRSSOR

In [None]:
from sklearn.linear_model import Lasso

In [None]:
ls = Lasso(alpha = 0.01)
ls.fit(X_train, y_train)

In [None]:
lasso_pred = ls.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, lasso_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, lasso_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, lasso_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,lasso_pred)))

### LGBMRegressor

In [None]:
from lightgbm import LGBMRegressor

In [None]:
lgbm = LGBMRegressor()

In [None]:
lgbm.fit(X_train, y_train)

In [None]:
lgbm_pred = lgbm.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, lgbm_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, lgbm_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, lgbm_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,lgbm_pred)))

### RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators = 50, max_depth = 15, random_state = 47, min_samples_leaf = 10)

In [None]:
rf.fit(X_train, y_train)

In [None]:
rf_pred = rf.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, rf_pred))
print("Mean Squarred Error:", mean_squared_error(y_test, rf_pred))
print("RMSE:", math.sqrt(mean_squared_error(y_test, rf_pred)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,rf_pred)))

### DecisionTreeRegressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

In [None]:
dt = DecisionTreeRegressor()

param_dist = {
            'max_depth': [2,5,10,50,25,30,40,],
}

dt_gs = GridSearchCV(dt, param_grid = param_dist, cv = 6)
dt_gs.fit(X_train, y_train)

dt_predict = dt_gs.predict(X_test)

In [None]:
print("R2 Score:", r2_score(y_test, dt_predict))
print("Mean Squarred Error:", mean_squared_error(y_test, dt_predict))
print("RMSE:", math.sqrt(mean_squared_error(y_test, dt_predict)))
print("Mean Absolute Error : " + str(mean_absolute_error(y_test,dt_predict)))