# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [114]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import r2_score

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [115]:
df = pd.read_csv("fixed_data.csv")

In [125]:
# CHANGE OUT DUMMY WITH ONE HOT ENCODING NO _ IF >6 LABEL ENCODING? - OR USE CORRELATION FUNCTION TO REMOVE
# TRY BINNING FOR IE YEARS _OPENED AND OTHER TO MAKE CATEGROICAL _ DO FOR THINGS LIKE AGE
# CHECK IF EACH VARIABLE IS LINEAR _ PERHAPS NEED TO TAKE LOG IF SKEWED OR POLYNOMIAL FIT IF NON LINEAR


print(df.shape)
df.head()

y = df.Item_Outlet_Sales
X = df.drop(columns=['Item_Outlet_Sales', 'Outlet_Establishment_Year', 'Unnamed: 0'])
# X.columns
# X.head()

(8523, 39)


### SPLIT data

In [126]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8,test_size=0.20, random_state=101)

### SCALE

In [122]:
# # print out the numeric columns and categorical columns as numeric_cols and cat_cols 

# numeric_cols = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Years_Opened']
# cat_cols = list(set(X.columns) - set(numeric_cols))
# cat_cols.sort()


In [124]:
# # FROM EXERCISE
# scaler = StandardScaler()
# scaler.fit(X_train[numeric_cols])

# def get_features_and_target_arrays(df, numeric_cols, cat_cols, scaler):
#     X_numeric_scaled = scaler.transform(df[numeric_cols])
#     X_categorical = df[cat_cols].to_numpy()
#     X = np.hstack((X_categorical, X_numeric_scaled))
# #     y = df['target']
#     return X #, y

# X_train= get_features_and_target_arrays(X_train, numeric_cols, cat_cols, scaler)
# # X
# X_train.shape

In [127]:
# DO I NEED TO BE SCALING FIRST - CAUSE I SURE DID  
## ALSO I DIDN"T SCALE Y - PROB NEED TO DO THAT?

from sklearn.linear_model import LinearRegression, Ridge

# creating linear regression
lr = LinearRegression()
lr.fit(X_train,y_train)
y_lr_baseline = lr.predict(X_test)

In [129]:
from sklearn.metrics import mean_squared_error
MSE_lr_baseline = mean_squared_error(y_test,y_lr_baseline) 
r2_baseline = r2_score(y_test,y_lr_baseline)
print("MSE_lr", MSE_lr_baseline)
print("r2_baseline", r2_baseline)

MSE_lr 1204691.7533316647
r2_baseline 0.557970448584349


## Task
Split your data in 80% train set and 20% test set.

In [43]:
# See above

## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

#### Ridge

In [None]:
# HOW DO WE GET CROSS VALIDATION SCORE FOR EACH? OR I GUESS WE DONT NEED HAVE THE BEST? DOES IT AVERAGE FOR US?

In [136]:

model = Ridge()
param_grid = {
    #     'alpha': np.arange(0, 1, 0.1)
    'alpha': np.arange(10, 20, 0.1)
#     'alpha' : [1,0.1,0.01,0.001,0.0001,0]
}

grid_search_ridge = GridSearchCV(model, param_grid, cv=10)
grid_search_ridge.fit(X_train, y_train)

#  Results
print("Best R_squared from grid search: %.3f"
       % grid_search_ridge.score(X_train, y_train))
print(grid_search_ridge.best_params_)



Best R_squared from grid search: 0.564
{'alpha': 10.0}


In [137]:
# use alpha found to fit ridge
rr = Ridge(alpha=10)
rr.fit(X_train,y_train)
y_rr = rr.predict(X_test)

#### Lasso

In [138]:
# WHY IS CAN"T IFIND THIS IN THE SAME PLACE AS RIDGE?
from sklearn.linear_model import Lasso

model = Lasso()
np.arange(1, 50, 3)
param_grid = {
#     'alpha': np.arange(0.1, 40, 0.4)
    'alpha': np.arange(11, 20, 0.1)
}

lasso_grid_search = GridSearchCV(model, param_grid, cv=10)
lasso_grid_search.fit(X_train, y_train)

print("Best R_squared from grid search: %.3f"
       % lasso_grid_search.score(X_train, y_train))
print(lasso_grid_search.best_params_)

Best R_squared from grid search: 0.561
{'alpha': 11.0}


In [139]:
# use alpha found to fit lasso
# lasso = Lasso(alpha=14.1)
lasso = Lasso(alpha=11.0)
lasso.fit(X_train,y_train)
y_lasso = lasso.predict(X_test)

#### Louis' solution from slack

In [105]:
paramgrid = {
    'alpha': [0.001, 0.01, 0.1, 1]
}
n = 5

model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=paramgrid, cv=n, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train,y_train)

best_r2 = grid_result.best_score_
best_alpha = grid_result.best_params_['alpha']

Fitting 5 folds for each of 4 candidates, totalling 20 fits


In [106]:
best_model = grid_result.best_estimator_

In [107]:
from sklearn.metrics import r2_score
y_pred = best_model.predict(X_test)
r2_test = r2_score(y_test, y_pred)

print(best_model, y_pred, r2_test)

Ridge(alpha=1) [   32.91131306 -2574.16607316  1239.05041471 ... -3710.78194219
   -56.19594612 -1033.88942892] -6.157751680711613




## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [110]:
y_pred = best_model.predict(X_test)
r2_test = r2_score(y_test, y_pred)



In [141]:
from sklearn.metrics import r2_score
r2_ridge = r2_score(y_test, y_rr)
r2_lasso = r2_score(y_test, y_lasso)
r2_baseline = r2_score(y_test, y_lr_baseline)


print('r2_baseline', r2_baseline)
print('r2_lasso', r2_lasso)
print('r2_ridge', r2_ridge)

# print('r2_test_ridge', r2_test)

r2_baseline 0.557970448584349
r2_lasso 0.5569192992970409
r2_ridge 0.5582320311173267


In [112]:
from sklearn.metrics import mean_squared_error
# MSE = mean_squared_error(y_true,y_pred)  
MSE_ridge = mean_squared_error(y, y_rr)
MSE_lasso = mean_squared_error(y, y_lasso)
MSE_baseline = mean_squared_error(y, y_lr_baseline)
print('MSE_ridge', MSE_ridge)
print('MSE_lasso', MSE_lasso)
print('MSE_baseline', MSE_baseline)

MSE_ridge 1271132.6817950106
MSE_lasso 1282119.3802162975
MSE_baseline 1270203.416320679
