# Perdiction of sales

### Problem Statement
The dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [1]:
import numpy as np
import pandas as pd 
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline 
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


In [2]:
df = pd.read_csv('regression_exercise.csv')

In [3]:
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

Summarize dataset:   0%|          | 0/26 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
data = df.copy()
data['Item_Identifier_Type'] = data.Item_Identifier.astype(str).str[:2]

In [5]:
data['current_year'] = 2021
data['Operation_Year'] = data.current_year - data.Outlet_Establishment_Year

In [6]:
data.loc[data['Item_Fat_Content'] == 'LF', 'Item_Fat_Content'] = 'Low Fat'
data.loc[data['Item_Fat_Content'] == 'low fat', 'Item_Fat_Content'] = 'Low Fat'
data.loc[data['Item_Fat_Content'] == 'reg', 'Item_Fat_Content'] = 'Regular'

In [7]:
data = data.replace({'Outlet_Location_Type': {'Tier 1': 1, 'Tier 2': 2, 'Tier 3': 3}})

In [8]:
data["Outlet_Size"] = data["Outlet_Size"].fillna('Medium')

In [9]:
data.loc[data['Item_Identifier_Type'] == 'NC', 'Item_Fat_Content'] = 'None'

In [10]:
data["Outlet_Size"] = data["Outlet_Size"].fillna('Medium')
dropped = data[data.Item_Weight.notnull()]
mean = dropped.groupby('Item_Type').Item_Weight.mean()
name = pd.Series(dropped.Item_Type.unique())
mean_weights = pd.concat([mean,name],axis=1,keys=['mean'])
def fill_item(row):
    if not np.isnan(row['Item_Weight']):
        return row['Item_Weight']
    else:
        mean = mean_weights.loc[row.Item_Type, 'mean']
        return mean
data['Item_Weight'] = data.apply(fill_item, axis=1)

In [11]:
data2 = data[[ 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_MRP','Outlet_Size',
       'Outlet_Type', 'Item_Outlet_Sales', 'Item_Identifier_Type', 'Operation_Year','Outlet_Location_Type']]

In [12]:
data2

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Size,Outlet_Type,Item_Outlet_Sales,Item_Identifier_Type,Operation_Year,Outlet_Location_Type
0,9.300,Low Fat,0.016047,249.8092,Medium,Supermarket Type1,3735.1380,FD,22,1
1,5.920,Regular,0.019278,48.2692,Medium,Supermarket Type2,443.4228,DR,12,3
2,17.500,Low Fat,0.016760,141.6180,Medium,Supermarket Type1,2097.2700,FD,22,1
3,19.200,Regular,0.000000,182.0950,Medium,Grocery Store,732.3800,FD,23,3
4,8.930,,0.000000,53.8614,High,Supermarket Type1,994.7052,NC,34,3
...,...,...,...,...,...,...,...,...,...,...
8518,6.865,Low Fat,0.056783,214.5218,High,Supermarket Type1,2778.3834,FD,34,3
8519,8.380,Regular,0.046982,108.1570,Medium,Supermarket Type1,549.2850,FD,19,2
8520,10.600,,0.035186,85.1224,Small,Supermarket Type1,1193.1136,NC,17,2
8521,7.210,Regular,0.145221,103.1332,Medium,Supermarket Type2,1845.5976,FD,12,3


We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline model is the one which requires no predictive model and its like an informed guess. For instance, predict the sales as the overall average sales or just zero.
Making baseline models helps in setting a benchmark. If your predictive algorithm is below this, there is something going seriously wrong and you should check your data.

In [13]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures 
from sklearn.metrics import r2_score

In [14]:
reg = LinearRegression()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

r2_score(y_test, y_pred)
#mean_absolute_error(y_test, y_pred)

NameError: name 'X_train' is not defined

In [93]:
Xpoly_train = PolynomialFeatures(degree=2).fit_transform(X_train)
Xpoly_test = PolynomialFeatures(degree=2).fit_transform(X_test)

reg.fit(Xpoly_train, y_train)
ypoly_train_pred = reg.predict(Xpoly_train)
ypoly_test_pred = reg.predict(Xpoly_test)

r2poly_train = r2_score(y_train, ypoly_train_pred)
r2poly_test = r2_score(y_test, ypoly_test_pred)
print(f'Train R^2:\t{r2poly_train}\nTest R^2:\t{r2poly_test}')
mean_absolute_error(y_test, ypoly_test_pred)

Train R^2:	0.6092251236999888
Test R^2:	0.5979246262761988


764.6755205865103

## Task
Split your data in 80% train set and 20% test set.

In [68]:
y = data2.Item_Outlet_Sales
X = data2.drop(['Item_Outlet_Sales'], axis=1)

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

In [70]:
categorical_cols = [cname for cname in X_train.columns if X_train[cname].nunique() < 10 and 
                        X_train[cname].dtype == "object"]

numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]

In [71]:
numerical_cols

['Item_Weight',
 'Item_Visibility',
 'Item_MRP',
 'Operation_Year',
 'Outlet_Location_Type']

In [72]:
from sklearn.compose import make_column_transformer
# categorical_transformer= OneHotEncoder(handle_unknown='ignore')
# numerical_transformer=StandardScaler()

ct = make_column_transformer(
    (StandardScaler(), numerical_cols), #turn all values in these columns between 0 and 1 
    (OneHotEncoder(handle_unknown='ignore'), categorical_cols)
)
ct.fit(X_train)

#transform training and test data with normalization (MinMaxScaler) and OneHOtEncoder
X_train = ct.transform(X_train)
X_test = ct.transform(X_test)



## Task
Use grid_search to find the best value of parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [83]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
# Make a dictionary with model arguments as keys and lists of grid settings as values
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1],
    'l1_ratio': [0, 0.25, 0.5, 0.75, 1,1.25,5]
}

grid = GridSearchCV(estimator=ElasticNet(), param_grid=param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1) # verbose=1 -> print results, n_jobs=-1 -> use all processors in parallel
grid_result = grid.fit(X_train, y_train)

best_r2 = grid_result.best_score_
best_alpha = grid_result.best_params_['alpha']
best_l1_ratio = grid_result.best_params_['l1_ratio']
print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}\nAlpha:\t{best_alpha}\nL1 ratio:\t{best_l1_ratio}')

Fitting 5 folds for each of 28 candidates, totalling 140 fits
The best hyperparameter settings achieve a cross-validated R^2 of: 0.5637254605986994
Alpha:	0.1
L1 ratio:	1


        nan 0.56142116 0.56208927 0.56267165 0.56321737 0.56370796
        nan        nan 0.51573713 0.52890277 0.54266837 0.55565177
 0.56372546        nan        nan 0.3118371  0.34642315 0.39084482
 0.454348   0.56358063        nan        nan]
  model = cd_fast.enet_coordinate_descent(


In [94]:
elas = ElasticNet(alpha=0.1, l1_ratio = 1)
elas.fit(X_train, y_train)
y_pred =  elas.predict(X_test)
mean_absolute_error(y_test, y_pred)

  model = cd_fast.enet_coordinate_descent(


858.149993214268

## Task
Using the model from grid_search, predict the values in the test set and compare against the benchmark.

In [85]:
pip install xgboost

Collecting xgboost
  Downloading xgboost-1.4.2-py3-none-win_amd64.whl (97.8 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.4.2
Note: you may need to restart the kernel to use updated packages.


In [86]:
import xgboost as xgb

In [88]:
param_grid = {"learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ] }

In [89]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [90]:
import numpy as np
# "Learn" the mean from the training data
mean_train = np.mean(y_train)
# Get predictions on the test set
baseline_predictions = np.ones(y_test.shape) * mean_train
# Compute MAE
mae_baseline = mean_absolute_error(y_test, baseline_predictions)
print("Baseline MAE is {:.2f}".format(mae_baseline))

Baseline MAE is 1366.92


In [108]:
from xgboost import XGBRegressor
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.2, n_jobs=-1,max_depth = 15)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_test, y_test)], 
             verbose=False)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.2, max_delta_step=0, max_depth=15,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=1000, n_jobs=-1, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [109]:
y_pred = my_model.predict(X_test)
r2_score(y_test, y_pred)
mean_absolute_error(y_test, y_pred)

829.1464939487528

In [112]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
# Make a dictionary with model arguments as keys and lists of grid settings as values
param_grid = {"learning_rate"    : [0.05, 0.10, 0.15 ] ,
 "max_depth"        : [ 3, 4, 5, 6],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1],
 "colsample_bytree" : [ 0.3, 0.4 ] }

grid = GridSearchCV(estimator=XGBRegressor(), param_grid=param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1) # verbose=1 -> print results, n_jobs=-1 -> use all processors in parallel
grid_result = grid.fit(X_train, y_train)

best_r2 = grid_result.best_score_
best_alpha = grid_result.best_params_['gamma']
best_l1_ratio = grid_result.best_params_['colsample_bytree']
print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}\nAlpha:\t{best_alpha}\nL1 ratio:\t{best_l1_ratio}')

Fitting 5 folds for each of 192 candidates, totalling 960 fits
The best hyperparameter settings achieve a cross-validated R^2 of: 0.5965073125716851
Alpha:	0.0
L1 ratio:	0.4
