# Perdiction of sales

### Problem Statement
The dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

In [31]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('regression_exercise.csv')

In [32]:
data['Item_Identifier_Type'] = data.Item_Identifier.astype(str).str[:2]

data["Outlet_Size"] = data["Outlet_Size"].fillna('Medium')
dropped = data[data.Item_Weight.notnull()]
mean = dropped.groupby('Item_Type').Item_Weight.mean()
name = pd.Series(dropped.Item_Type.unique())
mean_weights = pd.concat([mean,name],axis=1,keys=['mean'])
def fill_item(row):
    if not np.isnan(row['Item_Weight']):
        return row['Item_Weight']
    else:
        mean = mean_weights.loc[row.Item_Type, 'mean']
        return mean
data['Item_Weight'] = data.apply(fill_item, axis=1)

In [33]:
# Separate target from predictors
y = data.Item_Outlet_Sales
X = data.drop(['Item_Outlet_Sales'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [34]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[('imputer',SimpleImputer(strategy='constant')),
                                       ('scaler', StandardScaler())
                                      ])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [35]:
X_train = preprocessor.fit_transform(X_train)

In [36]:
X_valid = preprocessor.fit_transform(X_valid)

In [67]:
# import the class
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
logreg = LogisticRegression()

xg_reg = xgb.XGBRegressor(objective ='reg:squarederror')

clf=RandomForestRegressor(n_estimators=200,
 min_samples_split=5,
 min_samples_leaf=4,
 max_features='auto',
 max_depth=5,
 bootstrap=True)

GBC = GradientBoostingRegressor(n_estimators=100, learning_rate=1.0,
     max_depth=1, random_state=0)

We have covered data preparation and feature engineering two weeks ago. Plus, we have created Lasso and Ridge regressions on Monday. Now, we will work on more complex ensemble models.

## Model Building

### Ensemble Models

Try different  ensemble models (Random Forest Regressor, Gradient Boosting, XGBoost)

Calculate the mean squared error on the test set. Explore how different parameters of the model affect the results and the performance of the model

- Use GridSearchCV to find optimal paramaters of models.
- Compare agains the Lasso and Ridge Regression models from Monday.

In [22]:
param_grid = {"learning_rate"    : [0.05, 0.10, 0.15 ] ,
 "max_depth"        : [ 3, 4, 5, 6],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1],
 "colsample_bytree" : [ 0.3, 0.4 ] }

grid = GridSearchCV(estimator=XGBRegressor(), param_grid=param_grid, cv=5, scoring='r2', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)

best_r2 = grid_result.best_score_
best_gamma = grid_result.best_params_['gamma']
best_colsample_bytree = grid_result.best_params_['colsample_bytree']
best_learning_rate = grid_result.best_params_['learning_rate']
best_max_depth = grid_result.best_params_['max_depth']

print(f'The best hyperparameter settings achieve a cross-validated R^2 of: {best_r2}\n Gamma:\t {best_gamma}\n col_sample_bytree:\t{best_colsample_bytree} \n Best learning rate: \t{best_learning_rate}\n Best maximum depth: \t{best_max_depth}')

Fitting 5 folds for each of 192 candidates, totalling 960 fits
The best hyperparameter settings achieve a cross-validated R^2 of: 0.5962593761466426
 Gamma:	 0.0
 col_sample_bytree:	0.4 
 Best learning rate: 	0.15
 Best maximum depth: 	3


In [52]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_valid)

rmse = (r2_score(y_valid, preds))
print("RMSE: %f" % (rmse))

RMSE: 0.521070


In [68]:
clf.fit(X_train, y_train)

y_preds = clf.predict(X_valid)

r2_score(y_valid, y_preds)

0.5914970558178341

In [60]:
GBC.fit(X_train, y_train)

y_preds = GBC.predict(X_valid)

r2_score(y_valid, y_preds)

0.552221055713558

In [61]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


In [62]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42, verbose=2)

In [63]:
rf_random.best_params_

{'n_estimators': 200,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'auto',
 'max_depth': 10,
 'bootstrap': True}