# Prediction of sales

### Problem Statement
This dataset represents sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store are available. The aim is to build a predictive model and find out the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### Explore the problem in following stages:

1. Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome
2. Data Exploration – looking at categorical and continuous feature summaries and making inferences about the data.
3. Data Cleaning – imputing missing values in the data and checking for outliers
4. Feature Engineering – modifying existing variables and creating new ones for analysis
5. Model Building – making predictive models on the data

We have covered data preparation and feature engineering two weeks ago. Now, it's time to do some predictive models.

## Model Building

## Task
Make a baseline model. Baseline models help us set a benchmark to gauge the performance of our future models. If your new model is below the baseline, something has gone wrong, and you should check your data.

To make a baseline model, run a simple regression model without altering the default parameters in sklearn. 

In [8]:
data=pd.read_csv('data_feature_engg.csv')

In [9]:
data.columns

Index(['Item_Fat_Content', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Outlet_Location_Type',
       'Item_Outlet_Sales', 'Item_Weight', 'Outlet_Size', 'misleading',
       'Item_Type_Baking Goods', 'Item_Type_Breads', 'Item_Type_Breakfast',
       'Item_Type_Canned', 'Item_Type_Dairy', 'Item_Type_Frozen Foods',
       'Item_Type_Fruits and Vegetables', 'Item_Type_Hard Drinks',
       'Item_Type_Health and Hygiene', 'Item_Type_Household', 'Item_Type_Meat',
       'Item_Type_Others', 'Item_Type_Seafood', 'Item_Type_Snack Foods',
       'Item_Type_Soft Drinks', 'Item_Type_Starchy Foods',
       'Outlet_Type_Grocery Store', 'Outlet_Type_Supermarket Type1',
       'Outlet_Type_Supermarket Type2', 'Outlet_Type_Supermarket Type3',
       'Item_Identify_Type_DR', 'Item_Identify_Type_FD',
       'Item_Identify_Type_NC'],
      dtype='object')

In [10]:
data.head()

Unnamed: 0,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Location_Type,Item_Outlet_Sales,Item_Weight,Outlet_Size,misleading,Item_Type_Baking Goods,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Item_Identify_Type_DR,Item_Identify_Type_FD,Item_Identify_Type_NC
0,0,0.016047,249.8092,23,0,3735.138,9.3,1,0,0,...,0,0,0,0,1,0,0,0,1,0
1,1,0.019278,48.2692,13,2,443.4228,5.92,1,0,0,...,0,1,0,0,0,1,0,1,0,0
2,0,0.01676,141.618,23,0,2097.27,17.5,1,0,0,...,0,0,0,0,1,0,0,0,1,0
3,1,0.0,182.095,24,2,732.38,19.2,0,0,0,...,0,0,0,1,0,0,0,0,1,0
4,0,0.0,53.8614,35,2,994.7052,8.93,2,1,0,...,0,0,0,0,1,0,0,0,0,1


In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
y=data.Item_Outlet_Sales
data.drop('Item_Outlet_Sales',axis=1,inplace=True)

In [13]:
regressor = LinearRegression()
regressor.fit(data, y)

LinearRegression()

In [14]:
print(regressor.coef_)

[ 4.16774830e+01 -2.89148724e+02  1.55743028e+01 -9.74431543e+00
 -3.33438162e+01 -2.78735259e-01  8.55189736e+01 -1.74405401e+01
 -3.81639203e+00 -7.00710454e-01  3.70883394e+00  2.24970626e+01
 -5.30003569e+01 -3.06907822e+01  2.44007259e+01 -5.11773905e+01
  7.43458433e+00 -2.09147499e+01 -3.62508671e+00 -3.96037456e+00
  1.79593849e+02 -1.47834944e+01 -7.33699373e+01  1.84042192e+01
 -1.64578260e+03  1.79452196e+02 -2.47306981e+02  1.71363738e+03
  3.01255098e+01 -1.26849697e+01 -1.74405401e+01]


In [16]:
regressor.score(data,y) #R^2

0.5629450976862653

## Task
Split your data in 80% train set and 20% test set.

In [17]:
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(data, y, train_size=0.8,test_size=0.2, random_state=101)

## Task
Use grid_search to find the best value of the parameter `alpha` for Ridge and Lasso regressions from `sklearn`.

In [18]:
import numpy as np
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [102]:
parameter_candidates = [
  {'alpha': np.linspace(0.01, 0.02, 20)}]

In [21]:
from sklearn.linear_model import Lasso,Ridge

In [33]:
poly = PolynomialFeatures(
    degree = 2, include_bias = False, interaction_only = False)
X_train_poly = poly.fit_transform(X_train)
polynomial_column_names = \
    poly.get_feature_names(input_features = X_train.columns)
X_train_poly = \
    pd.DataFrame(data = X_train_poly, 
        columns = polynomial_column_names )

In [34]:
X_train_poly.dtypes

Item_Fat_Content                               float64
Item_Visibility                                float64
Item_MRP                                       float64
Outlet_Establishment_Year                      float64
Outlet_Location_Type                           float64
                                                ...   
Item_Identify_Type_DR Item_Identify_Type_FD    float64
Item_Identify_Type_DR Item_Identify_Type_NC    float64
Item_Identify_Type_FD^2                        float64
Item_Identify_Type_FD Item_Identify_Type_NC    float64
Item_Identify_Type_NC^2                        float64
Length: 527, dtype: object

In [35]:
X_train_poly = poly.fit_transform(X_train)
polynomial_column_names = \
    poly.get_feature_names(input_features = X_train.columns)
X_train_poly = \
    pd.DataFrame(data = X_train_poly, 
        columns = polynomial_column_names )

In [36]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
sc = StandardScaler()
X_train_poly_scaled = sc.fit_transform(X_train_poly)
X_train_poly_scaled = pd.DataFrame(data = X_train_poly_scaled, columns = X_train_poly.columns)

In [37]:
#To do the same transform on test dat, no need to fit again
X_test_poly = poly.transform(X_test)
X_test_poly_scaled = sc.transform(X_test_poly)

In [43]:
from sklearn.metrics import \
    r2_score, get_scorer
scorer = get_scorer('r2')

In [107]:
lasso=Lasso(max_iter=3000)
ridge=Ridge()

In [108]:
lassoreg = GridSearchCV(estimator=lasso, param_grid=parameter_candidates, n_jobs=-1, scoring ='r2')


In [115]:
ridgereg=GridSearchCV(estimator=ridge, param_grid=parameter_candidates, n_jobs=-1, scoring='r2')

In [109]:
lassoreg.fit(X_train_poly_scaled, y_train)


  model = cd_fast.enet_coordinate_descent(


GridSearchCV(estimator=Lasso(max_iter=3000), n_jobs=-1,
             param_grid=[{'alpha': array([0.01      , 0.01052632, 0.01105263, 0.01157895, 0.01210526,
       0.01263158, 0.01315789, 0.01368421, 0.01421053, 0.01473684,
       0.01526316, 0.01578947, 0.01631579, 0.01684211, 0.01736842,
       0.01789474, 0.01842105, 0.01894737, 0.01947368, 0.02      ])}],
             scoring='r2')

In [116]:
ridgereg.fit(X_train_poly_scaled, y_train)

GridSearchCV(estimator=Ridge(), n_jobs=-1,
             param_grid=[{'alpha': array([0.01      , 0.01052632, 0.01105263, 0.01157895, 0.01210526,
       0.01263158, 0.01315789, 0.01368421, 0.01421053, 0.01473684,
       0.01526316, 0.01578947, 0.01631579, 0.01684211, 0.01736842,
       0.01789474, 0.01842105, 0.01894737, 0.01947368, 0.02      ])}],
             scoring='r2')

In [112]:
print('Best score for lasso:', lassoreg.best_score_) 


Best score for lasso: 0.5883887967990238


In [117]:
print('Best score for ridge:', ridgereg.best_score_) 


Best score for ridge: 0.5886496476187153


In [118]:
print('Best alpha for lasso:',lassoreg.best_estimator_.alpha) 


Best alpha for lasso: 0.02


In [119]:
print('Best alpha for ridge:',ridgereg.best_estimator_.alpha) 


Best alpha for ridge: 0.01


In [110]:
def regmodel_param_plot(
    validation_score, train_score, alphas_to_try, chosen_alpha,
    scoring, model_name, test_score = None, filename = None):
    
    plt.figure(figsize = (8,8))
    sns.lineplot(y = validation_score, x = alphas_to_try, 
                 label = 'validation_data')
    sns.lineplot(y = train_score, x = alphas_to_try, 
                 label = 'training_data')
    plt.axvline(x=chosen_alpha, linestyle='--')
    if test_score is not None:
        sns.lineplot(y = test_score, x = alphas_to_try, 
                     label = 'test_data')
    plt.xlabel('alpha_parameter')
    plt.ylabel(scoring)
    plt.title(model_name + ' Regularisation')
    plt.legend()
    if filename is not None:
        plt.savefig(str(filename) + ".png")
    plt.show()

In [111]:

def regmodel_param_test(
    alphas_to_try, X, y, cv, scoring = 'r2', 
    model_name = 'LASSO', X_test = None, y_test = None, 
    draw_plot = False, filename = None):
    
    validation_scores = []
    train_scores = []
    results_list = []
    if X_test is not None:
        test_scores = []
        scorer = get_scorer(scoring)
    else:
        test_scores = None

    for curr_alpha in alphas_to_try:
        
        if model_name == 'LASSO':
            regmodel = Lasso(alpha = curr_alpha)
        elif model_name == 'Ridge':
            regmodel = Ridge(alpha = curr_alpha)
        else:
            return None

        results = cross_validate(
            regmodel, X, y, scoring=scoring, cv=cv, 
            return_train_score = True)

        validation_scores.append(np.mean(results['test_score']))
        train_scores.append(np.mean(results['train_score']))
        results_list.append(results)

        if X_test is not None:
            regmodel.fit(X,y)
            y_pred = regmodel.predict(X_test)
            test_scores.append(scorer(regmodel, X_test, y_test))
    
    chosen_alpha_id = np.argmax(validation_scores)
    chosen_alpha = alphas_to_try[chosen_alpha_id]
    max_validation_score = np.max(validation_scores)
    if X_test is not None:
        test_score_at_chosen_alpha = test_scores[chosen_alpha_id]
    else:
        test_score_at_chosen_alpha = None
        
    if draw_plot:
        regmodel_param_plot(
            validation_scores, train_scores, alphas_to_try, chosen_alpha, 
            scoring, model_name, test_scores, filename)
    
    return chosen_alpha, max_validation_score, test_score_at_chosen_alpha

Not good, slightly more than baseline

## Task
Using the model from grid_search, predict the values in the test set and compare against your benchmark.

In [113]:
lassoreg.score(X_test_poly_scaled,y_test)

0.5987187825667534

In [120]:
ridgereg.score(X_test_poly_scaled,y_test)

0.5993027826389071