# **Project-2**

In this project, you will analyze and predict the weekly sales for a retail store. The dataset includes weekly sales data for *45* store locations over a *143-week* period. Create a machine learning model (**regression**) to predict weekly sales values using the train and test datasets provided.

**Dataset Details:**

*Store*: Store number

*Week*: 1 through 143

*Temperature*: Weekly outside temperature

*Holiday*: Yes for holiday week, No for non-holiday week

*CPI*: The Consumer Price Index

*Fuel Price*: Price per gallon

*Unemployment*: Unemployment rate

*WeeklySales*: Total sales amount


**Datasets Locations and Names:**
Canvas -> Modules -> Week 5 -> Datasets -> "trainSales.csv" and "testSales.csv".

Download the .ipynb file and save as FirstName_LastName_Project2.ipynb. Please submit (upload) your source code to Canvas.

Connect to Google drive and import the initaial libraries the may be needed

In [104]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Access the necessary files that contain our training and test data sets. As the data sets are already seperate we dont not need to utilize stratfied shuffling to seperate a single data set.

In [105]:
train_set = pd.read_csv("drive/My Drive/trainSales.csv")
test_set = pd.read_csv("drive/My Drive/testSales.csv")

Verify that the data sets were imported correctly. We can see that the train_set contains 5148 entries and the test_set contains 1287 entries.

In [106]:
train_set, test_set

(      Store  Week  Temperature Holiday         CPI  FuelPrice  Unemployment  \
 0         8   109        50.95      No  224.395979      3.630         5.825   
 1         2   127        84.20      No  221.521506      3.227         6.565   
 2        38    72        86.84      No  129.043200      3.935        13.736   
 3        41    27        69.21      No  190.099003      2.690         7.335   
 4        35   125        73.23      No  142.160646      3.564         8.876   
 ...     ...   ...          ...     ...         ...        ...           ...   
 5143     43    12        62.71      No  202.483191      2.795         9.593   
 5144      5   138        71.09      No  223.373759      3.721         5.603   
 5145     18    57        28.49      No  133.614143      3.437         9.131   
 5146      2    32        79.09     Yes  211.153210      2.565         8.099   
 5147      9     3        43.06      No  214.850618      2.514         6.415   
 
       WeeklySales  
 0       952264.9

Create the "WeeklySales" labels for both data sets, that we are trying to create a model to predict.

In [107]:
train = train_set.drop("WeeklySales", axis=1)
train_labels = train_set["WeeklySales"].copy()

test = test_set.drop("WeeklySales", axis=1)
test_labels = test_set["WeeklySales"].copy()

Verify that train_labels contains the list of weekly sales for each index

In [108]:
train_labels

0        952264.91
1       2041507.40
2        356797.00
3       1338132.72
4        911696.00
           ...    
5143     638957.35
5144     307306.76
5145    1063310.62
5146    1839128.83
5147     511327.90
Name: WeeklySales, Length: 5148, dtype: float64

Create our lists of attributes for each data set. Including separating the "Holiday" attributes, which can utilize a boolean style data type, then it can be used in the onehotencoder.  

In [109]:
num_atrb = list(train.drop("Holiday", axis=1))
bool_atrb = ["Holiday"]

num_atrb, bool_atrb

(['Store', 'Week', 'Temperature', 'CPI', 'FuelPrice', 'Unemployment'],
 ['Holiday'])

From the code discussed in class, import the libraries needed to create our pipelines. Including the standard scalar and OneHotEncoders

In [110]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')

imputer

numeric_std_pipeline = Pipeline([('imputer', SimpleImputer(strategy='median')),
                                  ('stdscaler', StandardScaler())])

bool_pipeline = Pipeline([('onehot', OneHotEncoder())])

Here the piplines are combined using a column transformer to create the transformer for our data sets.

In [111]:
from sklearn.compose import ColumnTransformer

full_transformer = ColumnTransformer([('numeric_stdpreprocessing', numeric_std_pipeline, num_atrb),('bool_preprocessing', bool_pipeline, bool_atrb)
                                ])

Below is the function to fit our data to a chosen model and then return the training error, test error and R2 scores

In [112]:
#For regression

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

def fit_and_print(p, train_set, train_labels, test_set, test_labels):
  p.fit(train_set, train_labels)
  train_preds = p.predict(train_set)
  test_preds = p.predict(test_set)
  print("Training Error: " + str(mean_absolute_error(train_preds, train_labels)))
  print("Test Error: " + str(mean_absolute_error(test_preds, test_labels)))
  print("R2 score: " + str(r2_score(test_preds, test_labels)))

The next three pieces of code import and itialize three different regression models. Including linear, random forest and K Neighbors.

In [113]:
from sklearn.linear_model import LinearRegression

LR_full_pipeline = Pipeline([('all_column_transformation', full_transformer),
                        ('linear_regression', LinearRegression())
                      ])

In [114]:
from sklearn.ensemble import RandomForestRegressor as RFR

RFR_full_pipeline = Pipeline([('all_column_transformation', full_transformer),
               ("RFR_model", RFR())
               ])

In [115]:
from sklearn.neighbors import KNeighborsRegressor as KNR

KNR_full_pipeline = Pipeline([('all_column_transformation', full_transformer),
               ("KNN Regressor", KNR())
               ])

After running the fit and print function three time using one of the regression pipelines in each one, we see our training/test errors and the R2 scores for each model below.

The results found for the linear regression model show a severe lack of compatability with the data.

In [116]:
fit_and_print(LR_full_pipeline, train, train_labels, test, test_labels)

Training Error: 428353.3155306957
Test Error: 437453.8028262147
R2 score: -4.664731087070617


With an R2 score of 0.948, the random forest regression models appears to provide the best result for the data, however the errors that are found show signs of overfitting.

In [117]:
fit_and_print(RFR_full_pipeline, train, train_labels, test, test_labels)

Training Error: 27506.837977369836
Test Error: 68728.31571693857
R2 score: 0.9485115719134508


The KNeighbors model provides better results than linear regression, but the 0.45 R2 score is far from a perfect score and the errors are significant as well as showing signs of overfitting

In [118]:
fit_and_print(KNR_full_pipeline, train, train_labels, test, test_labels)

Training Error: 164030.82773038073
Test Error: 224714.07525097125
R2 score: 0.4535152877722701


Below is the grid search cv method(used in class) to help find the best parameters for our models

In [119]:

from sklearn.model_selection import GridSearchCV
param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'RFR_model__n_estimators': [3, 10, 30], 'RFR_model__max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'RFR_model__bootstrap': [False], 'RFR_model__n_estimators': [3, 10], 'RFR_model__max_features': [2, 3, 4]},
  ]

grid_search = GridSearchCV(RFR_full_pipeline, param_grid, cv=5)
grid_search.fit(train, train_labels)

Then grid_search is ran with the best_params funciton to get the best possible parameters for the dataset in that particular model. In this case it is for the RFR model.

In [120]:
grid_search.best_params_

{'RFR_model__max_features': 4, 'RFR_model__n_estimators': 30}