<a href="https://colab.research.google.com/github/HeatherAnnFoster/Regression--Prediciton-of-Grocery-Sales/blob/main/Regression_Prediction_of_Grocery_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [55]:
import pandas as pd
import numpy as np
from numpy.lib.function_base import mean
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn import set_config
set_config (display = 'diagram')

In [56]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [57]:
path = '/content/sales_predictions.xlsx'
df = pd.read_excel(path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


*This information show that in the Item Fat Content column, there are 5 different value names.  This will inconsistency will be fixed to show 'Low Fat' and 'Regular' names.*

In [58]:
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [59]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace("low fat")
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('LF')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('reg')
df['Item_Fat_Content'].value_counts()

Low Fat    5423
Regular    3100
Name: Item_Fat_Content, dtype: int64

In [60]:
df.duplicated().sum()

0

*There are two columns that have missing values.  The Item Weight column is missng 1,463 values, which is 17.17% of its values.  The Outlet Size column is missing 2,410 values, which is 28.27% of its values.  Deleting these columns would skew the results of the preprocessing, so they will be adjusted during the pipeline phase of this analysis.*

In [61]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [62]:
df.shape

(8523, 12)

*The Item Identifier column was dropped.  It was irrelevant to this data set and the analysis.*

In [63]:
df = df.drop(columns = 'Item_Identifier')

*The target for the data is the "Item Outlet Sales" column.  The rest of the information will be kept in the X section of the data.*

In [64]:
y = df['Item_Outlet_Sales']

In [65]:
X = df.drop(columns = 'Item_Outlet_Sales')

*The data is being split here.  The target, or y is Item Outlet Sales.*

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

*The data is put through the prepocessing object to get the data ready for modeling.  The selectors and columns are defined and ready for the machine learning to work properly.*

In [67]:
cat_selector = make_column_selector(dtype_include = 'object')
num_selector = make_column_selector(dtype_include = 'number')
mean_imputer = SimpleImputer(strategy = 'mean')
scaler = StandardScaler()
frequency_imputer = SimpleImputer(strategy = 'most_frequent')
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False)
num_columns = num_selector(X_train)
cat_columns = cat_selector(X_train)
print('numeric columns are', num_columns)
print('categorical columns are', cat_columns)

numeric columns are ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']
categorical columns are ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']


*This is where the preprocessor comes into play.  The numeric and categorical pipelines take in the data, clean it and prepare it for modeling.*

In [68]:
numeric_pipeline = make_pipeline(mean_imputer, scaler)
numeric_pipeline

In [69]:
categorical_pipeline = make_pipeline(frequency_imputer, ohe)
categorical_pipeline

*Here, we will fit the dataset into the transformer and fill in the missing values.*

In [70]:
num_tuple = (numeric_pipeline, num_selector)
cat_tuple = (categorical_pipeline, cat_selector)

In [71]:
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder = 'passthrough')
col_transformer

In [72]:
preprocessor = make_column_transformer (num_tuple, cat_tuple)
preprocessor.fit(X_train)

*Now, we will double check the dataset to make certain that the imputation is correct.*

In [73]:
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [74]:
print(X_train_processed)

[[ 0.81724868 -0.71277507  1.82810922 ...  0.          1.
   0.        ]
 [ 0.5563395  -1.29105225  0.60336888 ...  0.          1.
   0.        ]
 [-0.13151196  1.81331864  0.24454056 ...  1.          0.
   0.        ]
 ...
 [ 1.11373638 -0.92052713  1.52302674 ...  1.          0.
   0.        ]
 [ 1.76600931 -0.2277552  -0.38377708 ...  1.          0.
   0.        ]
 [ 0.81724868 -0.95867683 -0.73836105 ...  1.          0.
   0.        ]]


*Below is the pipeline needed for the data set.  This will take the data set and get it ready for predictions.*

In [75]:
linreg = LinearRegression()
linreg_pipe_1 = make_pipeline(preprocessor, linreg)
linreg_pipe_1

*Here the data set was fit into the preprocessor to get it ready to predict the sales profits.*

In [76]:
linreg_pipe_1.fit(X_train, y_train)

*This prediction was made and shows a massive deviation in the training and testing models of the data set.*

In [77]:
training_predictions = linreg_pipe_1.predict(X_train)
testing_predictions = linreg_pipe_1.predict(X_test)
training_predictions[:10]

array([3808.  , 2650.  , 2610.25, 1482.75, 1875.25,  -70.  , 1591.25,
       5652.  , 4200.75, 2046.25])

In [78]:
y_train.head(10)

4776     515.3292
7510    3056.0220
5828    1577.9460
5327    1331.6000
4810    1687.1372
4377     111.8544
2280    1151.1682
8198    3401.5722
7514    3570.0196
3463    1523.3504
Name: Item_Outlet_Sales, dtype: float64

*This model is showing what the r2 or coefficient of determination.  This model shows the perentage of the variation in the target.*

In [79]:
train_r2 = np.corrcoef(y_train, training_predictions)[0][1]**2
test_r2 = np.corrcoef(y_test, testing_predictions)[0][1]**2
print(f'Model Training R2:{train_r2}')
print(f'Model Testing R2:{test_r2}')

Model Training R2:0.5615533291462604
Model Testing R2:0.5679379310902388


*This model is the Root Mean Squared Error model.  It is showing data that is positive.  It also shows there is a massive gap between the two scores.*

In [80]:
train_RMSE = np.sqrt(np.mean(np.abs(training_predictions-y_train)**2))
test_RMSE = np.sqrt(np.mean(np.abs(testing_predictions-y_test)**2))
print(f'Model Training RMSE:{train_RMSE}')
print(f'Model Testing RMSE:{test_RMSE}')

Model Training RMSE:1139.1065723592649
Model Testing RMSE:1092.7458279520458


*The following is a decision tree model.  It shows the data is positively coorelated to each other, but there is a big gap between the scores.*

In [81]:
dec_tree = DecisionTreeRegressor(random_state = 42)
dec_tree_pipe = make_pipeline(preprocessor, dec_tree)
dec_tree_pipe

In [82]:
dec_tree_pipe.fit(X_train, y_train)

In [83]:
training_predictions = dec_tree_pipe.predict(X_train)
training_predictions = dec_tree_pipe.predict(X_test)
training_predictions[:10]

array([ 792.302 , 1366.2216,  784.3124, 3691.1952, 2570.6538,  732.38  ,
       5303.097 ,  850.8924, 1704.448 , 2926.191 ])

In [84]:
train_score = dec_tree_pipe.score (X_train, y_train)
test_score = dec_tree_pipe.score(X_test, y_test)
print(train_score)
print(test_score)

1.0
0.185677307686953


*In the code boxes below, the information for the Decision Tree was put through a loop that would take the information and determine which model would work best for the company.  As the box with the scores shows, the model that would work the best is the Decision Tree with 5 columns because the test score was around 59%.  That means there are some errors, but they can be overlooked for the good of the sales.*

In [85]:
depths = list(range(2,12))
scores = pd.DataFrame(index = depths, columns = ["train_score", "test_score"])

In [86]:
for depth in depths:
  DT1 = DecisionTreeRegressor(max_depth = depth, random_state= 42)
  DT1.fit(X_train_processed, y_train)
  train_score = DT1.score(X_train_processed, y_train)
  test_score = DT1.score (X_test_processed, y_test)
  scores.loc[depth, "train_score"] = train_score

In [87]:
scores.sort_values(by = "test_score", ascending = False).head()

Unnamed: 0,train_score,test_score
2,0.431641,
3,0.524218,
4,0.582625,
5,0.603934,
6,0.61499,
