<a href="https://colab.research.google.com/github/HeatherAnnFoster/Regression--Prediciton-of-Grocery-Sales/blob/main/Project_1_part_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [59]:
import pandas as pd
import numpy as np
from numpy.lib.function_base import mean
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn import set_config
set_config (display = 'diagram')

In [60]:
path = '/content/sales_predictions.xlsx'
df = pd.read_excel(path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


*This information show that in the Item Fat Content column, there are 5 different value names.  This will inconsistency will be fixed to show 'Low Fat' and 'Regular' names.*

In [61]:
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [62]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace("low fat")
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('LF')
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('reg')
df['Item_Fat_Content'].value_counts()

Low Fat    5423
Regular    3100
Name: Item_Fat_Content, dtype: int64

In [63]:
df.duplicated().sum()

0

*There are two columns that have missing values.  The Item Weight column is missng 1,463 values, which is 17.17% of its values.  The Outlet Size column is missing 2,410 values, which is 28.27% of its values.  Deleting these columns would skew the results of the preprocessing, so they will be adjusted during the pipeline phase of this analysis.*

In [64]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

*The target for the data is the "Item Outlet Sales" column.  The rest of the information will be kept in the X section of the data.*

In [65]:
y = df['Item_Outlet_Sales']

In [66]:
X = df.drop(columns = 'Item_Outlet_Sales')

*The data is being split here.  The target, or y is Item Outlet Sales.*

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

*The data is put through the prepocessing object to get the data ready for modeling.  The selectors and columns are defined and ready for the machine learning to work properly.*

In [79]:
cat_selector = make_column_selector(dtype_include = 'object')
num_selector = make_column_selector(dtype_include = 'number')
mean_imputer = SimpleImputer(strategy = 'mean')
scaler = StandardScaler()
frequency_imputer = SimpleImputer(strategy = 'most_frequent')
ohe = OneHotEncoder(handle_unknown = 'ignore')
num_columns = num_selector(X_train)
cat_columns = cat_selector(X_train)
print('numeric columns are', num_columns)
print('categorical columns are', cat_columns)

numeric columns are ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']
categorical columns are ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']


In [74]:
num_tuple = (scaler, num_selector)
cat_tuple = (ohe, cat_selector)

In [76]:
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder = 'passthrough')
col_transformer

**This is where the preprocessor comes into play.  The numeric and categorical pipelines take in the data, clean it and prepare it for modeling.

In [72]:
numeric_pipeline = make_pipeline(mean_imputer, scaler)
numeric_pipeline

In [80]:
categorical_pipeline = make_pipeline(frequency_imputer, ohe)
categorical_pipeline

*Here, we will fit the dataset into the transformer and fill in the missing values.*

In [83]:
preprocessor = make_column_transformer (num_tuple, cat_tuple)
preprocessor.fit(X_train)

*Now, we will double check the dataset to make certain that the imputation is correct.*

In [90]:
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [106]:
print(X_train_processed)

  (0, 0)	0.7431189556457063
  (0, 1)	-0.7127750716052249
  (0, 2)	1.8281092189993307
  (0, 3)	1.3278489341413338
  (0, 1320)	1.0
  (0, 1554)	1.0
  (0, 1565)	1.0
  (0, 1575)	1.0
  (0, 1583)	1.0
  (0, 1588)	1.0
  (0, 1591)	1.0
  (1, 0)	0.5058759246744862
  (1, 1)	-1.291052247079188
  (1, 2)	0.6033688805132759
  (1, 3)	1.3278489341413338
  (1, 1060)	1.0
  (1, 1555)	1.0
  (1, 1569)	1.0
  (1, 1575)	1.0
  (1, 1583)	1.0
  (1, 1588)	1.0
  (1, 1591)	1.0
  (2, 0)	-0.11958297515872977
  (2, 1)	1.8133186433439548
  (2, 2)	0.24454055715071762
  :	:
  (6389, 1585)	1.0
  (6389, 1587)	1.0
  (6389, 1590)	1.0
  (6390, 0)	1.605820886450142
  (6390, 1)	-0.22775520135060978
  (6390, 2)	-0.38377707627087376
  (6390, 3)	1.089516596141153
  (6390, 609)	1.0
  (6390, 1554)	1.0
  (6390, 1569)	1.0
  (6390, 1574)	1.0
  (6390, 1585)	1.0
  (6390, 1587)	1.0
  (6390, 1590)	1.0
  (6391, 0)	0.7431189556457063
  (6391, 1)	-0.9586768265695343
  (6391, 2)	-0.7383610459950399
  (6391, 3)	-0.10214509385974999
  (6391, 1410)	