<a href="https://colab.research.google.com/github/JoeBwonKenobi/Project-1--Part-1-Joe-Lardie/blob/main/Joe_Lardie__Project1_Part5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Sales Prediction Project**

Joe Lardie

Dec 2022

We will continue to work on your sales prediction project. The goal of this step is to help the retailer by using machine learning to make predictions about future sales based on the data provided.

For Part 5, you will go back to your original, uncleaned, sales prediction dataset with the goal of preventing data leakage.

You should reload the original data set here using pd.read_csv() to ensure there is no data leakage!

Before splitting your data, you can drop duplicates and fix inconsistencies in categorical data.* (*There is a way to do this after the split, but for this project, you may perform this step before the split)
Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.
Perform a train test split
Create a preprocessing object to prepare the dataset for Machine Learning
Make sure your imputation of missing values occurs after the train test split using SimpleImputer.
Here is the Data Dictionary for this dataset:

- Variable Name	Description
- Item_Identifier	Unique product ID
- Item_Weight	Weight of product
- Item_Fat_Content	Whether the product is low fat or regular
- Item_Visibility	The percentage of total display area of all products in a store allocated to the particular product
- Item_Type	The category to which the product belongs
- Item_MRP	Maximum Retail Price (list price) of the product
- Outlet_Identifier	Unique store ID
- Outlet_Establishment_Year	The year in which store was established
- Outlet_Size	The size of the store in terms of ground area covered
- Outlet_Location_Type	The type of area in which the store is located
- Outlet_Type	Whether the outlet is a grocery store or some sort of supermarket
- Item_Outlet_Sales	Sales of the product in the particular store. This is the target variable to be predicted.


In [109]:
from google.colab import drive
drive.mount('/content/drive/')


Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [110]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.pipeline import make_pipeline
set_config(display='diagram')

###**Loading Data**

In [111]:
#load Data
df= pd.read_csv('/content/sales_predictions.csv')
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [112]:
# dataFrame info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [113]:
#Creating a copy of the DataFrame
eda_ml = df.copy()

In [114]:
#Print missing values
eda_ml.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [115]:
eda_ml.duplicated().sum()

0

There are no duplicates

In [116]:
#Looking at the info from the data
eda_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [117]:
#Display descriptive stats for all columns
eda_ml.describe(include='all')

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
count,8523,7060.0,8523,8523.0,8523,8523.0,8523,8523.0,6113,8523,8523,8523.0
unique,1559,,5,,16,,10,,3,3,4,
top,FDW13,,Low Fat,,Fruits and Vegetables,,OUT027,,Medium,Tier 3,Supermarket Type1,
freq,10,,5089,,1232,,935,,2793,3350,5577,
mean,,12.857645,,0.066132,,140.992782,,1997.831867,,,,2181.288914
std,,4.643456,,0.051598,,62.275067,,8.37176,,,,1706.499616
min,,4.555,,0.0,,31.29,,1985.0,,,,33.29
25%,,8.77375,,0.026989,,93.8265,,1987.0,,,,834.2474
50%,,12.6,,0.053931,,143.0128,,1999.0,,,,1794.331
75%,,16.85,,0.094585,,185.6437,,2004.0,,,,3101.2964


###**Identify the features X and target y**

In [118]:
#Define X and y
X = df.drop (columns = 'Item_Outlet_Sales')
y = df['Item_Outlet_Sales']

###**Train, Test, Split**

In [119]:
#Train/Test/Split the data
X_train, X_test, y_train, y_test = train_test_split (X,y, random_state = 42)

###**Create a preprocessing object to prepare the dataset for Machine Learning**

**Instantiate Transformers**

In [120]:
#Create a numerics Selector
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [121]:
#Instantiate the standard Scaler and the one hot encoder
scaler = StandardScaler()
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [122]:
#Make tuplesfor procaessing the categorical and numeric columns
num_tuple = (scaler, num_selector)
cat_tuple = (ohe, cat_selector)

In [123]:
#Create column transformer
from sklearn.compose import make_column_transformer
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder= 'passthrough')

In [124]:
#Fit Transformer
col_transformer.fit(X_train)

In [125]:
#Transform
X_train_processed = col_transformer.transform(X_train)
X_test_processed = col_transformer.transform(X_test)

In [126]:
#View Transformer
X_train_df = pd.DataFrame(X_train_processed)
X_train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1586,1587,1588,1589,1590,1591,1592,1593,1594,1595
0,0.743119,-0.712775,1.828109,1.327849,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.505876,-1.291052,0.603369,1.327849,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,-0.119583,1.813319,0.244541,0.136187,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-1.063163,-1.004931,-0.952591,0.732018,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,1.390145,-0.965484,-0.33646,0.493686,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


In [127]:
X_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6392 entries, 0 to 6391
Columns: 1596 entries, 0 to 1595
dtypes: float64(1596)
memory usage: 77.8 MB


In [128]:
#Look at info from data
eda_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [129]:
#Creating a categorical data selector
cat_selector = make_column_selector(dtype_include = 'object')
cat_selector(X_train)

['Item_Identifier',
 'Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Size',
 'Outlet_Location_Type',
 'Outlet_Type']

In [130]:
#Creating a pipeline for handling categorical data
impute_cat = SimpleImputer(strategy= 'most_frequent')
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

cat_pipe = make_pipeline(impute_cat, ohe)
cat_pipe

In [131]:
#Creating a numeric data selector 
num_selector = make_column_selector(dtype_include='number')
num_selector(X_train)

['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']

In [132]:
#Creating a pipeline for handling numerical data
impute_num = SimpleImputer(strategy= 'mean')
scaler = StandardScaler()

num_pipe = make_pipeline(impute_num,scaler)
num_pipe

In [133]:
#Combine into 1 transformer
preprocessor = make_column_transformer((cat_pipe,cat_selector),
                                       (num_pipe,num_selector))
preprocessor

In [134]:
#Test Col Transformer
preprocessor.fit_transform(X_train)

array([[ 0.        ,  0.        ,  0.        , ..., -0.71277507,
         1.82810922,  1.32784893],
       [ 0.        ,  0.        ,  0.        , ..., -1.29105225,
         0.60336888,  1.32784893],
       [ 0.        ,  0.        ,  0.        , ...,  1.81331864,
         0.24454056,  0.13618724],
       ...,
       [ 0.        ,  0.        ,  0.        , ..., -0.92052713,
         1.52302674,  0.49368575],
       [ 0.        ,  0.        ,  0.        , ..., -0.2277552 ,
        -0.38377708,  1.0895166 ],
       [ 0.        ,  0.        ,  0.        , ..., -0.95867683,
        -0.73836105, -0.10214509]])