<a href="https://colab.research.google.com/github/Kamal-Moha/Food-Sales-Predictions/blob/main/Stack2_Food_Sales_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import necessary libraries**


In [None]:
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')



**Explore the Data**

In [None]:
filename = '/content/drive/MyDrive/CODING DOJO DS BOOTCAMP/Stack 1 - Data Science Fundamentals/02 Week 2: Pandas/Assignments/Core Assignments/sales_predictions.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

**Ordinal Encoding**

In [None]:
# We can see Outlet_Size has Ordinal values that we can represent as numbers
replace_dict = {'Small':0, 'Medium':1, 'High':2}
df['Outlet_Size'].replace(replace_dict, inplace=True)

In [None]:
df['Outlet_Size'].value_counts()

1.0    2793
0.0    2388
2.0     932
Name: Outlet_Size, dtype: int64

**Validation Split**

In [None]:
# Item_Outlet_Sales is our target vector

# Creating the target vector
y = df['Item_Outlet_Sales']
# Creating the Feature Matrix
X = df.drop(columns='Item_Outlet_Sales')
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

**Instantiate Column Selectors**

In [None]:
# Separating our columns by dtype
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

**Instantiate Transformers**

In [22]:

# median simpleImputer
median_imputer = SimpleImputer(strategy='median')

# OneHotEncoder
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')



**Instantiate Column Transformers**

In [23]:
# Making tuples for column transfomation
num_tuple = (median_imputer, num_selector)
cat_tuple = (ohe, cat_selector)

In [24]:
# Creating col_transformer
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder='passthrough')
col_transformer

**Transform the Data**

In [33]:
# Fit X_train data
col_transformer.fit(X_train)



In [26]:
# Doing the actual transformation on X_train & X_test
X_train_processed = col_transformer.transform(X_train)
X_test_processed = col_transformer.transform(X_test)

**Inspect the Results**

In [30]:
# Check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6392, 1593)




array([[1.63500000e+01, 2.95653090e-02, 2.56464600e+02, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [1.52500000e+01, 0.00000000e+00, 1.79766000e+02, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [1.23500000e+01, 1.58715731e-01, 1.57294600e+02, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [1.76000000e+01, 1.89436660e-02, 2.37359000e+02, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.03500000e+01, 5.43626950e-02, 1.17946600e+02, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.63500000e+01, 1.69932040e-02, 9.57410000e+01, ...,
        1.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

We can see from the shape of the data that our columns massively increased and that's because of the **OneHotEncoder**

**Re-Creating back our DataFrame**

In [32]:
# Getting actual names 
processed_names = col_transformer.get_feature_names_out()
names = [i.split('__')[-1] for i in processed_names]

# Re-Creating the DataFrame after transformation
X_train_processed_df = pd.DataFrame(X_train_processed, columns=names)
X_train_processed_df.head()


Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,...,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,16.35,0.029565,256.4646,2009.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,15.25,0.0,179.766,2009.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,12.35,0.158716,157.2946,1999.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,7.975,0.014628,82.325,2004.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,19.35,0.016645,120.9098,2002.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
