<a href="https://colab.research.google.com/github/HarmonyKM/sales-predictions/blob/main/Project_1_Part_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 1 - Part 5 (CORE)**
- **Machine Learning** 
- **Week 1**
- **Harmony Gasologa**
- **February 1, 2023**

## **Tasks:**
 - You will need to:

 - You should reload the original data set [here](https://drive.google.com/file/d/1syH81TVrbBsdymLT_jl2JIf6IjPXtSQw/view) using pd.read_csv() to ensure there is no data leakage!

- [x] Before splitting your data, you can drop duplicates and fix inconsistencies in categorical data.* (*There is a way to do this after the split, but for this project, you may perform this step before the split)
- [x] Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.
- [x] Perform a train test split
- [x] Create a preprocessing object to prepare the dataset for Machine Learning
- [x] Make sure your imputation of missing values occurs after the train test split using SimpleImputer.

## **Load Libraries and inspect the data**



## **Import Libraries**

In [None]:
# imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

## **2. Load the Data**

In [None]:
path = '/content/drive/MyDrive/sales_predictions.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
# Make a copy of the original df to avoid any manipulations
df_pp = df.copy()

## **3. Explore the Data**

In [None]:
# Look at the info from the data
print(df_pp.info(), '\n')
print(df_pp.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB
None 

Item_Identifier                 0
Item

### **4. Fix Inconsistencies in Categorical data**

In [None]:
df_pp['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [None]:
df_pp['Item_Fat_Content'].replace(['LF'], ['Low Fat'], inplace=True)
df_pp['Item_Fat_Content'].replace(['low fat'], ['Low Fat'], inplace=True)
df_pp['Item_Fat_Content'].replace(['reg'], ['Regular'], inplace=True)
df_pp['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

## **Check for Duplicated, Missing, or Erroneous Data**

In [None]:
# Display the sum of missing values
df_pp.isna().sum().sum()

3873

In [None]:
# Display the sum of missing values
df_pp.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

## **5. Drop Duplicates**

In [None]:
# Check to see if there are any duplicated rows
df_pp.duplicated().sum()

0

In [None]:
 # Display descriptive statistics for all columns
df_pp.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


## **6. Split the Data (Validation Split)**

- Identify features (X) and target (y)

In [None]:
# split X and y, you are Item Outlet Sales
X = df.drop('Item_Outlet_Sales', axis=1)
y = df['Item_Outlet_Sales']

## **Perform a train test split**

In [None]:
# split training and test
# set random_state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25 , random_state=42)

## **Pre-processing**

 - Impute missing values 
 - Use the 'mean' strategy for numeric columns 
 - Use the 'most_frequent' strategy for categorical columns

## **Instantiate Column Selectors**

In [None]:
#Instantiate columns selectors
num_selector = make_column_selector(dtype_include= 'number')
cat_selector = make_column_selector(dtype_include='object')

## **Instantiate Transformers**

In [None]:
#Instatiate Transformers
mean_imputer = SimpleImputer(strategy='mean')
freq_imputer = SimpleImputer(strategy='most_frequent')

#Scaler
scaler = StandardScaler()

#OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

- One-hot encode nominal features
- Scale the numeric columns

## **Instantiate Pipelines**

In [None]:
#Instatiate Pipelines

#Numeric Pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [None]:
#Categorical Pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

## **Instantiate ColumnTransformer**

In [None]:
# Tuples for Column Transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)

# ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple, remainder='passthrough')
preprocessor

## **Fit Preprocessor**

- All preprocessing steps should be contained within a single preprocessing object
- We fit the ColumnTransformer, which we called 'preprocessor' on the training data. **(Never on testing data!)**

In [None]:
# fit on train
preprocessor.fit(X_train)

## **Transform Data**

In [None]:
# transform train and test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)