<a href="https://colab.research.google.com/github/SinghNavnoor/Regression_-Prediction_of_Grocery-Sales/blob/main/cereal_core_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram');


##Loading data

In [None]:
df = pd.read_excel('/content/Cereal with missing values.xlsx')

df.head()

Unnamed: 0,name,Manufacturer,type,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
0,Apple Cinnamon Cheerios,General Mills,Cold,110.0,2.0,2.0,180.0,1.5,10.5,10.0,70.0,25.0,1.0,1.0,0.75,29.509541
1,Basic 4,General Mills,Cold,130.0,3.0,2.0,,2.0,18.0,,100.0,25.0,3.0,1.33,0.75,37.038562
2,Cheerios,General Mills,Cold,,6.0,2.0,290.0,2.0,17.0,1.0,105.0,25.0,1.0,1.0,1.25,50.764999
3,Cinnamon Toast Crunch,General Mills,Cold,120.0,1.0,3.0,210.0,0.0,13.0,9.0,45.0,25.0,2.0,1.0,0.75,19.823573
4,Clusters,General Mills,Cold,110.0,3.0,2.0,140.0,2.0,13.0,7.0,105.0,25.0,3.0,1.0,0.5,40.400208


##Checking for duplicates

In [None]:
df.duplicated().sum()

0

###Dropping duplicates

In [None]:
df.drop_duplicates(inplace=True)

###Confirming there are no duplicates remaining 

In [None]:
df.duplicated().sum()

0

##Creating a copy of the Data

In [None]:
eda_df = df[['Manufacturer','type','grams of fat','grams of sugars','Weight in ounces per one serving', 'calories per serving']]


eda_df.head()

Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving,calories per serving
0,General Mills,Cold,2.0,10.0,1.0,110.0
1,General Mills,Cold,2.0,,1.33,130.0
2,General Mills,Cold,2.0,1.0,1.0,
3,General Mills,Cold,3.0,9.0,1.0,120.0
4,General Mills,Cold,2.0,7.0,1.0,110.0


##Looking for duplicates

In [None]:
eda_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 0 to 76
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Manufacturer                      77 non-null     object 
 1   type                              68 non-null     object 
 2   grams of fat                      69 non-null     float64
 3   grams of sugars                   68 non-null     float64
 4   Weight in ounces per one serving  77 non-null     float64
 5   calories per serving              70 non-null     float64
dtypes: float64(4), object(2)
memory usage: 4.2+ KB


There are null values in 'calories per serving', since that is our target columns we cannot have any null values nor can we impute them with mean, median or mode. 
I will be dropping them. 

In [None]:
eda_df.dropna(subset = ['calories per serving'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [None]:
eda_df.shape

(70, 6)

In [None]:
eda_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 76
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Manufacturer                      70 non-null     object 
 1   type                              62 non-null     object 
 2   grams of fat                      62 non-null     float64
 3   grams of sugars                   62 non-null     float64
 4   Weight in ounces per one serving  70 non-null     float64
 5   calories per serving              70 non-null     float64
dtypes: float64(4), object(2)
memory usage: 3.8+ KB


##Splitting the data

In [None]:
X = eda_df.drop(columns = 'calories per serving')
y = eda_df['calories per serving']
X.head()

Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving
0,General Mills,Cold,2.0,10.0,1.0
1,General Mills,Cold,2.0,,1.33
3,General Mills,Cold,3.0,9.0,1.0
4,General Mills,Cold,2.0,7.0,1.0
5,General Mills,Cold,1.0,13.0,1.0


###Train_test_Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)


In [None]:
X_train.head()

Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving
63,Quaker Oats,Cold,0.0,0.0,0.5
52,Post,,1.0,12.0,1.0
18,General Mills,Cold,1.0,3.0,1.0
37,Kelloggs,Cold,2.0,7.0,1.33
46,Nabisco,,0.0,0.0,0.83


In [None]:
X_train['Manufacturer'].value_counts()

Kelloggs          16
General Mills     13
Quaker Oats        6
Post               6
Ralston Purina     6
Nabisco            5
Name: Manufacturer, dtype: int64

In [None]:
X_train['type'].value_counts()

Cold    45
Hot      2
Name: type, dtype: int64

Manufacturer and Type columns are nominal categorical. We could put Type into the ordinal category but that won't be the best idea. That is because hot and cold are not ordered parameters. We will use One Hot Encoder to convert them to numbers. 
Grams of fat,	Grams of sugar and	Weight in ounces per one serving are numerical values. We will be standardizing them to give them equal importance. 

##Pipelines and Column Transfers

###Applying column selector 

In [None]:
num_selector = make_column_selector(dtype_include= 'number')
cat_selector = make_column_selector(dtype_include= 'object')

###Simple Imputation strategy initializing.

I am using mean to fill the null values in the numeric columns as I believe that it is the best way to fill them instead of using mode or median. We get better accuracy.
I am using most frequent values in both the object columns to fill there respective null values as it is the best way to create a more complete dataset. 

In [None]:
mean_imputer = SimpleImputer(strategy='mean')
most_freq_impute = SimpleImputer(strategy= 'most_frequent')

###Initiatlizing the scaler and ohe.

In [None]:
scaler = StandardScaler()

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

###Instantiate the pipeline

In [None]:
num_pipe = make_pipeline(mean_imputer, scaler)
num_pipe

In [None]:
cat_pipe = make_pipeline(most_freq_impute, ohe)
cat_pipe

###Instantiating Column Transform

####Creating tuples the column Transform  

In [None]:
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

####Transfoming column

In [None]:
preprocessing = make_column_transformer(cat_tuple, num_tuple)

In [None]:
preprocessing

##Fitting the data

In [None]:
preprocessing.fit(X_train)

##Transform the X_train, X_test

In [None]:
X_train_processed = preprocessing.transform(X_train)

X_test_processed = preprocessing.transform(X_test)

##Inspecting the data

In [None]:
#Printing missing values
print("X_train has", np.isnan(X_train_processed).sum().sum(), "missing values")
print("X_test has", np.isnan(X_test_processed).sum().sum(), "missing values")

#Printing dtypes of the columns
print("\n")
print('The X_train has:', X_train_processed.dtype)
print('The X_test has:', X_test_processed.dtype)

#Printing the shape of the dataset. 
print("\n")
print('The shape of X_train is', X_train_processed.shape)
print('The shape of X_test is', X_test_processed.shape)

#Printing a sample from the X_train_processed column
print("\n")
print("X_train_processed example:")
print(X_train_processed[3:6])

#Printing a sample from the X_test_processed column
print("\n")
print("X_test_processed example:")
print(X_test_processed[5:8])

X_train has 0 missing values
X_test has 0 missing values


The X_train has: float64
The X_test has: float64


The shape of X_train is (52, 11)
The shape of X_test is (18, 11)


X_train_processed example:
[[ 0.          1.          0.          0.          0.          0.
   1.          0.          0.92332071  0.05441138  1.92573028]
 [ 0.          0.          1.          0.          0.          0.
   1.          0.         -1.00539366 -1.53835815 -1.35533049]
 [ 0.          1.          0.          0.          0.          0.
   1.          0.         -1.00539366  0.05441138 -0.23976983]]


X_test_processed example:
[[ 1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
   2.14130310e-16 -8.55742637e-01 -2.39769825e-01]
 [ 1.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   0.00000000e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
   2.14130310e-16  1.19210391e+00 -2.39769825e-01]
 [ 0.00000000e