<a href="https://colab.research.google.com/github/Jonny-T87/Dojo-Work/blob/main/Pipelines_Activity_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pipelines Activity (Core)

- Jonny Tesfahun
- 06/22/22

How well can the calories be predicted based on the Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving of the cereal?  

At this point, you are just completing the pre-processing steps for this assignment.

You will need to:

1. Define features (X) and target (y).
2. Train test split the data to prepare for machine learning.
3. Identify each feature as numerical, ordinal, or nominal. (Please provide this answer in a text cell in your Colab notebook).
4. Use pipelines and column transformers to complete the following tasks:
 - Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.
 - One-hot encode the nominal features.
 - Scale the numeric columns.
5. All preprocessing steps should be contained within a single preprocessing object.
6. Use your preprocessing object to transform your data appropriately, avoiding data leakage, to make it ready for modeling. Show the resulting Numpy array.

In [30]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

In [31]:
df = pd.read_excel('/content/drive/MyDrive/DojoBootCamp/Project Files/Cereal with missing values.xlsx')

In [32]:
df.head()

Unnamed: 0,name,Manufacturer,type,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
0,Apple Cinnamon Cheerios,General Mills,Cold,110.0,2.0,2.0,180.0,1.5,10.5,10.0,70.0,25.0,1.0,1.0,0.75,29.509541
1,Basic 4,General Mills,Cold,130.0,3.0,2.0,,2.0,18.0,,100.0,25.0,3.0,1.33,0.75,37.038562
2,Cheerios,General Mills,Cold,,6.0,2.0,290.0,2.0,17.0,1.0,105.0,25.0,1.0,1.0,1.25,50.764999
3,Cinnamon Toast Crunch,General Mills,Cold,120.0,1.0,3.0,210.0,0.0,13.0,9.0,45.0,25.0,2.0,1.0,0.75,19.823573
4,Clusters,General Mills,Cold,110.0,3.0,2.0,140.0,2.0,13.0,7.0,105.0,25.0,3.0,1.0,0.5,40.400208


In [33]:
df = df[['Manufacturer', 'type', 'grams of fat', 'grams of sugars', 'Weight in ounces per one serving', 'calories per serving']]

In [34]:
df.head()

Unnamed: 0,Manufacturer,type,grams of fat,grams of sugars,Weight in ounces per one serving,calories per serving
0,General Mills,Cold,2.0,10.0,1.0,110.0
1,General Mills,Cold,2.0,,1.33,130.0
2,General Mills,Cold,2.0,1.0,1.0,
3,General Mills,Cold,3.0,9.0,1.0,120.0
4,General Mills,Cold,2.0,7.0,1.0,110.0


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Manufacturer                      77 non-null     object 
 1   type                              68 non-null     object 
 2   grams of fat                      69 non-null     float64
 3   grams of sugars                   68 non-null     float64
 4   Weight in ounces per one serving  77 non-null     float64
 5   calories per serving              70 non-null     float64
dtypes: float64(4), object(2)
memory usage: 3.7+ KB


In [36]:
df.isna().sum()

Manufacturer                        0
type                                9
grams of fat                        8
grams of sugars                     9
Weight in ounces per one serving    0
calories per serving                7
dtype: int64

- - Since this is nominal data, i will use replacement dictionary to make Cold=0 and Hot=1. And change with replacment from object to number.

In [37]:
df['type'].value_counts()

Cold    65
Hot      3
Name: type, dtype: int64

In [38]:
replacement_dictionary = {'Cold':0, 'Hot':1}

In [39]:
df['type'].replace(replacement_dictionary, inplace=True)
df['type']

0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
     ... 
72    0.0
73    0.0
74    1.0
75    1.0
76    1.0
Name: type, Length: 77, dtype: float64

Define features (X) and target (y)?
- - Features are  Manufacturer, cereal type, grams of fat, grams of sugars, and weight in ounces per one serving
- - Target is calories  

In [40]:
# Validation Split features and target 
X = df.drop('calories per serving', axis=1)
y = df['calories per serving']
# Also Train test split the data to prepare for machine learning.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Manufacturer                      77 non-null     object 
 1   type                              68 non-null     float64
 2   grams of fat                      69 non-null     float64
 3   grams of sugars                   68 non-null     float64
 4   Weight in ounces per one serving  77 non-null     float64
 5   calories per serving              70 non-null     float64
dtypes: float64(5), object(1)
memory usage: 3.7+ KB


Identify each feature as numerical, ordinal, or nominal. 
- - Manufacturer = nominal
 - type = ordinal
 - grams of fat, grams of sugars, weight in ounces = numerical

In [42]:
#Using column selectors to use with our column transformer.  
cal_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

Impute any missing values. Use the ‘mean’ strategy for numeric columns and the ‘most_frequent’ strategy for categorical columns.

In [43]:
# Using Imputers
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')

In [44]:
# Using Scaler
scaler = StandardScaler()
# Using One-hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

In [45]:
# Numeric pipeline for numbers
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [46]:
# Categorical pipeline for objects
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

In [53]:
# Tuples for Column Transformer, number tuple and category tuple
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cal_selector)

In [54]:
#Using column transformer with diagram
preprocessing = make_column_transformer(number_tuple,category_tuple)
preprocessing

In [55]:
#fitting the ColumnTransformer on the training data.
preprocessing.fit(X_train)

In [56]:
#Using ColumnTransformer to transform both the training and testing datasets.
X_train_processed = preprocessing.transform(X_train)
X_test_processed = preprocessing.transform(X_test)

In [57]:
#Checking if done correctly. Looks fine 
X_train_processed

array([[-2.22487418e-01, -9.74679434e-01,  9.94481647e-01,
        -1.32764897e-01,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01,  0.00000000e+00,  1.22191915e+00,
         2.03880702e+00,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01, -9.74679434e-01, -8.25018407e-01,
        -1.32764897e-01,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01,  0.00000000e+00,  1.67679417e+00,
         3.15749558e+00,  1.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01,  0.00000000e+00, -1.42705887e-01,
        -1.32764897e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  1.00000000e+00,
         0.

In [58]:
#Printing data for easy viewing
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (57, 10)




array([[-2.22487418e-01, -9.74679434e-01,  9.94481647e-01,
        -1.32764897e-01,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01,  0.00000000e+00,  1.22191915e+00,
         2.03880702e+00,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01, -9.74679434e-01, -8.25018407e-01,
        -1.32764897e-01,  0.00000000e+00,  1.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01,  0.00000000e+00,  1.67679417e+00,
         3.15749558e+00,  1.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [-2.22487418e-01,  0.00000000e+00, -1.42705887e-01,
        -1.32764897e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00,  1.00000000e+00,
         0.