<a href="https://colab.research.google.com/github/KDcodePy/Pipelines-Activity/blob/main/Pipeline_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pipeline Activity
Name: Kim Hazed Delfino

## Imports 

In [59]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display="diagram")

## Load the Data

In [60]:
filename = "/content/Cereal with missing values.xlsx - Sheet 1 - cereal.csv"
df = pd.read_csv(filename,index_col="name")

- using `index_col`default argument to set `name` as index since they're all unique

## Data Exploration 

In [61]:
df.sample(10)

Unnamed: 0_level_0,Manufacturer,type,calories per serving,grams of protein,grams of fat,milligrams of sodium,grams of dietary fiber,grams of complex carbohydrates,grams of sugars,milligrams of potassium,vitamins and minerals (% of FDA recommendation),Display shelf,Weight in ounces per one serving,Number of cups in one serving,Rating of cereal
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Frosted Flakes,Kelloggs,Cold,110.0,1,0.0,200.0,1.0,14.0,11.0,25,25.0,1,1.0,0.75,31.435973
Lucky Charms,General Mills,Cold,110.0,2,,180.0,0.0,12.0,12.0,55,25.0,2,1.0,1.0,26.734515
Wheaties,General Mills,,100.0,3,,200.0,3.0,17.0,3.0,110,25.0,1,1.0,1.0,51.592193
Crispix,Kelloggs,Cold,,2,0.0,220.0,1.0,21.0,3.0,30,25.0,3,1.0,1.0,46.895644
Muesli Raisins; Dates; & Almonds,Ralston Purina,Cold,150.0,4,3.0,95.0,3.0,16.0,11.0,170,25.0,3,1.0,1.0,37.136863
Nutri-Grain Almond-Raisin,Kelloggs,Cold,140.0,3,2.0,220.0,3.0,21.0,7.0,130,25.0,3,1.33,0.67,40.69232
Bran Flakes,Post,Cold,90.0,3,0.0,210.0,5.0,13.0,5.0,190,25.0,3,1.0,0.67,53.313813
Fruity Pebbles,Post,,110.0,1,1.0,135.0,0.0,13.0,12.0,25,25.0,2,1.0,0.75,28.025765
Puffed Wheat,Quaker Oats,Cold,50.0,2,0.0,0.0,1.0,10.0,0.0,50,0.0,3,0.5,1.0,63.005645
Total Whole Grain,General Mills,Cold,100.0,3,1.0,200.0,3.0,16.0,3.0,110,100.0,3,1.0,1.0,46.658844


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, Apple Cinnamon Cheerios to Quaker Oatmeal
Data columns (total 15 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   Manufacturer                                     77 non-null     object 
 1   type                                             68 non-null     object 
 2   calories per serving                             70 non-null     float64
 3   grams of protein                                 77 non-null     int64  
 4   grams of fat                                     69 non-null     float64
 5   milligrams of sodium                             76 non-null     float64
 6   grams of dietary fiber                           77 non-null     float64
 7   grams of complex carbohydrates                   77 non-null     float64
 8   grams of sugars                                  68 non-null     float64
 9   milli

In [63]:
df.isna().sum()

Manufacturer                                       0
type                                               9
calories per serving                               7
grams of protein                                   0
grams of fat                                       8
milligrams of sodium                               1
grams of dietary fiber                             0
grams of complex carbohydrates                     0
grams of sugars                                    9
milligrams of potassium                            0
vitamins and minerals (% of FDA recommendation)    1
Display shelf                                      0
Weight in ounces per one serving                   0
Number of cups in one serving                      0
Rating of cereal                                   0
dtype: int64

- Looks like we have some missing values in our dataset, we can use medean for numerical and most_frquent strategies for impuration of missing values since the amount of missing data is not enough to justify dropping rows/cols

In [64]:
df.duplicated().sum()

0

## Preprocessing 

In [65]:
X = df.drop(columns='calories per serving')
y = df["calories per serving"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [66]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 77 entries, Apple Cinnamon Cheerios to Quaker Oatmeal
Data columns (total 14 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   Manufacturer                                     77 non-null     object 
 1   type                                             68 non-null     object 
 2   grams of protein                                 77 non-null     int64  
 3   grams of fat                                     69 non-null     float64
 4   milligrams of sodium                             76 non-null     float64
 5   grams of dietary fiber                           77 non-null     float64
 6   grams of complex carbohydrates                   77 non-null     float64
 7   grams of sugars                                  68 non-null     float64
 8   milligrams of potassium                          77 non-null     int64  
 9   vitam

- Catergorical features are : `Manufacturers and type`
- Numerical features are : the rest of the features outside of categorical mentioned above.
- Ordinal features : we have None orginal features in this dataset, we could've had `Ratings of cereal` but that feature contains float data type with unknown scale so we can't consider it as ordinal

## Instantiate Column Selectors.

In [67]:
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

## Instantiante Transformers

In [68]:
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
freq_imputer = SimpleImputer(strategy='most_frequent')
median_imputer = SimpleImputer(strategy='median')

## Instantiate Pipeline

In [69]:
num_pipe = make_pipeline(median_imputer,scaler)
num_pipe

In [70]:
cat_pipe = make_pipeline(freq_imputer,ohe)
cat_pipe

## Instantiate ColumnTransformer

In [71]:
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

preprocessor = make_column_transformer(num_tuple, cat_tuple)
preprocessor

## Transformer Data

In [72]:
preprocessor.fit(X_train)

In [73]:
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

## Inspecting the Result

In [75]:
print(np.isnan(X_train_processed).sum().sum(),"Missing values in training data")
print(np.isnan(X_test_processed).sum().sum(),'Missing values in testing data')
print()
print(f"All Data in X_train_processed are {X_train_processed}")
print(f"All Data in X_test_processed are {X_test_processed}")
print()
print(f"Shape of Data is {X_train.shape}")
print()
X_train_processed

0 Missing values in training data
0 Missing values in testing data

All Data in X_train_processed are [[-1.30301442 -0.97467943  0.55786657 ...  0.          1.
   0.        ]
 [ 0.40438378  0.          0.67740941 ...  0.          1.
   0.        ]
 [ 0.40438378 -0.97467943  1.99238061 ...  0.          1.
   0.        ]
 ...
 [ 1.25808288  1.94935887 -0.03984761 ...  1.          1.
   0.        ]
 [ 0.40438378  0.97467943 -0.15939045 ...  0.          1.
   0.        ]
 [ 0.40438378  0.          0.07969522 ...  0.          1.
   0.        ]]
All Data in X_test_processed are [[ 0.40438378  0.97467943 -0.15939045 -0.02205781 -0.37217756  0.09965776
   0.21899349 -0.10151369  1.00332464 -0.1327649  -1.49653492 -0.16040716
   1.          0.          0.          0.          0.          0.
   1.          0.        ]
 [ 0.40438378  0.97467943 -0.03984761  0.44360702  0.61452575  1.46297599
   1.08693861 -0.10151369  1.00332464  3.15749558 -0.72312568 -0.86613455
   0.          1.          0.   

array([[-1.30301442, -0.97467943,  0.55786657, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.        ,  0.67740941, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378, -0.97467943,  1.99238061, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 1.25808288,  1.94935887, -0.03984761, ...,  1.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.97467943, -0.15939045, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.40438378,  0.        ,  0.07969522, ...,  0.        ,
         1.        ,  0.        ]])