<a href="https://colab.research.google.com/github/Rahafhosari/DataScience2024-2025/blob/master/pipeline_column_transformer_core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pipeline Practice & Column Transformer Core

Name : Rahaf Hosari

### Mount Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Data

### Imports

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

### Read Data

In [4]:
url = '/content/drive/MyDrive/AXSOSACADEMY/02-IntroML/Week06/ColumnTransformer/cereal-kaggle-crawford-modified - sheet 1.csv'
df = pd.read_csv(url)

In [5]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      77 non-null     object 
 1   mfr       75 non-null     object 
 2   type      77 non-null     object 
 3   calories  72 non-null     float64
 4   protein   77 non-null     int64  
 5   fat       70 non-null     float64
 6   sodium    77 non-null     int64  
 7   fiber     71 non-null     float64
 8   carbo     77 non-null     float64
 9   sugars    71 non-null     float64
 10  potass    77 non-null     int64  
 11  vitamins  77 non-null     int64  
 12  shelf     75 non-null     object 
 13  weight    77 non-null     float64
 14  cups      77 non-null     float64
 15  rating    77 non-null     float64
dtypes: float64(8), int64(4), object(4)
memory usage: 9.8+ KB


Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,,4,1.0,130,10.0,5.0,6.0,280,25,top,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120.0,3,5.0,15,2.0,8.0,8.0,135,0,top,1.0,1.0,33.983679
2,All-Bran,K,C,70.0,4,1.0,260,9.0,7.0,5.0,320,25,top,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50.0,4,0.0,140,14.0,8.0,0.0,330,25,top,1.0,0.5,93.704912
4,Almond Delight,R,C,,2,2.0,200,1.0,14.0,8.0,-1,25,,1.0,0.75,34.384843


Define Target

In [8]:
#Target
y = df['rating']

#Training Set
X = df.drop(columns = 'name')

#Train Split Test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
30,P,C,100.0,2,0.0,45,0.0,11.0,15.0,40,25,bottom,1.0,0.88,35.252444
40,G,C,110.0,2,1.0,260,0.0,21.0,3.0,40,25,middle,1.0,1.5,39.241114
39,K,C,140.0,3,1.0,170,2.0,20.0,9.0,95,100,top,1.3,0.75,36.471512
16,K,C,100.0,2,0.0,290,1.0,21.0,2.0,35,25,bottom,1.0,1.0,45.863324
65,N,C,90.0,3,0.0,0,3.0,20.0,0.0,120,0,bottom,1.0,0.67,72.801787


In [11]:
df.isna().sum()

Unnamed: 0,0
name,0
mfr,2
type,0
calories,5
protein,0
fat,7
sodium,0
fiber,6
carbo,0
sugars,6


### Ordinal Pipeline
* Save a list of ordinal features
* Impute null values using SimpleImputer using the "most_frequent" strategy.
* Use OrdinalEncoder to encode the "shelf" column.
* Scale the ordinal features using StandardScaler
* Display the pipeline to confirm the code was error-free

In [40]:
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer

In [33]:
#Ordinal Columns
# ordinal_cols = ['shelf','type']
ordinal_cols = ['shelf']

In [34]:
#Imputer
impute_most_frequent = SimpleImputer(strategy='most_frequent')

In [35]:
#Encoder

## Specifying order of categories for our  Ordinal Shelf Columms
shelf_col_ord = ['top','bottom', 'middle'] # Found using df['shelf'].value_counts()
shelf_ordinal_categories = [shelf_col_ord]

#Encoder
ord_encoder = OrdinalEncoder(categories=shelf_ordinal_categories)


In [36]:
#Scaler
ord_scaler = StandardScaler()

In [51]:
# Display the ordinal pipeline
ordinal_pipeline = make_pipeline(impute_most_frequent,ord_encoder,ord_scaler)
ordinal_pipeline

### Categorical (nominal) Pipeline
* Save a list of nominal features
* Impute null values using SimpleImputer using the ‘constant’ strategy with a fill value of "MISSING."
* Use OneHotEncoder to encode the features
* Be sure to include the arguments: sparse_output=False AND handle_unknown='ignore' when creating your OneHotEncoder.
* Display the pipeline to confirm the code was error-free

In [38]:
# nominal_cols = ['mfr','type']
#Select all object Type Columns and remove the Ordinal Columns selected before
nominal_cols = X_train.select_dtypes('object').drop(columns=ordinal_cols).columns
nominal_cols

Index(['mfr', 'type'], dtype='object')

In [39]:
#Impute Values with 'MISSING'
impute_constant_missing = SimpleImputer(strategy='constant', fill_value='MISSING')

In [42]:
#OneHotEncoder
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

In [52]:
#Display Nominal Pipline
nominal_pipeline = make_pipeline(impute_constant_missing,ohe_encoder)
nominal_pipeline

### Numerical Pipeline
* Save a list of numerical features
* Impute null values using SImpleImputer using the ‘mean’ strategy.
* Scale the data with StandardScaler
* Display the pipeline to confirm the code was error-free

In [44]:
numerical_cols = X_train.select_dtypes('number').columns
numerical_cols

Index(['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars',
       'potass', 'vitamins', 'weight', 'cups', 'rating'],
      dtype='object')

In [46]:
X_train[numerical_cols].isna().sum()

Unnamed: 0,0
calories,0
protein,0
fat,5
sodium,0
fiber,5
carbo,0
sugars,5
potass,0
vitamins,0
weight,0


The only numeric columns that have missing values are `fat`, `fiber` and `sugars`

In [47]:
# Summary stats
X_train[numerical_cols].describe().round(2)

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,weight,cups,rating
count,57.0,57.0,52.0,57.0,52.0,57.0,52.0,57.0,57.0,57.0,57.0,57.0
mean,106.49,2.6,0.96,153.25,2.04,14.97,6.42,92.79,28.07,1.03,0.84,43.62
std,20.48,1.15,1.01,88.23,2.48,4.53,4.55,70.85,24.12,0.16,0.23,13.98
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.5,0.33,22.74
25%,100.0,2.0,0.0,125.0,0.0,12.0,3.0,35.0,25.0,1.0,0.67,33.98
50%,110.0,3.0,1.0,170.0,1.5,15.0,6.0,90.0,25.0,1.0,0.88,40.45
75%,110.0,3.0,1.0,200.0,3.0,18.0,10.0,120.0,25.0,1.0,1.0,50.83
max,160.0,6.0,5.0,290.0,14.0,23.0,15.0,330.0,100.0,1.5,1.5,93.7


In [48]:
#Impute by mean
impute_mean = SimpleImputer(strategy='mean')

In [49]:
#Scaler
scaler = StandardScaler()

In [54]:
#Display Numerical Pipeline
numerical_pipeline = make_pipeline(impute_mean,scaler)
numerical_pipeline