<a href="https://colab.research.google.com/github/Rahafhosari/DataScience2024-2025/blob/master/abalone_processing_core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pipeline Practice & Column Transformer Core

Name : Rahaf Hosari

### Mount Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Data

### Imports

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
set_config(transform_output='pandas')

### Read Data

In [4]:
url = '/content/drive/MyDrive/AXSOSACADEMY/02-IntroML/Week06/Abalone Preprocessing/abalone_data.csv'
df = pd.read_csv(url)

In [5]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             4177 non-null   object 
 1   length          4177 non-null   float64
 2   diameter        4177 non-null   float64
 3   height          4177 non-null   float64
 4   whole_weight    4177 non-null   float64
 5   shucked_weight  4177 non-null   float64
 6   viscera_weight  4177 non-null   float64
 7   shell_weight    4177 non-null   float64
 8   rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


Perform basic EDA on the entire dataframe: (For this assignment, you may skip the feature-by-feature inspection ):
- Check the data types and convert dtypes, if needed.
- Column for duplicate rows and address them, if needed.
- Check for null values and impute them if needed. (Impute them in a way that prevents data leakage!)
- Check for inconsistent categories and fix them if needed.
- Check for impossible numeric values and fix them, if needed

### Exploratory Data Analysis (EDA)

#### Data Types

In [8]:
# Check data types
df.dtypes

Unnamed: 0,0
sex,object
length,float64
diameter,float64
height,float64
whole_weight,float64
shucked_weight,float64
viscera_weight,float64
shell_weight,float64
rings,int64


`No data types conversion are needed`

#### Duplicates

In [9]:
duplicated_rows = df.duplicated()
duplicated_rows.sum()

0

`There are no duplications in Dataset`

#### Null Values

In [10]:
df.isna().sum()

Unnamed: 0,0
sex,0
length,0
diameter,0
height,0
whole_weight,0
shucked_weight,0
viscera_weight,0
shell_weight,0
rings,0


`No Missing Values are found`

#### Inconsistency

In [13]:
df.describe()

Unnamed: 0,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


##### Object Columns

In [11]:
categorial_cols = df.select_dtypes(include='object').columns
categorial_cols.values

array(['sex'], dtype=object)

In [14]:
for column in categorial_cols:
  # print the value counts for the column
  count =  df[column].value_counts().sort_values(ascending=False)
  print(f"Value counts for {column} : {count}")
  print()

Value counts for sex : sex
M    1528
I    1342
F    1307
Name: count, dtype: int64



`No Inconsistencies observed`

##### Numerical Columns

In [12]:
numeric_cols = df.select_dtypes('number').columns
numeric_cols.values

array(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight',
       'viscera_weight', 'shell_weight', 'rings'], dtype=object)

In [17]:
# Assuming 'df' is your DataFrame as defined in the provided code.
for col in df.select_dtypes(include=np.number):
    unique_count = df[col].nunique()
    total_count = len(df[col])
    percentage_unique = (unique_count / total_count) * 100
    print(f"Column '{col}': {percentage_unique:.2f}% unique values")

Column 'length': 3.21% unique values
Column 'diameter': 2.66% unique values
Column 'height': 1.22% unique values
Column 'whole_weight': 58.15% unique values
Column 'shucked_weight': 36.27% unique values
Column 'viscera_weight': 21.07% unique values
Column 'shell_weight': 22.17% unique values
Column 'rings': 0.67% unique values


In [24]:
for column in numeric_cols:
  # print the value counts for the column
  count =  df[column].value_counts().sort_values(ascending=False)
  print(f"Value counts for {column} : {count}")
  print()

`Impossible Numeric Values Height = 0.000, two values have height as Zero, impute Zero height with Mean`

In [23]:
height_mean = df.describe().loc['mean', 'height'].round(3)
df['height'] = df['height'].replace(to_replace=0.000,value=height_mean)

### Define Target

- Separate your data into the feature matrix (X) and the target vector (y)
rings will be your y
- The rest of the features will be your X
- Train/test split the data. Please use the random number 42 for consistency.

In [None]:
#Target
y = df['rings']

#Training Set
X = df.drop(columns = ['rings'])

#Train Split Test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
3823,F,0.615,0.455,0.135,1.059,0.4735,0.263,0.274
3956,F,0.515,0.395,0.14,0.686,0.281,0.1255,0.22
3623,M,0.66,0.53,0.175,1.583,0.7395,0.3505,0.405
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
2183,M,0.495,0.4,0.155,0.8085,0.2345,0.1155,0.35


## Pipelines

### Ordinal Pipeline
* Save a list of ordinal features
* Impute null values using SimpleImputer using the "most_frequent" strategy.
* Use OrdinalEncoder to encode the "shelf" column.
* Scale the ordinal features using StandardScaler
* Display the pipeline to confirm the code was error-free

Three columns are of object type
Mfr, Shelf and Type
Shelf is ordinal
Mfr and Type are nominal

In [None]:
#Ordinal Columns
# ordinal_cols = ['shelf','type']
ordinal_cols = ['shelf']

In [None]:
#Imputer
impute_most_frequent = SimpleImputer(strategy='most_frequent')

In [None]:
#Encoder

## Specifying order of categories for our  Ordinal Shelf Columms
shelf_col_ord = ['top','bottom', 'middle'] # Found using df['shelf'].value_counts()
shelf_ordinal_categories = [shelf_col_ord]

#Encoder
ord_encoder = OrdinalEncoder(categories=shelf_ordinal_categories) # OR ord_encoder = OrdinalEncoder(categories=[shelf_col_ord])


In [None]:
#Scaler
ord_scaler = StandardScaler()

In [None]:
# Display the ordinal pipeline
ordinal_pipeline = make_pipeline(impute_most_frequent,ord_encoder,ord_scaler)
ordinal_pipeline

### Categorical (nominal) Pipeline
* Save a list of nominal features
* Impute null values using SimpleImputer using the ‘constant’ strategy with a fill value of "MISSING."
* Use OneHotEncoder to encode the features
* Be sure to include the arguments: sparse_output=False AND handle_unknown='ignore' when creating your OneHotEncoder.
* Display the pipeline to confirm the code was error-free

In [None]:
# nominal_cols = ['mfr','type']
#Select all object Type Columns and remove the Ordinal Columns selected before
nominal_cols = X_train.select_dtypes('object').drop(columns=ordinal_cols).columns
nominal_cols

Index(['mfr', 'type'], dtype='object')

In [None]:
#Impute Values with 'MISSING'
impute_constant_missing = SimpleImputer(strategy='constant', fill_value='MISSING')

In [None]:
#OneHotEncoder
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

In [None]:
#Display Nominal Pipline
nominal_pipeline = make_pipeline(impute_constant_missing,ohe_encoder)
nominal_pipeline

### Numerical Pipeline
* Save a list of numerical features
* Impute null values using SImpleImputer using the ‘mean’ strategy.
* Scale the data with StandardScaler
* Display the pipeline to confirm the code was error-free

In [None]:
numerical_cols = X_train.select_dtypes('number').columns
numerical_cols

Index(['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars',
       'potass', 'vitamins', 'weight', 'cups'],
      dtype='object')

In [None]:
X_train[numerical_cols].isna().sum()

Unnamed: 0,0
calories,0
protein,0
fat,5
sodium,0
fiber,5
carbo,0
sugars,5
potass,0
vitamins,0
weight,0


The only numeric columns that have missing values are `fat`, `fiber` and `sugars`

In [None]:
# Summary stats
X_train[numerical_cols].describe().round(2)

Unnamed: 0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,weight,cups
count,57.0,57.0,52.0,57.0,52.0,57.0,52.0,57.0,57.0,57.0,57.0
mean,106.49,2.6,0.96,153.25,2.04,14.97,6.42,92.79,28.07,1.03,0.84
std,20.48,1.15,1.01,88.23,2.48,4.53,4.55,70.85,24.12,0.16,0.23
min,50.0,1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,0.5,0.33
25%,100.0,2.0,0.0,125.0,0.0,12.0,3.0,35.0,25.0,1.0,0.67
50%,110.0,3.0,1.0,170.0,1.5,15.0,6.0,90.0,25.0,1.0,0.88
75%,110.0,3.0,1.0,200.0,3.0,18.0,10.0,120.0,25.0,1.0,1.0
max,160.0,6.0,5.0,290.0,14.0,23.0,15.0,330.0,100.0,1.5,1.5


In [None]:
#Impute by mean
impute_mean = SimpleImputer(strategy='mean')

In [None]:
#Scaler
scaler = StandardScaler()

In [None]:
#Display Numerical Pipeline
numerical_pipeline = make_pipeline(impute_mean,scaler)
numerical_pipeline

## Column Transformer

How well can the "rating" of cereal be predicted using the following features?

mfr, type, calories, protein, fat, fiber, sugars, shelf

Please recall the following instructions:-
- Define 3 tuples, each containing the name, the pipeline object, and the list of columns to which it should be applied.
- Create a column transformer object that encompasses the 3 preprocessing pipelines from the previous assignment.
- Fit the column transformer object to the training data.
- Store the transformed training data as X_train_processed and display its .head().
- Save the transformed testing data as X_test_processed and display its .head().

In [None]:
#Ordinal Tuple
ordinal_tuple = ('ordinal', ordinal_pipeline, ordinal_cols)
ordinal_tuple

('ordinal',
 Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['top', 'bottom', 'middle']])),
                 ('standardscaler', StandardScaler())]),
 ['shelf'])

In [None]:
#Nominal Tuple
nominal_tuple = ('nominal', nominal_pipeline, nominal_cols)
nominal_tuple

('nominal',
 Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='MISSING', strategy='constant')),
                 ('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse_output=False))]),
 Index(['mfr', 'type'], dtype='object'))

In [None]:
#Get the only the selected Columns calories, protein, fat, fiber, sugars
# All Numerical Cols = ['calories', 'protein', 'fat', 'sodium', 'fiber', 'carbo', 'sugars',
      #  'potass', 'vitamins', 'weight', 'cups']
numerical_cols = numerical_cols.drop(['sodium','carbo','potass','vitamins','weight','cups'])
numerical_cols

Index(['calories', 'protein', 'fat', 'fiber', 'sugars'], dtype='object')

In [None]:
#Numerical Tuple
numerical_tuple = ('numerical', numerical_pipeline, numerical_cols)
numerical_tuple

('numerical',
 Pipeline(steps=[('simpleimputer', SimpleImputer()),
                 ('standardscaler', StandardScaler())]),
 Index(['calories', 'protein', 'fat', 'fiber', 'sugars'], dtype='object'))

### Create Column Transformer

In [None]:
# Instantiate with verbose_feature_names_out=False
col_transformer = ColumnTransformer([ordinal_tuple,nominal_tuple,numerical_tuple],
                                    verbose_feature_names_out=False)
col_transformer

### Fit Col. Transformer to training data

In [None]:
col_transformer.fit(X_train)

###  Store the transformed training data as X_train_processed and display its .head().

In [None]:
X_train_processed = col_transformer.transform(X_train)

# Convert to DataFrame
X_train_processed = pd.DataFrame(X_train_processed, columns=col_transformer.get_feature_names_out())
X_train_processed.head()

Unnamed: 0,shelf,mfr_A,mfr_G,mfr_K,mfr_MISSING,mfr_N,mfr_P,mfr_Q,mfr_R,type_C,type_H,calories,protein,fat,fiber,sugars
0,0.259645,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-0.319703,-0.524507,-1.007451,-0.871334,1.992024
1,1.492961,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.172812,-0.524507,0.040298,-0.871334,-0.795023
2,-0.97367,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.650358,0.354813,0.040298,-0.01805,0.598501
3,0.259645,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.319703,-0.524507,-1.007451,-0.444692,-1.027277
4,0.259645,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,-0.812218,0.354813,-1.007451,0.408592,-1.491785


### Save the transformed testing data as X_test_processed and display its .head().

In [None]:
X_test_processed = col_transformer.transform(X_test)

# Convert to DataFrame
X_test_processed = pd.DataFrame(X_test_processed, columns=col_transformer.get_feature_names_out())

In [None]:
X_test_processed.head()

Unnamed: 0,shelf,mfr_A,mfr_G,mfr_K,mfr_MISSING,mfr_N,mfr_P,mfr_Q,mfr_R,type_C,type_H,calories,protein,fat,fiber,sugars
0,-0.97367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,-0.524507,1.088047,-0.444692,0.366247
1,-0.97367,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-1.403826,1.088047,-0.444692,1.063009
2,1.492961,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-1.403826,1.088047,-0.871334,1.295263
3,-0.97367,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.234133,0.040298,3.395084,-0.098261
4,-0.97367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.234133,2.135796,0.408592,1.063009
