<a href="https://colab.research.google.com/github/Rahafhosari/DataScience2024-2025/blob/master/abalone_processing_core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Pipeline Practice & Column Transformer Core

Name : Rahaf Hosari

### Mount Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading Data

### Imports

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
set_config(transform_output='pandas')

### Read Data

In [3]:
url = '/content/drive/MyDrive/AXSOSACADEMY/02-IntroML/Week06/Abalone Preprocessing/abalone_data.csv'
df = pd.read_csv(url)

In [4]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             4177 non-null   object 
 1   length          4177 non-null   float64
 2   diameter        4177 non-null   float64
 3   height          4177 non-null   float64
 4   whole_weight    4177 non-null   float64
 5   shucked_weight  4177 non-null   float64
 6   viscera_weight  4177 non-null   float64
 7   shell_weight    4177 non-null   float64
 8   rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


Perform basic EDA on the entire dataframe: (For this assignment, you may skip the feature-by-feature inspection ):
- Check the data types and convert dtypes, if needed.
- Column for duplicate rows and address them, if needed.
- Check for null values and impute them if needed. (Impute them in a way that prevents data leakage!)
- Check for inconsistent categories and fix them if needed.
- Check for impossible numeric values and fix them, if needed

### Exploratory Data Analysis (EDA)

#### Data Types

In [5]:
# Check data types
df.dtypes

Unnamed: 0,0
sex,object
length,float64
diameter,float64
height,float64
whole_weight,float64
shucked_weight,float64
viscera_weight,float64
shell_weight,float64
rings,int64


`No data types conversion are needed`

#### Duplicates

In [6]:
duplicated_rows = df.duplicated()
duplicated_rows.sum()

0

`There are no duplications in Dataset`

#### Null Values

In [7]:
df.isna().sum()

Unnamed: 0,0
sex,0
length,0
diameter,0
height,0
whole_weight,0
shucked_weight,0
viscera_weight,0
shell_weight,0
rings,0


`No Missing Values are found`

#### Inconsistency

In [8]:
df.describe()

Unnamed: 0,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


##### Object Columns

In [9]:
categorial_cols = df.select_dtypes(include='object').columns
categorial_cols.values

array(['sex'], dtype=object)

In [10]:
for column in categorial_cols:
  # print the value counts for the column
  count =  df[column].value_counts().sort_values(ascending=False)
  print(f"Value counts for {column} : {count}")
  print()

Value counts for sex : sex
M    1528
I    1342
F    1307
Name: count, dtype: int64



`No Inconsistencies observed`

##### Numerical Columns

In [11]:
numeric_cols = df.select_dtypes('number').columns
numeric_cols.values

array(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight',
       'viscera_weight', 'shell_weight', 'rings'], dtype=object)

In [None]:
# Assuming 'df' is your DataFrame as defined in the provided code.
for col in df.select_dtypes(include=np.number):
    unique_count = df[col].nunique()
    total_count = len(df[col])
    percentage_unique = (unique_count / total_count) * 100
    print(f"Column '{col}': {percentage_unique:.2f}% unique values")

In [None]:
for column in numeric_cols:
  # print the value counts for the column
  count =  df[column].value_counts().sort_values(ascending=False)
  print(f"Value counts for {column} : {count}")
  print()

`Impossible Numeric Values Height = 0.000, two values have height as Zero, impute Zero height with Mean`

In [12]:
height_mean = df.describe().loc['mean', 'height'].round(3)
df['height'] = df['height'].replace(to_replace=0.000,value=height_mean)

### Define Target

- Separate your data into the feature matrix (X) and the target vector (y)
rings will be your y
- The rest of the features will be your X
- Train/test split the data. Please use the random number 42 for consistency.

In [13]:
#Target
y = df['rings']

#Training Set
X = df.drop(columns = ['rings'])

#Train Split Test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,sex,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
3823,F,0.615,0.455,0.135,1.059,0.4735,0.263,0.274
3956,F,0.515,0.395,0.14,0.686,0.281,0.1255,0.22
3623,M,0.66,0.53,0.175,1.583,0.7395,0.3505,0.405
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
2183,M,0.495,0.4,0.155,0.8085,0.2345,0.1155,0.35


## Column Transformer
Create a ColumnTransformer to preprocess the data.

1. Create lists of column names for numeric and categorical columns.

In [14]:
cat_cols = X_train.select_dtypes('object').columns
cat_cols

Index(['sex'], dtype='object')

In [15]:
num_cols = X_train.select_dtypes('number').columns
num_cols

Index(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight',
       'viscera_weight', 'shell_weight'],
      dtype='object')

2. Create a StandardScaler for scaling numeric columns.

In [16]:
scaler = StandardScaler()

3. Create a OneHotEncoder for one-hot encoding the categorical columns.

In [17]:
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

4. Create a tuple for each transformer with the: name, the transformer object, and the list of columns.

In [20]:
categorial_tuple = ('categorial', ohe_encoder, cat_cols)
categorial_tuple

('categorial',
 OneHotEncoder(handle_unknown='ignore', sparse_output=False),
 Index(['sex'], dtype='object'))

In [19]:
numerical_tuple = ('numerical', scaler, num_cols)
numerical_tuple

('numerical',
 StandardScaler(),
 Index(['length', 'diameter', 'height', 'whole_weight', 'shucked_weight',
        'viscera_weight', 'shell_weight'],
       dtype='object'))

5. Use the tuples to create a ColumnTransformer to preprocess the data.
Make sure to set verbose_feature_names_out to False!

In [21]:
col_transformer = ColumnTransformer([categorial_tuple,numerical_tuple],
                                    verbose_feature_names_out=False)
col_transformer

## Fit Column Transformer
Fit the ColumnTransformer on your training data.

In [22]:
col_transformer.fit(X_train)

## Transform Data
Transform the training and test data

### Training Data

In [23]:
X_train_processed = col_transformer.transform(X_train)

X_train_processed = pd.DataFrame(X_train_processed, columns=col_transformer.get_feature_names_out())
X_train_processed.head()

Unnamed: 0,sex_F,sex_I,sex_M,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
3823,1.0,0.0,0.0,0.749291,0.464226,-0.121384,0.457447,0.499098,0.743973,0.241135
3956,1.0,0.0,0.0,-0.090254,-0.144654,-0.003756,-0.301655,-0.364269,-0.51404,-0.145838
3623,0.0,0.0,1.0,1.127086,1.225326,0.819643,1.523852,1.692114,1.544526,1.179902
0,0.0,0.0,1.0,-0.59398,-0.449095,-1.062411,-0.651696,-0.617673,-0.738195,-0.647469
2183,0.0,0.0,1.0,-0.258163,-0.093914,0.349129,-0.052352,-0.572823,-0.605532,0.785763


### Test Data

In [24]:
X_test_processed = col_transformer.transform(X_test)

X_test_processed = pd.DataFrame(X_test_processed, columns=col_transformer.get_feature_names_out())
X_test_processed.head()

Unnamed: 0,sex_F,sex_I,sex_M,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight
866,0.0,0.0,1.0,0.665336,0.464226,0.466758,0.54801,0.263634,1.096216,0.606609
1483,0.0,0.0,1.0,0.539405,0.312006,0.231501,0.077896,0.111143,0.304812,0.033316
599,1.0,0.0,0.0,0.287541,0.362746,1.290157,0.298707,-0.256629,0.391729,0.678271
1702,1.0,0.0,0.0,0.9172,0.819406,0.702015,0.869559,0.790624,0.775995,1.000748
670,0.0,0.0,1.0,-0.426072,-0.246134,0.113873,-0.441061,-0.57058,-0.67415,-0.181669
