<a href="https://colab.research.google.com/github/Jonny-T87/Dojo-Work/blob/main/Abalone_Preprocessing_Exercise_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Jonny Tesfahun
- 06/21/22


Tasks:

1. Separate your data into the features matrix (X) and target vector (y).

2. Train/test split the data. Please use the random number 42 for consistency.

3. Use column transformers to transform the appropriate columns in the appropriate ways.



For the Column Transformations:

    a) Select the categorical columns and the numerical columns

    b) Use a OneHotEncoder to encode the categorical columns

    c) Use a StandardScaler to scale the numeric columns

    d) Use a ColumnTransformer to match the transformation to the type of column

    e) Transform the data and display the resulting Numpy array.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [2]:
df = pd.read_csv('/content/drive/MyDrive/DojoBootCamp/Project Files/abalone.data', header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [5]:
df.columns = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Shucked weight', 'Viscera weight', 'Shell weight', 'Rings']

In [6]:
df.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4177 non-null   object 
 1   Length          4177 non-null   float64
 2   Diameter        4177 non-null   float64
 3   Height          4177 non-null   float64
 4   Whole weight    4177 non-null   float64
 5   Shucked weight  4177 non-null   float64
 6   Viscera weight  4177 non-null   float64
 7   Shell weight    4177 non-null   float64
 8   Rings           4177 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [8]:
#Formatting dataframe for ML and using Rings as Target and other columns as features
#1.Separate your data into the features matrix (X) and target vector (y).
X = df.drop(columns= 'Rings')
y = df['Rings']

In [9]:
#Train/test split the data. Please use the random number 42 for consistency.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42 )

In [11]:
#Checking data
X_train

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
3823,F,0.615,0.455,0.135,1.0590,0.4735,0.2630,0.274
3956,F,0.515,0.395,0.140,0.6860,0.2810,0.1255,0.220
3623,M,0.660,0.530,0.175,1.5830,0.7395,0.3505,0.405
0,M,0.455,0.365,0.095,0.5140,0.2245,0.1010,0.150
2183,M,0.495,0.400,0.155,0.8085,0.2345,0.1155,0.350
...,...,...,...,...,...,...,...,...
3444,F,0.490,0.400,0.115,0.5690,0.2560,0.1325,0.145
466,F,0.670,0.550,0.190,1.3905,0.5425,0.3035,0.400
3092,M,0.510,0.395,0.125,0.5805,0.2440,0.1335,0.188
3772,M,0.575,0.465,0.120,1.0535,0.5160,0.2185,0.235


- Use column transformers to transform the appropriate columns in the appropriate ways.

In [12]:
#going to split objects and numbers to transform columns 
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [13]:
#preprocessing, using StandardScaler to scale the numeric columns and OneHotEncoder to encode the categorical columns
#ignoring unknown values
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore')

In [14]:
#Make tuples for preprocessing the categorical and numeric columns
num_tuple = (scaler, num_selector)
cat_tuple = (ohe, cat_selector)

In [15]:
#going to import package sklearn to use column transformer 
from sklearn.compose import make_column_transformer

In [17]:
#setting key for column transformer with tuples included
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder='passthrough')

In [18]:
#fitting transformer on training data
col_transformer.fit(X_train)

ColumnTransformer(remainder='passthrough',
                  transformers=[('standardscaler', StandardScaler(),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f3a50d15590>),
                                ('onehotencoder',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x7f3a50d15c90>)])

In [20]:
#Transforming both the train and test set
X_train_transformed = col_transformer.transform(X_train)
X_test_transformed = col_transformer.transform(X_test)

In [21]:
#Showing the features train set in an numpy array
X_train_transformed

array([[ 0.74929076,  0.46422584, -0.11886923, ...,  1.        ,
         0.        ,  0.        ],
       [-0.09025371, -0.14465442, -0.0016468 , ...,  1.        ,
         0.        ,  0.        ],
       [ 1.12708577,  1.22532616,  0.81891021, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.13223093, -0.14465442, -0.35331409, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.41347297,  0.56570588, -0.47053652, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.58138187,  0.66718592,  0.46724292, ...,  1.        ,
         0.        ,  0.        ]])

In [23]:
#Showing the train test set in an numpy array
X_test_transformed

array([[ 0.66533631,  0.46422584,  0.46724292, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.53940464,  0.31200577,  0.23279806, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.2875413 ,  0.36274579,  1.28779993, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 0.91719966,  1.02236608,  1.05335507, ...,  0.        ,
         0.        ,  1.        ],
       [-0.55200317, -0.34761451, -0.0016468 , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.03567796, -0.24613447, -0.35331409, ...,  1.        ,
         0.        ,  0.        ]])