# Module 3 - Activity

During this module you will be exposed to the following topics:

- Understanding of Data (Data Types and Formats)
- Data Collection
- Data Cleaning
- Data Transforms
- Feature Engineering
- Outlier Removal Methodologies
- FEature Ranking and Selection
- Dimensionality Reduction

As you begin to utilize these in your data science process it will be important to understand how to put these into a processing pipeline. With machine learning techniques you will need to repeat the same processing steps for your train data and your test data. A processing pipeline allows for sequential processing of your data where the output of each step becomes the input for the next step. The pipeline should be configurable and extensible. 

You will use the below dataset that is split into train and test and you will implement a data pipeline using at least two processing steps. You may use the built in methods in sklearn for this assignment. 

In [209]:
import pandas as pd
import numpy as np
from scipy.stats import f
from copy import deepcopy

In [210]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target labels

# Perform an 80-20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets to verify the split
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (120, 4)
X_test shape: (30, 4)
y_train shape: (120,)
y_test shape: (30,)


In [211]:
X_train
#y_train

array([[4.6, 3.6, 1. , 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [6.7, 3.1, 4.4, 1.4],
       [4.8, 3.4, 1.6, 0.2],
       [4.4, 3.2, 1.3, 0.2],
       [6.3, 2.5, 5. , 1.9],
       [6.4, 3.2, 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.2, 4.1, 1.5, 0.1],
       [5.8, 2.7, 5.1, 1.9],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [5.4, 3.9, 1.3, 0.4],
       [5.4, 3.7, 1.5, 0.2],
       [5.5, 2.4, 3.7, 1. ],
       [6.3, 2.8, 5.1, 1.5],
       [6.4, 3.1, 5.5, 1.8],
       [6.6, 3. , 4.4, 1.4],
       [7.2, 3.6, 6.1, 2.5],
       [5.7, 2.9, 4.2, 1.3],
       [7.6, 3. , 6.6, 2.1],
       [5.6, 3. , 4.5, 1.5],
       [5.1, 3.5, 1.4, 0.2],
       [7.7, 2.8, 6.7, 2. ],
       [5.8, 2.7, 4.1, 1. ],
       [5.2, 3.4, 1.4, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [5.1, 3.8, 1.9, 0.4],
       [5. , 2. , 3.5, 1. ],
       [6.3, 2.7, 4.9, 1.8],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [5.6, 2

### Preprocessing
---

Let's place the data in a dataframe for manipulation

In [212]:
iris_df = pd.DataFrame(columns=iris.feature_names,data=iris.data)
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


Are there any duplicates within the frame?

In [213]:
iris_df[iris_df.duplicated() == True]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
142,5.8,2.7,5.1,1.9


It looks like there is one duplicate at index 142. Let's remove it and see what our dataframe looks like

In [214]:
iris_df = iris_df.drop_duplicates()
iris_df[140:145]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
140,6.7,3.1,5.6,2.4
141,6.9,3.1,5.1,2.3
143,6.8,3.2,5.9,2.3
144,6.7,3.3,5.7,2.5
145,6.7,3.0,5.2,2.3


Notice that row 142 has been removed successfully!

Now we can separate per feature

In [215]:
sepal_length = iris_df["sepal length (cm)"]
sepal_width = iris_df["sepal width (cm)"]
petal_length = iris_df["petal length (cm)"]
petal_width = iris_df["petal length (cm)"]

Normalization of the data is beneficial to ensure that each of the features are respresentated on the same scale. This can be very nice to achieve for feature ranking/separation. We can do this using min-max normalization!

\hat{x_{ik}} = \frac{x_{ik} - min(x_k)}{max(x_k) - min(x_k)} (b-a) +a

In [216]:
def normalize_min_max(df, a=0, b=1):
    # df: df.DataFrame -> dataset

    min_xk = np.min(df.values, axis=0)
    max_xk = np.max(df.values, axis=0)

    # Min-max normalization
    xhat = (df.values - min_xk) / (max_xk - min_xk) * (b-a) + a

    normalized_df = pd.DataFrame(data=xhat, columns=df.columns)
    return normalized_df


In [217]:
iris_df = normalize_min_max(iris_df)

What does our new dataset look like after normalization?

In [218]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.222222,0.625000,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.500000,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667
...,...,...,...,...
144,0.666667,0.416667,0.711864,0.916667
145,0.555556,0.208333,0.677966,0.750000
146,0.611111,0.416667,0.711864,0.791667
147,0.527778,0.583333,0.745763,0.916667


Let's look for some outliers!

Now, we can remove the outliers using Wilk's Outlier Removal:

$$ {malanobis}^2 = (x_i - \mu)\Sigma^{-1}(x_i - \mu)^T

In [219]:
def outlier_removal(df_values, alpha=0.01):
    # x: feature vector
    # df: pd.DataFrame.values -> dataset data
    # alpha: sensitivity for outliers
    data = deepcopy(df_values)
    n = data.shape[0]
    p = data.shape[1]

    mu = np.mean(data, axis=0, keepdims=True)
    #rowvar change since observation are in the rows
    cov = np.cov(data, rowvar=False) 
    inv_cov = np.linalg.inv(cov)
    # compute the squared malanobis distance, cap the minimum to only positive values
    malanobis_dist = np.sqrt(np.maximum((data - mu) @ inv_cov @ (data - mu).T,0)).diagonal()

    #compute F-distribution Threshold
    f_threshold = f.ppf(1 - alpha, p, n-p) * (p * (n-1) / (n-p))
    outliers = malanobis_dist > f_threshold
    outliers_removed = data[~outliers]
    return outliers_removed

In [220]:
outliers = outlier_removal(iris_df.values)
outliers.shape

(149, 4)

Looks like there is no outliers that are present based on Wilk's Outlier Removal and an `alpha = 0.01`

It would be nice to make a nice helper method to package all the preprocessing steps above. This will make it really nice and easy to transform the test and training data

In [221]:
def data_pipeline(df):
    # 1. Remove the duplicates
    df = df[~df.duplicated()]
    # 2. Normalize the Data
    normalized_df = normalize_min_max(df)
    # 3. Find and remove the outliers
    final_data = outlier_removal(normalized_df.values)

    return final_data

### Data fitting and Transformation
---

Above we used the full `iris_data` dataset as a test. However, let's use our little pipeline on the training data and the test data

#### Training Data

In [222]:
# Transform the Training data
train_df = pd.DataFrame(data=X_train, columns=iris.feature_names)
transformed_train = data_pipeline(train_df)
transformed_train

array([[0.08823529, 0.66666667, 0.        , 0.04166667],
       [0.41176471, 1.        , 0.0877193 , 0.125     ],
       [0.70588235, 0.45833333, 0.59649123, 0.54166667],
       [0.14705882, 0.58333333, 0.10526316, 0.04166667],
       [0.02941176, 0.5       , 0.05263158, 0.04166667],
       [0.58823529, 0.20833333, 0.70175439, 0.75      ],
       [0.61764706, 0.5       , 0.61403509, 0.58333333],
       [0.26470588, 0.625     , 0.0877193 , 0.04166667],
       [0.20588235, 0.66666667, 0.07017544, 0.04166667],
       [0.26470588, 0.875     , 0.0877193 , 0.        ],
       [0.44117647, 0.29166667, 0.71929825, 0.75      ],
       [0.5       , 0.58333333, 0.61403509, 0.625     ],
       [0.70588235, 0.45833333, 0.64912281, 0.58333333],
       [0.32352941, 0.79166667, 0.05263158, 0.125     ],
       [0.32352941, 0.70833333, 0.0877193 , 0.04166667],
       [0.35294118, 0.16666667, 0.47368421, 0.375     ],
       [0.58823529, 0.33333333, 0.71929825, 0.58333333],
       [0.61764706, 0.45833333,

#### Test Data

In [223]:
# Use the pipeline to transform the test data (X_test)
test_df = pd.DataFrame(data=X_test, columns=iris.feature_names)
transformed_test = data_pipeline(test_df)
transformed_test

array([[0.4375    , 0.375     , 0.60714286, 0.5       ],
       [0.3125    , 1.        , 0.07142857, 0.09090909],
       [0.9375    , 0.25      , 1.        , 1.        ],
       [0.40625   , 0.4375    , 0.57142857, 0.63636364],
       [0.65625   , 0.375     , 0.625     , 0.59090909],
       [0.21875   , 0.75      , 0.03571429, 0.13636364],
       [0.28125   , 0.4375    , 0.41071429, 0.54545455],
       [0.6875    , 0.5625    , 0.67857143, 1.        ],
       [0.46875   , 0.        , 0.57142857, 0.63636364],
       [0.34375   , 0.3125    , 0.46428571, 0.5       ],
       [0.5625    , 0.625     , 0.67857143, 0.86363636],
       [0.03125   , 0.5       , 0.01785714, 0.        ],
       [0.25      , 0.8125    , 0.        , 0.04545455],
       [0.0625    , 0.5625    , 0.03571429, 0.        ],
       [0.125     , 1.        , 0.03571429, 0.09090909],
       [0.5       , 0.6875    , 0.60714286, 0.68181818],
       [0.5625    , 0.5       , 0.80357143, 0.95454545],
       [0.28125   , 0.1875    ,