## Activity 2.4 - Normalization and Label Encoding
Prepared by: Ashwini Kumar Mathur

In [4]:
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
import matplotlib.pyplot as plt

In [22]:
#Manually load the the iris flower dataset
df = pd.read_csv('datasets/Iris.csv')
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,10.2,3.4,11.4,2.3,Iris-virginica


In [25]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [26]:
#Calculate the zscore using pre-build method in scipy library
from scipy import stats

# Specify columns to check for outliers
columns_to_check = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']

# Calculate Z-scores for each column
#z_scores = stats.zscore(df1[columns_to_check])
z_scores = stats.zscore(df[columns_to_check])

# Define a threshold for outlier detection (e.g., 3 standard deviations)
threshold = 3
outlier_indices = (z_scores > threshold).any(axis=1)
outliers_df = df[outlier_indices]

print("Rows with Outliers:", outliers_df)

Rows with Outliers:       Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm  \
130  131            7.4          22.8            6.1          19.9   
141  142           15.9           3.1            5.1           2.3   
142  143            5.8           2.7            5.1          19.9   
148  149           10.2           3.4           11.4           2.3   
149  152            5.9           3.0           59.1           9.8   

            Species  
130  Iris-virginica  
141  Iris-virginica  
142  Iris-virginica  
148  Iris-virginica  
149  Iris-virginica  


In [27]:
outlier_indices.sum()

5

In [30]:
#Direct Method - Remove the outlier indices
clean_df = df[~outlier_indices]
clean_df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
143,144,6.8,3.2,5.9,2.3,Iris-virginica
144,145,6.7,3.3,5.7,2.5,Iris-virginica
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica


**Perform the Normalization using pre-built python funtion using sklearn library**

Techniques for Normalization:
- Standard scalar Method
- MinMax scalar Method*

In [32]:
# Apply the minmax scalar in the dataframe
scaler = MinMaxScaler()
scaled_df = pd.DataFrame(scaler.fit_transform(clean_df[columns_to_check]),columns=columns_to_check)
scaled_df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,0.222222,0.625000,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.500000,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667
...,...,...,...,...
140,0.694444,0.500000,0.830508,0.916667
141,0.666667,0.541667,0.796610,1.000000
142,0.666667,0.416667,0.711864,0.916667
143,0.555556,0.208333,0.677966,0.750000


**#Perfom the encoding techniques :**
- Label encoding
- One Hot encoding

In [45]:
# Perform the Label encoding
label_encoder =  LabelEncoder()
encoded_species = pd.Series(label_encoder.fit_transform(df["Species"]),name='Species')
encoded_species

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: Species, Length: 150, dtype: int32

In [50]:
final_df=pd.concat([scaled_df,encoded_species],axis=1).dropna()
final_df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,0.222222,0.625000,0.067797,0.041667,0
1,0.166667,0.416667,0.067797,0.041667,0
2,0.111111,0.500000,0.050847,0.041667,0
3,0.083333,0.458333,0.084746,0.041667,0
4,0.194444,0.666667,0.067797,0.041667,0
...,...,...,...,...,...
140,0.694444,0.500000,0.830508,0.916667,2
141,0.666667,0.541667,0.796610,1.000000,2
142,0.666667,0.416667,0.711864,0.916667,2
143,0.555556,0.208333,0.677966,0.750000,2


## Final Step

**Training and Testing split for Machine learning Model**

In [47]:
#Imports
from sklearn.model_selection import train_test_split

In [56]:
# Split the data into training and testing sets (default test size is 25%)
X = final_df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
y = final_df['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [57]:
# Check the shapes of the resulting sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(108, 4) (37, 4) (108,) (37,)


### Bonus tips in python: Dataframe Slicing quick notes

**Key Methods for Slicing:**

1. .iloc[]: Accesses rows and columns by integer position (like standard Python lists).


**Slicing Syntax:**

2. df.iloc[start:stop:step, start:stop:step]- Selects rows and columns using integer positions.



Examples:

**Slicing Rows:**

- df.iloc[0:3]- Selects the first 3 rows (0, 1, 2).
- df.iloc[::2]- Selects every other row (start to end, step of 2).

**Slicing Columns:**

- df.iloc[:, 1:4]- Selects columns from index 1 to 3.


**Slicing Both Rows and Columns:**

- df.iloc[0:3, 1:4]- Selects rows 0 to 2 and columns 1 to 3.

Additional Tips:

- Negative indices count from the end (e.g., -1 for the last row/column).