#An intuitive idea on how ML algorithms work


###**General Workflow**

**Data Collection**: Gathering the data.\
**Data Preprocessing**: Cleaning and organizing the data.\
**Training**: Feeding the data into the algorithm to learn the patterns.\
**Evaluation**: Testing the model on unseen data to evaluate its performance.\
**Tuning**: Adjusting the model parameters to improve performance.\
**Deployment**: Using the trained model to make predictions on new data.

#Data Preprocessing

Raw data generally contains noise and missing values which cannot be directly used by machine learning models. Hence, data preprocessing needs to be done to clean up data and format it to make it suitable for ML models.

In [None]:
# Importing necessary libraries

import numpy as np
import pandas as pd

##Reading dataset

In [None]:
df = pd.read_csv('Iris_missingdata.csv')

In [None]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,


In [None]:
df.shape

(150, 6)

##Finding missing data

Let's find the number of missing values in each column

In [None]:
df.isnull().sum()

Id                0
SepalLengthCm    11
SepalWidthCm      7
PetalLengthCm     8
PetalWidthCm      9
Species          12
dtype: int64

Finding datatype of each column will help us decide what methods to use to fill in the missing values

In [None]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

##Categorical encoding

Now, let's check for categorical columns.

Categorical data represents discrete values that belong to a specific set of categories or groups.

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns
categorical_columns

Index(['Species'], dtype='object')

Now, let's perform **Label Encoding** on the categorical columns in the dataset.

Label encoding assigns each category an integer value.\
In this case,\
`Iris-setosa` = 0\
`Iris-versicolor` = 1\
`Iris-virginica` = 2




In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['Species'] = label_encoder.fit_transform(df['Species'])
df

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,3
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,2
146,147,6.3,,5.0,1.9,2
147,148,6.5,3.0,5.2,2.0,2
148,149,6.2,3.4,5.4,2.3,2


Here, it also encoded the NaN values to an integer so we have to undo that

In [None]:
df['Species'] = df['Species'].replace(3, np.nan)
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0.0
1,2,4.9,3.0,1.4,0.2,0.0
2,3,4.7,3.2,1.3,0.2,0.0
3,4,,3.1,1.5,0.2,0.0
4,5,5.0,3.6,1.4,0.2,


##Handling missing values

To fill in the missing values for categorical columns, **Mode Imputation** is adopted.

Mode imputation involves replacing missing values with the most frequent value (mode) in the respective column. This method is straightforward and works well for categorical variables.

In [None]:
mode_value = df['Species'].mode()[0]
df['Species'].fillna(mode_value, inplace=True)

Now let's fill in the values for numerical columns.

Most common method is filling it with mean/median values.

In [None]:
# Fill NaN values with mean values of each column
mean_values = df[['PetalLengthCm', 'PetalWidthCm', 'SepalWidthCm', 'SepalLengthCm']].mean()  # Calculate mean values for each column
print(mean_values)

df = df.fillna(mean_values)  # Fill NaNs with mean values
df

PetalLengthCm    3.780282
PetalWidthCm     1.198582
SepalWidthCm     3.050350
SepalLengthCm    5.841727
dtype: float64


Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.100000,3.50000,1.4,0.2,0.0
1,2,4.900000,3.00000,1.4,0.2,0.0
2,3,4.700000,3.20000,1.3,0.2,0.0
3,4,5.841727,3.10000,1.5,0.2,0.0
4,5,5.000000,3.60000,1.4,0.2,0.0
...,...,...,...,...,...,...
145,146,6.700000,3.00000,5.2,2.3,2.0
146,147,6.300000,3.05035,5.0,1.9,2.0
147,148,6.500000,3.00000,5.2,2.0,2.0
148,149,6.200000,3.40000,5.4,2.3,2.0


Let's check if we have filled in all the missing values

In [None]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

##Feature scaling

Let's find the minimum and maximum values of each column

In [None]:
min_values = df.min()
max_values = df.max()

print("\nMinimum values for each column:")
print(min_values)

print("\nMaximum values for each column:")
print(max_values)


Minimum values for each column:
Id               1.0
SepalLengthCm    4.4
SepalWidthCm     2.0
PetalLengthCm    1.0
PetalWidthCm     0.1
Species          0.0
dtype: float64

Maximum values for each column:
Id               150.0
SepalLengthCm      7.9
SepalWidthCm       4.4
PetalLengthCm      6.9
PetalWidthCm       2.5
Species            2.0
dtype: float64


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Specify columns for min-max scaling
columns_to_scale = ['PetalLengthCm', 'PetalWidthCm', 'SepalWidthCm', 'SepalLengthCm']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit scaler on the specified columns and transform them
df[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

In [None]:
min_values = df.min()
max_values = df.max()

print("\nMinimum values for each column:")
print(min_values)

print("\nMaximum values for each column:")
print(max_values)


Minimum values for each column:
Id               1.0
SepalLengthCm    0.0
SepalWidthCm     0.0
PetalLengthCm    0.0
PetalWidthCm     0.0
Species          0.0
dtype: float64

Maximum values for each column:
Id               150.0
SepalLengthCm      1.0
SepalWidthCm       1.0
PetalLengthCm      1.0
PetalWidthCm       1.0
Species            2.0
dtype: float64


##Splitting data into features(x) and target variable(y)

Before splitting the dataset let's check if we have anymore missing values and recheck the size of the dataset

In [None]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [None]:
df.shape

(150, 6)

In [None]:
df_X = df.drop('Species', axis=1)
df_y = df['Species']

##Performing train-test split

In [None]:
from sklearn.model_selection import train_test_split

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.2)

Our dataset is now preprocessed and is ready to be trained!