## Importing Required Libraries
---

In [1]:
import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Data Gathering
---
Using `pd.read_csv`, load the data from  `Iris_Dataset.csv` file

In [2]:
file_path = r'Data_Sets\Iris_Dataset.csv'
df = pd.read_csv(file_path)

FileNotFoundError: [Errno 2] No such file or directory: 'Data_Sets\\Iris_Dataset.csv'

Reading top `5` rows from our `DataFrame` using `df.head()`. 

In [None]:
df.head()

## Exploratory Data Analysis
---
Let's know the shape of our dataframe using `df.shape`. `df.shape[0]` will give us `number of rows` while `df.shape[1]` will give us `number of columns`

In [None]:
print('Number of Rows: ', df.shape[0])

print('-' * 100)

print('Number of Columns: ', df.shape[1])

Now let's check the data types of our features using `df.dtypes` 

In [None]:
print('Data Types: \n', df.dtypes)

Here, All the `features/columns` are in numeric form i.e. either `int` or `float`, which is good. Only `Species` is in `Object` datatype. `Species` is our `Target` variable.

In [None]:
df.Species.value_counts()

Examine the species names and note that they all begin with `Iris-`. Remove this portion of the name so the species name is shorter. 


In [None]:
df['Species'] = df.Species.apply(lambda r: r.replace('Iris-', ''))
df.head()

In [None]:
df.Species.value_counts()

Check contribution of each species using `pie-chart`

In [None]:
fig = plt.figure()

ax = fig.add_axes([0,0,1,1])

ax.axis('equal')

data = df['Species'].value_counts()

ax.pie(data, labels = data.keys(), autopct='%1.2f%%', colors='gry')

plt.show()

Now, `Statistics` of dataframe can be checked using `df.describe`. This function give us `Mean`, `Median`, `Standard Deviation`,  etc.
Here, `Median` Value is represented by `50 %`

In [None]:
df.describe()

Here, `Id` Column has unique continuous values which cannot contribute to predict Species type. Hence, we need to drop that column using `df.drop()` 

In [None]:
df.drop(columns = 'Id' , inplace = True)

Check the information of Dataframe using `df.info()`.

In [None]:
df.info()

From the above results, we can conclude that there are `no null values` in our dataframe. we can verify this using `df.isna().sum()`.

In [None]:
df.isna().sum()

#### Visualization
---
I will be using `Seaborn` and `Matplotlib` libraries for Data Visualization.

In [None]:
def plot_continuous_distribution(data: pd.DataFrame = None, column: str = None, height: int = 5):
    
    _ = sns.displot(data, x=column, kde=True, height=height, aspect=height/5).set(title=f'Distribution of {column}')
    plt.show() 
    
    
    
    
def correlation_plot(data: pd.DataFrame = None, numeric_only: bool = True, width: int = 10, height: int = 5):
    
    corr = data.corr(numeric_only = numeric_only)
    
    mask_ = np.array(corr)
    mask_[np.tril_indices_from(mask_)] = False

    fig , ax = plt.subplots()
    fig.set_size_inches(width,height)
    sns.heatmap(corr , mask = mask_ , vmax = 0.8 , square = True , annot = True , cmap = "YlGnBu")
    plt.show()
    

##### 1. SepalLengthCm

In [None]:
plot_continuous_distribution(df, 'SepalLengthCm')

##### 2. SepalWidthCm

In [None]:
plot_continuous_distribution(df, 'SepalWidthCm')

##### 3. PetalLengthCm

In [None]:
plot_continuous_distribution(df, 'PetalLengthCm')

##### 4. PetalWidthCm

In [None]:
plot_continuous_distribution(df, 'PetalWidthCm')

From the above plots we can conclude that:
- Column `SepalWidthCm` is only `Distributed Normally`
- Rest data is not `Normally Distributed` (Skewed)


We are creating a copy of Dataframe in order to check relation of `Species` with other features. For this, we will use `df.copy()` and `df['column_name'].replace({key : value})`

In [None]:
data = df.copy() 
data['Species'].replace({'setosa' : 0, 'versicolor' : 1, 'virginica' : 2} , inplace = True)

In [None]:
correlation_plot(data)

###### Conclusion: 

- `SepalWidthCm` is fairly related with `Species` with a Correlation Coefficient of `-0.42`.
- Correlation Coefficient shall be near to `1` or `-1` for `Best` correlation.
    

In [None]:
px.scatter_3d(df, x = 'SepalLengthCm', y = 'SepalWidthCm', z = 'PetalLengthCm', color = 'Species')

From Scatter plot, we can say that the features are linearly separable.

###### Using `Pairplot`

In [None]:
sns.pairplot(df, hue='Species')

## Feature Engineering
---
Check Skewness of each column

In [None]:
def check_skewness(data: pd.DataFrame = None, target : str = None, limit : float = 0.75):
    
    skw = {}
    
    for column in data.columns:
        
        if column == target:
            
            continue
        
        sk_val = data[column].skew()
        
        if abs(sk_val) >= limit:            
        
            skw.update({column : sk_val})
        
    return skw

In [None]:
check_skewness(df, 'Species')

- Ideal skew value should be less than `0.75`.
- No features are skewed from above result

#### Check for Outliers

- Using `sns.boxplot()`, we can check for outliers present in our data.
- Outlier treatment method may differ based on feature importance or available data.

In [None]:
sns.boxplot(df, x = 'SepalLengthCm', y = 'Species')

In [None]:
sns.boxplot(df, x = 'SepalWidthCm', y = 'Species')

In [None]:
sns.boxplot(df, x = 'PetalLengthCm', y = 'Species')

In [None]:
sns.boxplot(df, x = 'PetalWidthCm', y = 'Species')

From results above, there are only few outliers present in individual `Species`. Hence, we can leave them as it is.

## Feature Scaling

It is not mandatory to perform feature scaling, but in this file we will be performing feature scaling. We will be `MinMax Scaler` which can be imported from `from sklearn.preprocessing import MinMaxScaler`

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

Before we fit data to our scaler, lets separate data from `target column`

In [None]:
X = df.drop(columns = 'Species') # alternatively we can use >> df.drop('Species' , axis = 1)
Y = df['Species']

In [None]:
X.shape , Y.shape

Lets fit our data to the scaler object

To fit data, we can use:
- `fit()`  => We have to transform data in next step using `transform()` or
- `fit_transform()` => No need to separatly transform data

In [None]:
arr = scaler.fit_transform(X)
X_scaled = pd.DataFrame(arr, columns = X.columns)
X_scaled

We need to save the scaling model for transforming data while testing. `Pickle` can be used to save model files.


In [None]:
import pickle
import json

In [None]:
path = r'Task_1_Model_Files//Scaler.pkl'
with open(path , 'wb') as f: # We can pass complete destination path where we want to save the file
    
    pickle.dump(scaler, f)

## Model Building
---
- Using `from sklearn.model_selection import train_test_split` we can split data into training and testing.
- Ideally `75 %` of data is used for `training` and rest for `testing`

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.model_selection import GridSearchCV

`random_state` specifies that the dataset does not change everytime we execute cell. Thus ensuring the accuracy of model.
Any `int` value can can assigned to `random_state`

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, train_size = 0.75, random_state = 5)

###### Check the split data size

In [None]:
x_train.shape , x_test.shape , y_train.shape , y_test.shape

We are ready to train our model.
We will train oour model on:

- Logistic Regression
- KNN Classifier
- RandomForest Classifier
- Adaboost Classifier

Based on the accuracy, we would finalize the model from above

Let's import the models required.
Also we will need to evaluate our model. Hence we will also import evaluation matrices like `confusion matrix`, `classification report`, `accuracy score`
---

In [None]:
# importing model libraries

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier

In [None]:
# importing evaluation matrix libraries

from sklearn.metrics import classification_report , accuracy_score
from sklearn.metrics import multilabel_confusion_matrix


#### 1. Logistic Regression

In [None]:
log_model = LogisticRegression(multi_class = 'ovr') # ovr > one verses rest
log_model.fit(x_train , y_train)

##### Accuracy of `LogisticRegression` Model on Trainig Dataset

In [None]:
y_pred_train = log_model.predict(x_train)

accuracy = round(accuracy_score(y_train , y_pred_train), 2)

conf_mat = multilabel_confusion_matrix(y_train , y_pred_train)

class_rep = classification_report(y_train , y_pred_train)

print('-' * 100)

print(f'Accuracy of model on training data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on training data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

##### Accuracy of `LogisticRegression` Model on Testing Dataset

In [None]:
y_pred = log_model.predict(x_test)

accuracy = round(accuracy_score(y_test , y_pred), 2)

conf_mat = multilabel_confusion_matrix(y_test , y_pred)

class_rep = classification_report(y_test , y_pred)

print('-' * 100)

print(f'Accuracy of model on testing data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on testing data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

###### Conclusion : 

- Training Accuracy :  0.86
- Testing Accuracy : 0.84
---

#### 2. KNN Classifier

In [None]:
knn_model = KNeighborsClassifier(n_neighbors = 5 , p = 2)
knn_model.fit(x_train , y_train)

##### Accuracy of `KNeighborsClassifier` Model on Training Dataset

In [None]:
y_pred_train = knn_model.predict(x_train)

accuracy = round(accuracy_score(y_train , y_pred_train), 2)

conf_mat = multilabel_confusion_matrix(y_train , y_pred_train)

class_rep = classification_report(y_train , y_pred_train)

print('-' * 100)

print(f'Accuracy of model on training data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on training data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

##### Accuracy of `KNeighborsClassifier` Model on Testing Dataset

In [None]:
y_pred = knn_model.predict(x_test)

accuracy = round(accuracy_score(y_test , y_pred), 2)

conf_mat = multilabel_confusion_matrix(y_test , y_pred)

class_rep = classification_report(y_test , y_pred)

print('-' * 100)

print(f'Accuracy of model on testing data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on testing data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

###### Conclusion : 

- Training Accuracy :  0.97
- Testing Accuracy : 0.92
---

#### 3. RandomForestClassifier

In [None]:
ran_model = RandomForestClassifier(n_jobs = -1 , random_state = 9)
ran_model.fit(x_train , y_train)

##### Accuracy of `RandomForestClassifier` Model on Training Dataset

In [None]:
y_pred_train = ran_model.predict(x_train)

accuracy = round(accuracy_score(y_train , y_pred_train), 2)

conf_mat = multilabel_confusion_matrix(y_train , y_pred_train)

class_rep = classification_report(y_train , y_pred_train)

print('-' * 100)

print(f'Accuracy of model on training data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on training data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

##### Accuracy of `RandomForestClassifier` Model on Testing Dataset

In [None]:
y_pred = ran_model.predict(x_test)

accuracy = round(accuracy_score(y_test , y_pred), 2)

conf_mat = multilabel_confusion_matrix(y_test , y_pred)

class_rep = classification_report(y_test , y_pred)

print('-' * 100)

print(f'Accuracy of model on testing data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on testing data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

###### Conclusion : 

- Training Accuracy :  1.0
- Testing Accuracy : 0.92
---

#### 4. AdaBoostClassifier

In [None]:
ada_model = AdaBoostClassifier(random_state = 6)
ada_model.fit(x_train , y_train)

##### Accuracy of `AdaBoostClassifier` Model on Training Dataset

In [None]:
y_pred_train = ada_model.predict(x_train)

accuracy = round(accuracy_score(y_train , y_pred_train), 2)

conf_mat = multilabel_confusion_matrix(y_train , y_pred_train)

class_rep = classification_report(y_train , y_pred_train)

print('-' * 100)

print(f'Accuracy of model on training data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on training data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

##### Accuracy of `AdaBoostClassifier` Model on Testing Dataset

In [None]:
y_pred = ada_model.predict(x_test)

accuracy = round(accuracy_score(y_test , y_pred), 2)

conf_mat = multilabel_confusion_matrix(y_test , y_pred)

class_rep = classification_report(y_test , y_pred)

print('-' * 100)

print(f'Accuracy of model on testing data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on testing data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

###### Conclusion : 

- Training Accuracy :  0.99
- Testing Accuracy : 0.92
---

After Hypertunning, we may bring good results to our model. We are selecting `AdaBoostClassifier` for hypertunning.


In [None]:
hyper_parameters = {'n_estimators' : np.arange(30 , 200), 
                    'learning_rate' : np.arange(0.1 , 1.0 , 0.1)
                   }
   

In [None]:
ada_grcv = GridSearchCV(ada_model , hyper_parameters , cv = 5 , n_jobs = -1)
ada_grcv.fit(x_train , y_train)

In [None]:
adagr_model = ada_grcv.best_estimator_

##### Accuracy of `Hypertuned AdaBoostClassifier` Model on Training Dataset

In [None]:
y_pred_train = adagr_model.predict(x_train)

accuracy = round(accuracy_score(y_train , y_pred_train), 2)

conf_mat = multilabel_confusion_matrix(y_train , y_pred_train)

class_rep = classification_report(y_train , y_pred_train)

print('-' * 100)

print(f'Accuracy of model on training data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on training data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

##### Accuracy of `Hypertuned AdaBoostClassifier` Model on Testing Dataset

In [None]:
y_pred = adagr_model.predict(x_test)

accuracy = round(accuracy_score(y_test , y_pred), 2)

conf_mat = multilabel_confusion_matrix(y_test , y_pred)

class_rep = classification_report(y_test , y_pred)

print('-' * 100)

print(f'Accuracy of model on testing data is: {accuracy}')

print('-' * 100)

print(f'Classification Report of model on testing data : \n{class_rep}')

print('-' * 100)

print(f'Multilabel Confusion Matrix : \n\n {conf_mat}')

print('-' * 100)

###### Conclusion : 

- Training Accuracy :  0.96
- Testing Accuracy : 0.92
---

###### Exporting Model File

In [None]:
with open('Ada_model.pkl' , 'wb') as f:
    
    pickle.dump(adagr_model , f)

## Taking random samples for testing

In [None]:
test = X_scaled[30 : 41]

In [None]:
prediction = adagr_model.predict(test)
prediction

In [None]:
actual = Y[30 : 41]
actual

## Thank You
---