# Case study on Supervised learning

Do the following in the iris dataset.
1. Read the dataset to the python environment.
2. Do necessary pre-processing steps.
3. Find out which classification model gives the best result to predict iris
species.(also do random forest algorithm)

In [None]:
# Import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Read the dataset to the python environment.

In [None]:
# Loading the excel file into a pandas dataframe.
iris_data = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Week 11/Activity/Case Study/iris.xls')

In [None]:
# Display the data
iris_data.head()

In [None]:
# number of elements in each dimension (Rows and Columns)
iris_data.shape

In [None]:
# Summary of the data
iris_data.info()

In [None]:
# Display the columns in the dataset
iris_data.columns

From the iris dataset :

- 50 samples of 3 different species of iris (150 samples total)
- Measurements: sepal length, sepal width, petal length, petal width


In [None]:
# check the target variable
iris_data['Classification'].value_counts()

As you can see that the feature 'Classification' is the target variable in the iris dataset (whether the iris classifcation there are 50 observations of each species (setosa, versicolor, virginica). Hence it is a Multi-Class Classification Problem

### 2. Do necessary pre-processing steps.

In [None]:
# check the data type for all features in the dataset
iris_data.dtypes

Now, we need to check whether any null or missing values present in the iris dataset

In [None]:
# Calculating the null values present in each columns in the dataset (Before treatment)
iris_data.isna().sum()

As you can see that there are 19 null or missing values present in the iris dataset. The features sepal length(SL), sepal width(SW) and petal length(PL) are having the null values. we can treat the null values. Since sepal length(SL), sepal width(SW) and petal length(PL) are float data type so we can fill the missing values with mean/median method.

In [None]:
# Display the data before missing values treatment
iris_data[iris_data.isna().any(axis=1)] # check at least one null values in a row 

We can use plots and summary statistics to help identify missing or corrupt data.

In [None]:
# we can plot the frequency graph
freq_graph = iris_data.select_dtypes(include=['float64'])
freq_graph.hist(figsize=(15,10))
plt.show()

From the above frequency graph we can say that sepal length(SL), sepal width(SW) , petal length(PL) and Petal Width(PW) are follows almost normal distribution so we can use mean for missing values treatment.

In [None]:
# The Statistical summary of iris dataset
iris_data.describe().T

> "Missing values are frequently indicated by out-of-range entries; perhaps a negative number (e.g., -1) in a numeric field that is normally only positive, or a 0 in a numeric field that can never normally be 0".

Here we can see the complete statistical summary of count, mean, statndard deviation, minimum value of each column also maximum, 25%, 50% and 75% percentile.

we can use a loop to treat missing or null values for sepal length(SL), sepal width(SW) and petal length(PL)

In [None]:
# fill the missing values
missing_val = ['SL','SW','PL']

for i in missing_val:
    iris_data[i] = iris_data[i].fillna(iris_data[i].mean())

In [None]:
# Calculating the null values present in each columns in the dataset (After treatment)
iris_data.isna().sum()

In [None]:
# Display the data after missing values treatment
iris_data[iris_data.isna().any(axis=1)] # check at least one null values in a row 

As you can see that it is treated well for missing values in sepal length(SL), sepal width(SW) and petal length(PL). Now our iris dataset is completely treated with no null or missing values

In [None]:
# number of elements in each dimension (Rows and Columns)
iris_data.shape

So next we can check and handle outliers in the iris dataset. For finding outliers we can use boxplot.

In [None]:
# Display the columns in the dataset
iris_data.columns

In [None]:
# boxplot before removing the outliers from SL, SW, PL and PW features
for i in iris_data.columns[iris_data.dtypes == float]:
   fig = plt.figure(figsize=(8,5))
   iris_data[i].value_counts(normalize=True).plot(kind='box')
   fig.suptitle(i)

As you can see that there are some outliers present in the Sepal Width(SW) features so we need to remove the outliers.

> Sepal Width (SW) feature

In [None]:
# boxplot before removing the outliers from Sepal Width feature
fig = plt.figure(figsize=(8,5))
iris_data['SW'].value_counts(normalize=True).plot(kind='box')
fig.suptitle('Sepal Width')

In [None]:
# For removing outliers first we need to check the quartiles. It manages the outliers.
Q1 = np.percentile(iris_data['SW'],25,interpolation='midpoint')
Q2 = np.percentile(iris_data['SW'],50,interpolation='midpoint')
Q3 = np.percentile(iris_data['SW'],75,interpolation='midpoint')
print('Q1: ',Q1,'\nQ2: ',Q2,'\nQ3: ',Q3)

# check the inter quartile range (IQR)
IQR = Q3 - Q1
print('IQR: ',round(IQR,2))

#check the lower and upper limit 
low_lm = Q1-1.5*IQR
upp_lm = Q3+1.5*IQR
print("Lower limit is : ",round(low_lm,2))
print("Upper limit is : ",round(upp_lm,2))

'''Normally the datapoints which fall below Q1-1.5(IQR) and above Q3+1.5(IQR) are considered as outliers.
 If the value above the upper limit or below the lower limit we need to remove that outliers.'''

 # display the outilers
outliers = []
for i in iris_data['SW']:
  if((i>upp_lm)or(i<low_lm)):
    outliers.append(i)

print("Outliers in the Sepal Width: ",outliers)

As you can see that these values are the outliers in the Sepal Width feature and also we can observe that there is one lower limit value and remaining all the upper limit values as outliers. Now we need to find the index values for the outliers. 

In [None]:
# select the index of these outliers
indx_low = iris_data['SW']<low_lm
outlier_indx_low = iris_data.loc[indx_low].index

indx_upp = iris_data['SW']>upp_lm
outlier_indx_upp= iris_data.loc[indx_upp].index

print('The outliers index value of lower limit is {}'.format(outlier_indx_low),
      '\nand upper limit is {}'.format(outlier_indx_upp))

In [None]:
# drop these index for removing outliers
iris_data.drop(outlier_indx_low, inplace=True)
iris_data.drop(outlier_indx_upp, inplace=True)

In [None]:
# boxplot after removing the outliers from Sepal Width feature
fig = plt.figure(figsize=(8,5))
iris_data['SW'].value_counts(normalize=True).plot(kind='box')
fig.suptitle('Sepal Width')

From the above boxplot we can see that we reduced the outliers from the Sepal Width(SW) feature.

In [None]:
# Summary of the data
iris_data.info()

In [None]:
# Statistical Summary of the data
iris_data.describe()

For filling with mean did not introduce observable change in iris dataset

In [None]:
# Label Encode with target 'Classification' feature
from sklearn.preprocessing import LabelEncoder
iris_data['Classification'] = LabelEncoder().fit_transform(iris_data['Classification'])

In [None]:
# check the target variable after label encode
iris_data['Classification'].value_counts()

### 3. Find out which classification model gives the best result to predict iris species.(also do random forest algorithm)

In [None]:
# Function to check model performances
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
def check_model_metrices(y_test, y_pred):
    print('Model Accuracy = ', accuracy_score(y_test, y_pred))
    print('Model Precision = ', precision_score(y_test, y_pred, average='micro'))
    print('Model Recall = ', recall_score(y_test, y_pred, average='micro'))
    print('Model F1 Score = ', f1_score(y_test, y_pred, average='micro'))
    print('Confusion Matrix = \n', confusion_matrix(y_test, y_pred))

In [None]:
# Extract feature columns
feature_cols = list(iris_data.columns[:-1])

# Extract target column 'CLassification'
target_col = iris_data.columns[-1] 

# Separate the data into feature data and target data (X and y, respectively),this method is called feature selection
X = iris_data[feature_cols]
y = iris_data[target_col]
print(f'Feature shape: {X.shape}')

In [None]:
# splitting the data into train and test 
from sklearn.model_selection import train_test_split
# training points (approximately 70%) and testing points (approximately 30%).
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

### Model 1. Multinomial Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class='multinomial', solver='newton-cg') # solver - Algorithm to use in the optimization
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)


In [None]:
# Calling function to check model performances
check_model_metrices(y_test, lr_pred)

For good model, accuracy and F1 score should be maximum possible. As per the Multinomial Logistic regression model has performed well with an accuracy of 88.63% with just 5 misclassification.

### Model 2: K-Nearest Neighbors(KNN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
acc_values = [] #find the optimum k values
neighbors = np.arange(3,16)
for k in neighbors:
  classifier = KNeighborsClassifier(n_neighbors=k, metric='minkowski')
  classifier.fit(X_train,y_train)
  y_pred = classifier.predict(X_test)
  acc = accuracy_score(y_test,y_pred) # find the maximum accuracy
  acc_values.append(acc)

acc_values

In [None]:
plt.plot(neighbors, acc_values,'o-')
plt.xlabel('k value')
plt.ylabel('accuracy')

As you can see best k value is 3

In [None]:
# now we can make the model with k value as 3
classifier = KNeighborsClassifier(n_neighbors=3, metric='minkowski')
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

# Calling function to check model performances
check_model_metrices(y_test, y_pred)

As per the  K-Nearest Neighbors(KNN) model has performed well with an accuracy of 93.18% with 3 misclassification.

Accuracy and F1 Score is much improvement in KNN model

### Model 3: SVM(Support Vector Machine) with multi-class


SVM Kernel have 3 types
1. Linear
2. Polynomial
3. Radial Basis Function (RBF)

In [None]:
from sklearn.svm import SVC
# decision_function_shape is set to One-vs_One for multi-class
linear = SVC(kernel='linear', decision_function_shape='ovo') 
linear.fit(X_train, y_train)
linear_pred = linear.predict(X_test)

In [None]:
# Calling function to check model performances
check_model_metrices(y_test, linear_pred)

As per the SVM(Support Vector Machine) with multi-class using linear kernel model has performed well with an accuracy of 90.90% with 4 misclassification.

Accuracy and F1 Score is much improvement with Logistic Regression and not improved with KNN.

In [None]:
# decision_function_shape is set to One-vs_One for multi-class
poly = SVC(kernel='poly', degree = 3, decision_function_shape='ovo')
poly.fit(X_train, y_train)
poly_pred = poly.predict(X_test)

In [None]:
# Calling function to check model performances
check_model_metrices(y_test, poly_pred)

As per the SVM(Support Vector Machine) with multi-class using polinomial kernel model has performed well with an accuracy of 93.18% with 3 misclassification.

Accuracy and F1 Score is much improvement with Logistic Regression and SVM linear kernel not improved with KNN.

In [None]:
# decision_function_shape is set to One-vs_One for multi-class
rbf = SVC(kernel='rbf', decision_function_shape='ovo') 
rbf.fit(X_train, y_train)
rbf_pred = rbf.predict(X_test)

In [None]:
# Calling function to check model performances
check_model_metrices(y_test, rbf_pred)

As per the SVM(Support Vector Machine) with multi-class using RBF kernel model has performed well with an accuracy of 86.63% with 6 misclassification.

Accuracy and F1 Score is not improvement with other classification model.

### Model 4: Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)

In [None]:
# Calling function to check model performances
check_model_metrices(y_test, dt_pred)

### Model 5: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

In [None]:
# Calling function to check model performances
check_model_metrices(y_test, rf_pred)