## Problem Description

Use sklearn.datasets iris flower dataset to train an ML model using different classification algorithms. You need
to figure out the accuracy of your model and use that to predict different samples in your test
dataset. In iris dataset there are 150 samples containing following features:

1. Sepal Length
2. Sepal Width
3. Petal length
4. Petal width

Using above 4 features classify a flower in one of the three categories,
1. Setosa
2. Versicolour
3. Virginica

In [1]:
# Importing libraries
import numpy as np
import pandas as pd

In [2]:
# Loading dataset to variable 'iris'
from sklearn.datasets import load_iris
iris = load_iris()
dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

In [4]:
# Converting to a dataframe
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['target'] = iris.target
df['flower'] = iris.target
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,flower
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2,virginica
146,6.3,2.5,5.0,1.9,2,virginica
147,6.5,3.0,5.2,2.0,2,virginica
148,6.2,3.4,5.4,2.3,2,virginica


In [5]:
# Target names
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

'target' column in the dataframe consists of species of iris flower: 'setosa', 'versicolor', 'virginica' represented as 0,1 and 2 respectively.

### Data Exploration

In [6]:
df.shape

(150, 6)

There are 150 rows and 5 columns in the dataset

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target             150 non-null    int32  
 5   flower             150 non-null    object 
dtypes: float64(4), int32(1), object(1)
memory usage: 6.6+ KB


In [8]:
# Statistical summary
df.describe(include='all')

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,flower
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,setosa
freq,,,,,,50
mean,5.843333,3.057333,3.758,1.199333,1.0,
std,0.828066,0.435866,1.765298,0.762238,0.819232,
min,4.3,2.0,1.0,0.1,0.0,
25%,5.1,2.8,1.6,0.3,0.0,
50%,5.8,3.0,4.35,1.3,1.0,
75%,6.4,3.3,5.1,1.8,2.0,


In [10]:
#Count of null values
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
flower               0
dtype: int64

There are no missing values in the dataset.

#### Splitting data into training and testing sets

In [11]:
X=df.drop(['target','flower'], axis = 1)
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [12]:
y=df['target']
y

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Name: target, Length: 150, dtype: int32

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state = 42)

In [14]:
len(X_train)

120

In [15]:
len(X_test)

30

#### Comparing scores and hyperparameters of different classification ML models using Grid Search Cross Validation 

Performing a grid search using Cross-Validation (GridSearchCV) to find the best hyperparameters for following classification models:
1. Logistic Regression
2. Decision Tree
3. Random Forest
4. Support Vector Machine (SVM)
5. Naive Bayes algorithm
6. K-Nearest Neighbors (KNN)

In [16]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [17]:
model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    },
    'naive_bayes_gaussian': {
        'model': GaussianNB(),
        'params': {}
    },    
    'decision_tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'criterion': ['gini','entropy'],
            
        }
    },
    'knn': {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [1, 3, 5, 7, 9]
        }
    }
}

Here, model_params is a dictionary that contains information about different machine learning models to be evaluated, along with their hyperparameters for a grid search. 
Each entry in the dictionary corresponds to a different model. For example, 'svm' represents Support Vector Machine, 'random_forest' represents Random Forest, and so on. For each model, model object and a dictionary of hyperparameters along with their possible values for grid search is provided.

In [19]:
scores = []
for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(X_train,y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
score_df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
score_df

Unnamed: 0,model,best_score,best_params
0,svm,0.958333,"{'C': 1, 'kernel': 'linear'}"
1,random_forest,0.958333,{'n_estimators': 10}
2,logistic_regression,0.958333,{'C': 10}
3,naive_bayes_gaussian,0.941667,{}
4,decision_tree,0.95,{'criterion': 'gini'}
5,knn,0.958333,{'n_neighbors': 3}


The above loop iterates through each model defined in model_params. For each model, it does the following:

- Extracts the model object and its associated hyperparameters from the 'model_params' dictionary above.
- Creates a GridSearchCV object named 'clf' using the model and hyperparameters. The cv=5 parameter specifies 5-fold cross-validation.
- Fits the clf grid search object using the training data.
- Appends a dictionary to the scores list containing the model name, the best score obtained from grid search (clf.best_score_), and the best hyperparameters (clf.best_params_).

##### Based on the training data provided 'SVM', 'Random Forest','Logistic Regression' and 'KNN' models are slightly more accurate (approximately 96%) than 'Naive Bayes' and 'Decision Tree'(approx 94%). 

##### Predictions using Logistic Regression

In [20]:
# choosing logistic regression model with best parameters as above to make predictions
logreg_model = LogisticRegression(solver='liblinear', multi_class='auto', C=10)
logreg_model.fit(X_train,y_train)

In [21]:
y_pred=logreg_model.predict(X_test)
y_pred

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

#### Sample prediction

In [51]:
#choosing a sample
print("The feature values are :\n",X_test.loc[104])
print ("The flower category (target) is:",df['flower'].iloc[104])

The feature values are :
 sepal length (cm)    6.5
sepal width (cm)     3.0
petal length (cm)    5.8
petal width (cm)     2.2
Name: 104, dtype: float64
The flower category (target) is: virginica


Above are the features and target value at index position 104. Using this sample features to verify model prediction.

In [49]:
# Extracting the sample row from X_test
sample_X = X_test.loc[104,:]  

# Reshape the sample_X as a 2D array, since the model expects a 2D input
sample_X_reshaped = sample_X.values.reshape(1, -1)

# Predict the target variable for the sample using the fitted model
y_sample_pred = logreg_model.predict(sample_X_reshaped)

print("Predicted y:", y_sample_pred)

Predicted y: [2]




In [50]:
flower_category = iris.target_names[y_sample_pred]
flower_category

array(['virginica'], dtype='<U10')

Predicted y value is '2' represents 'virginica',which implies predicted and actual target values are the same.