<a href="https://colab.research.google.com/github/ShabnaIlmi/iris-project/blob/main/Iris_Flower_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Necessary Libraries**

In [23]:
# Installing necessary libraries
# !pip install shap
# !pip install lime



# **Importing the Relevant Libraries**

In [24]:
# Importing the relevant libraries
from sklearn.datasets import load_iris
import numpy as np
import joblib
import seaborn as sns
import matplotlib.pyplot as plt
import shap
import lime.lime_tabular
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_svmlight_file
from scipy.stats import zscore

In [25]:
# Mounting the google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Loading the Dataset**

In [26]:
# Loading the dataset using Scikit-learn
iris = load_iris()

### **Displaying the first few rows of the Iris Dataset**

In [27]:
# Displaying the first few rows of the iris dataset
iris_data = sns.load_dataset("iris")
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# **Exploratory Data Analysis**

In [28]:
# Displaying dataset information
print("Displaying iris data information")
iris_data.info()

Displaying iris data information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


**Checking the basic statistic of the dataset (mean, min, max, etc,....)**

In [29]:
# Checking the basic statistic of the dataset
iris_data.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


**Identifying categorical and numerical columns**

In [30]:
# Identifying categorical and numerical columns
categorical_cols = iris_data.select_dtypes(include=['object']).columns
numerical_cols = iris_data.select_dtypes(include=['int64', 'float64']).columns

## **Categorical Features**

In [31]:
# List of categorical features
categorical_features = iris_data.select_dtypes(include=['object']).columns

# Displaying the categorical features
print("Categorical Features:")
for feature in categorical_features:
    print(f"- {feature}")

# Display data type of the columns
print("\nData Type of Categorical Features:")
print(iris_data[categorical_features].dtypes)

Categorical Features:
- species

Data Type of Categorical Features:
species    object
dtype: object


**Unique values and their counts relevant to each categorical column**

In [32]:
# Displaying the unique values and their counts relevant to each categorical column
print("Unique values and their count relevant to each categorical column:\n")
for col in categorical_features:
    unique_values = iris_data[col].unique()
    value_counts = iris_data[col].value_counts()
    print(value_counts)
    print(" ")

Unique values and their count relevant to each categorical column:

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
 


In [37]:
# Displaying the categorical columns which contain null values and their counts
found_nulls = False
for col in categorical_features:
    null_count = iris_data[col].isnull().sum()
    if null_count > 0:
        print(f"{col}: {null_count}")
        found_nulls = True

if not found_nulls:
    print("There are no null values in the categorical columns")

There are no null values in the categorical columns


In [39]:
# Displaying the categorical columns which contain 'Unknown' or 'N/A' values and their relevant counts
found_unknown_na = False

for col in categorical_features:
    unknown_count = (iris_data[col] == 'Unknown').sum()
    na_count = (iris_data[col] == 'N/A').sum()

    if unknown_count > 0 or na_count > 0:
        found_unknown_na = True
        if unknown_count > 0:
            print(f"{col} - 'Unknown': {unknown_count}")
        if na_count > 0:
            print(f"{col} - 'N/A': {na_count}")

if not found_unknown_na:
    print("There are no values with 'Unknown' or 'N/A' in the catgorical columns")

There are no values with 'Unknown' or 'N/A' in the catgorical columns


## **Numerical Features**

In [40]:
# Numerical Features
numerical_features = iris_data.select_dtypes(include=['int64', 'float64']).columns

# Displaying the Numerical Columns
print("Numerical Features:")
print(numerical_features)

Numerical Features:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')


**Unique values and their count relevant to each numerical column**

In [41]:
# Displaying the unique values and their count in the numerical columns
print("Unique values and their count in the numerical columns:\n")
for col in numerical_features:
    unique_values = iris_data[col].unique()
    value_counts = iris_data[col].value_counts()
    print(value_counts)
    print(" ")

Unique values and their count in the numerical columns:

sepal_length
5.0    10
6.3     9
5.1     9
6.7     8
5.7     8
6.4     7
5.5     7
5.8     7
4.9     6
6.0     6
5.4     6
5.6     6
6.1     6
6.5     5
4.8     5
7.7     4
6.9     4
4.6     4
5.2     4
6.2     4
4.4     3
7.2     3
5.9     3
6.8     3
4.7     2
6.6     2
4.3     1
7.0     1
5.3     1
4.5     1
7.1     1
7.3     1
7.6     1
7.4     1
7.9     1
Name: count, dtype: int64
 
sepal_width
3.0    26
2.8    14
3.2    13
3.4    12
3.1    11
2.9    10
2.7     9
2.5     8
3.3     6
3.5     6
3.8     6
2.6     5
3.6     4
2.3     4
3.7     3
2.2     3
2.4     3
3.9     2
4.4     1
4.2     1
4.1     1
4.0     1
2.0     1
Name: count, dtype: int64
 
petal_length
1.4    13
1.5    13
4.5     8
5.1     8
1.3     7
1.6     7
5.6     6
4.9     5
4.0     5
4.7     5
1.7     4
4.8     4
5.0     4
4.4     4
4.2     4
4.1     3
3.9     3
5.8     3
5.7     3
5.5     3
6.1     3
4.6     3
1.9     2
5.2     2
5.4     2
1.2     2
3.3     2

**Numerical columns with null values and their relevant counts**

In [43]:
# Displaying the numerical columns with null values and their relevant counts
found_nulls = False

for col in numerical_cols:
    null_count = iris_data[col].isnull().sum()
    if null_count > 0:
        print(f"{col}: {null_count}")
        found_nulls = True

if not found_nulls:
    print("There are no null values in the Numerical Columns.")

There are no null values in the Numerical Columns


# **Data Preprocessing**

## **Data Cleaning**

In [44]:
# # Step 1: Removing whitespaces from the object type columns
object_columns = iris_data.select_dtypes(include=['object']).columns
iris_data[object_columns] = iris_data[object_columns].apply(lambda x: x.str.strip())

In [45]:
# Step 2: Dropping duplicate values
iris_data.drop_duplicates(inplace=True)
iris_data.reset_index(drop=True, inplace=True)

In [46]:
# Display dataset information
print("\nDataset information after removing duplicates:")
iris_data.info()


Dataset information after removing duplicates:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  149 non-null    float64
 1   sepal_width   149 non-null    float64
 2   petal_length  149 non-null    float64
 3   petal_width   149 non-null    float64
 4   species       149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
