# Dataset Information

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

# Import modules

In [7]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Loading the dataset

In [8]:
df = pd.read_csv('Iris.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Iris.csv'

In [None]:
# delete a column
df = df.drop(columns = ['Id'])
df.head()

In [None]:
# to display stats about data
df.describe()

In [None]:
# to basic info about datatype
df.info()

In [None]:
# to display no. of samples on each class
df['Species'].value_counts()

# Preprocessing the dataset

In [None]:
# check for null values
df.isnull().sum()

# Exploratory Data Analysis

In [None]:
# histograms
df['SepalLengthCm'].hist()

In [None]:
df['SepalWidthCm'].hist()

In [None]:
df['PetalLengthCm'].hist()

In [None]:
df['PetalWidthCm'].hist()

In [None]:
# scatterplot
colors = ['red', 'orange', 'blue']
species = ['Iris-virginica','Iris-versicolor','Iris-setosa']

In [None]:
for i in range(3):
    x = df[df['Species'] == species[i]]
    plt.scatter(x['SepalLengthCm'], x['SepalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()

In [None]:
for i in range(3):
    x = df[df['Species'] == species[i]]
    plt.scatter(x['PetalLengthCm'], x['PetalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.legend()

In [None]:
for i in range(3):
    x = df[df['Species'] == species[i]]
    plt.scatter(x['SepalLengthCm'], x['PetalLengthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()

In [None]:
for i in range(3):
    x = df[df['Species'] == species[i]]
    plt.scatter(x['SepalWidthCm'], x['PetalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Width")
plt.ylabel("Petal Width")
plt.legend()

# Coorelation Matrix

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value is in the range of -1 to 1. If two varibles have high correlation, we can neglect one variable from those two.

In [None]:
df.corr()

In [None]:
corr = df.corr()
fig, ax = plt.subplots(figsize=(5,4))
sns.heatmap(corr, annot=True, ax=ax, cmap = 'coolwarm')

# Label Encoder

In machine learning, we usually deal with datasets which contains multiple labels in one or more than one columns. These labels can be in the form of words or numbers. Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
df['Species'] = le.fit_transform(df['Species'])
df.head()

# Model Training

In [None]:
from sklearn.model_selection import train_test_split
# train - 70
# test - 30
X = df.drop(columns=['Species'])
Y = df['Species']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30)

In [None]:
# logistic regression 
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [None]:
# model training
model.fit(x_train, y_train)

In [None]:
# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)

In [None]:
# knn - k-nearest neighbours
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

In [None]:
model.fit(x_train, y_train)

In [None]:
# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)

In [None]:
# decision tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

In [None]:
model.fit(x_train, y_train)

In [None]:
# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)