## Loading/Exploring the data

Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import libraries

Import the `pandas` library as `pd`

In [1]:
import pandas as pd

:Read the `../data/iris.csv` dataset into an object named `iris`

In [2]:
iris = pd.read_csv('/content/iris-write-from-docker.csv')

In [3]:
iris.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [4]:
iris.tail(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [5]:
iris.shape

(150, 5)

How many different species are in this dataset?

In [6]:
iris['class'].nunique()

3

What are their names?

In [7]:
iris['class'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [8]:
iris['class'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [9]:
iris['sepal_ratio'] = iris['sepal_width'] / iris['sepal_length']  

In [10]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,sepal_ratio
0,5.1,3.5,1.4,0.2,Iris-setosa,0.686275
1,4.9,3.0,1.4,0.2,Iris-setosa,0.612245
2,4.7,3.2,1.3,0.2,Iris-setosa,0.680851
3,4.6,3.1,1.5,0.2,Iris-setosa,0.673913
4,5.0,3.6,1.4,0.2,Iris-setosa,0.72


Create a similar column called `'petal_ratio'`: petal width / petal length

In [11]:
iris['petal_ratio'] = iris['petal_width'] / iris['petal_length']

In [12]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,sepal_ratio,petal_ratio
0,5.1,3.5,1.4,0.2,Iris-setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,Iris-setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,Iris-setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,Iris-setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,Iris-setosa,0.72,0.142857


Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [13]:
iris['sepal length (inches)'] = iris['sepal_length'] * 0.393701
iris['petal length (inches)'] = iris['petal_length'] * 0.393701
iris['sepal width (inches)'] = iris['sepal_width'] * 0.393701
iris['petal width (inches)'] = iris['petal_width'] * 0.393701
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,sepal_ratio,petal_ratio,sepal length (inches),petal length (inches),sepal width (inches),petal width (inches)
0,5.1,3.5,1.4,0.2,Iris-setosa,0.686275,0.142857,2.007875,0.551181,1.377954,0.07874
1,4.9,3.0,1.4,0.2,Iris-setosa,0.612245,0.142857,1.929135,0.551181,1.181103,0.07874
2,4.7,3.2,1.3,0.2,Iris-setosa,0.680851,0.153846,1.850395,0.511811,1.259843,0.07874
3,4.6,3.1,1.5,0.2,Iris-setosa,0.673913,0.133333,1.811025,0.590552,1.220473,0.07874
4,5.0,3.6,1.4,0.2,Iris-setosa,0.72,0.142857,1.968505,0.551181,1.417324,0.07874


## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.map()</code> method to create the new column


In [14]:
species_dict = {
    'Iris-setosa': 0,
    'Iris-versicolor': 1,
    'Iris-virginica': 2
}
iris['encoded_species'] = iris['class'].map(species_dict)
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,sepal_ratio,petal_ratio,sepal length (inches),petal length (inches),sepal width (inches),petal width (inches),encoded_species
0,5.1,3.5,1.4,0.2,Iris-setosa,0.686275,0.142857,2.007875,0.551181,1.377954,0.07874,0
1,4.9,3.0,1.4,0.2,Iris-setosa,0.612245,0.142857,1.929135,0.551181,1.181103,0.07874,0
2,4.7,3.2,1.3,0.2,Iris-setosa,0.680851,0.153846,1.850395,0.511811,1.259843,0.07874,0
3,4.6,3.1,1.5,0.2,Iris-setosa,0.673913,0.133333,1.811025,0.590552,1.220473,0.07874,0
4,5.0,3.6,1.4,0.2,Iris-setosa,0.72,0.142857,1.968505,0.551181,1.417324,0.07874,0


In [15]:
iris['encoded_species'].unique()

array([0, 1, 2])

# Seperate the input with(sepal_length,sepal_wideth,petal_length,petal_width

In [16]:
X = iris[['sepal_length','sepal_width','petal_length','petal_width']]
y = iris['encoded_species']

# split data 80/20 train and test

In [17]:
from sklearn.model_selection import train_test_split
x_train , x_test, y_train, y_test = train_test_split(X,y ,test_size = 0.2 ,random_state = 0)

# Apply logistic Regression Model

In [18]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

# calculate y prediction

In [19]:
y_pred = lr.predict(x_test)

# Evaluate the model using confusion matrix and calculate accuracy

In [20]:
from sklearn.metrics import confusion_matrix,accuracy_score
cm = confusion_matrix(y_pred,y_test)
score =accuracy_score(y_pred,y_test)
print(cm,score)

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]] 1.0


# Apply kNN

In [21]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)

KNeighborsClassifier()

# calculate y_prediction

In [22]:
y_pred_knn = knn.predict(x_test)

# Evaluate the model using confusion matrix and calculate accuracy

In [23]:
cm = confusion_matrix(y_pred_knn,y_test)
score =accuracy_score(y_pred_knn,y_test)
print(cm,score)

[[11  0  0]
 [ 0 12  0]
 [ 0  1  6]] 0.9666666666666667
