# Classification Project on Types of Iris

In [1]:
import pandas as pd

## 1. Data Source

[Data source link](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/)

## 2. Import Data

In [2]:
data_folder = 'data/'

In [3]:
data_filename = 'iris.data'

In [4]:
df = \
pd.read_csv(data_folder + data_filename,
           header=None)

In [5]:
df.head(3)

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


## 3. Relabel Data Columns

Attribute Information:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      -- Iris Setosa
      -- Iris Versicolour
      -- Iris Virginica

In [6]:
columns_names = \
['sepal_length', 'sepal_width', 
 'petal_length', 'petal width', 
 'class'] 

In [7]:
df.columns = columns_names

In [8]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


## 4. Recode Target Variable ("class")

##### Learnig all possible values of 'class' variable:

In [9]:
df['class'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

##### Writing function to recode the "class" variable:

In [10]:
def recode_target_variable(i):
    if i == 'Iris-setosa':
        return 1
    elif i == 'Iris-versicolor':
        return 2
    elif i == 'Iris-virginica':
        return 3

##### Applying the function to recode the target variable to a new variable:

In [11]:
df['class_numerical'] = \
df['class'].apply(recode_target_variable)

##### Double-checking recoding results:

In [12]:
df['class_numerical'].value_counts()

1    50
2    50
3    50
Name: class_numerical, dtype: int64

In [13]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal width,class,class_numerical
0,5.1,3.5,1.4,0.2,Iris-setosa,1
1,4.9,3.0,1.4,0.2,Iris-setosa,1
2,4.7,3.2,1.3,0.2,Iris-setosa,1
3,4.6,3.1,1.5,0.2,Iris-setosa,1
4,5.0,3.6,1.4,0.2,Iris-setosa,1
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,3
146,6.3,2.5,5.0,1.9,Iris-virginica,3
147,6.5,3.0,5.2,2.0,Iris-virginica,3
148,6.2,3.4,5.4,2.3,Iris-virginica,3


## 5. Split Data to Train Dataset and Test Dataset

In [14]:
import numpy as np

##### Creating X as feature vector

In [15]:
X = np.array(df.iloc[:, :4])

##### Creating y as target vector

In [16]:
y = np.array(df['class_numerical'].values.tolist())

##### Splitting feature vector, target vector to train dataset and test dataset

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=123,
                 shuffle=True)

## 6. Train Model Using Training Dataset

##### Using logistic regression

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
logit = LogisticRegression(C=100, n_jobs=10, random_state=123, verbose=2)

##### Using the train dataset to train for the logistic regression model

In [21]:
import time

begin = time.time()

logit.fit(X_train, y_train)

end = time.time()

print('total time',round((end-begin),5),'sec')

[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.


RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           15     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.31833D+02    |proj g|=  9.85333D+01

At iterate    1    f=  1.26026D+02    |proj g|=  1.04094D+02

At iterate    2    f=  1.16689D+02    |proj g|=  7.20801D+01

At iterate    3    f=  9.68441D+01    |proj g|=  9.93328D+01

At iterate    4    f=  7.94581D+01    |proj g|=  5.57300D+01

At iterate    5    f=  6.18048D+01    |proj g|=  3.26696D+01

At iterate    6    f=  5.00164D+01    |proj g|=  1.28294D+01

At iterate    7    f=  3.90757D+01    |proj g|=  4.03024D+01

At iterate    8    f=  2.78724D+01    |proj g|=  9.58477D+00

At iterate    9    f=  1.96851D+01    |proj g|=  2.88207D+00

At iterate   10    f=  1.43748D+01    |proj g|=  5.09786D+00

At iterate   11    f=  1.16492D+01    |proj g|=  2.03205D+00

At iterate   12    f=  1.04364D+01    |proj g|=  2.33303D+00

At iterate   13    f=  1.0

 This problem is unconstrained.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=10)]: Done   1 out of   1 | elapsed:    0.5s finished


## 7. Testing Trained Legistic Regression Model on Test Dataset

##### Testing the trained logistic regression model on the test dataset

In [22]:
acc_score = logit.score(X_test,y_test)

##### ACCURACY RATE from testing the model on the test dataset

In [23]:
print('The accuracy rate of testing the trained model on the test dataset is '+str(round(acc_score*100, 1))+'%.')

The accuracy rate of testing the trained model on the test dataset is 93.3%.
