# CLASSIFY IRIS FLOWERS
- Objective: To build a classification model to classify Iris flowers as either Setosa, Versicolor, or Virginica.
- Data: Iris dataset from the scikit-learn datasets.
- Features: sepal length, sepal width, petal length, and petal width.

### 1. Import required modules

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

### 2. Import raw data

In [2]:
iris_data = load_iris()
print(iris_data['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

### 3. Data preprocessing

#### - Assign X (features) and y (target) variables, then split into train and test sets

In [3]:
X = iris_data['data']
y = iris_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

### 4. Model fitting and testing
- Scoring metrics:
    - Accuracy score
    - F_1 score : a blend of precision and recall scores.
#### 4.1 Decision tree classifier

In [4]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc.score(X_test, y_test)

0.96

In [5]:
confusion_matrix(y_true=y_test, y_pred=dtc.predict(X_test))

array([[21,  0,  0],
       [ 0, 29,  1],
       [ 0,  2, 22]], dtype=int64)

In [6]:
print(classification_report(y_true=y_test, y_pred=dtc.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.94      0.97      0.95        30
           2       0.96      0.92      0.94        24

    accuracy                           0.96        75
   macro avg       0.96      0.96      0.96        75
weighted avg       0.96      0.96      0.96        75



#### 4.2 Logistic regression with cross validation

In [7]:
lrcv = LogisticRegressionCV()
lrcv.fit(X_train, y_train)
lrcv.score(X_test, y_test)

0.9333333333333333

In [8]:
confusion_matrix(y_true=y_test, y_pred=lrcv.predict(X_test))

array([[21,  0,  0],
       [ 0, 29,  1],
       [ 0,  4, 20]], dtype=int64)

In [9]:
print(classification_report(y_true=y_test, y_pred=lrcv.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.88      0.97      0.92        30
           2       0.95      0.83      0.89        24

    accuracy                           0.93        75
   macro avg       0.94      0.93      0.94        75
weighted avg       0.94      0.93      0.93        75



#### 4.3 Support vector classifier

In [10]:
svc = SVC()
svc.fit(X_train, y_train)
svc.score(X_test, y_test)

0.9466666666666667

In [11]:
confusion_matrix(y_true=y_test, y_pred=svc.predict(X_test))

array([[21,  0,  0],
       [ 0, 29,  1],
       [ 0,  3, 21]], dtype=int64)

In [12]:
print(classification_report(y_true=y_test, y_pred=svc.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.91      0.97      0.94        30
           2       0.95      0.88      0.91        24

    accuracy                           0.95        75
   macro avg       0.95      0.95      0.95        75
weighted avg       0.95      0.95      0.95        75



Notes
- Based on accuracry, the decision tree performs the best with a score of 96%
- Based on f_1 score, the decision tree also perfoms the best with scores of 100%, 94%, and 95% for each of classes Setosa, Versicolour, and -Virginica respectively.
- Conclusion: select the decision tree.

### 5. Model Improvement

In [13]:
dtc.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

In [14]:
param_distributions = {'criterion':['gini','entropy','log_loss'],
                       'splitter':['best','random'],
                       'min_samples_leaf':[1,2,3,4,5],
                       'min_samples_split':[2,3,4,5]}
rscv = RandomizedSearchCV(estimator=dtc,
                         param_distributions=param_distributions,
                         n_jobs=2)
rscv.fit(X_train, y_train)
print('Train score:',rscv.best_score_)
print('Test score:',rscv.score(X_test, y_test))
print('Params:', rscv.best_params_)
print('\n', classification_report(y_true=y_test, y_pred=rscv.predict(X_test)))

Train score: 0.9600000000000002
Test score: 0.9333333333333333
Params: {'splitter': 'random', 'min_samples_split': 5, 'min_samples_leaf': 2, 'criterion': 'log_loss'}

               precision    recall  f1-score   support

           0       1.00      1.00      1.00        21
           1       0.88      0.97      0.92        30
           2       0.95      0.83      0.89        24

    accuracy                           0.93        75
   macro avg       0.94      0.93      0.94        75
weighted avg       0.94      0.93      0.93        75



#### Conclusion; stick to the original decision tree as it has higher scores.