# Question: 1 -- Classify Iris using Logistic Regression   
__Study the logistic regression model in scikit-learn and use it to classify the Iris data. List the python, scikit-learn commands that you use to apply the logistic model to the Iris data__

## 1.1 Importing libraries
__Sklearn:__ Will use sklearn to import 'iris dataset', 'train_test_split' and 'LogisticRegression' and 'accuracy_socre' 
__NumPy:__ Will use it's unique(list/numpy array) method to find distinct values in a list or numpy array


In [82]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np


## 1.2 Loading and Understanding Data   
We will use the load_iris() from sklearn to load the dataset, __load_iris__ returns data as Bunch object (sklearn custom object that extends dictionary and allow to access values by keys or attributes. e.g bunch['value_key'] or bunch.value_key). The bunch object returned from load_iris() has six feaatures that are; data, target, target_names, DESCR, feature_names, filename, frame. Following code cell look into the data returned by load_iris() as bunch object. With this we will get useful insights e.g; what are the types of data ? what is shape of features ? what is the target vector shape ? How many features ? How many training examples ? etc  


In [83]:
# Loading the data using load_iris() imported from sklearn
iris_data = load_iris()

# Checking what is the type of iris_data
print('Type of iris_data: ', type(iris_data)) # Bunch object 

# Description of the data 
print('Description: ', iris_data.DESCR)
print()

# As bunch object inherit dectionary thus is must have keys() method to retrieve keys
print('iris_data keys: ', iris_data.keys())

# Checking dimension of feature matrix
print('Shape of feature matrix: ', iris_data.data.shape)

# Checking shape of target vector
print('Shape of target vector: ', iris_data.target.shape)

# Checking names of features
print('Features names: ', iris_data.feature_names)

# Checking how many classes does target vector have
print('There are ', np.unique(iris_data.target), ' distinct classes the target may fall in.')

# Cheking the names of distinct classes 
print('Names of target classes: ', iris_data.target_names)



Type of iris_data:  <class 'sklearn.utils.Bunch'>
Description:  .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
 

**Data Set Characteristics:**  
After looking into data we determine that the data holds following characteristics

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
     :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
**iris_data keys:**  dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])  
**Shape of feature matrix:**  (150, 4)  
**Shape of target vector:**  (150,)  
**Features names:**  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']  
**There are  [0 1 2]  distinct classes the target may fall in.**  
**Names of target classes:**  ['setosa' 'versicolor' 'virginica']

## 1.3 Seperating Training and Testing Data
Before we build machine learning model from our data we must consider how to test our model predictions on new measurements of iris flower. From load_iris() we have 150 samples that we would not use all for training because we would not use same data to evolovate it. This is because our model can always simply remember the whole training set, and will therefore always predict the correct label for any sample in the dataset. This will not indicate whether our model will 'generalize' to new measurements.

To resolve this issue, we need to split dataset in training and testing set and best way to divide is to shuffle the samples first and then divide to 75% Training set and 25% Testing set. This way both sets will have all three species and will be exclusive.

scikit-learn contains a function that shuffles the dataset and splits to training and testing set. The train_test_split() function shuffle dataset and extract 75% of the rows of data as training set with corresponding lables and 25% together with labels is declared as testing set.

Arguments: data and target from iris_dataset dictionary and provided '0' as fixed seed to pesudorandom number generator.

In [84]:
X_train, X_test, y_train, y_test = train_test_split(iris_data['data'], iris_data['target'], random_state = 0)

# X_train, y_train represent the training features and targets, and X_test, y_test represent the test features and targets
# O indicate that anytime the test_train_split is called with same data and targets they training and testing datasets 
# produced will be same. 

Confirm whether data is splitted with exact percentages, following code is to find X_train, X_test, y_train and y_test dimensions, along with the iris_data.data and iris_data.target dimensions

In [85]:
print('Iris_data feature matrix shape: {}'.format(iris_data.data.shape))
print('Iris_data target vector shape: {}'.format(iris_data.target.shape))

print()
print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))

print()
print('X_test shape: {}'.format(X_test.shape))
print('y_test shape: {}'.format(y_test.shape))

Iris_data feature matrix shape: (150, 4)
Iris_data target vector shape: (150,)

X_train shape: (112, 4)
y_train shape: (112,)

X_test shape: (38, 4)
y_test shape: (38,)


## 1.4 Training the Model 
Will use LogisticRegression from sklearn to trian the model on iris_data. For training we will use the X_train as feature matrix an y_train as it's corresponding target vector. The fit method in LogisticRegression is used to  train the model.

In [86]:
# Creating our own  instance o logisticRegression with maximum ieration for the solver to convergeis 190 
logistic_model = LogisticRegression(max_iter = 190)
# Fit method to train the model 
logistic_model.fit(X_train, y_train)

LogisticRegression(max_iter=190)

## 1.5 Testing the Predictor  
To test the predictor we have used predict() method, invoked with X_test. At the end using accuracy_sore, we have compared the y_test with the prediction made by the tranined model.

In [119]:
# Predicting targets using X_test
predictions = logistic_model.predict(X_test)
print(predictions)

# Test the model accury by comparing y_test and the predictions made in last statement.
print(accuracy_score(y_test, predictions))

[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]
0.9736842105263158


## 1.6 Making Predictions  on New Data

In [112]:
# Generating numpy array of 5 rows random values
new_data = np.random.rand(5, 4) * 10
print(new_data)

prediction = logistic_model.predict(new_data)
print(prediction)

[[0.81676472 9.50685805 0.51732652 2.4185819 ]
 [2.0860781  9.70280773 9.57216921 0.29165695]
 [5.1421919  4.28920452 0.21262029 5.35225571]
 [4.24670875 5.37444729 2.66515665 7.19899658]
 [1.40351833 8.57270618 6.20913169 0.03022408]]
[0 2 0 2 0]
