# Practice - Iris Species Classification
### Instructions
- Some code has been written for you, most of it you practiced in the lessons
- Feel free to add in your code and analysis, this is a practice exercise and you should try to implement your learnings
- If something is not clear, go back to lessons, or the documentation page

In [1]:
import warnings
warnings.filterwarnings('ignore')

# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

iris = pd.read_csv('Iris.csv') # import data
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### Conduct you exploratory analysis

In [2]:
iris.drop('Id', axis = 1, inplace = True)

In [None]:
# checking the shape of the data
iris.shape

In [None]:
iris.info()

In [None]:
iris.describe() # Summary Statistics

### Exploratory data Visualization
- Feel free to add more plots and derive insights
- What can you learn from the plots you draw?

In [None]:
# Visualize pairwise relationship
sns.pairplot(iris, hue = 'Species')

In [None]:
#Compare distribution of petal and sepal length
plt.figure(figsize = (10, 6))

plt.subplot(1, 2, 1)
sns.violinplot(x="Species", y="PetalWidthCm", data=iris)

plt.subplot(1, 2, 2)
sns.violinplot(x="Species", y="PetalLengthCm", data=iris)

What can you say about your data by the visualizations?

can you see that from the pair plot and the violin plot it can be inferred that petal length and petal width have a linear relation and almost identical distribution

Can you study about other features and relationships too?

### Shuffle Data
Since our data was organised in a pattern by arranging all the species in order, it is not a good choice to train the data that way. So we shuffle the data randomly using sklearn's shuffle function

In [3]:
from sklearn.utils import shuffle
iris = shuffle(iris, random_state = 42)
iris.head(10)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
73,6.1,2.8,4.7,1.2,Iris-versicolor
18,5.7,3.8,1.7,0.3,Iris-setosa
118,7.7,2.6,6.9,2.3,Iris-virginica
78,6.0,2.9,4.5,1.5,Iris-versicolor
76,6.8,2.8,4.8,1.4,Iris-versicolor
31,5.4,3.4,1.5,0.4,Iris-setosa
64,5.6,2.9,3.6,1.3,Iris-versicolor
141,6.9,3.1,5.1,2.3,Iris-virginica
68,6.2,2.2,4.5,1.5,Iris-versicolor
82,5.8,2.7,3.9,1.2,Iris-versicolor


In [4]:
#split target and input
target = iris.Species
X_input = iris.drop('Species', axis = 1)

### Cross Validation
A technique which involves saving a portion of dataset called the validation set, on which we do not train the model, and later test out trained model on this sample before finalizing the model. We train the model on large portion of the dataset to recognise the pattern or trend in the data.

***k-fold cross validation***
Training and testing multiple times

- Randomly split the datat set into k folds
- For each k-fold, train the model on rest of the dataset and test it on the 'k-fold' reserved portion
- Repeat for each k-fold
- Average error for each k-fold is the cross validation error

### F1 Score
A better measure for the performance of classification problems than accuracy_score which can attain an accuracy score of 100% by just predicting all the inputs as 'True'. As accuracy measures how many we got right, by predicting everything true it still predicts the inputs which were supposed to be True as True, but also predicting the false ones True, which is a bad predicton.

Takes into account the false positives and false negatives; i.e th prediction which were false but predicted true and the prediction which were true but predicted false

**F1 = 2 * (precision X recall) / (precision + recall)**



learn more about it form here : http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import KFold # import kfold

clf = DecisionTreeClassifier() # build algorithm function object
kf = KFold(n_splits=5) # build the kfold function object

for train_index, test_index in kf.split(X_input):

    # splitting the test and train data using the kfolds
    X_train, X_test = X_input.iloc[train_index], X_input.iloc[test_index] 
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]
    
    #fit the model and predit 
    clf.fit(X_train, y_train)
    prediction = clf.predict(X_test)
    score = accuracy_score(y_test, prediction)
    f1 = f1_score(y_test, prediction, average = 'weighted')
    
    print('Accuracy: ', score)
    print('F1 Score: ', f1)
    

Accuracy:  1.0
F1 Score:  1.0
Accuracy:  0.9666666666666667
F1 Score:  0.9661782661782662
Accuracy:  0.9333333333333333
F1 Score:  0.9333333333333333
Accuracy:  0.9333333333333333
F1 Score:  0.9333333333333333
Accuracy:  0.9333333333333333
F1 Score:  0.9330808080808081


### Can you test your model on other classifiers?
- Naive Bayes
- Random Forest
- kNN
- SVM