**b)** Read the data from the CSV file and check the data using the `head()`, `describe()`, and other
Pandas commands. 

In [None]:
import pandas as pd
import os

df = pd.read_csv(os.getcwd() + "/iris-data.csv")

In [None]:
df.head()

In [None]:
df.describe()

**c)** Read again the data marking the missing values with ‘NA’. 

In [None]:
df = pd.read_csv('iris-data.csv', na_values=['NA'])

**d)** Import the MatPlotLib and Seaborn libraries and create a scatterplot matrix of the data.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(df.dropna(), hue='class');

**e)** After looking at the plot it seems that the field researchers make some errors inserting the
data. It sounds like one of them forgot to add Iris- before their Iris-versicolor entries. The
other extraneous class, Iris-setossa, was simply a typo that they forgot to fix. Use the
DataFrame to fix these errors. Create a new scatterplot of the data. 


In [None]:
df.loc[df['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
df.loc[df['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'

df['class'].unique()
sns.pairplot(df.dropna(), hue='class');

**f)** Looking at the scatter plot, since it is impossible to have any 'Iris-setosa' rows with a sepal
width less than 2.5 cm, drop those values and create an histogram with the 'Iris-setosa' sepal
width. 

In [None]:
df = df.loc[(df['class'] != 'Iris-setosa') | (df['sepal_width_cm'] >= 2.5)]
df.loc[df['class'] == 'Iris-setosa', 'sepal_width_cm'].hist();

**g)** The next data issue to address is the several near-zero sepal lengths for the Iris-versicolor
rows. Those rows were gathered in meters instead of cm. Please correct that mistake and
draw the corresponding histogram. 

In [None]:
df.loc[(df['class'] == 'Iris-versicolor') & (df['sepal_length_cm'] < 1.0)]

In [None]:
df.loc[(df['class'] == 'Iris-versicolor') & (df['sepal_length_cm'] < 1.0), 'sepal_length_cm'] *= 100.0
df.loc[df['class'] == 'Iris-versicolor', 'sepal_length_cm'].hist();

**h)** One way to deal with missing data is mean imputation. Do that for the missing values of the
petal widths for Iris-setosa and create a new scatter plot for the data. 

In [None]:
average_petal_width = df.loc[df['class'] == 'Iris-setosa', 'petal_width_cm'].mean()

df.loc[(df['class'] == 'Iris-setosa') & (df['petal_width_cm'].isnull()), 'petal_width_cm'] = average_petal_width
df.loc[(df['class'] == 'Iris-setosa') & (df['petal_width_cm'] == average_petal_width)]

In [None]:
sns.pairplot(df, hue='class');

**i)** Save the new clean dataset to the disk with the name “iris-data-clean.csv”. 

In [None]:
df.to_csv('iris-data-clean.csv', index=False)

**j)** Create some violin plots of the data to compare the measurement distributions of the
classes. Violin plots contain the same information as box plots, but also scales the box
according to the density of the data. 


In [None]:
plt.figure(figsize=(10, 10))

for column_index, column in enumerate(df.columns):
    if column != 'class':
        plt.subplot(2, 2, column_index + 1)
        sns.violinplot(x='class', y=column, data=df)

**k)** Create two variables with the inputs and labels using the clean dataset created. 

In [None]:
x = df[['sepal_length_cm', 'sepal_width_cm', 'petal_length_cm', 'petal_width_cm']].values  # Inputs
y = df['class'].values                                                                     # Labels

x[:5]

**l)** Import the `train_test_split` and create randomly training and testing sets with 75% of the
examples on the training set and 25% on the testing set: training_inputs, testing_inputs,
training_classes, testing_classes 


In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1)

**m)** Import the DecisionTreeClassifier and train the classifier on the training set showing the final
score/accuracy.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create the classifier
decision_tree_classifier = DecisionTreeClassifier()

# Train the classifier on the training set
decision_tree_classifier.fit(x_train, y_train)

# Validate the classifier on the testing set using classification accuracy
decision_tree_classifier.score(x_test, y_test)

**n)** Experiment 1000 times the classifier and plot a histogram of the obtained accuracies. 

In [None]:
model_accuracies = []

for repetition in range(1000):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
    
    decision_tree_classifier = DecisionTreeClassifier()
    decision_tree_classifier.fit(x_train, y_train)
    classifier_accuracy = decision_tree_classifier.score(x_test, y_test)
    model_accuracies.append(classifier_accuracy)
    
plt.hist(model_accuracies);

**o)** Import StratifiedKFold and use stratified cross-validation with 10 splits and train again the
data. 

In [None]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

def plot_cv(cv, features, labels):
    masks = []
    for train, test in cv.split(features, labels):
        mask = np.zeros(len(labels), dtype=bool)
        mask[test] = 1
        masks.append(mask)
    
    plt.figure(figsize=(15, 15))
    plt.imshow(masks, interpolation='none', cmap='gray_r')
    plt.ylabel('Fold')
    plt.xlabel('Row #')

plot_cv(StratifiedKFold(n_splits=10), x, y)

In [None]:
from sklearn.model_selection import cross_val_score

decision_tree_classifier = DecisionTreeClassifier()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(decision_tree_classifier, x, y, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))

**p)** Import GridSearchCV and perform a Grid Search over the Decision Tree parameters to find
the best parameters, visualizing the grid with the accuracies for each parameters pairs
(max_features 1-4 and max_depth 1-5). 

In [None]:
from sklearn.model_selection import GridSearchCV

decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'max_depth': [1, 2, 3, 4, 5], 'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(decision_tree_classifier, param_grid=parameter_grid, cv=cross_validation)

grid_search.fit(x, y)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

**q)** Visualize in a graphical manner the final decision tree achieved.

In [None]:
grid_visualization = grid_search.cv_results_['mean_test_score']
grid_visualization.shape = (5, 4)
sns.heatmap(grid_visualization, cmap='Blues', annot=True)
plt.xticks(np.arange(4) + 0.5, grid_search.param_grid['max_features'])
plt.yticks(np.arange(5) + 0.5, grid_search.param_grid['max_depth'])
plt.xlabel('max_features')
plt.ylabel('max_depth');

In [None]:
decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': [1, 2, 3, 4, 5],
    'max_features': [1, 2, 3, 4]
}

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(decision_tree_classifier, param_grid=parameter_grid, cv=cross_validation)
grid_search.fit(x, y)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

In [None]:
decision_tree_classifier = grid_search.best_estimator_
decision_tree_classifier

In [None]:
import sklearn.tree as tree

fig = plt.figure(figsize=(25,20))
tree.plot_tree(decision_tree_classifier, feature_names=x, class_names=y, filled=True);