# Machine Learning
### DePaul University
Ilyas Ustun


# What is classification?

###  Which of the following is a classification problem?
Developing a good intuition of classification problems is important. Of the scenarios listed below select the ones that you think are classification problems and justify your choice.

 1. Using labeled historic pricing data to predict if the price of gold will increase or decrease tomorrow. This is classification
 2. Using labeled pricing data to predict the price of gold tomorrow. This is regression.
 3. Using unlabeled data to cluster job candidates into roles. Not classification. 
 4. Using labeled data to predict the number of sales of a new song. This is Regression because of sales
 5. Training a drone to recognize a certain type of terrain from labeled data. This is Classification
 6. Using labeled flower data, predict the type of a new flower. This is classification


# Classification of IRIS dataset using Scikit-Learn 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets, metrics
%matplotlib inline 

ModuleNotFoundError: No module named 'seaborn'

### IRIS data set
#### https://en.wikipedia.org/wiki/Iris_flower_data_set

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg", width=200, height=200)

In [None]:
# import Iris dataset from Scikit-Learn's datasets
iris = datasets.load_iris()

print ("Shape of the data ", iris.data.shape) # shape of the data
print ("Shape of the data ", iris.target_names)
print ("Attributes ", iris.feature_names)

#view first 5 rows
print (iris.data[range(5)])
print (iris.target[range(5)])

In [None]:
iris.feature_names

In [None]:
#show it as a table
df = pd.DataFrame(data=iris.data) # converting the input data into a dataframe
df.columns = iris.feature_names   # assigning column names
df['Class'] = iris.target         # adding a new column to the data frame.
df['Name'] = iris.target_names[iris.target] # adding a new column to the data frame
df.head() # first 5 rows

In [None]:
df.tail() # the first five rows

In [None]:
df.info()

In [None]:
sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='Class');

In [None]:
sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='Name');

<p>So using a distinct categorical vaariable produces a better pallet</p>

Get the X and y arrays:

In [None]:
X = iris.data[:, :2]  # we only take the first two features all rows
y = iris.target

In [None]:
X[:6, :] # check the first 6 rows of the columns in our data set

In [None]:
y

Using the dataframe this can be done by using the .iloc() or .loc() methods. However the datasets returned will be a dataframe for X, and pandas series for y.

In [None]:
# give me back all the rows but only these two columns
X = df.loc[:, ['sepal length (cm)', 'sepal width (cm)']]  # we only take the first two features.
y = df.Class

In [None]:
X.head()

In [None]:
y # this is a pandas series, which is a one dimension

**`sklearn` accepts both arrays and dataframes, so both methods are  OK.**

### Create train-test split

In [None]:
from sklearn.model_selection import train_test_split
# test_size of 0.33 defines that I want 30% to go into my testing
# random_state with a number defines that you have the same result.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 

In [None]:
X_train.info()

In [None]:
y_train.shape

In [None]:
X_test.info()

In [None]:
y_test.shape

### Generate Model

In [None]:
# Logistic Regression
# instantiation
logreg = linear_model.LogisticRegression(C=1e5)

# fit the classifier on the data
# model fitting
logreg.fit(X_train, y_train)

# output the classifiers prediction on training set
pred_train = logreg.predict(X_train)

# output the classifiers prediction on testing set
pred_test = logreg.predict(X_test)

In [None]:
pred_train

In [None]:
# accuracy
# comparing the actual values, y train with the predicted values
(y_train == pred_train).mean()

In [None]:
(y_test == pred_test).mean()

<p>In general for larger normal datasets, the training accuracy should be higher and the testing accuracy should be lower.</p>

In [None]:
# misclassification rate on the training
(y_train != pred_train).mean()

In [None]:
# misclassification rate on the testing data set
(y_test != pred_test).mean()

### Predict

In [None]:
X_test['Predicted'] = pred_test
X_test['Actual'] = y_test
# df['Predicted Name'] = iris.target_names[predicted]
#df.head()
X_test.tail() #end of the data

<p>Here, we tire everything back to the testing dataset.</p>

### Plot decision boundaries

In [None]:
df.groupby("Name").mean()

In [None]:
legend_labels = ["setosa", "versicolor", "virginica"]

In [None]:
X_test.columns

In [None]:
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].

def plot_clf_boundary(X, y, legend_labels):
    
    xlabel = X.columns.tolist()[0]
    ylabel = X.columns.tolist()[1]
    
    X = X.values.copy()
    
    h = .02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(1, )

    fig_kw = {"figsize":(7, 5)}
    fig, ax = plt.subplots(**fig_kw)

    ax.pcolormesh(xx, yy, Z, cmap=plt.cm.rainbow, shading='auto')

    # Plot also the training points
    kwargs = {'edgecolor':"k",
                 # 'facecolor':"k",
                 'linewidth':1,
                 'linestyle':'--',
                }

    # sns.scatterplot(data=X_test, x="sepal length (cm)", y="sepal width (cm)", hue=y_test, palette=plt.cm.rainbow, **kwargs)
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.rainbow)
    plt.xlabel(xlabel, fontdict = {'size': 16})
    plt.ylabel(ylabel, fontdict = {'size': 16})


    # produce a legend with the unique colors from the scatter
    # get the handles and labels
    handles, _ = scatter.legend_elements()
    legend1 = ax.legend(# *scatter.legend_elements(),
                        handles, legend_labels,
                        loc="lower left", title="Classes")
    ax.add_artist(legend1)

    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())

    plt.show()

In [None]:
plot_clf_boundary(X=X_test, y=y_test, legend_labels=legend_labels)

<p>This means that there are three classes: setosa, versicolor, virginica in purple, greenish and red respectively. This means that if a point ended up in purple color, it will be predicted as setosa, if it ended up in the grenish area, it will be predicted as versicolor and so on. Everything in purple agrees with each other. In the greenish area, we have some red points and this is means that there is a mis-classification; the same that applies to the greenish points in the red area, there are on the wrong side of the boundary.</p>

### Plot confusion matrix

In [None]:
confusion_matrix =  pd.crosstab(index = y_test, columns=pred_test.ravel(), rownames=['Expected'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True, square=False, fmt='', cbar=False)
plt.title("Classification Matrix", fontsize = 15)
plt.show() # this is one way to create one

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=pred_test)

<p>The same plot is plotted as above though it looks better.Class 0 and 2 are predcited correctly.</p>

In [None]:
metrics.confusion_matrix(y_true=y_test, y_pred=pred_test)

### Classification Report

In [None]:
print (metrics.classification_report(y_test, pred_test))

## <span style="color:cornflowerblue">Exercise:</span>

1. Train two other classifiers
2. Plot classification matrix
3. Compare their performance with the Logistic Regression Classifier. 
4. Here is list of available classifiers  http://scikit-learn.org/stable/supervised_learning.html#supervised-learning 
