<a href="https://colab.research.google.com/github/RhythmOfRiora/Projects/blob/master/Copy_of_WWC_ML_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import libraries**

Python uses imported libraries to perform a wide range of functions

In [0]:
import matplotlib.pyplot as plt #Matplotlib is used to create visualisations of the data
from matplotlib.colors import ListedColormap
from mpl_toolkits.mplot3d import Axes3D#Axes is used to create figures
import numpy as np#Numpy is used to support multi-dimensional arrays/matrices and perform mathematical functions on them

#Sklearn is a machine learning library that contains a large number of ml 
# algorithms and ways to analyse them

from sklearn import linear_model,  svm, datasets
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_fscore_support
from sklearn.utils.multiclass import unique_labels


**Load in the dataset**

SKLearn has a selection of datasets we can use. This includes the iris dataset, a great example of classification. Consists of 150 samples, of 3 types of iris's. We have to be able to identify the 3 different types of irises using petals sepal length and width. These flowers look similar to the naked eye, similar in colour and shape but have some key differences that will make it simple for our classification models to differienate them

In [0]:
iris=datasets.load_iris()

**Class names**

There are 3 different types of iris's, and it will be our class names/labels for the data.

In [0]:
  iris.target_names

**Feature names**

The dataset we are using is made of 4 features/input variable types, these store all the data. 

In [0]:
 iris.feature_names


**Labelled sample**

An particular instance of the dataset, containing both the label and the features

In [0]:
example=[iris.data[0], iris.target[0]]
print(example)

**Visualize the data**

It's important to be able to understand the data we're using, and the best way to do this is by visualising it. What data is being visualised below in the two plots? Note in the first graph there is a large amount of overlap between the 2nd and 3rd classes but with the 2nd graph there is a less of an overlap but some still remains

In [0]:
X = iris.data[:,:2 ]  # Take 1st and 2nd feature,by taking all the data from the first two arrays in the multi-dimensional array
y = iris.target

# Return the min and max by returning both,using a comma to separate the parameters
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(2, figsize=(8, 6))
plt.clf()

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('X1')
plt.ylabel('X2')

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())


In [0]:
X = iris.data[:,2: ]  # Take 3rd and 4th feature,by taking all the data from the first two arrays in the multi-dimensional array
y = iris.target

# Return the min and max by returning both,using a comma to separate the parameters
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

plt.figure(2, figsize=(8, 6))
plt.clf()

# Plot the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel('X3')
plt.ylabel('X4')

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())


**Visualise all dimensions**

The previous visulisations only look at the first two dimensions, but there are 4 dimensions to the dataset. PCA is a way of visualising the data by squashing all the information into a smaller number of dimensions. These are represented via eigenvectors, where we can see there is very little overlap between the classes. This shows that our classifier should work well on this dataset

In [0]:
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

plt.show()

**Importance of splitting the data into train and test split**

When training a dataset we want to give it as much data as possible, so would it make sense to give the dataset all of our data? Why? Why not?

In [0]:
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=45)


**KNN Neighbours classifier**

KNN neighbours classifier finds its labels by copying the majority of the labels its neighbours  with. The label is affected by how many examples are considered neighbours

In [0]:
# Apply Knn classifier
#Test different sizes of n neighbours 1,3,5,15
knn_classifier = KNeighborsClassifier(n_neighbors=1)
knn_classifier.fit(X_train, y_train) 


# predict the response
knn_prediction = knn_classifier.predict(X_test)

**Accuracy**

The percentage of times the classifer selects the correct label in the test dataset

In [0]:
# evaluate accuracy
print(accuracy_score(y_test, knn_prediction))

**KNN neighbours visulisation**

Changing the number of neighbours affects the accuracy, which classes are selected. Test to find the best accuracy for this dataset

In [0]:
# Test different sizes of n neighbours 1,3,5,10,15 -using only sepal length and width
n_neighbors = 15

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for items in [[X_train[:,:2],y_train, "Train"],[X_test[:,:2],y_test, "Test"]]:
    
    X=items[0]
    y=items[1]
   # we create an instance of Neighbours Classifier and fit the data.
    clf =KNeighborsClassifier(n_neighbors, weights="uniform")
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    #Two parameters to find the minimum and maximum, comma to indicate the two parts are doing     
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification "+items[2]+" k ="+str(n_neighbors))
    
knn_visualisation_classifier = KNeighborsClassifier(n_neighbors)
knn_visualisation_classifier.fit(X_train[:,:2], y_train) 


# predict the response
knn_visualisation_prediction = knn_visualisation_classifier.predict(X_test[:,:2])

# evaluate accuracy
print(accuracy_score(y_test, knn_visualisation_prediction))

**Logistic regression**

Logistic regression is a classification model that find a line of best fit in which on either sides are different classes. This can be extended to work on multiclasses too, by finding lines that will work in comparing each class separately

In [0]:
# Logistic regression

# Create logistic regression object
# Try different solvers:  'newton-cg', 'sag', 'saga' and 'lbfgs'
logistic_regression_classifier = linear_model.LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial')

# Train the model using the training sets
logistic_regression_classifier.fit(X_train,y_train)

# Make predictions using the testing set
logistic_regression_classifier_prediction = logistic_regression_classifier.predict(X_test)


# evaluate accuracy
print(accuracy_score(y_test, logistic_regression_classifier_prediction))


**Feel free to try other classifiers!**

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

**Precision and Recall**

Accuracy shouldn't be the only considered

*Precision-* what proportion of positive identifications were actually correct. 

*Recall-* what proportion of actual positives were identified correctly

Useful link: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

In [0]:
# Explain how accuracy isn't everything
# Recall, precision. 
knn_precision, knn_recall, knn_fbeta_score, knn_support=precision_recall_fscore_support(y_test, knn_prediction, average='macro')
print(knn_precision)
print(knn_recall)

lr_precision, lr_recall, lr_fbeta_score, lr_support=precision_recall_fscore_support(y_test, logistic_regression_classifier_prediction, average='macro')
print(lr_precision)
print(lr_recall)


In [0]:
# Work on next tabular dataset. 
breast_cancer=datasets.load_breast_cancer()