# <font color='289C4E'>Part A: Classification<font><a class='anchor' id='top'></a>

### <font color='289C4E'>A1.1<font><a class='anchor' id='top'></a>

Supervised machine learning is a type of machine learning where the algorithm builds a model based off of both input and output data. The algorithm tries to predict the output based off of the input data by either classification or regression. Classification is the idea of predicting a category based off the input data and regression tries to predict a continuous value. Supervised machine learning needs both input and output data so the algorithm knows what the correct answer is or could be given a set of input data. It is different from unsupervised learning which only takes input data and attempts to cluster them into groups.

Labelled data consists of values for both input and output variables. The algorithm takes the input data and attempts to map it to the output data (the label) and labelled data gives the algorithm the ground truth or possible correct answer for training data. Labels can either be categorical, such as cats or dogs, or continuous, like a dollar figure.

The training and testing datasets are extrapolated from the given dataset. Generally, a machine learning model is trained from a segment of data such as 75% of the data which means the model is given that data to produce a model. The testing data is not given to the algorithm to train with as it is used for validation, it is used to check the accuracy of the model in prediction since we can feed it the input and check its prediction and the real output to see the accuracy of the model.

### <font color='289C4E'>A1.2<font><a class='anchor' id='top'></a>

In [26]:
# Import statement and reading file and storing it in variable.
# Seeing head of dataset to see format
import pandas as pd
dataset = pd.read_csv('Essay-Features.csv')
dataset.head()

Unnamed: 0,essayid,chars,words,commas,apostrophes,punctuations,avg_word_length,sentences,questions,avg_word_sentence,POS,POS/total_words,prompt_words,prompt_words/total_words,synonym_words,synonym_words/total_words,unstemmed,stemmed,score
0,1457,2153,426,14,6,0,5.053991,16,0,26.625,423.995272,0.995294,207,0.485915,105,0.246479,424,412,4
1,503,1480,292,9,7,0,5.068493,11,0,26.545455,290.993103,0.996552,148,0.506849,77,0.263699,356,345,4
2,253,3964,849,19,26,1,4.669022,49,2,17.326531,843.990544,0.9941,285,0.335689,130,0.153121,750,750,4
3,107,988,210,8,7,0,4.704762,12,0,17.5,207.653784,0.988828,112,0.533333,62,0.295238,217,209,3
4,1450,3139,600,13,8,0,5.231667,24,1,25.0,594.65215,0.991087,255,0.425,165,0.275,702,677,4


In [27]:
# Storing the independent variables in X and dependent variable (label) in Y
X = dataset.iloc[:, 1:17].values # chars, words, commas, apostrophes, punctuations, avg_word_length, sentences, questions, avg_word_sentence, POS, POS/total_words, prompt_words, prompt_words/total_words, synonym_words, synonym_words/total_words, unstemmed, stemmed
y = dataset.iloc[:, 18].values # score
# Do not take in essayid since the feature will not be used for model training.

### <font color='289C4E'>A1.3<font><a class='anchor' id='top'></a>

In [28]:
# Import statement
from sklearn.model_selection import train_test_split
# Assigning 25% of data to be reserved for testing and using 75% of data for training.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

### <font color='289C4E'>A2.1<font><a class='anchor' id='top'></a>

Binary classification attempts to categorise the data into two possible classes, it is generally a true or false output to detect the presence or absence of something. Possible examples include spam detection, medical diagnoses and fraud detection. 

Multi class classification tries to divide the data into multiple (more than two) classes or categories. Examples include image classification, language processing, handwriting recognition and species identification. The algorithm tries to assign each datapoint based off of the input variables to one of the labels (output possibilities).

### <font color='289C4E'>A2.2a<font><a class='anchor' id='top'></a>

Normalising data, or feature scaling, is used when the raw data has values that vary significantly. So a model that cannot tell the difference between the features will weight the importance of each feature differently. Some objective functions also will not work such as classifiers that calculate the Euclidean distance between two datapoints. If one of the features has very extreme values, it will significantly affect how the distance is calculated, hence why we need to normalise the data so each feature contributes proportionately to the final outcome.

### <font color='289C4E'>A2.2b<font><a class='anchor' id='top'></a>

In [29]:
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# Transforming the explanatory variables so that there are no extreme values so the effect of each variable is proportionate
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### <font color='289C4E'>A2.3a<font><a class='anchor' id='top'></a>

Support vector machine is a supervised machine learning algorithm primarily used for classification but can also be used for regression if altered. In generally tries to produce a hyperplane that separates the data into its classes by having a dividing wall between each category of datapoints. 

In the simplest dimension, a curve of datapoints on a 2 dimensional plane can be sectioned into segments with a linear line, where the middle data is one category and the outside data can be the other in binary classification. When there are more input features, a higher dimension is required which cannot be visualised but the same concept is used to separate the data using planes. 

In SVM, the data nearest to the plane is called suppor vectors, the algorithm tries to keep the distance between the plane and these support vectors to a maximum to have as good distinct classes as possible to ensure a high accuracy. This is called margin maximisation which the algorithm is designed to do.

### <font color='289C4E'>A2.3b<font><a class='anchor' id='top'></a>

As stated earlier with the higher dimension for data which has several variables, the kernel aims to transform the data into the higher dimensional space since it may not be possible to separate easily when it is linear. Kernels allow SVM to implicitly attempt to find a hyperplane that segments the data without actually having to calculate the formula for the planes which would take lots of computational power in a higher dimension.

Some examples of kernels used are the linear kernel, polynomial, radial basis function and sigmoid kernel.

### <font color='289C4E'>A2.3c<font><a class='anchor' id='top'></a>

In [30]:
# Finding the Pearson's correlation for each feature to score
dataset.iloc[:, 0:19].corr(method='pearson')['score']

essayid                      0.033463
chars                        0.683983
words                        0.662091
commas                       0.525055
apostrophes                  0.322052
punctuations                 0.157976
avg_word_length              0.327814
sentences                    0.230895
questions                    0.277392
avg_word_sentence           -0.113036
POS                          0.662823
POS/total_words              0.311555
prompt_words                 0.641119
prompt_words/total_words     0.026646
synonym_words                0.578352
synonym_words/total_words   -0.305405
unstemmed                    0.697187
stemmed                      0.696776
score                        1.000000
Name: score, dtype: float64

In [31]:
# Take out variables that have low correlation to score such as essayid so that model can be more accurate
# essayid is also not needed since it will not tell us anything about the score the student will get, hence remove it
X = dataset.iloc[:, [1,2,3,4,6,7,8,9,10,11,13,14,15,16]].values
y = dataset.iloc[:, 18].values

In [32]:
# Separate data into training and testing data again
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [33]:
# Feature scaling explanatory variables again
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [34]:
# Import code
from sklearn.svm import SVC

In [35]:
# Testing which kernel yields highest accuracy by looping over example kernels and seeing which has highest accuracy
# In this case, linear kernel has highest accuracy
for kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
    svm_classifier = SVC(kernel = kernel)
    svm_classifier.fit(X_train, y_train)
    print('Accuracy for kernel:', kernel, 'is', svm_classifier.score(X_test,y_test))

Accuracy for kernel: linear is 0.6666666666666666
Accuracy for kernel: poly is 0.6576576576576577
Accuracy for kernel: rbf is 0.6636636636636637
Accuracy for kernel: sigmoid is 0.6186186186186187


In [36]:
# Seeing which value for C will have highest accuracy, this determines whether there is a soft or harder margin for misclassification
# Choose C=1
for C in [0.1,1,10,100]:
    svm_classifier = SVC(kernel = 'linear', C=C)
    svm_classifier.fit(X_train, y_train)
    print('Accuracy for C:', C, 'is', svm_classifier.score(X_test,y_test))

Accuracy for C: 0.1 is 0.6576576576576577
Accuracy for C: 1 is 0.6666666666666666
Accuracy for C: 10 is 0.6606606606606606
Accuracy for C: 100 is 0.6546546546546547


In [37]:
# Seeing which degree of polynomial will yield highest accuracy
# Although degree 5 poly is higher than linear, it may be overfitting
# Since accuracy increase is not much, I chose to stick with linear
for degree in [1,2,3,4,5,6,7,8,9,10]:
    svm_classifier = SVC(kernel = 'poly', degree = degree)
    svm_classifier.fit(X_train, y_train)
    print('Accuracy for degree:', degree, 'is', svm_classifier.score(X_test,y_test))

Accuracy for degree: 1 is 0.6546546546546547
Accuracy for degree: 2 is 0.5375375375375375
Accuracy for degree: 3 is 0.6576576576576577
Accuracy for degree: 4 is 0.5675675675675675
Accuracy for degree: 5 is 0.6816816816816816
Accuracy for degree: 6 is 0.5795795795795796
Accuracy for degree: 7 is 0.5765765765765766
Accuracy for degree: 8 is 0.5015015015015015
Accuracy for degree: 9 is 0.5465465465465466
Accuracy for degree: 10 is 0.4804804804804805


In [38]:
# Performing the model building using linear kernel and C=1
svm_classifier = SVC(
    kernel = 'linear', 
    C=1
)
# Fitting the training data to model
svm_classifier.fit(X_train, y_train)

### <font color='289C4E'>A2.4<font><a class='anchor' id='top'></a>

In [39]:
# Import statement for random forest
from sklearn.ensemble import RandomForestClassifier
# Performing model building using entropy criterion
forest_classifier = RandomForestClassifier(
    criterion = 'entropy',
    random_state = 0
)
# Fitting the training data to model
forest_classifier.fit(X_train, y_train)

### <font color='289C4E'>A3.1<font><a class='anchor' id='top'></a>

In [40]:
# Predicting the results based off of test subset using SVM
y_pred_svm = svm_classifier.predict(X_test)
y_pred_svm

array([4, 3, 3, 2, 3, 4, 4, 4, 2, 4, 4, 3, 4, 3, 3, 4, 3, 4, 4, 4, 3, 3,
       3, 3, 3, 4, 4, 4, 3, 3, 4, 3, 3, 4, 4, 3, 4, 4, 3, 3, 4, 2, 3, 3,
       3, 3, 3, 3, 2, 3, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 3, 4, 4,
       4, 4, 4, 4, 3, 3, 4, 4, 4, 3, 3, 4, 4, 3, 4, 3, 4, 3, 3, 4, 3, 3,
       4, 4, 3, 4, 3, 3, 1, 3, 3, 4, 4, 2, 4, 3, 3, 4, 4, 2, 4, 4, 3, 4,
       2, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 4, 4, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 4, 3, 3, 4, 4, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 4, 4,
       4, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 2, 4, 3,
       4, 4, 4, 2, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 2, 3, 4, 4,
       3, 4, 3, 4, 4, 3, 3, 4, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 3, 3,
       4, 4, 4, 4, 4, 3, 3, 4, 3, 3, 3, 3, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4,
       3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 3, 4, 4, 4, 3, 2, 2, 3, 3, 3, 4, 3,
       3, 3, 4, 4, 4, 3, 3, 4, 3, 4, 4, 3, 3, 3, 4, 3, 4, 4, 2, 3, 4, 3,
       3, 3, 4, 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 3, 3,

In [41]:
# Predicting the results based off of test subset using Random forest
y_pred_forest = forest_classifier.predict(X_test)
y_pred_forest

array([4, 3, 3, 2, 3, 4, 4, 4, 2, 4, 4, 3, 4, 3, 3, 4, 3, 4, 4, 4, 3, 3,
       3, 3, 3, 4, 4, 4, 3, 4, 4, 3, 3, 4, 4, 2, 4, 4, 3, 3, 4, 2, 4, 3,
       3, 3, 3, 3, 2, 3, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4, 4, 4, 3, 3, 4, 4,
       4, 4, 4, 4, 3, 3, 4, 5, 4, 3, 3, 4, 4, 4, 4, 3, 4, 3, 3, 4, 4, 3,
       4, 4, 3, 4, 4, 3, 3, 4, 3, 4, 3, 2, 4, 3, 3, 4, 4, 3, 4, 4, 3, 4,
       3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, 4, 4,
       4, 3, 4, 3, 3, 4, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 3, 4, 4,
       4, 3, 2, 3, 3, 3, 3, 4, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 4, 3,
       4, 4, 4, 2, 3, 4, 4, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3, 3, 1, 3, 4, 4,
       3, 4, 3, 4, 4, 3, 3, 4, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 3, 3,
       4, 4, 4, 4, 4, 3, 2, 4, 3, 3, 4, 3, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4,
       3, 3, 3, 4, 4, 3, 4, 4, 3, 4, 3, 4, 4, 4, 2, 2, 2, 3, 3, 3, 4, 3,
       3, 3, 4, 4, 5, 3, 3, 4, 3, 4, 4, 3, 4, 3, 4, 3, 4, 4, 2, 3, 4, 3,
       3, 3, 4, 4, 4, 4, 3, 4, 3, 3, 4, 4, 4, 3, 3,

### <font color='289C4E'>A3.2<font><a class='anchor' id='top'></a>

These 6x6 confusion matrices show which outputs were correctly and incorrectly identified. The diagonals show the number of testing datapoints which are correctly identified. The matrix also shows which were incorrectly identified and what they were identified as. For example the third column shows the actual values that were 3 but what they were predicted as, the second row of that column would mean the number of values the model predicted as 2 but were actually 3. This is why the diagonals show how many were predicted correctly.

In [42]:
# Import statement for cofusion matrix
from sklearn.metrics import confusion_matrix
# Creating confusion matrix based off test data and predicted outputs
cm_svm = confusion_matrix(y_test, y_pred_svm)
cm_svm

array([[  0,   2,   0,   0,   0,   0],
       [  1,   9,  13,   0,   0,   0],
       [  0,   2, 108,  37,   0,   0],
       [  0,   0,  39, 105,   0,   0],
       [  0,   0,   1,  15,   0,   0],
       [  0,   0,   0,   1,   0,   0]], dtype=int64)

In [43]:
# Making the Confusion Matrix
# Creating confusion matrix based off test data and predicted outputs
cm_forest = confusion_matrix(y_test, y_pred_forest)
cm_forest

array([[  1,   1,   0,   0,   0,   0],
       [  0,  11,  11,   1,   0,   0],
       [  0,   1, 106,  40,   0,   0],
       [  0,   0,  34, 109,   1,   0],
       [  0,   0,   1,  14,   1,   0],
       [  0,   0,   0,   1,   0,   0]], dtype=int64)

### <font color='289C4E'>A3.3<font><a class='anchor' id='top'></a>

In [44]:
# Finding the score of svm for classifiying testing data
svm_classifier.score(X_test,y_test)

0.6666666666666666

In [45]:
# Finding the score of random forest model for classifiying testing data
forest_classifier.score(X_test,y_test)

0.6846846846846847

In [46]:
# For svm model
# Can also be found through confusion matrix
# Finds the sum of diagonal and divides by total
i = 0
correctly_predicted = 0
total = 0
for row in cm_svm:
    # Finding sum of diagonal
    correctly_predicted += row[i]
    i += 1
    # Finding total
    for value in row:
        total += value
print(correctly_predicted/total)

0.6666666666666666


In [47]:
# For random forest model
# Finds the sum of diagonal and divides by total
i = 0
correctly_predicted = 0
total = 0
for row in cm_forest:
    # Finding sum of diagonal
    correctly_predicted += row[i]
    i += 1
    # Finding total
    for value in row:
        total += value
print(correctly_predicted/total)

0.6846846846846847


Based on the accuracy of the two models' prediction of the testing data, it is seen that both models are not that great. However it is seen that the random forest performs slightly better since its predictions of the testing data were more accurate. 

### <font color='289C4E'>A4.1<font><a class='anchor' id='top'></a>

In [48]:
# Reading data and storing, also seeing shape of data to see number of rows.
dataset_sample = pd.read_csv('Essay-Features-Submission.csv')
dataset_sample.shape

(199, 18)

### <font color='289C4E'>A4.2<font><a class='anchor' id='top'></a>

In [49]:
# Putting the features used in the model training into the test variable
test = dataset_sample.iloc[:,[1,2,3,4,6,7,8,9,10,11,13,14,15,16]].values
# Scaling the features the same way we did in model training
sc = StandardScaler()
test = sc.fit_transform(test)
# Predicting using random forest classifier using test data
prediction = forest_classifier.predict(test)
prediction

array([4, 3, 4, 4, 4, 4, 3, 3, 3, 2, 3, 4, 4, 3, 4, 4, 4, 4, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 4, 4, 3, 2, 4, 3, 3, 4, 3, 4, 4, 3, 3, 3, 4, 3, 3,
       1, 3, 3, 4, 4, 3, 3, 4, 4, 4, 3, 4, 3, 4, 4, 4, 2, 3, 3, 4, 3, 4,
       4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 3, 4, 2, 4, 4,
       2, 3, 3, 3, 4, 3, 4, 3, 4, 4, 4, 3, 3, 5, 2, 3, 3, 4, 3, 3, 4, 4,
       4, 3, 3, 4, 4, 3, 2, 4, 2, 3, 3, 4, 4, 3, 3, 4, 3, 4, 3, 2, 3, 4,
       2, 3, 4, 4, 3, 3, 1, 4, 4, 4, 4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 4, 3,
       4, 3, 3, 3, 4, 4, 4, 3, 2, 3, 4, 4, 3, 3, 2, 3, 3, 4, 3, 4, 4, 4,
       4, 3, 3, 4, 4, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3,
       3], dtype=int64)

Results added to 'A4 predictions.csv'