<h2>Algorithm Evaluation and Sentiment Predictions</h2>
<p>This script evaluates 6 classification algorithms for their performance in relation to sentiment analysis of Amazon reviews.</p>
<p>The classification algorithms for evaluation include:</p>
<ul>
    <li>Linear Support Vector Machine (SVM)</li>
    <li>Feed Forward (Neural Network)</li>
    <li>Naïve Bayes</li>
    <li>Random Forest</li>
    <li>Decision Tree</li>
    <li>K-Nearest Neighbours (K-NN)</li>
</ul>
<p>The algorithms are evaluated using various values as modifiers, which in turn produce different results</p>

<h3>Import Modules</h3>
<p>This block imports the modules required pandas for handling CSV files and DataFrames; sklearn for the classification algorithms, metrics and converting each word to features. itemgetter is used to access a specific attribute while sorting lists.</p>
<p>The <i>all_data</i> variable is a list, which will contain all of the outputs from the algorithm classifications, including: </p>
<ul>
    <li><b>[0]</b> - Algorithm Name</li>
    <li><b>[1]</b> - Modifier Name</li>
    <li><b>[2]</b> - Modifier Value</li>
    <li><b>[3]</b> - List containing the predictions</li>
    <li><b>[4]</b> - Precision</li>
    <li><b>[5]</b> - Recall</li>
    <li><b>[6]</b> - F-Score</li>
</ul>

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter 
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_recall_fscore_support
from operator import itemgetter

all_data = []

<h3>load_amazon_dataset (input_csv)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>input_csv</b> - [string] filename of the file containing the Amazon reviews</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>sentiment_scores</b> - [list] Actual class of the review</li>
            <li><b>product_ids</b> - [list] Product ID the review relates to</li>
            <li><b>text_reviews</b> - [list] Text of the review</li>
        </ul>
    </li>
</ul>
<h4>What does this method do?</h4>
<ol>
    <li>The <i>load_amazon_dataset()</i> method is used to extract the Amazon reviews from a file and convert them into a DataFrame</li>
    <li>Once this is complete, three columns of the DataFrame are returned</li>
   </ol>

In [None]:
def load_amazon_dataset(input_csv):
    df = pd.read_csv(input_csv,
                    delimiter='\t',
                    header=None)
    sentiment_scores = df[0]
    product_ids = df[1]
    text_reviews = df[2]
    return sentiment_scores, product_ids, text_reviews

<h3>Obtaining the Amazon Reviews</h3>
<h4>What does this block do?</h4>
<ol>
    <li>This block defines the file paths for the training and test Amazon reviews (<i>FILE_PATH_TO_TRAINING_REVIEWS</i> and <i>FILE_PATH_TO_TEST_REVIEWS</i> respectively)
    <li>Calls the <i>load_amazon_dataset</i> twice, once for training reviews, once for test reviews. Results are stored in <i>&ast;_sentiment_scores</i>, <i>&ast;_product_ids</i> and <i>&ast;_text_reviews</i></li>
    <li>Prints the length of the training and test text reviews</li>
    <li>Prints the text and sentiment of the first review for training and test datasets</li>
   </ol>

In [None]:
FILE_PATH_TO_TRAINING_REVIEWS = 'Review Files/reviews_Apps_for_Android_5.training.txt'
FILE_PATH_TO_TEST_REVIEWS = 'Review Files/reviews_Apps_for_Android_5.test.txt'

training_sentiment_scores, training_product_ids, training_text_reviews = load_amazon_dataset(input_csv=FILE_PATH_TO_TRAINING_REVIEWS)
test_sentiment_scores, test_product_ids, test_text_reviews = load_amazon_dataset(input_csv=FILE_PATH_TO_TEST_REVIEWS)
print('Loaded ', len(training_text_reviews), ' training reviews')
print('Loaded ', len(test_text_reviews), ' test reviews') 
print(training_text_reviews[0], '--sent. label--', training_sentiment_scores[0])
print(test_text_reviews[0], '--sent. label--', test_sentiment_scores[0])

<h3>Converting Review Text to Bag of Words Features</h3>
<h4>What does this block do?</h4>
<ol>
    <li>This block transforms the text of each training review to bag of words features for analysis</li>
    <li>Transforms the test text review of each test review to bag of words using the training features</li>
   </ol>

In [None]:
training_vectorizer = CountVectorizer()
training_vectorizer.fit(training_text_reviews)
training_instances_bow = training_vectorizer.transform(training_text_reviews)

test_vectorizer = CountVectorizer(vocabulary=training_vectorizer.get_feature_names())
test_vectorizer.fit(test_text_reviews)
test_instances_bow = test_vectorizer.fit_transform(test_text_reviews) 
print('Finished converting text reviews into bow')
print('Generated', training_instances_bow.shape[1], ' bow features')
print('Below are the first 1,000 bow features')
print(training_vectorizer.get_feature_names()[0:999])

<h3>svm_linear_classification (c_value)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>c_value</b> -[integer] Modifier value</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>predicted_test_sentiment_scores</b> - [list] The predictions made by the classification algorithm</li>
            <li><b>precision_recall_fscore_support()</b> - [list] Metrics of the classification algorithm's performance
                <ul>
                    <li>This method returns a list containing the precision, recall, f-score and support metrics using weighted averages</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The svm_linear_classification() method is used to implement the Linear Support Vector Machine classification algorithm with the C value modified to that provided by the parameter <i>c_value</i>.</li>
    <li>The classifier variable is trained using the text and sentiment class of the training dataset</li>
    <li>The trained classifier variable is then used to predict the sentiment class of each review in the test dataset</li>
    <li>The predictions and metrics are then returned</li>
</ol>

In [None]:
def svm_linear_classification(c_value):
    print("SVM Linear", c_value)
    classifier = LinearSVC(C=c_value)
    classifier.fit(X=training_instances_bow, y=training_sentiment_scores)
    print('Finished Training')
    print('Predicting Test Instances')
    predicted_test_sentiment_scores = classifier.predict(test_instances_bow)
    return predicted_test_sentiment_scores, precision_recall_fscore_support(test_sentiment_scores, predicted_test_sentiment_scores, average='weighted')

<h3>feed_forward_classification (layers, iterations)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>layers</b> - [list] A list, with each element in the list representing a layer, and the value of the element representing the number of nodes in the layer</li>
            <li><b>iterations</b> - [integer] Used to limit the number of times the algorithm will attempt to correct the weightings of features</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>predicted_test_sentiment_scores</b> - [list] The predictions made by the classification algorithm</li>
            <li><b>precision_recall_fscore_support()</b> - [list] Metrics of the classification algorithm's performance
                <ul>
                    <li>This method returns a list containing the precision, recall, f-score and support metrics using weighted averages</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The feed_forward_classification() method is used to implement the Feed Forward classification algorithm with the layers and iterations values modified to those provided by the parameters <i>layers</i> and <i>iterations</i>.</li>
    <li>The classifier variable is trained using the text and sentiment class of the training dataset</li>
    <li>The trained classifier variable is then used to predict the sentiment class of each review in the test dataset</li>
    <li>The predictions and metrics are then returned</li>
</ol>

In [None]:
def feed_forward_classification(layers, iterations):
    print("Feed Forward", layers, iterations)    
    classifier = MLPClassifier(hidden_layer_sizes=(layers),
                           max_iter=iterations)
    classifier.fit(training_instances_bow, training_sentiment_scores)
    print('Finished Training')
    print('Predicting Test Instances')
    predicted_test_sentiment_scores = classifier.predict(test_instances_bow) 
    return predicted_test_sentiment_scores, precision_recall_fscore_support(test_sentiment_scores, predicted_test_sentiment_scores, average='weighted')

<h3>naive_bayes_classification (alpha)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>alpha</b> - [integer] Modifier value</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>predicted_test_sentiment_scores</b> - [list] The predictions made by the classification algorithm</li>
            <li><b>precision_recall_fscore_support()</b> - [list] Metrics of the classification algorithm's performance
                <ul>
                    <li>This method returns a list containing the precision, recall, f-score and support metrics using weighted averages</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The naive_bayes_classification() method is used to implement the Naïve Bayes classification algorithm with the alpha value modified to that provided by the <i>alpha</i> parameter.</li>
    <li>The classifier variable is trained using the text and sentiment class of the training dataset</li>
    <li>The trained classifier variable is then used to predict the sentiment class of each review in the test dataset</li>
    <li>The predictions and metrics are then returned</li>
</ol>

In [None]:
def naive_bayes_classification(alpha):
    print("Naive Bayes", alpha)
    classifier = MultinomialNB(alpha=alpha)
    classifier.fit(training_instances_bow, training_sentiment_scores)
    print('Finished Training')
    print('Predicting Test Instances')
    predicted_test_sentiment_scores = classifier.predict(test_instances_bow)
    return predicted_test_sentiment_scores, precision_recall_fscore_support(test_sentiment_scores, predicted_test_sentiment_scores, average='weighted')

<h3>random_forest_classification (trees, features)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>trees</b> - [integer] The number of decision trees to be used in the random forest</li>
            <li><b>features</b> - [integer] The number of features to be taken into consideration by the algorithm</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>predicted_test_sentiment_scores</b> - [list] The predictions made by the classification algorithm</li>
            <li><b>precision_recall_fscore_support()</b> - [list] Metrics of the classification algorithm's performance
                <ul>
                    <li>This method returns a list containing the precision, recall, f-score and support metrics using weighted averages</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h4>What does this method do?</h4>
<ol>
    <li>The random_forest_classification() method is used to implement the Random Forest classification algorithm with the trees and features values modified to those provided by the <i>trees</i> and <i>features</i> parameters.</li>
    <li>The classifier variable is trained using the text and sentiment class of the training dataset</li>
    <li>The trained classifier variable is then used to predict the sentiment class of each review in the test dataset</li>
    <li>The predictions and metrics are then returned</li>
</ol>

In [None]:
def random_forest_classification(trees, features):
    print("Random Forest", trees, features)
    classifier = RandomForestClassifier(n_estimators=trees,
                                    max_features=features,
                                    n_jobs=100)
    classifier.fit(training_instances_bow, training_sentiment_scores)
    print('Finished Training')
    print('Predicting Test Instances')
    predicted_test_sentiment_scores = classifier.predict(test_instances_bow)
    return predicted_test_sentiment_scores, precision_recall_fscore_support(test_sentiment_scores, predicted_test_sentiment_scores, average='weighted')

<h3>decision_tree_classification ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>predicted_test_sentiment_scores</b> - [list] The predictions made by the classification algorithm</li>
            <li><b>precision_recall_fscore_support()</b> - [list] Metrics of the classification algorithm's performance
                <ul>
                    <li>This method returns a list containing the precision, recall, f-score and support metrics using weighted averages</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The decision_tree_classification() method is used to implement the Decision Tree classification algorithm with no modifiers <b>[as this algorithm doesnt have these???]</b></li>
    <li>The classifier variable is trained using the text and sentiment class of the training dataset</li>
    <li>The trained classifier variable is then used to predict the sentiment class of each review in the test dataset</li>
    <li>The predictions and metrics are then returned</li>
</ol>

In [None]:
def decision_tree_classification():
    print("Decision Tree")
    classifier = tree.DecisionTreeClassifier()
    classifier.fit(training_instances_bow, training_sentiment_scores)
    print('Finished Training')
    print('Predicting Test Instances')
    predicted_test_sentiment_scores = classifier.predict(test_instances_bow)
    return predicted_test_sentiment_scores, precision_recall_fscore_support(test_sentiment_scores, predicted_test_sentiment_scores, average='weighted')

<h3>k_nn_classification (neighbours)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>neighbours</b> - [integer] the number of neighbours to be used for evaluation</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>predicted_test_sentiment_scores</b> - [list] The predictions made by the classification algorithm</li>
            <li><b>precision_recall_fscore_support()</b> - [list] Metrics of the classification algorithm's performance
                <ul>
                    <li>This method returns a list containing the precision, recall, f-score and support metrics using weighted averages</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The k_nn_classification() method is used to implement the K-Nearest Neighbours classification algorithm with the neighbours value modified to that provided by the <i>neighbours</i> parameter.</li>
    <li>The classifier variable is trained using the text and sentiment class of the training dataset</li>
    <li>The trained classifier variable is then used to predict the sentiment class of each review in the test dataset</li>
    <li>The predictions and metrics are then returned</li>
</ol>

In [None]:
def k_nn_classification(neighbours):
    print("K-NN", neighbours)
    classifier = KNeighborsClassifier(n_neighbors=neighbours)
    classifier.fit(training_instances_bow, training_sentiment_scores)
    print('Finished Training')
    print('Predicting Test Instances')
    predicted_test_sentiment_scores = classifier.predict(test_instances_bow) 
    return predicted_test_sentiment_scores, precision_recall_fscore_support(test_sentiment_scores, predicted_test_sentiment_scores, average='weighted')

<h3>evaluate_svm_linear ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The evaluate_linear_svm() method iterates over the list of (sorted) modifier values (<i>c_values</i>) and calls the <i>svm_linear_classification()</i> method with the current modifier value which runs the algorithm with the modified value</li>
    <li>The data returned from the <i>svm_linear_classification</i> method is stored in the <i>predictions</i> and <i>metrics</i> variables</li>
    <li>The data contained in the <i>predictions</i> and <i>metrics</i> variables is then stored in the <i>all_data</i> global variable (with the attributes in the order shown where the global <i>all_data</i> variable is declared</li>
    <li>The data stored in the <i>all_data</i> variable is also printed so the user can see the performance of algorithm with this modifier</li>
</ol>

In [None]:
def evaluate_svm_linear():
    global all_data
    c_values = [0.05, 0.02, 0.001, 0.01, 0.3, 0.2, 0.1, 2, 10, 0.0001, 1, 3, 0.03, 5]
    c_values.sort()
    for c_value in c_values:
        predictions, metrics = svm_linear_classification(c_value)
        all_data.append(['SVM Linear', 'C Value', c_value, predictions, round(metrics[0], 4), round(metrics[1], 4), round(metrics[2], 4)])
        print(all_data[-1][0], "C_Value", all_data[-1][2], "Precision", all_data[-1][4], "Recall", all_data[-1][5], "F-Score", all_data[-1][6], "\n")
    return

<h3>evaluate_feed_forward ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The evaluate_feed_forward() method iterates over the list of (sorted) modifier values (<i>values</i>) and calls the <i>feed_forward_classification()</i> method with the current modifier values (<i>layers</i> and <i>iterations</i>) which runs the algorithm with the modified values</li>
    <li>The data returned from the <i>feed_forward_classification</i> method is stored in the <i>predictions</i> and <i>metrics</i> variables</li>
    <li>The data contained in the <i>predictions</i> and <i>metrics</i> variables is then stored in the <i>all_data</i> global variable (with the attributes in the order shown where the global <i>all_data</i> variable is declared</li>
    <li>The data stored in the <i>all_data</i> variable is also printed so the user can see the performance of algorithm with this modifier</li>
</ol>

In [None]:
def evaluate_feed_forward():
    global all_data
    values = [[[100], 150], [[100], 200], [[100, 50], 200], [[100, 50], 150], 
          [[100, 75, 50], 100], [[100,75,50], 125], [[100,75,50], 150], 
          [[15], 100], [[200, 100], 125], [[200, 100], 150], [[25], 150], 
          [[300], 100], [[35], 150], [[35, 20], 100], [[35, 25], 150], 
          [[65, 20], 100], [[65, 30], 200], [[65, 30, 10], 150], [[75, 55], 100], 
          [[75, 55], 125], [[75, 55], 200]]
    values.sort()
    for layers, iterations in values:
        predictions, metrics = feed_forward_classification(layers, iterations)
        all_data.append(['Feed Forward', 'Layers; Iterations', str(layers) + ";" + str(iterations), predictions, round(metrics[0], 4), round(metrics[1], 4), round(metrics[2], 4)])
        print(all_data[-1][0], "[Layers]:Iterations", all_data[-1][2], "Precision", all_data[-1][4], "Recall", all_data[-1][5], "F-Score", all_data[-1][6], "\n")
    return

<h3>evaluate_naive_bayes ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The evaluate_naive_bayes() method iterates over the list of (sorted) modifier values (<i>values</i>) and calls the <i>naive_bayes_classification()</i> method with the current modifier value which runs the algorithm with the modified value</li>
    <li>The data returned from the <i>naive_bayes_classification</i> method is stored in the <i>predictions</i> and <i>metrics</i> variables</li>
    <li>The data contained in the <i>predictions</i> and <i>metrics</i> variables is then stored in the <i>all_data</i> global variable (with the attributes in the order shown where the global <i>all_data</i> variable is declared</li>
    <li>The data stored in the <i>all_data</i> variable is also printed so the user can see the performance of algorithm with this modifier</li>
</ol>

In [None]:
def evaluate_naive_bayes():
    global all_data
    values = [0.001, 0.01, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0, 2.0, 3.0, 5.0, 10.0]
    values.sort()
    for value in values:
        predictions, metrics = naive_bayes_classification(value)
        all_data.append(['Naive Bayes', 'Alpha Value', value, predictions, round(metrics[0], 4), round(metrics[1], 4), round(metrics[2], 4)])
        print(all_data[-1][0], "Alpha", all_data[-1][2], "Precision", all_data[-1][4], "Recall", all_data[-1][5], "F-Score", all_data[-1][6], "\n")
    return

<h3>evaluate_random_forest ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The evaluate_random_forest() method iterates over the list of modifier values (<i>values</i>) and calls the <i>random_forest_classification()</i> method with the current modifier values (<i>trees</i> and <i>features</i>) which runs the algorithm with the modified values</li>
    <li>The data returned from the <i>random_forest_classification</i> method is stored in the <i>predictions</i> and <i>metrics</i> variables</li>
    <li>The data contained in the <i>predictions</i> and <i>metrics</i> variables is then stored in the <i>all_data</i> global variable (with the attributes in the order shown where the global <i>all_data</i> variable is declared</li>
    <li>The data stored in the <i>all_data</i> variable is also printed so the user can see the performance of algorithm with this modifier</li>
</ol>

In [None]:
def evaluate_random_forest():
    global all_data
    values = [[50, 1000], [50, 5000], [50, 10000],
               [100, 3], [100, 100], [100, 200], [100, 300],
               [150, 50],
               [200, 50], [200, 100], [200, 300],
               [300, 10], [300, 20], [300, 50], [300, 100], [300, 500], [300, 1000], [300, 5000]]
    for trees, features in values:
        predictions, metrics = random_forest_classification(trees, features)
        all_data.append(['Random Forest', 'Trees; Features', str(trees) + ";" + str(features), predictions, round(metrics[0], 4), round(metrics[1], 4), round(metrics[2], 4)])
        print(all_data[-1][0], "trees;features", all_data[-1][2], "Precision", all_data[-1][4], "Recall", all_data[-1][5], "F-Score", all_data[-1][6], "\n")
    return

<h3>evaluate_decision_tree ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The evaluate_decision_tree() method calls the <i>decision_tree_classification()</i> method which runs the algorithm.</li>
    <li>The data returned from the <i>decision_tree_classification</i> method is stored in the <i>predictions</i> and <i>metrics</i> variables</li>
    <li>The data contained in the <i>predictions</i> and <i>metrics</i> variables is then stored in the <i>all_data</i> global variable (with the attributes in the order shown where the global <i>all_data</i> variable is declared</li>
    <li>The data stored in the <i>all_data</i> variable is also printed so the user can see the performance of the decision tree algorithm</li>
</ol>

In [None]:
def evaluate_decision_tree():
    global all_data
    predictions, metrics = decision_tree_classification()
    all_data.append(['Decision Tree', '', '' , predictions, round(metrics[0], 4), round(metrics[1], 4), round(metrics[2], 4)])
    print(all_data[-1][0], "", all_data[-1][2], "Precision", all_data[-1][4], "Recall", all_data[-1][5], "F-Score", all_data[-1][6], "\n")
    return

<h3>evaluate_k_nn ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The evaluate_k_nn() method iterates over the list of (sorted) modifier values (<i>values</i>) and calls the <i>k_nn_classification()</i> method with the current modifier value which runs the algorithm with the modified value</li>
    <li>The data returned from the <i>k_nn_classification()</i> method is stored in the <i>predictions</i> and <i>metrics</i> variables</li>
    <li>The data contained in the <i>predictions</i> and <i>metrics</i> variables is then stored in the <i>all_data</i> global variable (with the attributes in the order shown where the global <i>all_data</i> variable is declared</li>
    <li>The data stored in the <i>all_data</i> variable is also printed so the user can see the performance of algorithm with this modifier</li>
</ol>

In [None]:
def evaluate_k_nn():
    global all_data
    values = [1, 3, 5, 7, 9, 11]
    values.sort()
    for value in values:
        predictions, metrics = k_nn_classification(value)
        all_data.append(['K-NN', 'Neighbours', value, predictions, round(metrics[0], 4), round(metrics[1], 4), round(metrics[2], 4)])
        print(all_data[-1][0], all_data[-1][1], all_data[-1][2], "Precision", all_data[-1][4], "Recall", all_data[-1][5], "F-Score", all_data[-1][6], "\n")
    return

<h3>pull_individual_algorithm_data (algorithm)</h3>

<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>algorithm</b> - [string] The name of the algorithm that data is being retrieved for</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>algorithm_data</b> - [list] List containing the data for the selected algorithm</li>
            <li><b>algorithm_name</b> - [string] The name of the algorithm</li>
            <li><b>algorithm_modifier_name</b> - [string] The name of the algorithm modifier</li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The pull_individual_algorithm_data() method iterates over <i>all_data</i> list and checks if the first attribute (algorithm name) matches the algorithm name provided as the parameter (<i>algorithm</i>)</li>
    <li>If the algorithm names match, the following data is extracted from the global <i>all_data</i> list:
        <ul>
            <li><b>[2]</b> - Modifier Value</li>
            <li><b>[4]</b> - Precision</li>
            <li><b>[5]</b> - Recall</li>
            <li><b>[6]</b> - F-Score</li>
            <li><b>[3]</b> - List containing the predictions</li>
        </ul>
    </li>
    <li>The extracted data is then added to the <i>algorithm_data</i></li>
    <li>The <i>algorithm_name</i> and <i>algorithm_modifier_name</i> variables are then set with the name (contained in [0]) and the modifier name (contained in [1])</li>
    <li>Once the for loop completes processing the <i>all_data</i> global variable, the <i>algorithm_data</i>, <i>algorithm_name</i> and <i>algorithm_modifier_name</i> variables are returned</li>
</ol>

In [None]:
def pull_individual_algorithm_data(algorithm):
    global all_data
    algorithm_data = []
    algorithm_name, algorithm_modifier_name = '', ''
    for data in all_data:
        if data[0] == algorithm:
            algorithm_data.append([data[2], data[4], data[5], data[6], data[3]])
            algorithm_name = data[0]
            algorithm_modifier_name = data[1]
    return algorithm_data, algorithm_name, algorithm_modifier_name

<h3>get_data_from_single_column (data, column)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>data</b> - [list] The list of data for a single algorithm</li>
            <li><b>column</b> - [integer] An integer identifying a column index in the <i>data</i> list</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>list_to_return</b> - [list] A list containing only the values in the specified index of the <i>data</i> list</li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The get_data_from_single_column() method iterates over <i>data</i> paramater (list) and the data contained in the row at the <i>column</i> index is added to the <i>list_to_return</i> list</li>
    <li>The <i>list_to_return</i> variable is then returned
</ol>

In [None]:
def get_data_from_single_column(data, column):
    list_to_return = []
    for row in data:
        list_to_return.append(row[column])
    return list_to_return

<h3>create_dataframe (data)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>data</b> - [list] The list of data for a single algorithm</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>dataframe</b> - [dataframe] A dataframe containing the modifier value, precision, recall and f-score from <i>data</i></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The create_dataframe() method creates a dictionary (<i>data_dict</i>) containing the following columns from the <i>data</i> parameter:
        <ul>
            <li><b>[1]</b> - precision</li>
            <li><b>[2]</b> - recall</li>
            <li><b>[3]</b> - f-score</li>
        </ul>
    </li>
    <li>The dataframe is creates using the <i>data_dict</i> dictionary, with an index of the algorithm modifier value ([0])</li>
    <li>The <i>dataframe</i> variable is then returned
</ol>

In [None]:
def create_dataframe(data):
    data_dict = {'Precision': get_data_from_single_column(data, 1), 'Recall': get_data_from_single_column(data, 2), 'F-Score': get_data_from_single_column(data, 3)}
    dataframe = pd.DataFrame(data=data_dict, index=get_data_from_single_column(data, 0))
    return dataframe

<h3>sort_predictions (data)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>data</b> - [list] The list of data for a single algorithm</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>data</b> - [list] The sorted list of data for a single algorithm</li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The sort_predictions() method sorts the <i>data</i> list by the f-score (<i>data[3]</i>) (high-low)</li>
    <li>The sorted list is then returned</li>
</ol>

In [None]:
def sort_predictions(data):
    data = sorted(data, key=itemgetter(3), reverse=True)
    return data

<h3>store_best_predictions (data, name, modifier)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>data</b> - [list] The list of data for a single algorithm</li>
            <li><b>name</b> - [string] The name of the algorithm</li>
            <li><b>modifier</b> - [string] The name of the algorithm modifier</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>.CSV file</b> - [file] file containing the product ID's and predicted sentiment classes</li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>A <i>filename</i> is created from the algorithm name, modifier name, and modifier value</li>
    <li>The method then creates the file and iterates over the <i>data</i> list, outputting the product_id and the predictions which are stored in the file</li>
</ol>

In [None]:
def store_best_predictions(data, name, modifier):
    filename = 'Predictions/' + name + ' - ' + modifier + ' ' + str(data[0][0]) + ' - Predictions.csv'
    header = "product_id,predicted_sentiment_score\n"
    print('Storing Predictions')
    with open(file=filename, mode='w') as file_writer:
        file_writer.write(header)
        for idx in range(0, len(data[0][-1])):
            file_writer.write(str(test_product_ids[idx]) + "," + str(data[0][-1][idx]) + "\n")

<h3>create_graph_and_store_predictions (algorithm)</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>algorithm</b> - [string] The name of the algorithm</li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>.PNG file</b> - [file] image file containing graph</li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The create_graph_and_store_predictions() method is used to store the algorithms best predictions and to create a graph representing the performance results</li>
    <li>The method gets all the data for the specific algorithm using the <i>pull_individula_algorithm_data()</i> method</li>
    <li>A dataframe is created from the algorithm specific data using the <i>create_dataframe()</i> method</li>
    <li>A bar graph is then created with customised colours and size</li>
    <li>The Y axis boundaries for the graphs are then set (the range is consistent regardless of algorithm as all had a minimum f-score of 0.60 and a maximum of 0.80)</li>
    <li>The X axis label (the modifier) is then set</li>
    <li>The graph is then stored and the predictions are sorted and stored using the <i>sort_and_store_best_predictions()</i> method</li>
    <li>Finally, the predictions for all of the entries in the all_data list are replaced with an empty string (to ensure RAM is not filled up)
</ol>

In [None]:
def create_graph_and_store_predictions(algorithm):
    data, name, modifier = pull_individual_algorithm_data(algorithm)
    data = sorted(data, key=itemgetter(3))
    dataframe = create_dataframe(data)
    print(algorithm)
    print('Creating Graph')
    title = algorithm + ' Performance'
    graph = dataframe.plot.bar(color=['#C44E52', '#55A868', '#4C72B0'], figsize=(15,11), title=title)
    graph.set_ylim(bottom=0.60, top=0.825)
    graph.set_xlabel(modifier)
    graph = graph.get_figure()
    graph.savefig('Graphs/' + algorithm.replace(" ", "_") + '.png')
    store_best_predictions(sort_predictions(data), name, modifier)
    print("Complete \n")
    global all_data
    for row in range(len(all_data)):
        all_data[row][3] = ""

<h3>Evaluation, Performance Graphs and Storing Predictions</h3>
<p>This section calls the methods for running the evaluation algorithms, generating graphs and storing the best predictions</p>

In [None]:
evaluate_svm_linear()
create_graph_and_store_predictions("SVM Linear")
evaluate_feed_forward()
create_graph_and_store_predictions("Feed Forward")
evaluate_naive_bayes()
create_graph_and_store_predictions("Naive Bayes")
evaluate_random_forest()
create_graph_and_store_predictions("Random Forest")
evaluate_decision_tree()
create_graph_and_store_predictions("Decision Tree")
evaluate_k_nn()
create_graph_and_store_predictions("K-NN")

<h3>analysis_output ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The analysis_output() method sorts the results contained in the <i>all_data</i> global variable by f-score from highest to lowest</li>
    <li>The performance details for the best performing algorithm are printed for the user to see</li>
    <li>A DataFrame is created containing the performance results for all of the performance data (algorithm, modifier, modifier value, precision, recall and f-score)</li>
</ol>

In [None]:
def analysis_output():
    global all_data
    all_data = sorted(all_data, key=itemgetter(6), reverse=True)
    
    print("The best performing algorithm was", all_data[0][0], "with a", all_data[0][1], "of", all_data[0][2])
    print("This algorithm had the following results:\n")
    print("Precision", all_data[0][4])
    print("Recall", all_data[0][5])
    print("F-Score", all_data[0][6], "\n")
    
    print("A table of all results is shown below")
    data_df = {'Algorithm': get_data_from_single_column(all_data, 0),
               'Modifier': get_data_from_single_column(all_data, 1),
               'Modifier Value': get_data_from_single_column(all_data, 2),
               'Precision': get_data_from_single_column(all_data, 4),
               'Recall': get_data_from_single_column(all_data, 5),
               'F-Score': get_data_from_single_column(all_data, 6)}
    all_data_df = pd.DataFrame(data=data_df, index=range(1,len(all_data)+1))
    print(all_data_df, "\n\n")
    print("The predictions for the best performing instance of each algorithm have been stored in CSV files in the \'Predictions\' directory")

<h3>report_output ()</h3>
<ul>
    <li><u>Inputs</u>
        <ul>
            <li><b>NONE</b></li>
        </ul>
    </li>
    <li><u>Outputs</u>
        <ul>
            <li><b>.TXT file</b> - [file] containing the performance data in the format for a LaTeX table</li>
        </ul>
    </li>
</ul>
    
<h4>What does this method do?</h4>
<ol>
    <li>The report_output() method creates a text file <i>Algorithm Performance Results (LaTeX Table Format).txt</i> to store the data</li>
    <li>Writes the table header to the file</li>
    <li>Iterates over the data contained in the <i>all_data</i> global variable which is already sorted and writes it to the file</li>
</ol>

In [None]:
def report_output():
    global all_data
    with open(file='Algorithm Performance Results (LaTeX Table Format).txt', mode='w') as file_writer:
        file_writer.write("\\begin{table}[H]\n")
        file_writer.write("  \\caption{All Algorithm Results (Ordered by F-Score [Largest to Smallest])}\n")
        file_writer.write("  \\label{tab:all-algorithm-results-1}\n")
        file_writer.write("  \\centering\n")
        file_writer.write("  \\begin{tabular}{c|c|c|c|c|c}\n")
        file_writer.write("    \\toprule\n")
        
        file_writer.write("    Algorithm & Modifier & Modifier Value & Precision & Recall & F-Score \\\\ \n")
        file_writer.write("    \\midrule \n")
        for data in all_data:
            file_writer.write("    " + str(data[0]) + " & " + str(data[1]) + " & " + str(data[2]) + " & " + str(data[4]) + " & " + str(data[5]) + " & " + str(data[6]) + " \\\\\n")
        file_writer.write("    \\bottomrule\n")
        file_writer.write("  \\end{tabular}\n")
        file_writer.write("\\end{table}")

<h3>Performance Results and Storing Results for Report</h3>
<p>This section calls the methods for presenting the performance results and storing the performance data in a LaTeX table</p>

In [None]:
analysis_output()
report_output()