<div class="alert alert-danger">
    <h3>NOTE</h3>
    <p>Before you submit this assignment, <strong>make sure everything runs as expected</strong>:</p>
    <ol>
        <li><strong>restart the kernel</strong> (in the menubar, select <strong>Kernel → Restart</strong>)
        <li><strong>run all cells</strong> (in the menubar, select <strong>Cell → Run All</strong>)</li>
    </ol>
    <p>Make sure to complete every cell that states "<strong><TT>YOUR CODE IN THIS CELL</TT></strong>".</p>
</div>

---
# Final Project
**COMP-482: Winter 2023**


## Objectives

* access a *real-world* dataset (e.g., used in **kaggle.com** competitions, etc.) rather than a *toy* dataset
* download and apply *pre-trained machine learning models*
* develop and evaluate model
* evaluate different machine learning models on *complex* real-world tasks
* work with a variety of industry-standard machine learning libraries (*Tensorflow*, *PyTorch*, *sklearn*, *NLTK*, etc.)
* visually communicate data and experimental results

### Instructions

Write your **code** in the *code cells* located directly below each red *Write Code* block.\
Write your **text** in the *Markdown cells* that follow every **Task** description below. Also complete this Notebook's **Final Report** section.

<div class="alert alert-info">
    <h4>PRO TIP</h4>
    <p>The best approach to this assignment is to work on <strong>one task at a time</strong>. Treat each task as a step toward a destination.</p>
</div>

------
## Preliminaries & Dependencies

You will require the following **Python** packages to complete this assignment (it is likely many of these libraries are already installed via **Anaconda**, **Quiz #4**, etc.):
* matplotlib
* numpy
* sklearn
* tensorflow
* tensorflow-hub
* tensorflow-datasets
* tfds-nightly
* seaborn
* nltk
* pandas


### Imports

Import the following libraries.

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import IPython.display as display
import seaborn
import sklearn
import nltk
import pandas as pd
import re

---
## Cross Validation Of A Dataset

The following example code demonstrates using the builtin cross validation module from the [**Scikit-Learn** library](https://scikit-learn.org/stable/modules/cross_validation.html).\
You will likely use a variation of the following code for all of the models you will be evaluating.

In [2]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE

# Code from:  https://scikit-learn.org/stable/modules/cross_validation.html

#import numpy as np
#from sklearn.model_selection import train_test_split
#from sklearn import datasets
#from sklearn import svm

# X = inputs or features of the data
# y = output values from the data that we are trying to predict
#X, y = datasets.load_iris(return_X_y=True)

#X.shape, y.shape # shape displays the dimensions of the matrices

In [3]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE

# Train the model using 80% of the dataset then test (evaluate) the model on the other 20% of the dataset
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# shape displays the dimensions of the matrices
#print(X_train.shape, y_train.shape)
#print(X_test.shape, y_test.shape)

#clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
#score = clf.score(X_test, y_test)
#print("The performance of one run of the SVM model:", score)

In [4]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE

#from sklearn.model_selection import cross_val_score

#clf = svm.SVC(kernel='linear', C=1, random_state=42)

#scores = cross_val_score(clf, X, y, cv=5)
#print("Scores for the 5 runs of the SVM model:", scores)
#print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

----
## Task: Evaluating Language Knowledge   (5 Marks)

See https://www.kaggle.com/competitions/feedback-prize-english-language-learning/overview.

From [**Kaggle**](https://www.kaggle.com/competitions/feedback-prize-english-language-learning/overview):
> The goal of this competition is to assess the language proficiency of 8th-12th grade English Language Learners (ELLs). Utilizing a dataset of essays written by ELLs will help to develop proficiency models that better supports all students.



### Data Description
The dataset presented here (the ELLIPSE corpus) comprises argumentative essays written by 8th-12th grade English Language Learners (ELLs). The essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions.

> #### Files
> 
> **train.csv** - The training set, comprising the full_text of each essay, identified by a unique text_id. The essays are also given a score for each of the seven analytic measures above: cohesion, etc. These analytic measures comprise the target for the competition.\
\
**test.csv** - For the test data we give only the full_text of an essay together with its text_id.\
\
**sample_submission.csv** - A submission file in the correct format. See the Evaluation page for details.


### What You Are Being Asked To Do

(A version of this is posted on **Discord**)

The file `sample_submission.csv` has examples of the format that the evaluated text should display the results. This includes the text_id with the cohesion, syntax, vocabulary, phraseology, grammer and conventions scores.

An example of what you are being asked to do:
* give your model the text of a english language learner (8-12th grade) which the model assigns the  **cohesion, syntax, vocabulary, phraseology, grammer and conventions scores**. (each score 1-5)
* in the competition, **Kaggle** will take the scores and compare them to the human determined scores
* Then the model will get scored based on how close it is to the human ranking

### Download The Datasets


In [5]:
# One way to download the data is through Tensorflow's Datasets repository
# Data is stored at:
#    ~/tensorflow_datasets/wikipedia_toxicity_subtypes/
#    ~/tensorflow_datasets/civil_comments/CivilComments/
# Dataset sizes are:
#    2 Gb - wikipedia_toxicity_subtypes
#    1 Gb - civil_comments
# This code cell takes about 5-10 minutes to execute
# You only need to run this code cell once. If you run it again it doesn't
#     do anything since the datasets have already been downloaded

#import tensorflow_datasets as tfds

# Construct a tf.data.Dataset
#ds_wikipedia_comments = tfds.load('wikipedia_toxicity_subtypes')
#ds_comment_bias = tfds.load('civil_comments')

In [6]:
# Code from: https://www.tensorflow.org/datasets/overview#tfdsas_dataframe

#import tensorflow_datasets as tfds
#import pandas

# working with DataFrames
#dset, info = tfds.load('wikipedia_toxicity_subtypes', with_info=True)
#dframe = tfds.as_dataframe(dset["train"].take(1000), info) # only takes the first 1000 data samples
#dframe = tfds.as_dataframe(dset["train"], info) # takes all the data into memory (takes a while)

# display first few rows & last few rows
#dframe.head()
#dframe.tail()

# display column names
#dframe.columns

# statistic summary of the data
#dframe.describe()

# selecting the "text" column (these do the same thing)
#dframe["text"]
#dframe.text

# selecting the "insult" column (these do the same thing)
#dframe["insult"]
#dframe.insult

# getting the comment text from the 10th comment in the dataset
#dframe["text"][9]

# selecting rows 10 to 15
#dframe[10:16]

# selecting all the comments that are labelled as an insult
#dframe[dframe["insult"] > 0]

# fancy advanced:
#    selects the comments where column "E" has either the value of "two" or "four"
#dframe = tfds.as_dataframe(dset["train"].take(6), info) # only takes the first 6 data samples
#dframe2 = dframe.copy()
#dframe2["E"] = ["one", "one", "two", "three", "four", "three"]
#dframe2[dframe2["E"].isin(["two", "four"])]

In [7]:
# Reading csv files attached
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
pd.set_option('display.max_rows', 20)


In [8]:
# REMOVING BAD DATA
count = 0
for a in train["cohesion"]:
    try:
        z = float(a)
    except:
        train = train.drop([count])
    count+=1
count = 0
for a in train["vocabulary"]:
    try:
        z = float(a)
    except:
        train = train.drop([count])
    count+= 1
count = 0
for a in train["syntax"]:
    try:
        z = float(a)
    except:
        train = train.drop([count])
    count+= 1
count = 0
for a in train["phraseology"]:
    try:
        z = float(a)
    except:
        train = train.drop([count])
    count+= 1
count = 0
for a in train["grammar"]:
    try:
        z = float(a)
    except:
        train = train.drop([count])
    count+= 1
count = 0
for a in train["conventions"]:
    try:
        z = float(a)
    except:
        train = train.drop([count])
    count+= 1

## Task: Extract Features From Dataset   (40 Marks)

Used **NLTK**  and **TF-IDF** to extract **features** from the **Essays** dataset.\
The **features** will be used to train  **SVM** to identify writing level.

**features** to extract:\
The frequency distribution is vectorized with tf-idf giving a large amount of features


### Transforming Data Into Features

We represent the data as matrices/vectors. So we are transforming the dataset into a matrix representation. None of the models accept text input! Example:\
`X = [[0, 0], [1, 3], [2, 0], [3, 1]]`\
`Y = [0, 1, 2, 3]`

### Rubric

We will be evaluating this section in part by how clever your choice of features were (and that you were able to extract them).
Simple sets of features may not be as informative as complex setds of features, but simple features are easier to extract from a dataset compared to complex features.

A breakdown of the marking for this task:
* [**10 marks**] basic features gathered (trivial)
* [**25 marks**] quality features gathered (advanced), corresponding to unusual features or clever features most would not have considered
* [**5 marks**] formatting the features and output correctly so it can be used immediately downstream for the machine learning classifiers without any further preprocessing needing to be done at that stage

----
<div class="alert alert-info">
    <h4>PRO TIP</h4>
    <p>Extract at least one or two features before implementing any of the models.</p>
</div>

In [9]:
# EXAMPLE CODE: check if a word is in a list of words
#list_of_words = ["house", "hat", "war"]
#word = "hat"
#print("Is 'hat' in the wordlist?", word in list_of_words)

#word = "Hat"
#print("Is 'Hat' in the wordlist?", word in list_of_words)

In [10]:
# EXAMPLE CODE: check if a character is in a list of characters
#character_string = '?.",!@$%^&*()\n' # using a String as if it were a list
#character = "?"
#print("Is '?' in the characterlist?", character in character_string)

#character = "\n"
#print("Is '\\n' (newline) in the characterlist?", character in character_string)

In [11]:
# EXAMPLE CODE: for more code fragments that may be useful
#               check the Discord server (I will not be adding new snippets to this Task here)

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that extracts features from the dataset.
</div>

### Old Vocab test features

In [12]:
# Extracting features for VOCAB of TRAINING DATA
#
#result = []
#
#tokenize full_text into words
#for text in train["full_text"]:
#    total_length = 0
#    max_word_length = 0
#    text = nltk.tokenize.word_tokenize(text)
#    
#    for word in text:
#        #get word length total
#        total_length += len(word)
#
#        #get max word length
#        if len(word) > max_word_length:
#            max_word_length = len(word)
#        
#    # get word length average
#    word_length_average = total_length/len(text)
#    
#    result.append([max_word_length, word_length_average])
#vocab_features_train = pd.DataFrame(result, columns = ['Max Word Length', 'Average Word Length'])

In [13]:
# Extracting features for VOCAB of TESTING DATA

#tokenize full_text into words
#for text in test["full_text"]:
#    total_length = 0
#    max_word_length = 0
#    text = nltk.tokenize.word_tokenize(text)
#    
#    for word in text:
#        #get word length total
#        total_length += len(word)
#
#        #get max word length
#        if len(word) > max_word_length:
#            max_word_length = len(word)
#        
#    # get word length average
#    word_length_average = total_length/len(text)
#    
#    result.append([max_word_length, word_length_average])
#vocab_features_test = pd.DataFrame(result, columns = ['Max Word Length', 'Average Word Length'])

### New TD-IDF Generated Features
Taken from: https://www.kaggle.com/code/tracyporter/ell-nlp-multioutput/notebook

In [14]:
# Feature headings
features = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar',  'conventions']

target = train[features]

# text of both train and test files
text_train = train['full_text']
text_test = test['full_text']
text = pd.concat([text_train, text_test])
text

0       I think that students would benefit from learn...
1       When a problem is a change you have to let it ...
2       Dear, Principal\n\nIf u change the school poli...
3       The best time in life is when you become yours...
4       Small act of kindness can impact in other peop...
                              ...                        
3909    Many people disagree with Albert Schweitzer's ...
3910    Do you think that failure is the main thing fo...
0       when a person has no experience on a job their...
1       Do you think students would benefit from being...
2       Thomas Jefferson once states that "it is wonde...
Name: full_text, Length: 3912, dtype: object

In [15]:
# formatting the text to be lowercase without any symbols or punctioation

text = text.str.lower()
text = text.apply(lambda x : re.sub("[^a-z]\s","",x) )
text = text.str.replace("#", "")

In [16]:
apostrophe_dict = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [17]:
# function to replace apostrophe dictionary with the full two words
def lookup_dict(txt, dictionary):
    for word in txt.split():
        if word.lower() in dictionary:
            if word.lower() in txt.split():
                txt = txt.replace(word, dictionary[word.lower()])
    return txt
text = text.apply(lambda x: lookup_dict(x,apostrophe_dict))

In [18]:
# getting frequency distribution
fdist = nltk.probability.FreqDist()
for x in text:
    fdist += nltk.probability.FreqDist(word for word in nltk.tokenize.word_tokenize(x))
fdist

FreqDist({'to': 70704, 'the': 52872, 'and': 40272, 'you': 36163, 'a': 35898, 'that': 29494, 'is': 29015, 'in': 26879, 'they': 24149, 'not': 23719, ...})

In [19]:
# remove words of one count
# most likely mistakes

v = text.str.split().tolist()

text = [' '.join([y for y in x if fdist[y] > 1]) for x in v]
text = pd.Series(text)
text

0       i think that students would benefit from learn...
1       when a problem is a change you have to let it ...
2       u change the school policy of having a grade b...
3       the best time in life is when you become yours...
4       small act of kindness can impact in other peop...
                              ...                        
3907    many people disagree with albert is not the ma...
3908    do you think that failure is the main thing fo...
3909    when a person has no experience on a job their...
3910    do you think students would benefit from being...
3911    thomas jefferson once states that is wonderful...
Length: 3912, dtype: object

#### Generating features using TF-IDF vectorization

In [20]:
from sklearn import svm

# setting up input and output for SVM
y = target
X = text[: len(train)]
X_test = text[len(train) :]
X = X.tolist()
X_test = X_test.tolist() 

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorizoring using tf-idf
# generating features

vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5, min_df=0.01)

X = vectorizer.fit_transform(X)
X_test = vectorizer.transform(X_test)

In [22]:
print(len(vectorizer.get_feature_names_out()))

1206


----
<div class="alert alert-info">
    <h4>PRO TIP</h4>
    Classifiers take two arrays as input: <strong>array X</strong> and <strong>array y</strong>.</br>
    <strong>array X</strong> has shape <tt>(number_of_samples, number_of_features)</tt> containing the training samples feature data</br>
    <strong>array y</strong> of class labels/outputs (strings or integers) has shape <tt>(number_of_samples)</tt></p>
    <p></p>
    <p style="text-indent:0px"><tt>print(photos.shape, labels.shape)</br>
    num_samples = labels.shape[0]<br>
    x = np.reshape(photos, (num_samples, -1))<tt></p>
</div>

### Decision Trees For Classification

Example code for a **Decision Tree** performing classification. The **Decision Tree** model is predicting one category from a set of categories, such as which genre a film belongs to (`Horror`, `Comedy`, `Action`, etc.):

In [23]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
#from sklearn import tree

#X = [[0, 0, -1], [1, 1, 1], [1, 10, 9], [-3, 0, 33]]
#Y = [0, 1, 4, 1]

# DecisionTreeClassifier takes as input two arrays: X & Y
#    an array X, sparse or dense, of shape (number_of_samples, number_of_features) holding the training samples
#    and an array Y of integer values, of shape (number_of_samples) holding the class labels for the training samples
#clf = tree.DecisionTreeClassifier()

# train the decision tree classifier model
#clf = clf.fit(X, Y)

# after being fitted, predict from a new set of samples
#clf.predict([[2., 2., 10.]])


# plot a visualization of the decision tree
#tree.plot_tree(clf)


# a text visualization of the decision tree
#from sklearn.tree import export_text
#text_tree = export_text(clf, feature_names=["First Feature", "Height Feature", "Salary Feature"])
#print("Text visualization of decision tree:\n", text_tree)

### Decision Trees For Regression

Regression using **Decision Trees** is [found here](https://scikit-learn.org/stable/modules/tree.html#regression).

Example code of a **Decision Tree** for regression (where the **Decision Tree** model is predicting a continuous value):

In [24]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# code from https://scikit-learn.org/stable/modules/tree.html#regression

#from sklearn import tree

#X = [[0, 0], [2, 2]]
#y = [0.5, 2.5]


#clf = tree.DecisionTreeRegressor()

# train the decision tree regression model
#clf = clf.fit(X, y)

# after being fitted, predict from a new set of samples
#clf.predict([[1, 1]])

# plot a visualization of the decision tree
#tree.plot_tree(clf)

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements a Decision Tree classifier.
</div>

In [25]:
# YOUR CODE IN THIS CELL
#raise NotImplementedError() # Remove this after you have started implementing your code below

# Train the model


# Evaluate the model


----
## Task: SVM Classifier   (5 Marks)

Build an [SVM classifier](https://scikit-learn.org/stable/modules/svm.html#classification) that identifies toxic comments.\
Use the **SVM**'s default parameters.

In [26]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from: https://scikit-learn.org/stable/modules/svm.html#classification

#from sklearn import svm

# the dataset  X:features, y:output values we are trying to predict
#X = [[0.0, 0], [1.2, 1]]
#y = [0, 1]

#clf = svm.SVC()

# train the model
#clf.fit(X, y)

# after being fitted, predict new values
#clf.predict([[2., 2.]])

**SVM** for regression:

In [27]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from: https://scikit-learn.org/stable/modules/svm.html#regression

#from sklearn import svm

# the dataset  X:features, y:output values we are trying to predict
#X = [[0, 0], [1, 1]]
#y = [0, 1.2]

#clf = svm.SVR()

# train the model
#clf.fit(X, y)

# after being fitted, predict new values
#clf.predict([[2., 2.]])

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements an SVM classifier.
</div>

### Old SVM With Old Features

In [28]:
# YOUR CODE IN THIS CELL
#
# Train the model
#from sklearn import svm
#
# the dataset  X:features, y:output values we are trying to predict
#X = vocab_features_train.values.tolist()
#y = train["vocabulary"].values.tolist()
#
#clf = svm.SVR(kernel='poly')
#
# train the model
#clf.fit(X, y)
#
# after being fitted, predict new values
#results = clf.predict([[x,y] for x,y in vocab_features_train.values.tolist()])
#print(results,flush=True)
#temp = []
#for x in results:
#    temp.append(round(x*2)/2)
#results = temp
# Evaluate the model


---
#### Cross Validation Of A Dataset Of SVM

The following example code demonstrates using the builtin cross validation module from the [**Scikit-Learn** library](https://scikit-learn.org/stable/modules/cross_validation.html).\
You will likely use a variation of the following code for all of the models you will be evaluating.

In [29]:
#from sklearn.model_selection import train_test_split

#X = np.array(vocab_features_train, dtype=float)
#y = np.array(train['vocabulary'], dtype=float)

#X = vocab_features_train
#y = train['vocabulary']

#X.shape, y.shape # shape displays the dimensions of the matrices

In [30]:

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.007, random_state=42)


#clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

#score = clf.score(X_test, y_test)

#print("The performance of one run of the SVM model:", score)

In [31]:
#from sklearn.model_selection import cross_val_score

#clf = svm.SVR(kernel='linear', C=1, random_state=42)

#scores = cross_val_score(clf, X, y, cv=2)
#print("Scores for the 5 runs of the SVM model:", scores)
#print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

### Cross Validation with SVM Regression

In [32]:
# Cross Validation
from sklearn.multioutput import MultiOutputRegressor
from sklearn import svm

clf = MultiOutputRegressor(svm.SVR()).fit(X, y)

score = clf.score(X, y)

print("The performance of one run of the SVM model:", score)

The performance of one run of the SVM model: 0.8683084411466719


In [33]:
from sklearn.model_selection import cross_val_score


scores = cross_val_score(clf, X, y, cv=5)
print("Scores for the 5 runs of the SVM model:", scores)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

Scores for the 5 runs of the SVM model: [0.2043183  0.22164647 0.20201111 0.23256935 0.21464616]
0.22 accuracy with a standard deviation of 0.01


In [34]:
predictions = clf.predict(X_test)
predictions

array([[2.8752967 , 2.66967502, 3.27707537, 3.09959794, 2.7080656 ,
        2.63712166],
       [3.11647148, 2.84580193, 2.89114212, 2.6167223 , 2.58304548,
        3.00077795],
       [3.54040148, 3.46766088, 3.64180325, 3.49960481, 3.37827924,
        3.39248726]])

In [35]:
# refer to Quiz #4 for help with n-gram language models

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements a Tri-gram Language Model.
</div>

In [36]:
# YOUR CODE IN THIS CELL
#raise NotImplementedError() # Remove this after you have started implementing your code below

# Create a Trigram language model


# Create a Bigram language model


# Create a Unigram language model


# Compute the probability a comment is toxic


# Evaluate the model


In [37]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from: https://scikit-learn.org/stable/modules/ensemble.html#forest
# Classification

from sklearn.ensemble import RandomForestClassifier

X = [[0, 0], [1, 1]]
Y = [0, 1]

clf = RandomForestClassifier(n_estimators=10)

clf = clf.fit(X, Y)

**Random Forest** for regression:

In [38]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
# Regression

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2, random_state=0, shuffle=False)

regr = RandomForestRegressor(max_depth=2, random_state=0)

regr.fit(X, y)

print(regr.predict([[0, 0, 0, 0]]))

[-8.32987858]


<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements a Random Forest Classifier.
</div>

In [39]:
# YOUR CODE IN THIS CELL
#raise NotImplementedError() # Remove this after you have started implementing your code below

# Train the model

# Evaluate the model


NotImplementedError: 

In [None]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from:  https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from itertools import product
from sklearn.ensemble import VotingClassifier

# Load some example data
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target

# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=7)
clf3 = SVC(kernel='rbf', probability=True)

# The Ensemble classifier
# 'estimator' is another name for 'model' or 'learner'
eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)],
                        voting='soft',
                        weights=[2, 1, 2])

# train the individual classifiers first
clf1 = clf1.fit(X, y)
clf2 = clf2.fit(X, y)
clf3 = clf3.fit(X, y)
# then train the ensemble of the trained classifiers (clf1, clf2, & clf3)
eclf = eclf.fit(X, y)

# make predictions
print("Ensemble classifier's predictions:\n", eclf.predict(X))
print("\nThe resulting dimensions of the Ensemble classifier:", eclf.transform(X).shape)

A **Voting Ensemble** example for regression is [found here](https://scikit-learn.org/stable/modules/ensemble.html#voting-regressor).

In [None]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from:  https://scikit-learn.org/stable/modules/ensemble.html#voting-regressor

from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import VotingRegressor

# Loading some example data
X, y = load_diabetes(return_X_y=True)

# Training individual models
reg1 = GradientBoostingRegressor(random_state=1)
reg2 = RandomForestRegressor(random_state=1)
reg3 = LinearRegression()

# create an ensemble from the individual models
ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)])

ereg = ereg.fit(X, y)

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements a Voting Ensemble classifier.
</div>

In [None]:
# YOUR CODE IN THIS CELL
raise NotImplementedError() # Remove this after you have started implementing your code below

# Train the model


# Evaluate the model



In [None]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# from   https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

import os
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds


########################
### SET UP THE DATASET
########################

# Uses the IMDB Movie Reviews dataset
# Split the training set into 60% and 40% to get
#     15,000 training examples
#     10,000 examples for validation
#     25,000 testing examples
train_data, validation_data, test_data = tfds.load( name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))

# print first 10 examples
print(train_examples_batch)
# print the first 10 labels
train_labels_batch


####################################
### SET UP THE NEURAL NETWORK MODEL
####################################

# create a Keras layer that uses a TensorFlow Hub model to embed the sentences
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

# build the full model
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

# configure the model to use an optimizer and a loss function
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])


####################################
### TRAIN THE NEURAL NETWORK MODEL
####################################

# Train the model for 10 epochs in mini-batches of 512 samples
# This is 10 iterations (epochs) over all samples in the x_train and y_train tensors
# While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)


######################################
### EVALUATE THE NEURAL NETWORK MODEL
######################################

# Evaluate the model
# Two values will be returned:
#    Loss (a number which represents our error, lower values are better)
#    Accuracy
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements a Neural Network.
</div>

In [None]:
# YOUR CODE IN THIS CELL
raise NotImplementedError() # Remove this after you have started implementing your code below


# Train the model


# Evaluate the model



In [None]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from:  https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

import numpy as np
from sklearn.naive_bayes import GaussianNB

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

clf = GaussianNB()
clf.fit(X, Y)

print(clf.predict([[-0.8, -1]]))

clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y))

print(clf_pf.predict([[-0.8, -1]]))

### Regression (Linear Model)

For regression, use a **Linear Regression** model instead of *Naive Bayes*.\
A succinct overview of using a [linear model to detect diabetes](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) provides a good explanation of an end-to-end experimental workflow.

In [None]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW
# Code from:  https://scikit-learn.org/stable/modules/linear_model.html

from sklearn import linear_model

reg = linear_model.LinearRegression()

# train the model with data
reg.fit([[0, 0], [1, 1], [2, 2]],
        [0, 1, 2])

# make predictions
prediction = reg.predict([[-0.8, -1]])
print("Prediction:", prediction)

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code that implements a Naive Bayes Classifier.
</div>

In [None]:
# YOUR CODE IN THIS CELL
raise NotImplementedError() # Remove this after you have started implementing your code below

# Train the model


# Evaluate the model



## Task: Evaluation   (10 Marks)


Model will be *trained* on the **training data**.\
Model will be *evaluated* on the **test data**.

<font color=red>**TBD**.<font color=black> The below is *not* the final evaluation method but the one used in the **Kaggle** competition.

<font color=lightgray>

The evaluation description for the original **Idenifying Toxic Comments** task needs to be updated.
    
> Submissions are evaluated on the **mean column-wise ROC AUC**.\
> In other words, the score is the **average of the individual AUCs of each predicted column**.
> 
> **Submission File**
> 
> For each id in the test set, you must predict a probability for each of the six possible types of comment toxicity (*toxic, severetoxic, obscene, threat, insult, identityhate*). The columns must be in the same order as shown below. The file should contain a header and have the following format:
> 
> 		id,toxic,severe_toxic,obscene,threat,insult,identity_hate
> 		00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
> 		0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
> 		...
> 		etc.

    
<font color=black>

### How To Evaluate A Machine Learning Model

We will assign a toxicity score to two comments where one of the comments are determined by humans to be more toxic than the other comment.
The performance of a model is measured by how many of the comment pairs agree with human rankings.

For example, **Comment 1** (from `validation_data.csv`) is given a score of 1.65 &
**Comment 2** is given a score of 76, resulting in **Comment 2** being more toxic than **Comment 1**. If this matches the human assessment then we score 1/1. If not, then the score is 0/1.
    
From the [task's description on **Kaggle**](https://www.kaggle.com/c/jigsaw-toxic-severity-rating/overview/evaluation):
> *For each of the approximately 200,000 pair ratings in the ground truth test data, we use your predicted toxicity score to rank the comment pair. The pair receives a 1 if this ranking matches the annotator ranking, or 0 if it does not match.*
    
The data used in the evaluation comes from `validation_data.csv`.

In [None]:
# EXAMPLE CODE: COMMENT OUT THIS CODE CELL AFTER IMPLEMENTING YOUR CODE BELOW

# keep track of how many comment pairs we correctly rank
#total_correct_comment_pair_rankings = 0

# get next pair of comments from validation_data.csv
# NOTE: comment1 is from the column corresponding to "LESS TOXIC"
#       comment2 is from the column corresponding to "MORE TOXIC"
#comment1 = "Comment from Wikipedia! (less toxic comment)"
#comment2 = "ANOTHER Comment from Wikipedia! Swear word: bA$$ (more toxic comment)"

# convert the comment's text into a feature vector e.g., [0, 0, 2.51, 1, ...]
#comment1_features = extract_features(comment1)
#comment2_features = extract_features(comment2)

# compute the toxicity score of each comment
#toxicicity_score_of_comment1 = some_model.predict(comment1_features)
#toxicicity_score_of_comment2 = some_model.predict(comment2_features)

#if toxicicity_score_of_comment2 > toxicicity_score_of_comment1:
#    total_correct_comment_pair_rankings = total_correct_comment_pair_rankings + 1

<div class="alert alert-danger">
    <h4>WRITE CODE</h4>
    In the cell below, write the code for evaluating a model.
</div>

In [None]:
# Submission of predictions
# rounding predictions to increments of .5 with list comprehension
submission = pd.DataFrame()
submission['cohesion'] = [round(x*2)/2 for x in predictions[:,0]]
submission['syntax'] =  [round(x*2)/2 for x in predictions[:,1]]
submission['vocabulary'] =  [round(x*2)/2 for x in predictions[:,2]]
submission['phraseology'] =  [round(x*2)/2 for x in predictions[:,3]]
submission['grammar'] =  [round(x*2)/2 for x in predictions[:,4]]
submission['conventions'] =  [round(x*2)/2 for x in predictions[:,5]]
submission

In [None]:
submission.to_csv('submission.csv',index=False) # writing data to a CSV file

----
# Task: Project Report   

The total marks for all of the sections below is **80 Marks**.

This section corresponds to the write-up of the project. Your write-up is to be included within this **Jupyter Notebook** below (the code for this assignment is in the code cells above).\
The **Project Report** will consist of a few sections that each discuss a different stage of the end-to-end experiment.

## Overview    (5 Marks)

Discuss:
* the problem/task you are addressing
* provide concrete examples of the problem
* why the problem is worth the time and effort trying to solve
* compare the task with other tasks that are similar

The following are optional:
* *Related Work* i.e., what have other people tried
* historical background of the problem
* discuss strategies used in other tasks that are similar to **Toxic Comment Identification**
* discuss the differences with those tasks that are similar to **Toxic Comment Identification**

## Dataset    (5 Marks)

Discuss:
* the dataset's size
* languages dataset contains
* anything unusual about the data
* how representative the dataset is of everyday communication
* etc.

## Features    (5 Marks)

Discuss:
* the features that were extracted
* the number of features

## Models    (5 Marks)

Discuss:
* the models used
* any specific parameters, configuration, or settings of each model
* any differences in how each model was trained

## Evaluation    (5 Marks)

This section discusses:
* how you evaluated the models in order to compare their relative performance
* evaluating models based on *overall performance*
* use visuals, tables, charts, graphs, etc. to communicate results

## Discussion    (15 Marks)
 
Compare the performance of the above models on **Identifying Toxic Comments**.\
Use visuals, charts, graphs, etc. to communicate your results.

Discuss:
* your findings in general
* compare the performance of the various models (was the performance what you expected?)
* which system performed best? why?
* which system had the worst performance? why?
* discuss the reasons which lead to the results from the evaluation
* provide some ideas you would like to have tried (provided you had more time or resources) that could potentially improve the performance of the models or a question that you were interested in exploring (i.e., *Future Work*)

<div class="alert alert-danger">
    <h1>YOUR PROJECT REPORT BEGINS BELOW THIS CELL</h1>
</div>

**INFORMATION**\
Student Name: *Hunter Klassen*\
Student ID: *300174049*\
Student UFV Email: *Hunter.Klassen@student.ufv.ca*

# Overview

The project I have worked on is the Kaggle competition problem 'Feedback Prize - English Language Learning'. This competition tasked me with rating cohesion, syntax, vocabulary, phraseology, grammar, and conventions based on the training data given where a rating of 1.0-5.0 in 0.5 increments was given for each section assosiated with an essay of writing from a grade 8-12 level. for this test text:

*"when a person has no experience on a job their is always going to be good people to help you and try to explane the job you need to get done in life you were not born with knowing everything. Life is bassicly about learing new things every single day even though without experience because life is simple and we must live happy and around with the people we love. When a person thinks they know everything in life they dont do good because they trying to make the other person less then others you must be kind to those the dont have experience because you may not know some day you will go to a different country. When you dont know anyting because you not from their so you going to need help from others to explain you about the culture or how to eat a food because you have to no experience on the new country. You must help a person the has no experience because maybe you may need help from the person the you didnt want to help.\n\nyes, even thought you may not have experience in the type of job you seek,you can learn and teach others.\n\nIf you dont have experence in a restaurant for the job you seek for you will learn. For example a person the has no experence working in a restaurant,the only place they will offer you would be to be a diswasher. But you want to dream big because everytime a person has big dreams they can learn the job whereever, you want to be like a cooking person or a kitchen manager. In a job there is always going to be people the they dont want to see you in a better place because they may think you dont deserve to be there but you the only one the knows how hard you being working to achieve your dream .In life you always going to have proof without experience to see how good you are to learn a new job in the kitchen so if you can learn quick. They can give you a good place for you teach others the has no experience.\n\nMy dad has always talk with my cousins that when their is no experience you can always learn and fight for what you want. When my cosuin came to America he wanted to play travel soccer, he went to tryout but the coach told him the has any experience of talking english so he wouldn't make it. But later on he learn how to speak then, he went to tryout for fc virginia he made it also everyone was talking about him because hes a great soccer player also, hes a great person. Then he came to the house and thanks my dad because of the edvice he gave to him the even though you dont have experience,you can always learn and fight for you goals. My dad is man the wants the best for hes family because when he was a kid he wanted to be a loyer but it was hard for hes parents because they were really poor and the corruption was really bad in the country.\n\nIn every job their is always going to be a person with no experience. For example people with no experience those are the ones the learn the job and when they learn the job very well they always try not to make a mistake because thay want to get the job done with quality. Because everytime you do a job for someone else they want to see good quality on you before they give the kind of job they want you to get done for them. The people the has no experience in a job that doesnt make them a less person because we all are humans and we must have the same equal rights. we all know the everytime you aplied for a job the first thing they asked you is about if you have experience but its okay to say no because they can teach you and you can get it fast.\n\nI think yes, you can be a good candidate to be hire without no experience because every person in the world needs to have a opportunity to try something new. People today in life they dont need to have experience to go find a job why because today in every work you get one week of train which people can learn the job just in one week. Because they will practice the job and every time they get practice they will going to get better and better. Practices makes everything better so dont be scared to applied for job only because it says you must have experience no just go believe in youself. If you believe in you everything will be good in like ad you will be going great just be you even though you have experience or not we all deserve a chance."*

There will be a value for cohesion, syntax, vocabulary, phraseology, grammar, and conventions. potentially these values could be:\
cohesion = 3.0\
syntax = 2.5\
vocabulary = 3.5\
phraseology = 3.0\
grammar = 3.0\
conventions = 2.5
---
**Application**

This can help students get feedback in particular areas of their writing skill. With particular enphasise on English Language Learners trying to hone their skills. With automated feedback, getting constructive feedback instantly enables writers to write more.\
This automated feedback is also unbaised and consistant.

# Dataset

The data being used is the one provided with the competition listed at the top. The data is a table consisting of the 3912 full texts along with the rating of each section for every text. The data came formated in csv and was easily convertable to the pandas dataframe object. This data came with errors in both the text with missing spaces and random sumbols but also text in the cell of what should be the rating for a particular aspect of writing. This caused issues when trying to format the data excluding the mistakes.\
Importing re I removed the special characters and punctuation. Making the data all lowercase as well as using the apostrophe dictionary so that the frequency distribution is accurate to an extent.

# Features

The features are made using the sklearn package. With TF-IDF vectorization I generated features using a frequency distribution of all of the training data. This creates 1206 features. The TF-IDF vectorization is the measure of originality of a word. This is comparing the number of times the words are used in a document with the number of documents the words appear in.\
$TF-IDF = TF(t, d)* IDF(t)$\
 where, $TF(t, d)$ = Number of times term "t" appears in a document "d". $IDF(t)$ = Inverse document frequency of the term t

# Models

The model I used was the Suport Vector Machine model. Specifically I used the regression model. This made sections for the features provided based on the vector spaces they are located in. Playing around with the SVC as well as the linear and polynomial variations to get the best result I used the SVR Multi Output Regressor. This strategy fitts one regressor per target because natively they do not support multi-target regression. teh multiple targets in this instance being the classifications of each sections ie: 'syntax', and 'grammar' with values 1.0-5.0.

# Evaluation

The score originally was 86% accuracy, but while trying to do cross validation with the cross_val_score method and got around 22% accuracy. I must be using it wrong or have done something fundamentally wrong. Using 5 segments for the cross validation all were very close. The standard deviation was negligable.

# Conclusion 

With simple concepts quickly and accurately you can generate features and predictions that can be extremely useful. Looking at the top submissions for the kaggle competition, you can see just how impressively accurate they can be. Looking at a winning solutions write up we can see the approach used to win. The first place solution used different modeling approaches in combination.
<ol>
  <li>microsoft-deberta-v3-base</li>
  <li>deberta-v3-large</li>
  <li>deberta-v2-xlarge</li>
  <li>roberta-large</li>
    <li>distilbert-base-uncased</li>
</ol>

They also used different pooling techniques such as.

<ol>
  <li>mean pooling</li>
  <li>concat pooling</li>
  <li>weighted layer pooling</li>
  <li>gem pooling</li>
    <li>LSTM pooling</li>
</ol>