### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

In [None]:
# You will need to download any word embeddings required for your code, e.g.:

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# For any packages that Colab does not provide auotmatically you will also need to install these below, e.g.:

#! pip install torch

--2021-03-01 08:38:02--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-03-01 08:38:02--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-03-01 08:38:03--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
# Imports

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import codecs

In [None]:
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load data
train_df = pd.read_csv('/content/drive/MyDrive/data/task-1/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/data/task-1/dev.csv')
test_set = pd.read_csv('/content/drive/MyDrive/data/task-1/test.csv')

In [None]:
# Number of epochs
epochs = 10

# Proportion of training data for train compared to dev
train_proportion = 0.8

In [None]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    
    print("Training model.")

    for epoch in range(1, number_epoch+1):

        model.train()
        epoch_loss = 0
        epoch_sse = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:

            feature, target = batch

            feature, target = feature.to(device), target.to(device)

            # for RNN:
            model.batch_size = target.shape[0]
            no_observations = no_observations + target.shape[0]
            model.hidden = model.init_hidden()

            predictions = model(feature).squeeze(1)

            optimizer.zero_grad()

            loss = loss_fn(predictions, target)

            sse, __ = model_performance(predictions.detach().cpu().numpy(), target.detach().cpu().numpy())

            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse

        valid_loss, valid_mse, __, __ = eval(dev_iter, model)

        epoch_loss, epoch_mse = epoch_loss / no_observations, epoch_sse / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:.2f} | Train MSE: {epoch_mse:.2f} | Train RMSE: {epoch_mse**0.5:.2f} | \
        Val. Loss: {valid_loss:.2f} | Val. MSE: {valid_mse:.2f} |  Val. RMSE: {valid_mse**0.5:.2f} |')

In [None]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_sse = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
            feature, target = batch

            feature, target = feature.to(device), target.to(device)

            # for RNN:
            model.batch_size = target.shape[0]
            no_observations = no_observations + target.shape[0]
            model.hidden = model.init_hidden()

            predictions = model(feature).squeeze(1)
            loss = loss_fn(predictions, target)

            # We get the mse
            pred, trg = predictions.detach().cpu().numpy(), target.detach().cpu().numpy()
            sse, __ = model_performance(pred, trg)

            epoch_loss += loss.item()*target.shape[0]
            epoch_sse += sse
            pred_all.extend(pred)
            trg_all.extend(trg)

    return epoch_loss/no_observations, epoch_sse/no_observations, np.array(pred_all), np.array(trg_all)

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns SSE and MSE per batch (printing the MSE and the RMSE)
    """

    sq_error = (output - target)**2

    sse = np.sum(sq_error)
    mse = np.mean(sq_error)
    rmse = np.sqrt(mse)

    if print_output:
        print(f'| MSE: {mse:.2f} | RMSE: {rmse:.2f} |')

    return sse, mse

In [None]:
def create_vocab(data):
    """
    Creating a corpus of all the tokens used
    """
    tokenized_corpus = [] # Let us put the tokenized corpus in a list

    for sentence in data:

        tokenized_sentence = []

        for token in sentence.split(' '): # simplest split is

            tokenized_sentence.append(token)

        tokenized_corpus.append(tokenized_sentence)

    # Create single list of all vocabulary
    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

    for sentence in tokenized_corpus:

        for token in sentence:

            if token not in vocabulary:

                if True:
                    vocabulary.append(token)

    return vocabulary, tokenized_corpus

In [None]:
def collate_fn_padd(batch):
    '''
    We add padding to our minibatches and create tensors for our model
    '''

    batch_labels = [l for f, l in batch]
    batch_features = [f for f, l in batch]

    batch_features_len = [len(f) for f, l in batch]

    seq_tensor = torch.zeros((len(batch), max(batch_features_len))).long()

    for idx, (seq, seqlen) in enumerate(zip(batch_features, batch_features_len)):
        seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

    batch_labels = torch.FloatTensor(batch_labels)

    return seq_tensor, batch_labels

class Task1Dataset(Dataset):

    def __init__(self, train_data, labels):
        self.x_train = train_data
        self.y_train = labels

    def __len__(self):
        return len(self.y_train)

    def __getitem__(self, item):
        return self.x_train[item], self.y_train[item]

In [31]:
import re
# obtaining edited headline
idx = 0
edited = []
for item in train_df['original']:
  text = item.strip()
  m = re.search(r"\<[^()]*\/>", text)
  #print(train_df['original'][idx])
  #print(m.group(0))
  text = text.replace(m.group(0), train_df['edit'][idx])
  edited.append(text)
  idx +=1
# print(edited)


In [32]:
# obtaining editied test headline
idx = 0
edited_test = []
for item in test_set['original']:
  text = item.strip()
  # finding words to replace in the brackets
  m = re.search(r"\<[^()]*\/>", text)
  # replacing
  text = text.replace(m.group(0), test_set['edit'][idx])
  edited_test.append(text)
  idx +=1
# print(edited_test)

#### Approach 2: No pre-trained representations

In [None]:
# ORIGINAL- LINEAR REGRESSION
train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']
training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
regression_model = LinearRegression().fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)

# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.01 | RMSE: 0.09 |

Dev performance:
| MSE: 17.09 | RMSE: 4.13 |

Test performance:
| MSE: 52.21 | RMSE: 7.23 |


#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + np.mean(training_y)
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, dev_y, True)


Baseline performance:
| MSE: 0.34 | RMSE: 0.58 |


#### Random Forest Regression

In [None]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
regr = RandomForestRegressor(n_estimators = 200, max_depth=20, random_state=0)
regression_model = regr.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.28 | RMSE: 0.53 |

Dev performance:
| MSE: 0.33 | RMSE: 0.57 |

Test performance:
| MSE: 0.32 | RMSE: 0.56 |


#### Support Vector Regression

In [None]:
# SVR
from sklearn import svm
train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
model = svm.SVR(C=1.0, kernel='linear', degree=3, epsilon=0.5, gamma='auto')
regression_model = model.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)

# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.18 | RMSE: 0.42 |

Dev performance:
| MSE: 0.34 | RMSE: 0.58 |

Test performance:
| MSE: 0.33 | RMSE: 0.58 |


####Naive Bayes

In [None]:
# Naive Bayes- MultinomialNB
#attempt to use for a regression task
import numpy as np
from sklearn.naive_bayes import MultinomialNB
train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
model = MultinomialNB()
regression_model = model.fit(train_counts, training_y.astype('int'))

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.59 | RMSE: 0.77 |

Dev performance:
| MSE: 0.82 | RMSE: 0.91 |

Test performance:
| MSE: 0.85 | RMSE: 0.92 |


####Polynomial Regression

In [None]:
# polynomial regression

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
degree = 2
reg = make_pipeline(PolynomialFeatures(degree),LinearRegression())
regression_model = reg.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.00 | RMSE: 0.02 |

Dev performance:
| MSE: 0.43 | RMSE: 0.65 |

Test performance:
| MSE: 0.42 | RMSE: 0.65 |


#### Ridge Regression

In [None]:
# Ridge Regression

from sklearn.linear_model import Ridge

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
degree = 2
reg = Ridge(alpha=1.0, normalize=True)
regression_model = reg.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.12 | RMSE: 0.34 |

Dev performance:
| MSE: 0.35 | RMSE: 0.59 |

Test performance:
| MSE: 0.34 | RMSE: 0.58 |


#### Lasso Regression

In [None]:
# Lasso Regression
from sklearn import linear_model

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
degree = 2
reg = linear_model.Lasso(alpha = 0.0001)
regression_model = reg.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.25 | RMSE: 0.50 |

Dev performance:
| MSE: 0.33 | RMSE: 0.57 |

Test performance:
| MSE: 0.32 | RMSE: 0.56 |


#### MLP Regression

In [None]:
#MLP Regressor
from sklearn.neural_network import MLPRegressor

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english')
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
degree = 2
reg = MLPRegressor(early_stopping=True)
regression_model = reg.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.22 | RMSE: 0.47 |

Dev performance:
| MSE: 0.33 | RMSE: 0.58 |

Test performance:
| MSE: 0.33 | RMSE: 0.57 |


####Stemming

In [None]:
# stemming
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import EnglishStemmer
from sklearn.ensemble import RandomForestRegressor
stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()

# defininf function to allow stemming
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(stop_words='english', analyzer=stemmed_words)
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
regr = RandomForestRegressor(max_depth=20, random_state=0)
#regression_model = LinearRegression().fit(train_counts, training_y)
regression_model = regr.fit(train_counts, training_y)
# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.27 | RMSE: 0.52 |

Dev performance:
| MSE: 0.33 | RMSE: 0.57 |

Test performance:
| MSE: 0.32 | RMSE: 0.56 |


#### Lemmatization

In [None]:
# imports to carry out lemmatization
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
#Lemmatization
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
from sklearn.ensemble import RandomForestRegressor

# defining class for lemmatizing
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

train_and_dev = edited
test = edited_test
test_y = test_set['meanGrade']

training_data, dev_data, training_y, dev_y = train_test_split(edited, train_df['meanGrade'],
                                                                        test_size=(1-train_proportion),
                                                                                        random_state=42)

# We train a Tf-idf model
count_vect = CountVectorizer(tokenizer=LemmaTokenizer(),
                                # stop_words = 'english', # removed because it wasn't compatible
                                lowercase = True)
train_counts = count_vect.fit_transform(training_data)
transformer = TfidfTransformer().fit(train_counts)
train_counts = transformer.transform(train_counts)
regr = RandomForestRegressor(max_depth=20, random_state=0)
#regression_model = LinearRegression().fit(train_counts, training_y)    # previous experiment
regression_model = regr.fit(train_counts, training_y)

# Train predictions
predicted_train = regression_model.predict(train_counts)

# Calculate Tf-idf using train and dev, and validate model on dev:
test_and_test_counts = count_vect.transform(train_and_dev)
transformer = TfidfTransformer().fit(test_and_test_counts)

test_counts = count_vect.transform(dev_data)

test_counts = transformer.transform(test_counts)

# Dev predictions
predicted = regression_model.predict(test_counts)

# We run the evaluation:
print("\nTrain performance:")
sse, mse = model_performance(predicted_train, training_y, True)

print("\nDev performance:")
sse, mse = model_performance(predicted, dev_y, True)
# Test predictions
test_c = count_vect.transform(test)
test_c = transformer.transform(test_c)
predicted = regression_model.predict(test_c)
print("\nTest performance:")
sse, mse = model_performance(predicted, test_y, True)


Train performance:
| MSE: 0.27 | RMSE: 0.52 |

Dev performance:
| MSE: 0.33 | RMSE: 0.57 |

Test performance:
| MSE: 0.32 | RMSE: 0.56 |
