# Notebook Project 2 Predictive Modeling
### Michael Fatjanov

## Introduction
In this project, we will build a SVC model with a linear kernel and train it with mostly word embeddings. Other features and models are also used and explained later.
## Modules
First, the necessary modules are imported. *Gensim* is used to import word embeddings. From *sklearn* we will use different models such as Support Vector Classifiers (SVC), Multi-layer Perceptron Classifier (MLPC) and XGBoost. Lastly, *nltk* is used to lemmatize the words and to obtain part-of-speech tags of words.

In [3]:
import pandas as pd
import gensim
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
from nltk.stem import WordNetLemmatizer
import xgboost as xgb
import pickle
from sklearn import preprocessing

lem = WordNetLemmatizer()

## Preprocessing of training dataset
To train the model, we will use [this Kaggle dataset](https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus). This contains 1048575 words extracted by Groningen Meaning Bank and is also annotated for Named Entity Recognition usecases so we don't have to manually label the data by ourselves. It provides two kinds of features: Part of Speech (POS) tags and the entity. In this project, we will specifically look for location so we can assign label 1 for words which have tag 'I-geo' or 'B-geo' and label 0 for all other words. These are the labels to be predicted later.

Lemmatization is also performed (converting every word to its base root).

In [4]:
df = pd.read_csv('archive/ner_dataset.csv', encoding='latin1')

label = []
for i in df['Tag']:
    if i == 'I-geo' or i == 'B-geo':
        label.append(1)
    else:
        label.append(0)
df['label'] = label

df['lemma'] = [lem.lemmatize(i) for i in df['Word']]


We will also take the part-of-speech tags into account, but first we have to assign a numerical label to those so it becomes understandable for a predictive model.

In [19]:
le = preprocessing.LabelEncoder()

POS_set = set(df['POS'].tolist())
POS_list = (list(POS_set))

le.fit(POS_list)
POS = le.transform(df['POS'].tolist())
df['labelPOS'] = POS

In [6]:
print(df.head(30))

     Sentence #           Word  POS    Tag  label         lemma
0   Sentence: 1      Thousands  NNS      O      0     Thousands
1           NaN             of   IN      O      0            of
2           NaN  demonstrators  NNS      O      0  demonstrator
3           NaN           have  VBP      O      0          have
4           NaN        marched  VBN      O      0       marched
5           NaN        through   IN      O      0       through
6           NaN         London  NNP  B-geo      1        London
7           NaN             to   TO      O      0            to
8           NaN        protest   VB      O      0       protest
9           NaN            the   DT      O      0           the
10          NaN            war   NN      O      0           war
11          NaN             in   IN      O      0            in
12          NaN           Iraq  NNP  B-geo      1          Iraq
13          NaN            and   CC      O      0           and
14          NaN         demand   VB     

In [5]:
print(len(df))

1048575


## Building the feature set
Here, *GloV*e pre-trained word embeddings will be imported *via* the *Gensim* library. This specific vocabulary contains vectors of 6 billion tokens extracted from *Wikipedia* and *Gigaword* and the vectors are 50 units long. I have tried to work with embeddings of 300 units long, but building feature sets was in this case not feasible on a laptop with 16 GB of RAM. 

In [6]:
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove2word2vec('glove.6B.50d.txt', 'glove')
w2v = KeyedVectors.load_word2vec_format('glove', binary=False)

Here, a list of word embeddings was made for every word in the training dataset. *Gensim* will throw an error when a word doesn't exist in the loaded vocabulary, so those words are assigned a padding vector containing 50 zeros. The length of this list is compaired to the amount of words to ensure the code has not missed a word.

Variables that might take a while to build, are pickled (dumping Python variables onto local files that can be loaded later) to avoid waiting all over again.

In [6]:
print(len(df))
vectorlist = []
l = 50 * [0]
for i in range(len(df)):
    try:
        vectorlist.append(w2v[df['lemma'][i]].tolist())
    except KeyError:
        vectorlist.append(l)
print(len(vectorlist))

outfile = open('vectorlist','wb')
pickle.dump(vectorlist, outfile)
outfile.close()

outfile = open('df','wb')
pickle.dump(df, outfile)
outfile.close()

1048575
1048575


In [7]:
infile = open('vectorlist','rb')
vc2 = pickle.load(infile)
infile.close()
infile = open('df','rb')
df2 = pickle.load(infile)
infile.close()

Here are all features consired to train the model later. For a word in dataset, a dictionary is made consisting of all of the word's features. The features considered here are the vectors of the main word and the 3 words before and the 3 words after it. For the main word, there is a label included telling if it is made out of digits since these have a very small chance of being a location. There are also labels containing info on the capitalization of all 7 words since practically all locations would be capitalized.

For words in the beginning or the end of the dataset, padding vectors and padding labels are added as a vector of 50 zeros and zeros respectively. For example, the very first word of the dataset doesn't have words before it, so padding is added there. In hindsight, approaching the dataset as a circular unit might have been a better idea in this case.

Unfortunately due to time constraints, I could not put as many features as I wanted to. The feature set takes at least 3 hours to build so it wasn't practical to easily add and remove features. That's also why I did not put everything in one dataframe since it would likely reach the limits of my system. In hindsight, I could have also test out features on a smaller dataset. 

I added and trained a model that is trained with POS tags later on in this notebook.

In [8]:
def word2features(i, df, vc):
    lemma = df['lemma'].tolist()
    l = [0] * 50
    features = {
        'wordvector': vc[i],
        'word.istitle()': int(lemma[i].istitle()),
        'word.isdigit()': int(lemma[i].isdigit()),
        'wordvector[-1]': l,
        '-1:word.istitle()': 0,
        'wordvector[-2]': l,
        '-2:word.istitle()': 0,
        'wordvector[-3]': l,
        '-3:word.istitle()': 0,
        'wordvector[+1]': l,
        '+1:word.istitle()': 0,
        'wordvector[+2]': l,
        '+2:word.istitle()': 0,
        'wordvector[+3]': l,
        '+3:word.istitle()': 0
    }
    if i == 1:
        features.update({
            'wordvector[-1]': vc[i-1],
            '-1:word.istitle()': int(lemma[i-1].istitle()),
        })

    if i == 2:
        features.update({
            'wordvector[-2]': vc[i-2],
            '-2:word.istitle()': int(lemma[i-2].istitle())
        })
    if i >= 3:
        features.update({
            'wordvector[-1]': vc[i-1],
            '-1:word.istitle()': int(lemma[i-1].istitle()),
            'wordvector[-2]': vc[i-2],
            '-2:word.istitle()': int(lemma[i-2].istitle()),
            'wordvector[-3]': vc[i-3],
            '-3:word.istitle()': int(lemma[i-3].istitle())
        })
    if i == len(df)-2:
        features.update({
            'wordvector[+1]': vc[i+1],
            '+1:word.istitle()': int(lemma[i+1].istitle()),
        })
    if i == len(df)-3:
        features.update({
            'wordvector[+2]': vc[i+2],
            '+2:word.istitle()': int(lemma[i+2].istitle())
        })
    if i <= len(df)-4:
        features.update({
            'wordvector[+1]': vc[i+1],
            '+1:word.istitle()': int(lemma[i+1].istitle()),
            'wordvector[+2]': vc[i+2],
            '+2:word.istitle()': int(lemma[i+2].istitle()),
            'wordvector[+3]': vc[i+3],
            '+3:word.istitle()': int(lemma[i+3].istitle())
        })
    return features

This next block employs the function above on every word to gather its features and puts them in list X.

In [16]:
X = []
for i in range(len(df)):
    p = word2features(i, df, vectorlist).values()
    new_p = []
    for a in p:
        if type(a)!=list:
            new_p.append(a)
        else:
            new_p.extend(a)
    X.append(new_p)

outfile = open('X','wb')
pickle.dump(X, outfile)
outfile.close()

In [None]:
infile = open('X','rb')
X = pickle.load(infile)
infile.close()

To ensure we haven't missed any word, we take a look at length of the vectors.

In [14]:
print(len(X))
y = df['label'].tolist()
print(len(y))

1048575
1048575


To ensure every word has the same amount of features, we can take a quick look at its parts.

In [11]:
print(len(X[0]))
print(len(X[42]))

365
365


## Training the linear SVC model
Here, we will introduce X and y into a Support Vector Classifier (SVC) with a linear kernel. I have also tried linear classification *via* Linear Discriminant Analysis (LDA), but this performed worse on all fronts in comparison to SVC in this case. This is likely due to the fact that LDA makes a lot of assumptions on data like for example that the data should be normally distributed whilst SVC does not make any assumptions. 

I have also tried SVC with a Radial Basis Fucntion kernel (RBF), but fitting such a model took a large amount of time. Fitting a SVC with RBF takes more than double the time of a linear SVC so I decided to focus on linear SVC from here. SVC with RBF is used and discussed later in the very last part of this notebook. This all applies too MLPC and XGBoost too.

For the C value, I have tried 0.1, 1 and 10 and 10 seems to give the best results without overfitting when using it on the test set later. A more refined way to find the optimal C-value, would be a grid search. This would have taken up too much time computationally, but it would be a nice implementation to further refine the model.

The `class_weight` parameter is set to 'balanced'. This means that the function will automatically adjust the weight of the classes. This yields far better results than leaving it on defaut ('None') because our dataset contains much more non-location words than locations. The maximum iterations has been set to a higher value because it always exceeds the default value (1000). It seems to perform better with more iterations, but that also comes at the cost of computational time.

In [11]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
svm1 = LinearSVC(C = 1, class_weight='balanced',  max_iter=2000)
svm1.fit(x_train,y_train)



LinearSVC(C=1, class_weight='balanced', max_iter=2000)

In [14]:
svm2 = LinearSVC(C = 0.1, class_weight='balanced',  max_iter=2000)
svm2.fit(x_train,y_train)



LinearSVC(C=0.1, class_weight='balanced', max_iter=2000)

In [15]:
svm3 = LinearSVC(C = 10, class_weight='balanced',  max_iter=2000)
svm3.fit(x_train,y_train)



LinearSVC(C=10, class_weight='balanced', max_iter=2000)

The tests provide first the accuracy and then the F1-score. Here we can see the type of class imbalance in data/prediction. A high accuracy doesn't necessarily mean that we have a good model. This could be a result of having a small number of locations and a large number of non-locations in both the training and test dataset.

In [13]:
from sklearn.metrics import f1_score
y_pred_test = svm1.predict(x_test)
print(svm1.score(x_test,y_test))
print(f1_score(y_test, y_pred_test))

0.8941754285578046
0.4338087098502436


In [16]:
from sklearn.metrics import f1_score
y_pred_test2 = svm2.predict(x_test)
print(svm2.score(x_test,y_test))
print(f1_score(y_test, y_pred_test2))

0.8719881744271988
0.39303640063305445


In [20]:
from sklearn.metrics import f1_score
y_pred_test3 = svm3.predict(x_test)
print(svm3.score(x_test,y_test))
print(f1_score(y_test, y_pred_test3))

0.9335193000023841
0.5196720181905877


The model we will finally use here is a SVC with a linear kernel with the C-value set to 10 and at 5000 iterations fitted on the whole training dataset.

In [7]:
svm = LinearSVC(C = 10, class_weight='balanced', max_iter = 5000)
svm.fit(X,y)

outfile = open('svm','wb')
pickle.dump(svm, outfile)
outfile.close()



LinearSVC(C=10, class_weight='balanced', max_iter=5000)

In [None]:
infile = open('svm','rb')
svm = pickle.load(infile)
infile.close()

## Training the linear SVC model with POS
In the next block of code, the POS labels of the main word and the 6 words around it are added to the feature set. Then a linear SVC model with the same settings as above will be trained on this feature set.

In [None]:
X_pos = X
for i in range(len(df)):
    X_pos[i].append(df['labelPOS'][i])
    if i < (len(df) - 2):
        X_pos[i].append(df['labelPOS'][i+1])
    else:
        X_pos[i].append(0)
    if i < (len(df) - 3):
        X_pos[i].append(df['labelPOS'][i+2])
    else:
        X_pos[i].append(0)
    if i < (len(df) - 4):
        X_pos[i].append(df['labelPOS'][i+3])
    else:
        X_pos[i].append(0)
    if i > 0:
        X_pos[i].append(df['labelPOS'][i-1])
    else:
        X_pos[i].append(0)
    if i > 1:
        X_pos[i].append(df['labelPOS'][i-2])
    else:
        X_pos[i].append(0)
    if i > 2:
        X_pos[i].append(df['labelPOS'][i-3])
    else:
        X_pos[i].append(0)

outfile = open('X_pos','wb')
pickle.dump(X_pos, outfile)
outfile.close()

svm_pos = LinearSVC(C = 10, class_weight='balanced', max_iter = 5000)
svm.fit(X_pos,y)

outfile = open('svm_pos','wb')
pickle.dump(svm_pos, outfile)
outfile.close()

In [None]:
infile = open('X_pos','rb')
X_pos = pickle.load(infile)
infile.close()

infile = open('svm_pos','rb')
svm_pos = pickle.load(infile)
infile.close()

## Preprocessing the test data
After building the model, we can now finally prepare the (competition) test dataset for prediction. Lemmatiziation and the same features will be applied here too. 

In [4]:
df_test = pd.read_csv('test_dataset.csv')
lemma = []
for i in df_test['Word']:
    lemma.append(lem.lemmatize(i))
df_test['lemma'] = lemma

POS tags were not provided with the test dataset so we can predict them via the *nltk* package. If the package doesn't do this properly, this could have an impact on the overall accuracy of the model.

In [8]:
import nltk

text = nltk.pos_tag(lemma)
POS_tags = [i[1] for i in text]
df_test['POS'] = POS_tags

In [10]:
vckag = []
l = [0] * 50
for i in range(len(df_test)):
    try:
        vckag.append(w2v[df_test['lemma'][i]].tolist())
    except KeyError:
        vckag.append(l)

outfile = open('vckag','wb')
pickle.dump(vckag, outfile)
outfile.close()

outfile = open('dfkag','wb')
pickle.dump(df_test, outfile)
outfile.close()

## Make feature set of test data (without POS)

In [11]:
X_pred = []
for i in range(len(df_test)):
    p = word2features(i, df_test, vckag).values()
    new_p = []
    for a in p:
        if type(a)!=list:
            new_p.append(a)
        else:
            new_p.extend(a)
    X_pred.append(new_p)

outfile = open('X_pred','wb')
pickle.dump(X_pred, outfile)
outfile.close()

In [None]:
infile = open('X_pred','rb')
X_pred = pickle.load(infile)
infile.close()

## Make feature set of test data (with POS)

In [13]:
new = [',' if df_test['POS'][i] == '(' or df_test['POS'][i] == ')' or df_test['POS'][i] == "''" or df_test['POS'][i] == '#' or df_test['POS'][i] == 'SYM' else df_test['POS'][i] for i in range(len(df_test))]
POS_test = le.transform(new)
df_test['labelPOS'] = POS_test

X_pred_pos = X_pred
for i in range(len(df_test)):
    X_pred_pos[i].append(df_test['labelPOS'][i])
    if i < (len(df_test) - 2):
        X_pred_pos[i].append(df_test['labelPOS'][i+1])
    else:
        X_pred_pos[i].append(0)
    if i < (len(df_test) - 3):
        X_pred_pos[i].append(df_test['labelPOS'][i+2])
    else:
        X_pred_pos[i].append(0)
    if i < (len(df_test) - 4):
        X_pred_pos[i].append(df_test['labelPOS'][i+3])
    else:
        X_pred_pos[i].append(0)
    if i > 0:
        X_pred_pos[i].append(df_test['labelPOS'][i-1])
    else:
        X_pred_pos[i].append(0)
    if i > 1:
        X_pred_pos[i].append(df_test['labelPOS'][i-2])
    else:
        X_pred_pos[i].append(0)
    if i > 2:
        X_pred_pos[i].append(df_test['labelPOS'][i-3])
    else:
        X_pred_pos[i].append(0)

outfile = open('X_pred_pos','wb')
pickle.dump(X_pred_pos, outfile)
outfile.close()

## Predict via linear SVC model

In [14]:
y_pred = svm.predict(X_pred)

In [None]:
y_pred_pos = svm_pos.predict(X_pred_pos)

In [15]:
idlist = [x for x in range(len(df_test))]

data_pred = {'Id':idlist,
        'Predicted':y_pred}
df_pred = pd.DataFrame(data_pred)
df_pred.to_csv('predictions.csv',index=False)

data_pred = {'Id':idlist,
        'Predicted':y_pred_pos}
df_pred = pd.DataFrame(data_pred)
df_pred.to_csv('predictions_pos.csv',index=False)

The model without POS features yielded an F1-score of almost 0.38 in the Kaggle leaderboard which is not too far off the F1-score in the testing phase but it can use some more refinement to reduce overfitting. In conclusion, this model could have potential for some better predictions after some more finetuning and an addition of more interesting features. One of the main drawbacks of this model is that it is very computationally heavy and could use some more refinement in terms of feature engineering and its code.

The same could be said for the model with POS features. This yielded an F1-score of almost 0.37 in the Kaggle leaderboard. From this, it seems that adding POS features makes no significant difference to the model. What could have happened here is that because the feature vectors are so large (365 units long), adding more features would not influence the model at all. An intersting thing to try out later could be using even smaller word embeddings or making them smaller by ourselves by using a dimensional reduction method such as Principle Component Analysis (PCA) to see if the POS tags would make a difference. 

Also, because no normilization has been done, it could be that some features have more importance than others. Because of computational limits, I couldn't normalize the full feature sets so I only took the first 100000 words to train the model. The code for is not included, but it was easily done in just 2 lines with *sklearn*s StandardScaler function. In hindsight, I could have done this in batches, but this was not possible due to time constraints. The model with normalized smaller feature set without POS scored around 0.36 on Kaggle and the one with POS score scored 0.38. This is very similar to the scores above so it could be that both normalization and splicing the dataset doesn't affect the model that much.

## Other models

Here is the section were I experimented with other models.

Here I tried a MLPClassifier and this model actually yielded my highest score of 0.41. This is definitely a model to look further into later, but due to time constraints, this was not possible now. There are many parameters here to experiment with.

In [10]:
mlp = MLPClassifier(max_iter=500)
mlp.fit(X, y)
y_pred_mlp = mlp.predict(X_pred)

outfile = open('mlp','wb')
pickle.dump(mlp, outfile)
outfile.close()

idlist = [x for x in range(len(df_test))]

data_pred = {'Id':idlist,
        'Predicted':y_pred_mlp}
df_pred = pd.DataFrame(data_pred)
df_pred.to_csv('predictions_mlp.csv',index=False)

Here I tried an SVC with a RBF kernel, but this did not yield good results at all at a much higher computational time so I stcuk with linear SVC. This model could use much more refinement like optimizing the C and gamma value.

In [11]:
svmrbf = SVC(C = 10, class_weight='balanced',  max_iter=5000)
svmrbf.fit(X,y)

outfile = open('svmrbf','wb')
pickle.dump(svmrbf, outfile)
outfile.close()

y_pred_rbf = svmrbf.predict(X_pred)

idlist = [x for x in range(len(df_test))]

data_pred = {'Id':idlist,
        'Predicted':y_pred_rbf}
df_pred = pd.DataFrame(data_pred)
df_pred.to_csv('predictions_rbf.csv',index=False)



In [8]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

mlp = MLPClassifier(max_iter=500)
mlp.fit(x_train, y_train)
y_pred_mlp2 = mlp.predict(x_test)

print(mlp.score(x_test,y_test))
print(f1_score(y_test, y_pred_mlp2))

0.9678516081348497
0.6350546714301181


In [9]:
svmrbf = SVC(C = 10, class_weight='balanced',  max_iter=5000)
svmrbf.fit(x_train,y_train)
y_pred_rbf = svmrbf.predict(X_pred)

print(svmrbf.score(x_test,y_test))
print(f1_score(y_test, y_pred_mlp2))



0.6130415087142074
0.6350546714301181


Here I tried an XGBoost model and this also yielded a score of approximately 0.39. I have not looked much into this model but it surely has potential with more refinement.

In [15]:
weight = df['label'].value_counts().tolist()[0] / df['label'].value_counts().tolist()[1]

xgb_model = xgb.XGBClassifier(scale_pos_weight = weight)
xgb_model.fit(X, y)

y_pred_xgb = xgb_model.predict(X_pred)

outfile = open('xgb','wb')
pickle.dump(xgb_model, outfile)
outfile.close()

idlist = [x for x in range(len(df_test))]

data_pred = {'Id':idlist,
        'Predicted':y_pred_xgb}
df_pred = pd.DataFrame(data_pred)
df_pred.to_csv('predictions_xgb.csv',index=False)

## Conclusion
Because all of the models (except for the SVC with RBF kernel) yielded very similar results, it is very possible that our traning dataset or most likely our builded feature set could use much more refinement. Perhaps more relevant features could be implemented and maybe even a dimensional reduction of the word embeddings could be of use in this case, both in terms of better predictions and computational efficiency.