## Setting up Google Drive

You can set up a connection to your google drive with the code below. This will create a prompt to connect the Colab notebook to your Google account.

In order to access the files in our shared "ML Team 16"  folder, were I put the json files, you need to have a link to the folder in your personal Drive. To do this just go to "Shared with Me" and drag the folder over to "My Drive". This should do the trick.

In [2]:
from google.colab import drive
drive.mount('/Drive')

Mounted at /Drive


After you run the above code, on the left under the "Files" category you should be able to see a folder called "Drive". This folder is your personal Google drive. Navigate to the "ML_Team_16" folder, and then to "Data" and right-click on the file you need to get the path to that file. You can paste this path in the read_json() function to read the json file with pandas and create a DataFrame object out of it.

## Feaute Engineering

### Loading the data

In [7]:
import pandas as pd
import numpy as np
articles = pd.read_json('/Drive/MyDrive/ML_Team_16/Data/train.json')


### Cleaning the data

In [19]:
# Replacing blank entries with NaN
articles = articles.mask(articles == '')

# Changing to authorId column to string
articles['authorId'] = articles['authorId'].astype('string')

# Counting the number of NaNs in the dataset. There are 242 missing venue values.
print(articles.isna().sum())

paperId         0
title           0
authorId        0
authorName      0
abstract        0
year            0
venue         242
dtype: int64



### Tokenize and lemmatize abstract and title

In [21]:
import spacy
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk import WordNetLemmatizer
from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer 
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [24]:
tokens_all = []

for _, row in articles.iterrows():
    tokens = tokenizer(row['title'].lower() + ' ' + row['abstract'].lower())
    lemmas = [lemmatizer.lemmatize(token.text) for token in tokens]
    lemmas_per_row = []
    
    for token, lemma in zip(tokens, lemmas):
    # removing stop words
      if token.is_stop:
        pass
      else:
        # storing the lemma of the token
        lemmas_per_row.append(lemma)

    tokens_all.append(lemmas_per_row)
    
articles['tokens'] = tokens_all

### Train Test Split

In [25]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(articles, test_size=0.1, random_state=32)

### Functions for creating n-grams

In [26]:
def character_ngrams(s, n):
    
    """takes in a string and an integer defining the size of ngrams.
     Returns the character ngrams of desired size in the input string"""
    
    s = '#'*(n-1) + s.replace('§', ' ') + '#'*(n-1)
    ngrams = [s[i:i+n] for i in range(len(s)-n+1)]
    
    return ngrams


def word_ngrams(l, n):
    
    """takes in a list and an integer defining the size of ngrams.
     Returns the word ngrams of desired size in the input string"""
    
    
    s = ['#']*(n-1) + l + ['#']*(n-1)
    ngrams = ['§'.join(s[i:i+n]) for i in range(len(s)-n+1)]
    
    return ngrams


def token2ngrams(articles, n, char_n_grams=False):

    featurised_articles = []

    for i, row in articles.iterrows():
        
        featurised_row = []

        if char_n_grams:
          featurised_row.extend(character_ngrams(row['tokens'], n))
        else:
          featurised_row.extend(word_ngrams(row['tokens'], n))
        
        featurised_articles.append(featurised_row)


    articles['ngrams'] = featurised_articles
    
    return articles


def feature_matrix(articles, mapping=None):
    
    if not mapping:
        all_ngrams = {}
        for _, row in articles.iterrows():
          for ngram in row['ngrams']:
            try:
              all_ngrams[ngram] += 1
            except KeyError:
              all_ngrams[ngram] = 1

        # removing ngrams that appear 5 times or less 
        reduced_ngrams = set([ngram for ngram, count in all_ngrams.items() if count > 5])
        mapping = {ngram: i for i, ngram in enumerate(reduced_ngrams)}
    
    X = np.zeros((len(articles.index), len(mapping)), dtype='uint8')
    #y = np.zeros(len(articles.index))
    y = []

    r = 0
    for _, row in articles.iterrows():
        #y[r] = row['authorId']
        y.append(str(row['authorId']))
        for ngram in row['ngrams']:
            try:
                X[r, mapping[ngram]] += 1
            except KeyError:
                pass
        r += 1
    
    return X, y, mapping

# Credit for these functions goes to Dr. Giovanni Cassanni (https://www.tilburguniversity.edu/staff/g-cassani)


### Creating n-grams for training and testing datasets

In [27]:
train = token2ngrams(train, n=1, char_n_grams=False)
test = token2ngrams(test, n=1, char_n_grams=False)

### Creating feature and target matrices

In [None]:
X_train, y_train, ngram2id = feature_matrix(train)
X_test, y_test, _ = feature_matrix(test, mapping=ngram2id)

In [None]:
print(X_train.shape)
print(X_test.shape)

(10916, 7101)
(1213, 7101)


### Doing feature selection using chi squared

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest, chi2

In [None]:
"In principle this should not be used, unless a feature set of different size is required. For the data import it as written down below"

#selector = SelectKBest(chi2, k=500).fit(X_train, y_train)
#X_train_reduced = selector.transform(X_train)
#X_test_reduced = selector.transform(X_test)

#print('Done')

Done


In [None]:
#If new data is generated, save it using the following code
X_train_reduced.tofile("x_train_reduced_randomstate32.csv", sep=",")
X_test_reduced.tofile("x_test_reduced_randomstate32.csv", sep=",")

### Opening feature-reduced training and test sets

Reduced data for chi2 of 500 is saved down below; open this data from the folder. This is faster than running the code above which takes at least 40 minutes. 
Use the code below to open both files as Pandas dataframes

In [None]:
X_train_reduced = pd.read_csv('x_train_reduced_randomstate32.csv', delimiter=",", header=None)
X_test_reduced = pd.read_csv('x_test_reduced_randomstate32.csv', delimiter=",", header=None)


In [None]:
selector.get_feature_names_out()

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Training NB with Full Feature Matrix

In [None]:
NB = MultinomialNB(alpha=0.001, fit_prior=True)
NB.fit(X_train, y_train)

#Highest accuracy of 10.8%"

MultinomialNB(alpha=0.001)

### Training NB with Full Feature Matrix & Year included

In [None]:
X_train_years = X_train 
X_train_years = np.hstack((X_train_years, np.reshape(train['year'].values,(-1,1)))) #a column of year is added to X_train

X_test_years = X_test
X_test_years = np.hstack((X_test_years, np.reshape(test['year'].values,(-1,1)))) #a column of year is added to X_test

In [None]:
NB_full_years = MultinomialNB(alpha=0.01, fit_prior=True)
NB_full_years.fit(X_train_years, y_train)
y_pred = NB_full_years.predict(X_test_years)
print(accuracy_score(y_test, y_pred))

#accuracy outcome: 0.10305028854080792

0.10305028854080792


### Training NB with Reduced Feature Matrix

In [None]:
NB_reduced = MultinomialNB(alpha=0.001, fit_prior=True)
NB_reduced.fit(X_train_reduced, y_train)


MultinomialNB(alpha=0.001)

### Training NB with Reduced features and Year-features
This does not give higher accuracy scores compared with only using reduced features

In [None]:
#Testing with the inclusion of year-feature
temporary = X_train_reduced 
temporary = np.hstack((temporary, np.reshape(train['year'].values,(-1,1)))) #a column of year is added to X_train_reduced

temporary_test = X_test_reduced
temporary_test = np.hstack((temporary_test, np.reshape(test['year'].values,(-1,1)))) #a column of year is added to X_test_reduced


In [None]:
NB_reduced_and_year = MultinomialNB(alpha=0.001, fit_prior=True)
NB_reduced_and_year.fit(temporary, y_train)

y_pred_reduced = NB_reduced_and_year.predict(temporary_test)
print(accuracy_score(y_test, y_pred_reduced))
#Accuracy of 0.016488046166529265

MultinomialNB(alpha=0.001)

### Accuracy Score with Full Matrix

In [None]:
y_pred = NB.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))

#Accuracy 0.1079967023907667

0.1079967023907667
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Accuracy Score with Reduced Matrix 

In [None]:
y_pred_reduced = NB_reduced.predict(X_test_reduced)
print(accuracy_score(y_test, y_pred_reduced))
#print(precision_score(y_test, y_pred_reduced, average=None))
#print(recall_score(y_test, y_pred_reduced, average=None))

0.01731244847485573


## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
#The following code turns the data into dataframes and extends it with the "year" feature
X_train_df = pd.DataFrame(X_train)
X_test_df = pd.DataFrame(X_test)

X_train_reduced_df = pd.DataFrame(X_train_reduced)
X_test_reduced_df = pd.DataFrame(X_test_reduced)

X_train_df['year'] = train['year'].astype("int")
X_test_df['year'] = test['year'].astype("int")

In [None]:
X_train_years = X_train 
X_train_years = np.hstack((X_train_years, np.reshape(train['year'].values,(-1,1)))) #a column of year is added to X_train

X_test_years = X_test
X_test_years = np.hstack((X_test_years, np.reshape(test['year'].values,(-1,1)))) #a column of year is added to X_test

(10916, 7102)


(10916, 7101)

### Replacing missing values of year by median year (2018)

I am not sure whether we need to actually need to impute these missing year values. Because there shouldn't be any missing. Correct me if I am wrong, but running an "is.null().sum()", on the whole train data from year, does not show that any values are missing. As such the X_train dataframe shouldn't have missing values either

In [None]:
train["year"].isnull().sum()

0

In [None]:
X_train_df['year'] = X_train_df['year'].fillna(2018)
X_test_df['year'] = X_test_df['year'].fillna(2018)

X_train_reduced_df['year'] = X_train_df['year'].fillna(2018)
X_test_reduced_df['year'] = X_test_df['year'].fillna(2018)

In [None]:
logreg = LogisticRegression(penalty='none')
logreg.fit(X_train_reduced_df, y_train)

In [None]:
logreg_reduced = logreg

### Accuracy Score Logistic Regression

In [None]:
y_pred_reduced_log = logreg_reduced.predict(X_test_reduced_df)
print(accuracy_score(y_test, y_pred_reduced_log))

0.01483924154987634




### GridSearch - Testing different parameters for logistic regression

In [None]:
from sklearn.model_selection import GridSearchCV

grid_reduced = GridSearchCV(estimator=LogisticRegression(),
        param_grid={'C': [10, 1, 0.1, 0.01], 'penalty': ['none', 'l1', 'l2', 'elasticnet'], 'solver' : ["saga"]}, n_jobs=-1, cv=1)
grid_reduced.fit(X_train_reduced_df, y_train)



In [None]:
print(grid_reduced.best_estimator_)
print('best score: ' + str(round(grid.best_score_,2)))

In [None]:
grid = GridSearchCV(estimator=LogisticRegression(),
        param_grid={'C': [10, 1, 0.1, 0.01], 'penalty': ['none', 'l1', 'l2', 'elasticnet'], 'solver' : ["saga"]}, n_jobs=-1, cv=1)
grid.fit(X_train_df, y_train)

In [None]:
## some random stuff