# Final Project
## Remi LeBlanc and Max Shinnerl

https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb

Some info:

Original fastai model was pretrained with wiki and finetuned on these african new articles
Task to was a binary classification to determine in the tags 'Agribusiness' or 'Food and Agriculture' were in the list of tags for each article. One problem with this was only 3% of the articles had this tag, this made it a bit interesting. Accuracy is not a good metric, used f1/f beta score and precision and recall as our metric.
Fastai model performed about 80% recall 60% precision.

For our project we can either continue looking for agriculture news articles, becasue then we wont have to do anything with the fastai notebooks, or we can do a multilabel or multiclass somehow. 

The link above in from the fastai text book and he goes through out to make a lanugage model from scratch in pytorch - we might be able to follow that

### How will we compare the models?

We can just compare the metrics we chose - precision/recall and f1score.
We can talk about how it is harder to build from scratch, but have more flexibility. We should find something to do in the pytorch model that helps our model we couldnt do in fastai. Like perhaps with our unbalanced data

In [1]:
# output_0628_fixedwithspace_updated.json has about 100,000 articles

In [2]:
# output_0728.json has about 500k with 241k in english

In [3]:
# small_data.json has 5000 articles

In [4]:
import json
import pandas as pd
import os
import spacy
from tqdm.notebook import tqdm
from collections import Counter
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

In [5]:
#with open('output_0628_fixedwithspace_updated.json') as json_file:
#with open('output_0728.json') as json_file:
with open('small_data.json') as json_file:

    data = json.load(json_file)

In [6]:
len(data)

5000

In [7]:
all_articles = {key:val for key, val in data if type(val) == dict}


In [8]:
len(all_articles)

5000

In [9]:
all_articles = {k:v for k,v in all_articles.items() if v!='Article NA'}
all_articles = {k:v for k,v in all_articles.items() if v['language']=='English'}

In [10]:
len(all_articles)

5000

In [11]:
all_articles['202003010001']

{'title': 'Somalia: Sufi Muslim Leaders Surrender to Government',
 'article': ' The leaders of a Sufi Muslim group turned themselves into the custody of the Somali government Saturday after fighting left 22 people dead in central Somalia. Moallim Mohamud Sheikh, the spiritual leader, and Sheikh Mohamed Shakir, the chief of Ahlu-Sunna Wal-Jamaa (ASWJ), are in the custody of the Somali national army in the town of Dhusamareb after the group’s militias were overpowered in a battle with government forces. Dhusamareb is the administrative capital of Galmudug state. "Our security forces have ended the standoff and disarmed all ASWJ militias,” Osman Isse Nur, the spokesperson of the newly elected president, told VOA. Speaking in a video posted online, ASWJ chief Sheikh Shakir said his group ceded power to the Somali national army. "We agreed to end the fighting for the sake of the civilians. We agreed to hand over ASWJ militias to the commander general who will, in return, take responsibility

In [12]:
# Clean up non-matching column names
for k,v in all_articles.items():
    # Update tag to tags
    t = v.get('tag')
    if t:
        all_articles[k]['tags'] = t
        del all_articles[k]['tag']
        
    # Update full_story to full
    f = v.get('full text')
    if str(f)!='None':
        all_articles[k]['full_story'] = f
        del all_articles[k]['full text']

In [13]:
df = pd.DataFrame.from_dict(all_articles,orient='index')
df[df['full text'].notnull()]

Unnamed: 0,title,article,author,date,source,language,original_url,tags,full_story,full text


In [14]:
df = df.drop(['full text'],axis=1)
len(df)

5000

In [15]:
df['liststring'] = [','.join(map(str, l)) for l in df['tags']]

In [16]:
# Get the indices of the articles with agri tags
indices = []
for i, tags in enumerate(df['liststring']):
    if 'Agribusiness' in tags or 'Food and Agriculture' in tags:
        indices.append(i)
# make column of 0 and 1 in df for appearance of that tag
ies = [0]*df.shape[0]
for x in indices:
    ies[x]+=1
df['agri_label'] = ies

In [17]:
df['agri_label'] = ies

In [18]:
# 3% of articles have agri/foodag tag
sum(df['agri_label']==1), len(df), sum(df['agri_label']==1)/len(df)

(123, 5000, 0.0246)

In [19]:
df.head()

Unnamed: 0,title,article,author,date,source,language,original_url,tags,full_story,liststring,agri_label
202003010001,Somalia: Sufi Muslim Leaders Surrender to Gove...,The leaders of a Sufi Muslim group turned the...,By Abdulaziz Osman,1 March 2020,"Voice of America (Washington, DC)",English,https://www.voanews.com/africa/somalias-sufi-m...,"[Somalia, East Africa, Legal Affairs, Conflict...",False,"Somalia,East Africa,Legal Affairs,Conflict,Arm...",0
202003010002,Libya: UN-Mediated Political Talks on Libya En...,U.N.-mediated political talks aimed at resolv...,By Lisa Schlein,1 March 2020,"Voice of America (Washington, DC)",English,https://www.voanews.com/middle-east/un-mediate...,"[Libya, Conflict, External Relations, North Af...",False,"Libya,Conflict,External Relations,North Africa...",0
202003010003,Nigeria: Obasanjo At 83 - a Leader and His Cou...,"""I don't want to be remembered. I am still he...",,1 March 2020,This Day (Lagos),English,https://www.thisdaylive.com/index.php/2020/03/...,"[Nigeria, West Africa, Governance]",False,"Nigeria,West Africa,Governance",0
202003010004,"Sudan: Service of 14 Ambassadors, 35 Diplomats...",Khartoum — The Higher Committee for the disma...,,29 February 2020,Sudan News Agency (Khartoum),English,https://suna-sd.net/en/single?id=561889,"[Sudan, East Africa, Governance]",False,"Sudan,East Africa,Governance",0
202003010005,Sudan: German President Ends Sudan Visit With ...,Khartoum — German President Frank-Walter Stei...,,29 February 2020,Radio Dabanga (Amsterdam),English,https://www.dabangasudan.org/en/all-news/artic...,"[Sudan, External Relations, East Africa, Gover...",False,"Sudan,External Relations,East Africa,Governanc...",0


 We will build a language model from the 'article' column to predict the 'agri_label' column or the 'tags' column

In [20]:
# load information about words
#!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

In [21]:
# "article" column has article text
# Let's make new column with title and article text appended + tokenized and cleaned.
# also get max article length while we're here, for padding later
import re
combined_clean = []
max_length = 0

for i in tqdm(range(len(df))):
    row = df.iloc[i].copy()
    combined = row['title'] + ' ' + row['article']
    combined = re.sub("[^A-Za-z']+", ' ', combined).lower()
    combined = nlp(combined)
    combined = [token.lemma_ for token in combined if ((not token.is_stop) or (' ' in token.text))]
    if len(combined) > 1:
        if len(combined) > max_length:
            max_length = len(combined)
        cleaned = ' '.join(combined)
    combined_clean.append(cleaned)

df['cleaned'] = combined_clean
df.to_csv('cleaned.csv')
print('max_length:', max_length)

HBox(children=(FloatProgress(value=0.0, max=5000.0), HTML(value='')))


max_length: 3575


#### re-read clean data and let's begin

In [22]:
df_clean = pd.read_csv('cleaned.csv')
df_clean.columns
max_length=3575 # calculated above, hardcoded here to avoid running long code

#### Get vocab

In [23]:
articles = [article.split(' ') for article in list(df_clean['cleaned'])]
word_freq = dict(Counter([token for text in articles for token in text]).most_common())
len(word_freq)

48338

In [24]:
min_freq = 5  # word must appear at least min_freq times to get it's own token

word2idx = {}
i = 0  # leave 0 for unknowns/infrequents
for word in word_freq:
    if word_freq[word] > min_freq:
        word2idx[word] = i
        i += 1
    else:
        word2idx[word] = 0

vocab_length = max(word2idx.values()) + 1
vocab_length  # essentially number of all unique words excluding infrequents

14423

In [25]:
class ArticleDataset(Dataset):
    def __init__(self, df, word_dict, max_length):
        self.df = df
        self.word_dict = word_dict
        self.max_length = max_length
        
    def __len__(self):
        return(len(self.df))
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx].copy()
        
        article = row['cleaned'].split(' ')
        x = torch.zeros(self.max_length)
        for i in range(len(article)):
            # front pad
            x[self.max_length - len(article) + i] = self.word_dict[article[i]]
            
        y = torch.tensor(row['agri_label'])
        
        return x.long(), y.float()

ds = ArticleDataset(df_clean, word2idx, max_length)
next(iter(ds))

(tensor([   0,    0,    0,  ..., 6501,  806, 1627]), tensor(0.))

In [26]:
# get data loaders
batch_size = 100

train_df, valid_df = train_test_split(df_clean, test_size=0.2)

train_ds = ArticleDataset(train_df, word2idx, max_length)
valid_ds = ArticleDataset(valid_df, word2idx, max_length)

train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size)

# Models

## RNN's

In [27]:
class RNNModel1(nn.Module):
    def __init__(self, vocab_size=vocab_length, hidden_size=50):
        super(RNNModel1, self).__init__()
        self.i_h = nn.Embedding(vocab_size, hidden_size)  
        self.h_h = nn.Linear(hidden_size, hidden_size)     
        self.h_o = nn.Linear(hidden_size, 1)
        self.h = 0
        self.relu = nn.ReLU()
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = self.h_h(self.h)
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out.squeeze()


In [28]:
class RNNModel2(nn.Module):
    def __init__(self, vocab_size=vocab_length, embedding_size=50):
        super(RNNModel2, self).__init__()
        self.word_emb = nn.Embedding(vocab_size, embedding_size, padding_idx=0)
 
        self.rnn = nn.RNN(input_size=embedding_size, hidden_size=1, batch_first=True)
        
    def forward(self, x):
        x = self.word_emb(x)

        x = self.rnn(x)[1]
    
        return torch.squeeze(x)

In [29]:
# let's train
epochs = 5
model1_rnn = RNNModel1()
lossFun = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model1_rnn.parameters(), lr = 0.001)

for epoch in tqdm(range(epochs)):
    total_loss = 0.0
    model1_rnn.train()
    for x, y in tqdm(train_dl):
        yhat = model1_rnn(x)
        loss = lossFun(yhat, y)
        
        total_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    avg_loss = total_loss / len(train_dl)
    print('epoch:', epoch, 'avg_loss:', avg_loss)
        
    

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 0 avg_loss: 0.2204151235637255


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 1 avg_loss: 0.1174587975256145


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 2 avg_loss: 0.11634211838245392


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 3 avg_loss: 0.11704528140835464


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 4 avg_loss: 0.11682993597351014



In [30]:
# let's train
epochs = 5
model2_rnn = RNNModel2()
lossFun = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model2_rnn.parameters(), lr = 0.001)

for epoch in tqdm(range(epochs)):
    total_loss = 0.0
    model2_rnn.train()
    for x, y in tqdm(train_dl):
        yhat = model2_rnn(x)
        loss = lossFun(yhat, y)
        
        total_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    avg_loss = total_loss / len(train_dl)
    print('epoch:', epoch, 'avg_loss:', avg_loss)
        
    

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 0 avg_loss: 0.8319368928670883


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 1 avg_loss: 0.794037775695324


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 2 avg_loss: 0.7512031868100166


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 3 avg_loss: 0.7021968856453895


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 4 avg_loss: 0.6514816924929618



#### Precision and recall:
Precision = True Positives / (True Positives + False Positives)

Recall = True Postivies / (True Positives + False Negatives)

Reason: Significantly fewer agriculture articles compared to others, don't really care about True Negatives (i.e. y and yhat both 0)

We want to emphasize guessing the agriculture ones correctly


In [31]:
def precision_recall(ytrues, ypreds):
    assert(len(ytrues) == len(ypreds))
    
    # don't need all of these, tracking just in case
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    N = len(ytrues)
    
    for ytrue, ypred in zip(ytrues, ypreds):
        
        if   ytrue == 1 and ypred == 1:
            TP += 1
        elif ytrue == 0 and ypred == 1:
            FP += 1
        elif ytrue == 1 and ypred == 0:
            FN += 1
        elif ytrue == 0 and ypred == 0:
            TN += 1
        else:
            print('bad input, not ones and zeroes')
            break
    
    if TP == 0:
        return (0,0)
    
    precision = TP / (TP + FP)
    recall    = TP / (TP + FN)
    
    return (precision,recall)

In [43]:
def get_binary_prediction(x, model, threshold=0.01):
    model.eval()
    return (torch.sigmoid(model(x)) > threshold).float()


In [44]:
model1_ypreds_rnn = []
model2_ypreds_rnn = []
ytrues_rnn        = []

for x, y in valid_dl:
    
    ytrues_rnn += list(y)
    model1_ypreds_rnn += list(get_binary_prediction(x, model1_rnn))
    model2_ypreds_rnn += list(get_binary_prediction(x, model2_rnn))


In [45]:
precision_recall(ytrues_rnn, model1_ypreds_rnn)

(0.024, 1.0)

In [46]:
precision_recall(ytrues_rnn, model2_ypreds_rnn)

(0.024, 1.0)

## LSTM's

In [36]:
# LSTM Model 1
class LSTMModel1(nn.Module):
    def __init__(self,
                 batch_size,
                 vocab_size=vocab_length,
                 hidden_size=50):
        
        super(LSTMModel1, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
        self.h2o = nn.Linear(hidden_size, 1)  # binary classification
        
    
    def forward(self, x):
        emb = self.emb(x)
        out, hidden = self.lstm(emb)
        
        # many to one, so final hidden state through linear layer
        final = torch.squeeze(hidden[-1].permute((1,2,0)))
        final = self.h2o(final)
        
        
        return(torch.squeeze(final))

In [37]:
# LSTM Model 2 -- dropout layer added
class LSTMModel2(nn.Module):
    def __init__(self,
                 batch_size,
                 vocab_size=vocab_length,
                 hidden_size=50,
                 p=0.5
                ):
        
        super(LSTMModel2, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
        self.h2o = nn.Linear(hidden_size, 1)  # binary classification
        
        self.dropout = nn.Dropout(p)
    
    def forward(self, x):
        emb = self.emb(x)
        out, hidden = self.lstm(emb)
        
        final = torch.squeeze(hidden[-1].permute((1,2,0)))
        
        final = self.dropout(final)
        
        final = self.h2o(final)
        
        return(torch.squeeze(final))

In [38]:
# train LSTM 1
epochs = 5
model1_lstm = LSTMModel1(batch_size)
lossFun = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model1_lstm.parameters(), lr = 0.001)

for epoch in tqdm(range(epochs)):
    total_loss = 0.0
    model1_lstm.train()
    for x, y in tqdm(train_dl):
        yhat = model1_lstm(x)
        
        loss = lossFun(yhat, y)
        
        total_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    avg_loss = total_loss / len(train_dl)
    print('epoch:', epoch, 'avg_loss:', avg_loss)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 0 avg_loss: 0.4460502710193396


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 1 avg_loss: 0.12852044766768814


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 2 avg_loss: 0.11184826488606632


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 3 avg_loss: 0.10404208078980445


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 4 avg_loss: 0.09463547193445265



In [39]:
# train lstm 2
epochs = 5
model2_lstm = LSTMModel2(batch_size)
lossFun = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model2_lstm.parameters(), lr = 0.001)

for epoch in tqdm(range(epochs)):
    total_loss = 0.0
    model2_lstm.train()
    for x, y in tqdm(train_dl):
        yhat = model2_lstm(x)
        
        loss = lossFun(yhat, y)
        
        total_loss += loss.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    avg_loss = total_loss / len(train_dl)
    print('epoch:', epoch, 'avg_loss:', avg_loss)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 0 avg_loss: 0.46490304078906775


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 1 avg_loss: 0.13312433022074402


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 2 avg_loss: 0.11384900910779834


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 3 avg_loss: 0.10714390901848674


HBox(children=(FloatProgress(value=0.0, max=40.0), HTML(value='')))


epoch: 4 avg_loss: 0.09489621040411293



In [48]:
model1_ypreds_lstm = []
model2_ypreds_lstm = []
ytrues_lstm        = []

for x, y in valid_dl:
    
    ytrues_lstm += list(y)
    model1_ypreds_lstm += list(get_binary_prediction(x, model1_lstm))
    model2_ypreds_lstm += list(get_binary_prediction(x, model2_lstm))


In [49]:
precision_recall(ytrues_lstm, model1_ypreds_lstm)

(0.032432432432432434, 0.75)

In [50]:
precision_recall(ytrues_lstm, model2_ypreds_lstm)

(0.03355704697986577, 0.625)

# Model Performance Summary

### RNN Performance:
RNN Model 1 precision, recall:  (0.024, 1.0)

RNN Model 2 precision, recall:  (0.024, 1.0)

### LSTM Performance:
LSTM Model 1 precision, recall: (0.032, 0.75)

LSTM Model 2 precision, recall: (0.034, 0.625)

### Fastai model performance:
For the fastai it was 60% precision, 80% recall.