# Introduction to deep learning - exam project
Af: Andreas Nørregaard Pedersen & Michael Rulle
## Fake News Text Generator
I dette projekt er gruppen blevet hyret til at få produceret fake news til danske medier, får at skabe ravage i det danske samfund. Projektet vil bestå af en Model der vil generere tekst ud fra en start sætning. Tekst generering er svært at få en maskine til at forstå, da tekst har kontekst bagved(en sætning af ord af bogstaver). 

## Dataindsamling
Der skulle indhentes data, hvilket blev taget fra kaggle: dataset: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset.
I dette datasæt skulle der være nyhedsartikler der både var rigtige og falske, hvorfra tekst modellen skulle have forskellige scenarier, der er kun brugt rigtige artikler.
Herfra

## Prepare the data
Dataene renses for gentagelser i starten af hver artikel eks. bynavn (nyhedsbureau) og tegnsætning erstattes af tokens.
Den længste artikel bliver fundet, og de resterende artikler bliver fyldt ud med padding, for at nå samme antal tokens.

## Model
Vi har valgt at bruge LSTM grundet dens hukommelse, og oven i LSTM, har vi valgt at bruge Bidirectional, for at få den til at huske sekvensen, altså mere kontekst i en sætning.

## Tuning af hyper parametre
Vi testede forskellige parameter for at finde den bedste nøjagtighed, med de forskellige parameter.
Herfra trænede vi modellen med de fundne bedste hyper parametre.
Dropout probability er ikke bygget ind på tidspunkt for aflevering, men kan præsenteres til eksaminationen. Derfor er dropout probability=0


## Evaluering
Visualisering af nøjagtighed med plots, og lave en forudsigelse på generering af tekst. 

## How to do
Der skal være 3 mapper til hhv. "data", "models" og "images". I data mappen skal True.csv filen fra datasættet være i. Så kan i køre notebook’en.


Andreas og Michael

In [None]:
# entire project can be found at: https://github.com/mrulle/deep_learning_exam_project

import pickle
import pandas as pd
import re
import torch
import torch.nn as nn
import random
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker


In [None]:
corpus_size = 2000
# device config
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
df = pd.read_csv('./data/True.csv')
df.head()


In [None]:



# functions to clean data

# filter out first part containing CITY (news agency) and separator "-"

# to be replaced: “ ”

# search and replace regex
double_quotes = r'“|”'
single_quotes = r'’|‘'
backslashes = r'\\'
multiple_whitespace = r'\t|\v|\f| '
double_quotes = re.compile(double_quotes)
single_quotes = re.compile(single_quotes)
backslashes = re.compile(backslashes)
multiple_whitespace = re.compile(multiple_whitespace)


def clean_data(row):
    txt = row[1].lower()
    txt = txt[txt.find('-')+1:].lstrip()
    txt = double_quotes.sub('"', txt)
    txt = txt.replace("’", "")
    txt = txt.replace("‘", "")
    txt = multiple_whitespace.sub(' ', txt)
    # remove everything before the first dash (news agency and city)
    txt = txt[txt.find('-')+1:]

    return txt


In [None]:
# function to tokenize the text

def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenized dictionary where the key is the punctuation and the value is the token
    """
    # TODO: Implement Function
    token = dict()
    token['.'] = ' <PERIOD> '
    token[','] = ' <COMMA> '
    token['"'] = ' <QUOTATION_MARK> '
    token[':'] = ' <COLON>'
    token[';'] = ' <SEMICOLON> '
    token['!'] = ' <EXCLAIMATION_MARK> '
    token['?'] = ' <QUESTION_MARK> '
    token['('] = ' <LEFT_PAREN> '
    token[')'] = ' <RIGHT_PAREN> '
    token['-'] = ' <QUESTION_MARK> '
    token['\n'] = ' <NEW_LINE> '
    return token


In [None]:
# apply padding for all articles to match length of longest article
# this turned out to be unnecessary due to the get_random_batch and
def pad_to_max(tokenized, max):
    padding_length = max - len(tokenized)
    if padding_length == 0:
        return tokenized
    padding = ['<pad>' for i in range(padding_length)]
    tokenized.extend(padding)
    return tokenized


In [None]:
df = df[:corpus_size]
df = df.astype({'text': 'string'})

df['text'] = df.apply(clean_data, axis=1)
print(df['text'][0])
articles = df['text'].values.tolist()

longest_article = 0

token_dict = token_lookup()

tokenized_articles = []

for article in articles:
    for key, token in token_dict.items():
        article = article.replace(key, token)
    article = article.lower()
    article = article.split()
    if len(article) > longest_article:
        longest_article = len(article)
    tokenized_articles.append(article)


print(f'longest article contains {longest_article} tokens')


unique_tokens = set()

for tokens in tokenized_articles:
    tokens = pad_to_max(tokens, longest_article)
    for token in tokens:
        unique_tokens.add(token)

unique_tokens = list(unique_tokens)

print(f'there are {len(unique_tokens)} unique tokens')

print(
    f'articles equal length: {len(tokenized_articles[0])==len(tokenized_articles[1])}')

articles = [' '.join(art) for art in tokenized_articles]

print(articles[0])
print(tokenized_articles[0])


In [None]:
with open('./data/tokenized_2k_articles.dat', 'wb') as file:
    pickle.dump(tokenized_articles, file)

with open('./data/unique_words_2k_articles.dat', 'wb') as file:
    pickle.dump(unique_tokens, file)


In [None]:
# get words
all_words = []


with open('./data/unique_words_2k_articles.dat', 'rb') as file:
    all_words = pickle.load(file)

all_words.append(' ')
vocab_length = len(all_words)
print(f'number of words: {vocab_length}')

In [None]:
# read file
corpus = []
file_content = []
with open('./data/tokenized_articles.dat', 'rb') as file:
    file_content = pickle.load(file)

# this approach uses embedding, and therefore doesn't need padding so we remove it from the prepared data
print(len(file_content))
for article in file_content:
    corpus.extend([word.strip() for word in article if word not in ['<pad>', ' ']])

print(len(corpus))
# print(corpus[-30:])


In [None]:
# module

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout_rate):
        super(RNN, self).__init__()
        self.dropout_rate = dropout_rate
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.embed = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True, dropout=self.dropout_rate, bidirectional=True)
        self.fc = nn.Linear(self.hidden_size * num_layers, output_size)

    def forward(self, x, hidden, cell):
        out = self.embed(x)
        out, (hidden, cell) = self.lstm(out.unsqueeze(1), (hidden, cell))
        
        out = self.fc(out.reshape(out.shape[0], -1))
        return out, (hidden, cell)
    
    def init_hidden(self, batch_size):
        hidden = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size).to(device)
        cell = torch.zeros(self.num_layers * 2, batch_size, self.hidden_size).to(device)
        return hidden, cell
    
    def save(f_path):
        pass

    


In [None]:
class Generator():
    def __init__(self, chunk_length=200, num_epochs=500, batch_size=1, hidden_size=256, num_layers=2, learning_rate=0.002, dropout_rate=0.2):
        self.dropout_rate = dropout_rate
        self.chunk_len = chunk_length
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.print_every = self.num_epochs // 20 or 1
        self.plot_every = self.num_epochs // 40 or 1
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lr = learning_rate



    def word_tensor(self, string):
        tensor = torch.zeros(len(string)).long()
        for c in range(len(string)):
            tensor[c] = all_words.index(string[c])
        return tensor
    
    def get_random_chunk(self, chunk_length):
        start_idx = random.randint(0, len(corpus) - chunk_length)
        end_idx = start_idx + chunk_length + 1
        text_str = corpus[start_idx:end_idx]
        return text_str

    # method should be split to get random string, and convert to tensors
    def get_random_batch(self, chunk_length):
        start_idx = random.randint(0, len(corpus) - chunk_length)
        end_idx = start_idx + chunk_length + 1
        text_str = corpus[start_idx:end_idx]
        text_input = torch.zeros(self.batch_size, chunk_length)
        text_target = torch.zeros(self.batch_size, chunk_length)
        for i in range(self.batch_size):
            text_input[i,:] = self.word_tensor(text_str[:-1])
            text_target[i,:] = self.word_tensor(text_str[1:])
        return text_input.long(), text_target.long()


    def generate(self, initial_str='the president is dead', predict_len=200, temperature=0.85):
        initial_words = initial_str.split(' ')
        hidden, cell = self.rnn.init_hidden(batch_size=self.batch_size)
        initial_input = self.word_tensor(initial_words)
        predicted = initial_words
        
        for p in range(len(initial_words) - 1):
            _, (hidden, cell) = self.rnn(initial_input[p].view(1).to(device), hidden, cell)

        last_word = initial_input[-1]
        for p in range(predict_len):
            output, (hidden, cell) = self.rnn(last_word.view(1).to(device), hidden, cell)
            output_dist = output.data.view(-1).div(temperature).exp()
            top_word = torch.multinomial(output_dist, 1)[0]
            predicted_word = [all_words[top_word]]
            predicted.extend(predicted_word)
            last_word = self.word_tensor(predicted_word)

        return predicted


    def train(self):
        self.rnn = RNN(vocab_length, self.hidden_size, self.num_layers, vocab_length, self.dropout_rate).to(device)
        optimizer = torch.optim.Adam(self.rnn.parameters(), lr=self.lr)
        criterion = nn.CrossEntropyLoss()
        print(f'<{datetime.now()}>starting training')
        lowest_loss = 100.0 # just a high value, should not be lower than 10
        all_losses = []
        accumulated_losses = 0
        for epoch in range(1, self.num_epochs + 1):
            input, target = self.get_random_batch(self.chunk_len)
            hidden, cell = self.rnn.init_hidden(batch_size=self.batch_size)

            self.rnn.zero_grad()
            loss = 0
            input = input.to(device)
            target = target.to(device)

            for c in range(self.chunk_len):
                output, (hidden, cell) = self.rnn(input[:, c], hidden, cell)
                loss += criterion(output, target[:, c])

            loss.backward()
            optimizer.step()
            loss = loss.item() / self.chunk_len
            accumulated_losses += loss
            if loss < lowest_loss:
                self.best_model = self.rnn.state_dict() # set the model with least loss as the best, so it can be saved
                # print(f'<{datetime.now()}> better model found after {epoch}/{self.num_epochs} epochs with loss: {loss}')
                lowest_loss = loss
            if epoch % self.plot_every == 0:
                all_losses.append(accumulated_losses / self.plot_every)
                accumulated_losses = 0
            if epoch % self.print_every == 0: # enable below lines, if you wish to see print statements on progress
                pass
                # print(f'\n\n<{datetime.now()}> | epoch: {epoch}/{self.num_epochs} | loss: {loss}')
                # print(self.generate())
        file_path = f'./models/epoc_{self.num_epochs}_chunk_{self.chunk_len}_hiddensize_{self.hidden_size}_lr_{self.lr}__loss_{lowest_loss}.pt'
        print(f'saving model at {file_path}')
        torch.save(self.best_model, file_path)
        return all_losses


    

In [None]:
run_params = 'chunk_length=200, num_epochs=500, batch_size=1, hidden_size=256, num_layers=2, learning_rate=0.002, dropout_rate=0.2'
gen = Generator() # with default parameters: chunk_length=200, num_epochs=500, batch_size=1, hidden_size=256, num_layers=2, learning_rate=0.002, dropout_rate=0.2
%matplotlib inline
losses_to_plot = gen.train()
plt.figure()
plt.xlabel(f'epochs/{gen.plot_every}')
plt.ylabel('loss')
plt.text(2, 10, run_params)
plt.plot(losses_to_plot)
file_name = f'./images/test_run.svg'
plt.savefig(file_name, format='svg')

temperatures = [0.2, 0.4, 0.6, 0.8]

for temperature in temperatures:
    stmt = ' '.join(gen.generate(initial_str='i would like', predict_len=150, temperature=temperature)).replace('<quotation_mark>', '"').replace(' <question_mark>','?').replace(' <comma>', ',').replace(' <period>', '.')
    print(f'temperature: {temperature}\nstatement:\n{stmt}')


In [None]:
# default parameters for generator: 
# chunk_length=200, num_epochs=4000, batch_size=1, hidden_size=256, num_layers=2, learning_rate=0.002

epoch_numbers = [50, 25] # to save time
chunk_lengths = [50, 200, 300]
batch_sizes = [1, 5, 10]
hidden_sizes = [64, 128, 256]
dropout_probs = [0.1, 0.2, 0.4] # not implemented in model at time of handin
layer_numbers = [2, 4] # taken out due to mismatch on layer input/output
learning_rates = [0.001, 0.003, 0.005]


# simple naive tuning
tuning_results = []
for epoch in epoch_numbers:
    for batch_size in batch_sizes:
        for chunk in chunk_lengths:
            for hidden_size in hidden_sizes:
                for dropout_prob in dropout_probs:
                    for learning_rate in learning_rates:
                        gen = Generator(chunk_length=chunk, num_epochs=epoch, batch_size=batch_size, hidden_size=hidden_size, learning_rate=learning_rate, dropout_rate=dropout_prob)
                        losses_to_plot = gen.train()
                        print(losses_to_plot)
                        run_params = f'epoc_{epoch}_chunk_{chunk}_hidden_s_{hidden_size}_lr_{learning_rate}'
                        labelx = f'epochs/{gen.plot_every}'
                        plt.figure()
                        plt.xlabel(labelx)
                        plt.ylabel('loss')
                        plt.text(2, 10, run_params)
                        plt.plot(losses_to_plot)
                        file_name = f'./images/{run_params}.svg'
                        plt.savefig(file_name, format='svg')
                        plt.close()

                        test_res = f'epochs: {epoch}\nchunk length: {chunk}\nhidden_size: {hidden_size}\nlearning rate: {learning_rate}'
                        print(test_res)


In [None]:
# from the saved plots, it looks like the following parameters are the best:

# chunk_size = 200
# hidden_size = 128
# learning_rate = 0.001
# dropout_rate = 0.2 - this wasnt part of the tuning, new tuning is reqired - we consider 0.2 as a qualified guess

# so we're going to go with them

gen = Generator(chunk_length=200, num_epochs=5000, hidden_size=128, learning_rate=0.001, dropout_rate=0.2)
gen.train()


temperatures = [0.2, 0.4, 0.6, 0.8]

for temperature in temperatures:
    stmt = ' '.join(gen.generate(initial_str='i would like', predict_len=150, temperature=temperature)).replace('<quotation_mark>', '"').replace(' <question_mark>','?').replace(' <comma>', ',').replace(' <period>', '.')
    print(f'temperature: {temperature}\nstatement:\n{stmt}')
