# Exercise 1 Report
<b> Contains the code, and report as comments </b> <br> 

The following script presented below is tasked with reading and processing information acquired into something readable, along with the statistics required to make an informed decision regarding it. This report will cover the code blocks performing so, alongsides the reasons behind the results and choice of methods.

The following block covers the imports and variables instantiated. There is no real significant mention here

In [1]:
import re
import nltk
import json
from nltk import FreqDist, collections
from collections import Counter
from nltk.corpus import wordnet
from functools import lru_cache
from nltk import word_tokenize
import math
import time

# PartA:Q1
data = []
lmm = nltk.WordNetLemmatizer()
lemma_tize = lru_cache(maxsize=50000)(lmm.lemmatize)
vocab = Counter()
total_length = 0
positive_count = 0
negative_count = 0
news_pos = 0
news_neg = 0
C0 = 0
word_list = []
pos_list = set()
neg_list = set()
gen_sent = ['is','this']
wordlist1 = []
wordlist2 = []
test_list = []

This section reads the positive and negative word text files (I can't seem to make it read anything in a subdirectory, so I had to place them in the same directory as this script), saving them as a key to a set: one positive, and one negative.

There is also a function for POS tagging that has been commented out, which will be addressed further down during the lemmatization phase

There is also another function, regTokenize(), which is another way of tokenizing that I chose purely because it had less overhead and did roughly the same job as nltk.word_tokenize, thus taking less time to process

In [2]:
with open(r"positive-words.txt") as p:
    for line in p:
        key = p.readline()
        key = key.strip()
        pos_list.add(key)
    # print(pos_list)
with open(r"negative-words.txt") as p:
    for line in p:
        key = p.readline()
        key = key.strip()
        neg_list.add(key)
    # print(neg_list)
    
# def get_pos(word):
#     w_synsets = wordnet.synsets(word)
#     pos_counts = Counter()
#     pos_counts["n"] = len([item for item in w_synsets if item.pos() == "n"])
#     pos_counts["v"] = len([item for item in w_synsets if item.pos() == "v"])
#     pos_counts["a"] = len([item for item in w_synsets if item.pos() == "a"])
#     pos_counts["r"] = len([item for item in w_synsets if item.pos() == "r"])
#     most_common_pos_list = pos_counts.most_common(3)
#     return most_common_pos_list[0][0]

WORD = re.compile(r'\w+')
def regTokenize(text):
    words = WORD.findall(text)
    return words

The following code block has a massive pile of code sections mashed into a single for loop, so for the sake of convenience and navigation, comment markers with their respective numbers have been used:

[1] This section opens the jsonl file and loads in only the "content" element of each line, thus processing just the news stories. It is also where the text preprocessing is done, such that:
    * URLS are filtered out such that any word containing HTTP until the first whitespace is removed. I had considered adding another piece of filter to include words with no spacing and a ".", such as 'abcd.efg'. I assumed this to be dangerous as poor formatting of the file may result in unnecessary deletions*
    * Filtering of nonalphanumeric characters, words of 1 length, and pure numbers. standard regex sub process here*
The text is then tokenized with a custom function previously mentioned above*

[2] This section counts the overall occurences of positive/negative words, as well as an instantiated variable of current positive/negative words, to tally into the news stories which has more positives than negatives, and vice versa. 
* Not much to be said here, other than a set was used since order/value of the keys did not matter, just its presence*

[3] This section tallies the overall tokens per news story, into an overall total count, alongsides updating the dictionary with unique word occurences per story. 
Lemmatization of the tokenized story was also done here, saved as a string to a list, for future processing
* The lemmatization function is a basic wordnetlemmatizer in the initialization phase, but cached for performance. I also did not use the POS tagger function (from above) due to the absurdly long time it takes to process. This does however affect the trigrams and overall wording as some words like 'as', become 'a'*

In [3]:
# start_time = time.time()
#[1]=============================================
with open('signal-news1.jsonl') as f:
    for line in f.readlines():
        obj = json.loads(line)
        x = obj["content"].lower()
        #--------
        test_list.append(x)
        x = re.sub(r"http\S+", "", x)
        x = re.sub(r"\s*[^\w=]+\s*", " ", x)
        x = re.sub(r"\b\w{1,1}\b", "", x)
        x = re.sub(r'\b[0-9]+\b', '', x)
        y = regTokenize(x)
#[1]---------------------------------------------
#[2]=============================================
        pos_news_count = 0
        neg_news_count = 0
        for i in y:
            if i in pos_list:
                positive_count += 1
                pos_news_count += 1
            if i in neg_list:
                negative_count += 1
                neg_news_count += 1
        if pos_news_count > neg_news_count:
            news_pos += 1
        elif pos_news_count < neg_news_count:
            news_neg += 1
        # print(positive_count, pos_news_count)
        # y = word_tokenize(x)
#[2]----------------------------------------------
#[3]===============================================
        total_length += len(y)
        y = ' '.join(lemma_tize(word) for word in y)
        word_list.append(y)
        [vocab.update(s.split()) for s in y]
#[3]-----------------------------------------------
f.close()
print("token count:", total_length)
print("vocabulary count:", (len(set(word_list))))
print("total positive:",positive_count,"| total negative:", negative_count)
print("postive>negative: ", news_pos, "| negative>positive:", news_neg)

token count: 5773603
vocabulary count: 19175
total positive: 95412 | total negative: 62333
postive>negative:  10512 | negative>positive: 5388


[4] This section splits the aforementioned wordlist, into sections, one up to 16,000 rows, and the other 16,000+.
The trigram after was formed by converting the wordlist (corpus) into a giant, single string, before tokenizing and making trigrams of it. The top 25 were displayed after

[5] This section runs a similiar version of [4], with the exception that it only does so up to the 16,000 rows. It also has a logic where it considers 2 words, "filter1" and "filter2" as parameters, to find the most common trigram beginning with those 2. The last 2 words are then set to "filter1" and "filter2", whilst the last word in the trigram is added to a list. That repeats 8 times (since the list already has 'is' and 'this') to form the sentence

In [4]:
#[4]======================================================
word1 = word_list[:16000]
word2 = word_list[16001:]
test = test_list[16001:]

z = [' '.join(sentence.split()) for sentence in word_list]
joined = " ".join(z)
trigrams = nltk.trigrams(nltk.word_tokenize(joined))
fd = Counter(trigrams)
print("Top 25 trigrams: ",fd.most_common(25))
#[4]--------------------------------------------------------
#[5]=========================================================
z1 = [' '.join(sentence.split()) for sentence in word1]
joined1 = " ".join(z1)
tri1 = nltk.trigrams(nltk.word_tokenize(joined1))
fd1 = Counter(tri1)
filter1 = 'is'
filter2 = 'this'
counter = 0
sentencelist = ['is','this']
for trigram in fd1.most_common():
    if (filter1 == trigram[0][0]) and (filter2 == trigram[0][1]):
        #print(filter1, filter2, trigram)
        sentencelist.append(trigram[0][2])
        filter1 = trigram[0][1]
        filter2 = trigram[0][2]
        counter += 1
    if counter == 8:
        break
print("generated sentence: ", sentencelist)
#[5]-----------------------------------------------------

Top 25 trigrams:  [(('one', 'of', 'the'), 2440), (('on', 'share', 'of'), 2095), (('day', 'moving', 'average'), 1979), (('on', 'the', 'stock'), 1567), (('a', 'well', 'a'), 1426), (('in', 'research', 'report'), 1417), (('in', 'research', 'note'), 1375), (('the', 'year', 'old'), 1261), (('the', 'united', 'state'), 1227), (('for', 'the', 'quarter'), 1221), (('average', 'price', 'of'), 1193), (('research', 'report', 'on'), 1177), (('research', 'note', 'on'), 1138), (('the', 'end', 'of'), 1134), (('share', 'of', 'the'), 1132), (('in', 'report', 'on'), 1124), (('earnings', 'per', 'share'), 1123), (('buy', 'rating', 'to'), 1075), (('cell', 'phone', 'plan'), 1073), (('phone', 'plan', 'detail'), 1070), (('according', 'to', 'the'), 1069), (('of', 'the', 'company'), 1058), (('appeared', 'first', 'on'), 995), (('moving', 'average', 'price'), 995), (('price', 'target', 'on'), 981)]
is this (('is', 'this', 'the'), 5)
this the (('this', 'the', 'company'), 4)
the company (('the', 'company', 'diesel'), 

[6] The final section involves a convoluted sequence, so it has to be explained in this text rather than in the code:
   * [6.1] The remaining rows in the corpus began with a fresh, unprocessed list from row 16,000 onwards. The reason is because having the processed/lemmatized version would remove all stopwords, thus making it impossible to distinguish the sentences per news story. Each line (sentence) in sentence (news story) is then saved to an array, named 'arr'. There are 2 ngrams used here: a bigram and unigram (to form the [[x,y],[z]] trigram model, being the probability of z given x,y)
   * [6.2] Each sentence in the array is now having their probability of occuring in that sequence, calculated, using the method in [6.3]. This will be summed as the final overall perplexity
   * [6.3] This function checks each word in that sentence, then calculates the probability of all of them occuring (By saving the value from [6.4], given the occurance of a word, into a list, then summing the value, before returning it in the chain rule formula)
   * [6.4] This function calculates the probability of a unigram following a bigram (I realized computing the probability using a trigram is too complex to code, so I went with a bigram calculation)

In [5]:
#[6]=============================================================
#[6.4]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def bigramprobability(bigram, unigram_list, bigram_list):
    word1 = str(bigram[0])
    word2 = str(bigram[1])
    sent1 = bigram_list.get((word1, word2))
    sent2 = unigram_list.get((word1,))
    return sent1/sent2
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#[6.1]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tokenizedcorpus = [nltk.sent_tokenize(content) for content in test]
arr = []
for sentence in tokenizedcorpus:
    for line in sentence:
        arr.append(line)
bigram_list = Counter()
unigram_list =Counter()
for line in arr:
    y = nltk.word_tokenize(line)
    bigram_list.update(nltk.ngrams(y,2))
    unigram_list.update(nltk.ngrams(y,1))
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#[6.3]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
def sentence_probability(sentence, bigram_list, unigram_list):
    val = []
    words = nltk.word_tokenize(sentence)
    sentlength = len(words)
    finalval = 1
    for i in range(1,sentlength):
        bigrams = [words[i-1],words[i]]
        prob = bigramprobability(bigrams,unigram_list,bigram_list)
        # print(prob)
        value = 1/prob
        val.append(value)
    for x in val:
        finalval *= x
    # print(finalval)
    return 2**-((1/sentlength) * finalval)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#[6.2]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
totalvalue=0
for sentence in arr:
    sentprob = sentence_probability(sentence, bigram_list, unigram_list)
    # print(sentprob)
    totalvalue += sentprob
    # print(totalvalue)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
print("row 16001+ perplexity:", totalvalue)
#[6]------------------------------------------------------
#-------------------------------------------
# print("--- %s seconds ---" % (time.time() - start_time))

row 16001+ perplexity: 107.98467530302058
