# HW 2: N-gram Language Models
**Rishi Parida and Jazzy Howard**

## Date Out: Thursday, February 20
## Due Date: Sunday, March 8

This programming assignment is more open-ended than the previous ones. It is centered on the N-gram language models and tasks you to:

* download and process a large text dataset in python using the <code>csv</code> library
* perform sentence and word tokenization
* calculate N-gram counts and probabilities
* compare the characteristics of the N-grams across different models
* generate random sentences using the models

<u>You may work in teams of two or three (2-tuples or 3-tuples?) for this assignment.</u>

<hr>

In [2]:
import nltk

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [4]:
import csv
#Users/anjuchopra/Downloads/Wine/winemag-data_first150k.csv

In [5]:
import random

### Task #1

<u>Download two large text datasets from Kaggle.</u>

The <a href="http://kaggle.com">Kaggle competition hosting site</a> offers a number of free datasets that contain interesting text fields. For this assignment, we will use the "Wine Reviews" and "All the News" datasets. They can be accessed by selecting the "Datasets" header and then searching for these specific datasets. Then, choose "Data" from the sub-header, preview some of the csv data and notice how at least one of the columns in the dataset will contain sufficient text. I chose to direct you to these two datasets because the textual content seemed interesting and would have different language characteristics, and both were large csv files that could generate significant n-gram counts, but not be too large of a file.

<em>(You can use other datasets if you wish. Others that looked interesting on Kaggle include the "Yelp Dataset" (but its over 3GB !!!), "SMS Spam Collection Dataset", "Russian Troll Tweets", and "A Million News Headlines".)</em>

### Task #2

<u>Process the downloaded <code>csv</code> files in python.</u>

There's a nice csv library already included in python for accessing values in that are stored in a comma separated values (csv) format. Read the <a href="https://docs.python.org/3/library/csv.html">csv library documentation</a>.
What is the delimiter in your csv files? Open each of the two .csv files that you downloaded using this library and be able to read in the data. Note that we really only care about the text column in this assignment.

In [6]:
# PYTHON CODE HERE
#wine = '/Users/anjuchopra/Downloads/Wine/Testing2.csv'
wine = '/Users/anjuchopra/Downloads/winemag-data-130k-v2.csv'
debate = '/Users/anjuchopra/Downloads/debate_transcripts_v3_2020-02-26.csv'

descriptions = []
sentTokens = []
wordTokens = []

debateDescriptions = []
debateSentTokens = []
debateWordTokens = []

### Task #3

<u>Perform sentence segmentation and word tokenization.</u>

Utilize the nltk module to perform sentence segmentation and word tokenization. But at this point, there are a few decisions that need to be made:

* How we should handle the .csv rows in the previous step? If we ignore row makers, and "lump everything together", how will that effect our language model?
* Do we want to remove punctuation? What is the effect of keeping punctuation in the model?
* Do we want to add sentence boundary markers, such as <samp>&lt;S&gt;</samp> and <samp>&lt;/S&gt;</samp>?</li>
* Should two the words <samp>The</samp> and <samp>the</samp> be treated as the same? What are the effects of doing, or not doing, this?

In [7]:
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        return False

In [8]:
with open(wine, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    reader
    for row in reader:
        currDesc = row['description']
        descriptions.append(currDesc)        
        
        sentTokens.append(sent_tokenize(currDesc))
        
        currWord = word_tokenize(currDesc)
        currWord = ['0.0' if is_number(x) else (x if x is 'i' else x) for x in currWord if (x.isalnum() or x == '.' or x == ',')]
        
        wordTokens.append(currWord)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/anjuchopra/Downloads/winemag-data-130k-v2.csv'

In [None]:
with open(debate, encoding='mac_roman', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        currDesc = row['speech']
        debateDescriptions.append(currDesc)        
        
        debateSentTokens.append(sent_tokenize(currDesc))
        
        currWord = word_tokenize(currDesc)
        currWord = ['0.0' if is_number(x) else (x if x is 'i' else x) for x in currWord if (x.isalnum() or x == '.' or x == ',')]
        #if(currWord.isalnum()):
        #print("NEW WORD", currWord)
        debateWordTokens.append(currWord)

### Task #4

<u>Calculate N-gram counts and compute probabilities.</u>

Use a python dictionary (or any suitable data structure) to first compute unigram counts. Then try bigram counts. Finally, trigram counts.

How much memory are you using? How fast, or slow, is the code -- how long is this step taking? If it is taking too long, try only using a fraction of your corpus: instead of loading the entire .csv file, try only reading the first 1000 rows of data.

Using those counts, compute the probabilities for the unigrams, bigrams, and trigrams, and store those in a new python dictionary (or some other data structure).

In [None]:
def generate_ngrams(words_list, n):
    ngrams_list = []
 
    for num in range(0, len(words_list)):
        ngram = ' '.join(words_list[num:num + n])
        ngrams_list.append(ngram)
 
    return ngrams_list

In [None]:
unigrams = {}
bigrams = {}
trigrams = {}

unigramsTemp = []
bigramsTemp = []
trigramsTemp = []

for i in wordTokens:
    unigramsTemp.append(generate_ngrams(i, 1))
    bigramsTemp.append(generate_ngrams(i, 2))
    trigramsTemp.append(generate_ngrams(i, 3))
#print (unigramsTemp)

for c in unigramsTemp:
    for i in c:
        if i in unigrams:
            unigrams[i] = unigrams[i] + 1
        else:
            unigrams[i] = 1

for c in bigramsTemp:
    for i in c:
        if i in bigrams:
            bigrams[i] = bigrams[i] + 1
        else:
            bigrams[i] = 1
            
for c in trigramsTemp:
    for i in c:
        if i in trigrams:
            trigrams[i] = trigrams[i] + 1
        else:
            trigrams[i] = 1
            

unigramsList = sorted(unigrams.items(), key=lambda x: x[1], reverse=True)
bigramsList = sorted(bigrams.items(), key=lambda x: x[1], reverse=True)
trigramsList = sorted(trigrams.items(), key=lambda x: x[1], reverse=True)

a = 0

In [None]:
debateUnigrams = {}
debateBigrams = {}
debateTrigrams = {}

unigramsTemp = []
bigramsTemp = []
trigramsTemp = []

for i in debateWordTokens:
    unigramsTemp.append(generate_ngrams(i, 1))
    bigramsTemp.append(generate_ngrams(i, 2))
    trigramsTemp.append(generate_ngrams(i, 3))
#print (unigramsTemp)

for c in unigramsTemp:
    for i in c:
        if i in debateUnigrams:
            debateUnigrams[i] = debateUnigrams[i] + 1
        else:
            debateUnigrams[i] = 1

for c in bigramsTemp:
    for i in c:
        if i in debateBigrams:
            debateBigrams[i] = debateBigrams[i] + 1
        else:
            debateBigrams[i] = 1

for c in trigramsTemp:
    for i in c:
        if i in debateTrigrams:
            debateTrigrams[i] = debateTrigrams[i] + 1
        else:
            debateTrigrams[i] = 1
            

debateUnigramsList = sorted(debateUnigrams.items(), key=lambda x: x[1], reverse=True)
debateBigramsList = sorted(debateBigrams.items(), key=lambda x: x[1], reverse=True)
debateTrigramsList = sorted(debateTrigrams.items(), key=lambda x: x[1], reverse=True)

a = 0

In [None]:
def generateNonStarter(base):
    totWords = 0
    possibleWords = []
    
    for i in bigramsList:
        if(i[0].startswith(base + ' ')):
            #add the total of this word to the total
            totWords += i[1]
            #put the reference as what the total is now (IF YOU NEED NO SPACE DELETE THE SPACE HERE)
            possibleWords.append((i[0].replace(base + ' ',''), totWords))
            #generate a random number between 0 and the total, the number will be the index of the word
            
    #print(possibleWords)
    try:
        rndIndex = random.randrange(totWords)
    except:
        try:
            rndIndex = 0
        except:
            try:
                generateNonStarter(random.choice(starters()))
            except:
                generateNonStarter("the")
    #print(rndIndex)
    
    for i in possibleWords:
        if rndIndex < i[1]:
            return i[0]
            break

In [None]:
def generateNonStarterTri(base):
    totWords = 0
    possibleWords = []
    
    for i in trigramsList[1:]:
        threeWords = i[0].split()
        if(len(threeWords) > 1):
            
            twoWords = threeWords[0] + " " + threeWords[1]
            
            if(twoWords == base):
                #add the total of this word to the total
                totWords += i[1]
                #put the reference as what the total is now (IF YOU NEED NO SPACE DELETE THE SPACE HERE)
                possibleWords.append((i[0].replace(base + ' ',''), totWords))
       
    try:
        rndIndex = random.randrange(totWords)
    except:
        rndIndex = 0
    #print(rndIndex)
    
    for i in possibleWords:
        try: 
            if rndIndex < i[1]:
                return i[0]
                break
        except:
            return generateNonStarter(base.split()[1])

In [None]:
def generateDebateNonStarter(base):
    totWords = 0
    possibleWords = []
    
    for i in debateBigramsList:
        if(i[0].startswith(base + ' ')):
            #add the total of this word to the total
            totWords += i[1]
            #put the reference as what the total is now (IF YOU NEED NO SPACE DELETE THE SPACE HERE)
            possibleWords.append((i[0].replace(base + ' ',''), totWords))
            #generate a random number between 0 and the total, the number will be the index of the word
            
    #print(possibleWords)
    try:
        rndIndex = random.randrange(totWords)
    except:
        try:
            rndIndex = 0
        except:
            try:
                generateDebateNonStarter(random.choice(debateStarters()))
            except:
                generateDebateNonStarter("the")
    #print(rndIndex)
    
    for i in possibleWords:
        if rndIndex < i[1]:
            return i[0]
            break

In [None]:
def generateDebateNonStarterTri(base):
    totWords = 0
    possibleWords = []
    
    for i in debateTrigramsList[1:]:
        threeWords = i[0].split()
        if(len(threeWords) > 1):
            
            twoWords = threeWords[0] + " " + threeWords[1]
            
            if(twoWords == base):
                #add the total of this word to the total
                totWords += i[1]
                #put the reference as what the total is now (IF YOU NEED NO SPACE DELETE THE SPACE HERE)
                possibleWords.append((i[0].replace(base + ' ',''), totWords))
       
    rndIndex = random.randrange(totWords)
    
    for i in possibleWords:
        if rndIndex < i[1]:
            return i[0]
            break

### Task #5

<u>Compare the statistics of the corpora.</u>
                        
Use the results of those calculations that you just made the poor computer painstakingly compute. What are the differences in the most common unigrams between the two language models? Are there interesting differences between the bigram models or trigram models?

Be able to sort the n-grams to output the top k with the highest count or probability.

In [None]:
print("Wine reviews: ",bigramsList)

In [None]:
print("Debate: ",debateBigramsList)

### Task #6

<u>Generate random sentences from the N-grams models for both datasets.</u>
                        
We briefly talked about this idea in class. It's also introduced at a high-level in J&M 4.3. How can a random number in the range [0,1] probabilistically generate a word using your model?

In [None]:
def generateWords(currWord):
    sentence = ""
    while not sentence.endswith('.'):
        sentence = addWord(sentence, currWord)
        currWord = generateNonStarter(currWord)
        if currWord == None:
            try:
                currWord = generateNonStarter(random.choice(getStarter()))
            except:
                currWord = "where"
    return sentence

In [None]:
def generateWordsSize(currWord, length):
    sentence = ""
    for i in range(length):
        sentence = addWord(sentence, currWord)
        currWord = generateNonStarter(currWord)
        if currWord == None:
            try:
                currWord = generateNonStarter(random.choice(getStarter()))
            except:
                currWord = "where"
    return sentence

In [None]:
def generateWordsTri(lastWord, currWord):
    sentence = lastWord
    while not sentence.endswith('.'):
        sentence = addWord(sentence, currWord)
        temp = currWord
        currWord = generateNonStarterTri(lastWord + " " + currWord)
        lastWord = temp
        if currWord == None:
            try:
                currWord = generateNonStarter(random.choice(getStarter()))
            except:
                currWord = "where"
    return sentence

In [None]:
def generateDebateWords(currWord):
    sentence = ""
    while not sentence.endswith('.'):
        sentence = addWord(sentence, currWord)
        currWord = generateDebateNonStarter(currWord)
        if currWord == None:
            try:
                currWord = generateDebateNonStarter(random.choice(getDebateStarter()))
            except:
                currWord = "where"
    return sentence

In [None]:
def generateDebateWordsSize(currWord, length):
    sentence = ""
    for i in range(length):
        sentence = addWord(sentence, currWord)
        currWord = generateDebateNonStarter(currWord)
        if currWord == None:
            try:
                currWord = generateDebateNonStarter(random.choice(getDebateStarter()))
            except:
                currWord = "where"
    return sentence

In [None]:
def generateDebateWordsTri(lastWord, currWord):
    sentence = lastWord
    while not sentence.endswith('.'):
        sentence = addWord(sentence, currWord)
        temp = currWord
        currWord = generateDebateNonStarterTri(lastWord + " " + currWord)
        lastWord = temp
        if currWord == None:
            try:
                currWord = generateNonStarter(random.choice(getStarter()))
            except:
                currWord = "where"
    return sentence

In [None]:
def addWord(base, word):
    if word == "0.0":
        word = str(random.randrange(0,9999))
    try:
        return base + (" " if not word.endswith('.') and not word.endswith(',') and not base == "" else "") + (word if not word == 'i' else 'I')
    except:
        return base + " " + (word if not word == 'i' else 'I')

In [None]:
def getDebateStarter():
    return generateDebateNonStarter('.')

def getStarter():
    return generateNonStarter('.')

In [None]:
randomStarter = random.choice(getStarter())

def biWine():
    sentence = generateWords(getStarter())
    return(sentence)

def triWine():
    temp = getStarter()
    try:
        temp2 = generateWordsSize(temp,2).split()[1]
    except:
        temp2 = getStarter()
    try:
        sentence = generateWordsTri(temp, temp2)
    except:
        try:
            sentence = generateWordsTri(temp, temp2)
        except:
            print("Unexpected error")
    return(sentence)


def biDebate():
    debateSentence = generateDebateWords((getDebateStarter()))
    return(debateSentence)

def triDebate():
    temp = getDebateStarter()
    try:
        temp2 = generateDebateWordsSize(temp,2).split()[1]
    except:
        temp2 = getStarter()
    try:
        sentence = generateDebateWordsTri(temp, temp2)
    except:
        try:
            sentence = generateDebateWordsTri(temp, temp2)
        except:
            print("Unexpected error!")
    return(sentence)

### Report

Write a technical report (in this Jupyter Notebook, with good Markdown formatting) that documents your findings, "lessons learned", any areas of where you ran into difficult, and also any other interesting details. Include in your report the following details:

1. Names of the datasets used.
1. Does your model use all of the data in the .csv file or only a subset of it (i.e. first 1,000 rows)?
1. What is the vocabulary and size of each dataset?
1. How did you handle the merging of separate rows in a .csv file? How did you handle sentence segmentation with sentence boundary markers? Also report on any other decisions made in step #3.
1. How long did it take your program to build these models? Do you have any statistics on memory/RAM usage?
1. Output the top 15 unigrams, bigrams, trigrams for each model. Are there any interesting differences?
1. Output 3 different randomly generated sentences for each unigram, bigram, trigram model. How did you know where the randomly generated sentence ended?

Also submit this python notebook `.ipynb` to D2L.

## 1. Names of the datasets used.

The datasets I used were the "Wine Reviews" (included in instructions) and the 2020 Democratic Debate Transcripts (can be found here):
https://www.kaggle.com/brandenciranni/democratic-debate-transcripts-2020 I used the former since it was in the instructions, but I decided to go with something a little different for the second data set. I settled on the 2020 Democratic Debate Transcripts since I thought it would be interesting to generate sentences based on politician's speaking patterns.

## 2. Does your model use all of the data in the .csv file or only a subset of it (i.e. first 1,000 rows)?

My models both use all of the data in the .csv file

## 3. What is the vocabulary and size of each dataset?

In [None]:
print("Wine reviews set size:",len(unigrams))
print("Debate set size:",len(debateUnigrams))

The size of the vocabulary for the wine reviews is **35186**
The size of the vocabulary for the debate transcripts is **10063**

## 4. How did you handle the merging of separate rows in a .csv file? How did you handle sentence segmentation with sentence boundary markers? Also report on any other decisions made in step #3.

For both the datasets I stored each row as a separate item in a list, then used a nested for loop to combine all the rows for unigrams, bigrams, and trigrams. Because of this, sentence segmentation markers weren't necessary, since they were all stored in different items. I decided to keep both the puntuation marks `,` and `.` as separate words, since they do change the meaning of sentences often. All other punctuation is discarded. Additionally, capital and lowercase words are treated as different words, so *The* and *the* are two different words since they will often have to deal with different contexts. All numbers were changed to be equal to 0.0, so n-gram models would just read them as a number would be following/preceding the target word. Just for a bit of extra fun, I generated a random number from 0-9999 when a number would be outputted.

## 5. How long did it take your program to build these models? Do you have any statistics on memory/RAM usage?

Unfortunately I don't know how to get any statistics on memory/RAM usage. It usually takes about 80-120 seconds to load the data sets.

## 6. Output the top 15 unigrams, bigrams, trigrams for each model. Are there any interesting differences?

In [None]:
print("Wine unigrams:", unigramsList[:15])
print("Wine bigrams:", bigramsList[:15])
print("Wine trigrams:", trigramsList[:15])

print("\n\nDebate unigrams:", debateUnigramsList[:15])
print("Debate bigrams:", debateBigramsList[:15])
print("Debate trigrams:", debateTrigramsList[:15])

The unigrams, bigrams, and trigrams are all (as expected) very different. While the wine n-grams are very focused on smells, aromas, and descriptions, the debate n-grams are all person and action focused, with words like I and we being much more heavily emphasized.

## 7. Output 3 different randomly generated sentences for each unigram, bigram, trigram model. How did you know where the randomly generated sentence ended?

In [None]:
print("***Bigram with Wine Review dataset***\n")
for i in range(3):
    print(biWine())

print("\n***Trigram with Wine Review dataset***\n")
for i in range(3):
    print(triWine())

print("\n***Bigram with Debate dataset***\n")
for i in range(3):
    print(biDebate())
    
print("\n***Trigram with Debate dataset***\n")
for i in range(3):
    print(triDebate())
    

The randomly generated sentences end when they reach a `.` character.