# Project 1: Language Modeling and Fake Review Classification

Names: Lusca Robinson, Kyrus Mama

Netids: nar73, krm74


**After you make your own copy, please rename this notebook by clicking on it's name in the upper left corner.** It should be named: CS4740_FA21_p1_netid1_netid2

Don't forget to share your newly copied notebook with your partner!

**Reminder: both of you can't work in this notebook at the same time from different computers/browser windows because of sync issues. We even suggest to close the tab with this notebook when you are not working on it so your partner doesn't get sync issues.**

---



## Introduction
In this project we will build an **n-gram-based language model** for deceptive review classification. We will also investigate a feature-based **Naive Bayes model**. The task we are faced with is to **decide whether a hotel review is deceptive or truthful**. This is a relavent problem as websites that contain consumer reviews are a target of opinion spam. Typically, these deceptive opinions are neither easily ignored nor even identifiable by a human reader so we'd like to assist in flagging reviews. The dataset we are investigating looks at *deceptive opinion spam*, that is decetive opinions that have been purposely written to sound genuine ([Ott et al](https://arxiv.org/pdf/1107.4557.pdf)).

To help us approach this problem, we will use NLP techniques covered thus far to frame this as a (supervised) binary classification task, where each opinion will have a label $y \in \{0,1\}$, where *0 indicates a truthful review* and *1 indicates a deceptive one*. You will train and validate your two different models and then run them on a test data set with hidden $y$ labels. You will then submit the results on the test data set to Kaggle to participate in our class-wide competition!

The project is divided into six parts:
1. Dataset loading and preprocessing
2. Unsmoothed n-gram language model (LM): build the unsmoothed n-gram language model using our Fake Review corpus. 
3. Smoothed n-gram language model: build a smoothed version of the model from part 2.
4. Perplexity: compute perplexity for both the unsmoothed and smoothed model
5. Putting everything together and submitting the first model to Kaggle
6. Naive Bayes: build a feature-based Naive Bayes model to perform the same classification task. Compare the LM with Naive Bayes and identify the pros and cons of each.

## Logistics (IMPORTANT!)
- You should work in **groups of 2 students**. Students in the same group will get the same grade. Thus, you should make sure that everyone in your group contributes to the project. 
- **Remember to form groups on BOTH CMS and Gradescope** or not all group members will receive grades. You can use make a post on EdStem to find a partner for this project.
- Please complete the written questions of this notebook in a clear and informative way. We have created a template document for you to answer the written questions. This document can be found [here](https://docs.google.com/document/d/11GX5vG8TeHk1F2eakOgbTFaqYl3lfvZf9fYoMUTVGMs/edit?usp=sharing). Please make a copy of this document for yourself and add your names and netids in the header and answer the written questions on it. You will need to submit this document to gradescope as well (do not forget to do this please!).
- At the end: please make sure to submit the following 3 items:
  1. PDF version of Colab notebook on Gradescope (instructions for converting to PDF are at the end).
  2. PDF version of Google Doc with written answers on Gradescope.
  3. .ipynb version of your colab notebook on CMS.

**Advice:** The written questions is where you get to show us that you understand not only what you are doing but also why and how you are doing it. So be clear, organized and concise; avoid vagueness and excess verbiage. Spend time doing error analysis for the models. This is how you understand the advantages and drawbacks of the systems you build. It's also useful to think about how the theory of n-grams/Naive Bayes bridges with the real world application we are building. Think about what you expect from these models based on your current understanding, and then see if your expectation aligns with empirical results that you'll get. 

## General Guidelines
In this project, we provide a few code snippets or starter points in case you need them. You DO NOT need to follow the structure. 

If you think you have a better idea, go for it. You can ADD, MODIFY, or DELETE any code snippets given to you.

You are expected to use functions or classes to organize your code. A portion of the grade is regarding code cleanliness / readability and applying these models in the real world means we need to collaborate with others (ie. other people should be able to read your code and run it)!

To help with debugging and testing, you should use this example from class [09-02 Thurs - lec3: N-gram models](https://edstem.org/us/courses/12801/resources) as your training corpus:

```
<s> I see what I eat and I eat what I see.
```

The test sentence you can use also comes from class:
```
I see what
```

**Let's do this** 🚀

### Dataset

You are given a **Review Corpus** on CMS, which consists of roughly the same amount of real and fake reviews.

Real review example:
```
Stayed with a group for a bachelorette party, and was disappointed. The hotel is beautiful, the staff was all rather friendly. The main problem was the room/sleeping situation. We had booked rooms with 2 queen beds several weeks before, but received an email a few days before our visit stating they were sold out (how that happens I don't know!!) so they "upgraded" us to two "suites" with a king and a pull out. First, this meant our party was split up and on different floors. Second, that meant two of us were stuck on a pull out couch. :( I'm not a picky, unreasonable person, but that was the WORST "bed" I've ever slept on! It was sunken in the middle so we literally rolled into each other unless we balanced ourselves on the very edge of the bed. Then there were the springs poking into our backs ALL night! Just awful! For the amount of money we spent I expected to be comfortable! I would not stay here again after this experience.
```

Fake review example:
```
I truly enjoyed my stay at the Omni Chicago Hotel. We stayed in a suite, which was clean and extremely nice, at a very reasonable rate. My husband and I spent quite a bit of time in the indoor pool, but personally I preferred laying out on the sundeck. Service was excellent; they were friendly and all of our needs were met promptly. I would definitely recommend this hotel to anyone looking to have a great experience in the downtown Chicago area.
```

In the dataset folder you should find 2 files, training and validation splits for both real and fake reviews.

The project will proceed generally as follows in terms of code development:
1. Write code to train unsmoothed unigram and bigram language models for an arbitrary corpus
2. Implement smoothing and unknown word handling. 
3. Implement the Perplexity calculation. 
4. Using 1, 2 and 3, together with the provided training and validation sets, develop a language-model-based approach for Fake Review Classification.
5. Apply your best language-model-based review classifier (from 4) to the
provided test set. Submit the results to the online Kaggle competition. 
6. Use any existing implementation of Naive Bayes (and the provided training and validation sets) to create an additional Naive Bayes fake review classifier. Apply your best NB classifier to the provided test set. Submit the results to the separate Kaggle competition (for NB classifiers). 

We will progress towards these tasks throughout this notebook.

# Part 1: Preprocessing the Dataset
In this part, you are going to do a few things:
* Connect to the google drive where the data set is stored
* Load and read files
* Preprocess the text

------
**Please upload the dataset to each partner's individual Google Drive now.** We suggest using the same folder structure within Google Drive because the notebook is shared among you, so the code to load the data would have to be changed every time if folder structures are different. One folder structure might be: Google Drive/CS 4740/Project 1/Dataset/ or whatever works for you. See our code below for an example of how we load the data from Google Drive.

## 1.1 Connect to google drive

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## 1.2 Load and read files
First, let's install [NLTK](https://www.nltk.org/), a very widely package for NLP preprocessing (and other tasks) for Python.

In [2]:
!pip install -U nltk tqdm

Collecting nltk
  Downloading nltk-3.6.3-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 5.2 MB/s 
Collecting tqdm
  Downloading tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 4.7 MB/s 
Installing collected packages: tqdm, nltk
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.62.2
    Uninstalling tqdm-4.62.2:
      Successfully uninstalled tqdm-4.62.2
  Attempting uninstall: nltk
    Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.6.3 tqdm-4.62.3


Then we read and load data.

In [26]:
import os
import csv
import io
import math
from nltk import word_tokenize, sent_tokenize
import nltk
from tqdm.notebook import tqdm
nltk.download('punkt')

root_path = os.path.join(os.getcwd(), "drive", "My Drive/CS 4740/Project 1") # replace based on your Google drive organization
dataset_path = os.path.join(root_path, "Dataset") # same here

real_review_train = []
real_review_validation = [] #loop through each word in list. Find wrd in dictionary. Compute probabilitiy of word
fake_review_train = []
fake_review_validation = []

def load_real_fake_dataset(dataset_path, filename):
    real = []
    fake = []
    with open(os.path.join(dataset_path, filename)) as fp:
        csvreader = csv.reader(fp, delimiter="|")
        for txt, label in csvreader:
            label = int(label)
            if label:
                fake.append(txt)
            else:
                real.append(txt)
    
    return real, fake

real_review_train, fake_review_train = load_real_fake_dataset(dataset_path, "P1_real_fake_review_train.txt")

real_review_validation, fake_review_validation = load_real_fake_dataset(dataset_path, "P1_real_fake_review_val.txt")


def tokenize_reviews(reviews):
    return [
        [
            word.lower() for sent in sent_tokenize(review)
            for word in word_tokenize(sent)
        ]
        for review in tqdm(reviews, leave=False)
    ]

tokenized_real_review_training = tokenize_reviews(real_review_train)
tokenized_fake_review_training = tokenize_reviews(fake_review_train)
tokenized_real_review_validation = tokenize_reviews(real_review_validation)
tokenized_fake_review_validation = tokenize_reviews(fake_review_validation)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  0%|          | 0/642 [00:00<?, ?it/s]

  0%|          | 0/638 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

  0%|          | 0/80 [00:00<?, ?it/s]

Sanity checks for our real and fake training sets

In [None]:
tokenized_real_review_training[0]

In [None]:
tokenized_fake_review_training[0]

## 1.3 Data Preprocessing & Preparation

There's a well-known parable in machine learning that 80% of the work is all about data preparation, 10% is supporting infrastructure and 10% is actual modeling. If your "raw" dataset is not preprocessed and prepared in a way to maximize its value, then your model will be more like this: https://xkcd.com/1838/. For this project, modeling is the star of the show for learning purposes, but we still want you to pay attention to the preprocessing stage.

*We've already tokenized and lowercased* the raw data for you. We have not added a start of sentence token but feel free to do so (it is not neccessary). Here are a few extra things you might want to do:

- Think about edge cases. For example, you don't want to accidentally append a period to the last word of a sentence. 
- Watch out for apostrophes and other tricky things like quotations, they cause lots of edge cases. For example, "they're" can be all one token, or two tokens ("they", "'re") or even three tokens ("they", " ' ", "re"). 

Why did we lowercase all tokens? Because the computer will otherwise consider "The" and "the" as two separate words and this will cause problems.

Note that you may use existing
tools just for the purpose of preprocessing. 

Advice: don't get bugged down in the dozens of preprocessing packages and suggestions that you can find on Towards Data Science or Stack Overflow. Start with this [NLTK tutorial](https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/introduction/nochunks.html#:~:text=The%20Natural%20Language%20Toolkit%20(NLTK,tokenization%2C%20tagging%2C%20and%20parsing.) and that should be plenty.

In [27]:
# TODO: preprocessing
def preprocessing(processList):
  for reviewEntry in processList:
    i=0
    while i < len(reviewEntry):
      if reviewEntry[i] == "n't":
        reviewEntry[i-1] = reviewEntry[i-1] + reviewEntry[i]
        reviewEntry.pop(i)
      elif reviewEntry[i][0] == "'":    
        reviewEntry.pop(i)
      elif "." in reviewEntry[i] and len(reviewEntry[i]) > 1 and reviewEntry[i] != "...":
        temp = reviewEntry.pop(i)
        stopLoc = temp.index(".")
        if stopLoc > 0:
          reviewEntry.insert(i, temp[0:stopLoc])
        reviewEntry.insert(i+1, ".")
        if stopLoc+1 < len(temp):
          reviewEntry.insert(i+2, temp[stopLoc+1:])
      elif reviewEntry[i] in [".", "?", "!", "..."]:
           reviewEntry.insert(i+1, "SOS")
           i += 2
      elif i == 0:
        reviewEntry.insert(i,"SOS") 
        i += 1
      elif reviewEntry[i].isnumeric():
        reviewEntry[i] = "NUMBER"
      # elif re.match("^[,^%$#@*()[]{}\/]+$", reviewEntry[i])
      else:
        i+=1

preprocessing(tokenized_real_review_training)
preprocessing(tokenized_fake_review_training)
preprocessing(tokenized_real_review_validation)
preprocessing(tokenized_fake_review_validation)



**Q1.1: Show some observations or statistics from the dataset** (should be quantitative – i.e. most frequent words, most frequent bigram, etc.) You may do the computations for your graphs/statistics on the colab notebook, however, please mmake sure you transfer all your work (statistics, graphs, snapshots of thh code if needed) to the Google Doc!

Please answer on your writeup doc!

In [76]:
# TODO: observations/statistics
word_count = {}
bigram_count = {}

def include_words(processList, word_count_dict={}):
  for reviewEntry in processList:
      for word in reviewEntry:
        if word in word_count_dict:
          word_count_dict[word] += 1
        else:
          word_count_dict[word] = 1

def include_bigrams(processList, word_count_dict={}):
  for reviewEntry in processList:
      for i in range(len(reviewEntry)-1):
        key = (reviewEntry[i], reviewEntry[i+1])
        if key in word_count_dict:
          word_count_dict[key] += 1
        else:
          word_count_dict[key] = 1

include_words(tokenized_real_review_training, word_count)
include_words(tokenized_fake_review_training, word_count)

real_word_count = {}
fake_word_count = {}
include_words(tokenized_real_review_training, real_word_count)
include_words(tokenized_fake_review_training, fake_word_count)

include_bigrams(tokenized_real_review_training, bigram_count)
include_bigrams(tokenized_fake_review_training, bigram_count)

real_bigram_count = {}
fake_bigram_count = {}
include_bigrams(tokenized_real_review_training, real_bigram_count)
include_bigrams(tokenized_fake_review_training, fake_bigram_count)

all_unigrams = set(real_word_count.keys()) | set(fake_word_count.keys())

all_bigrams = set(real_bigram_count.keys()) | set(fake_bigram_count.keys())

# tokens = word_count.keys()
# freq = [word_count[key] for key in tokens]

# tog = zip(freq, tokens)
# sor = sorted(tog)
# print(sor)

# tokens = bigram_count.keys()
# freq = [bigram_count[key] for key in tokens]

# tog = zip(freq, tokens)
# sor = sorted(tog)
# print(sor)

**Please answer the following question**:

**Q1.2: What did you do in your preprocessing part?**

Example answer format:

A: We tokenized and lowercased all the words.

Please answer on your writeup doc!

# Part 2: Compute Unsmoothed Language Models.

To start, you will write a program that computes unsmoothed unigram and bigram probabilities. You should consider real and deceptive reviews as separate corpora and
generate a separate language model for each set of reviews.
We have already loaded the data and (partially) preprocessed it and you probably did some of your own preprocessing. 

Note that you were allowed to use existing
tools for the purpose of preprocessing, but you must write the code for gathering n-gram counts and computing n-gram probabilities yourself. 

For example, consider the
simple corpus consisting of the sole sentence:


> the students liked the class

Part of what your program would compute for a unigram and bigram model, for example,
would be the following:


> $P("the") = 0.4; P("liked") = 0.2; P("the"|"liked") = 1.0; P("students"|"the") = 0.5$

Remember to add a symbol to mark the beginning of sentence. See Sept. 2nd lecture, p25-28 for an example.




**Advice**: jupyter notebooks (including colab) can be a double-edged sword. It's amazing and liberating to just start writing code and run it by simply running a cell. However, it gets messy very quickly. So, once you're done prototyping, you should be using functions (classes may be unnecessary but go for it if you want) to make things cleaner and easier to debug.

## 2.1 Unsmoothed Uni-gram Model.

In this part of the project, you are trying to compute the probabilities for a unigram model. You might want to take in a list of words, and return the probabilities for each
occurence. Think of an efficient data structure to use here given what ratio of reads and puts you expect.

Please look at the example above and consider how we get the probabilities.

Below is a starter point you can go from, but you DO NOT need to stick it. Feel free to use your own design.

In [77]:
def uu_lprob(word, word_count, total_num):
  return math.log(word_count[word] / total_num) if word in word_count else -1000000000

"""
Reference code for start. You do not need to follow this.
Function [unsmoothed_unigram] computes the probabilities for a unigram model
lst: a list of words in a sentence
Return: [data structure of your choice] that stores the result
"""
def unsmoothed_unigram(lst):
  # TODO
  tokens = real_word_count.keys()
  freq = [real_word_count[key] for key in tokens]
  real_total_num = sum(freq)

  tokens = fake_word_count.keys()
  freq = [fake_word_count[key] for key in tokens]
  fake_total_num = sum(freq)

  real_prob = 0.0
  fake_prob = 0.0
  for word in lst:
    real_prob += uu_lprob(word, real_word_count, real_total_num)
    fake_prob += uu_lprob(word, fake_word_count, fake_total_num)
  return real_prob < fake_prob, real_prob, fake_prob

# for review in tokenized_real_review_validation:
#   print(unsmoothed_unigram(review))
# print(unsmoothed_unigram(["SOS", "the", "hotel", "is", "good", ".", "SOS"]))

## 2.2 Unsmoothed Bi-gram Model.

In this part of the project, you are trying to compute the probabilities for a bigram model. You can approach this with similar methods as above.

Remember the definition:
$p(w_n\mid w_{n-1})=\frac{C(w_{n-1}w_n)}{C(w_{n-1})}$ this means you might want to store two things (count of $w_{n-1}$ and count of $w_{n-1}w_n$).

In [78]:
# TODO: Add code for bigram probability calculation. 
def ub_lprob(bigram, bigram_count, word_count):
  prev = bigram[0]
  return math.log(bigram_count[bigram] / word_count[prev]) \
    if bigram in bigram_count else -1000000000
    
"""
Reference code for start. You do not need to follow this.
Function [unsmoothed_unigram] computes the probabilities for a unigram model
lst: a list of words in a sentence
Return: [data structure of your choice] that stores the result
"""
def unsmoothed_bigram(lst):
  # TODO
  real_prob = 0.0
  fake_prob = 0.0
  for i in range(len(lst)-1):
    bigram = (lst[i], lst[i+1])
    real_prob += ub_lprob(bigram, real_bigram_count, real_word_count)
    fake_prob += ub_lprob(bigram, fake_bigram_count, fake_word_count)
  return real_prob < fake_prob, real_prob, fake_prob

# for review in tokenized_real_review_validation:
#   print(unsmoothed_bigram(review))
# print(unsmoothed_bigram(["SOS", "the", "hotel", "is", "good", ".", "SOS"]))

**Please answer the following question**:

**Q2: What data structure are you using to store probabilities for unigrams and bigrams? Why did you select this data structure?**

Please answer on your writeup doc!

# Part 3: Smoothed Language Model
In this part, you will need to implement **at least one** smoothing method and **at least one** method to handle unknown words in the test data. You can choose any method(s) that you want for each. You should make clear
**what method(s)** were selected and **why**, providing a description for any non-standard approach (e.g., an approach that was not covered in class or in the readings). 

You should use the
provided validation sets to experiment with different smoothing/unknown word handling
methods if you wish to see which one is more effective for this task. (We will cover this in Part 4).

## 3.1 Unknown Words Handling

**Please answer the following questions:**

**Q3.1: How are you going to handle unknown words? What parameters might be needed? Do you need a method to determine the value?**

Please answer on your writeup doc!


In [None]:
# TODO: Add your unknown word handling code 

## 3.2 Smoothing

In this part of project, we are going to compute the probabilities for unigram and bigram models after smoothing.
There are several smoothing methods you can start with:
* add-k
* Kneser-Ney
* Good-Turing
* ...

You need to compute for both unigram and bigram models.

Below is a starter point using add-k smoothing. As always, you DO NOT need to follow it; you do need to implement add-k smoothing however feel free to implement any other smoothing methods you'd like and use those for later parts of the assignment!

In [79]:
def su_lprob(word, word_count, total_num, k, all_unigrams):
  return math.log(((word_count[word] if word in word_count else 0) + k) /
                          (total_num + k * len(all_unigrams)))


"""
Reference code for add-k smoothing on unigram model.
dic: a dictionary of your unigrams. key: words, val: occurence
k: parameter k for smoothing
Return: a dictionary of results after smoothing
"""
def add_k_unigram(lst, k=65):
  tokens = real_word_count.keys()
  freq = [real_word_count[key] for key in tokens]
  real_total_num = sum(freq)

  tokens = fake_word_count.keys()
  freq = [fake_word_count[key] for key in tokens]
  fake_total_num = sum(freq)

  real_prob = 0.0
  fake_prob = 0.0
  for word in lst:
    real_prob += su_lprob(word, real_word_count, real_total_num, k, all_unigrams)
    fake_prob += su_lprob(word, fake_word_count, fake_total_num, k, all_unigrams)
  return real_prob < fake_prob, real_prob, fake_prob

# for review in tokenized_real_review_validation:
#   print(add_k_unigram(review))
# print(add_k_unigram(["SOS", "the", "hotel", "is", "good", ".", "SOS"]))


In [81]:
def sb_lprob(bigram, bigram_count, word_count, k, all_unigrams):
  prev = bigram[0]
  return math.log(((bigram_count[bigram] if bigram in bigram_count else 0) + k) /
                          ((word_count[prev] if prev in word_count else 0) + k * len(all_unigrams)))

"""
Reference code for add-k smoothing on bigram model.
uni_dic: a dictionary of your unigrams.
bi_dic: a dictionary of your bigrams.
k: parameter k for smoothing
Return: a dictionary of results after smoothing
"""
def add_k_bigram(lst, k=0.5):
  real_prob = 0.0
  fake_prob = 0.0
  for i in range(len(lst)-1):
    bigram = (lst[i], lst[i+1])
    real_prob += sb_lprob(bigram, real_bigram_count, real_word_count, k, all_unigrams)
    fake_prob += sb_lprob(bigram, fake_bigram_count, fake_word_count, k, all_unigrams)
  return real_prob < fake_prob, real_prob, fake_prob

# for review in tokenized_real_review_validation:
#   print(add_k_bigram(review))
# print(add_k_bigram(["SOS", "the", "hotel", "is", "good", ".", "SOS"]))

**Please answer the following question:**

**Q3.2: Which smoothing method did you choose? Are there any parameters, if so how are you planning to pick the value? If you choose to implement more than 1 method (not a requirement), please state each of them. Providing a description for any non-standard approach, e.g., an approach that was not covered in class or in the readings**

Please answer on your writeup doc!

# Part 4: Perplexity
At this point, we have developed several language models: unigram vs bigram, unsmoothed vs smoothed. We now want to compare all the models. 

Implement code to compute the perplexity of a **“development set.”** (“Development set”
is just another way to refer to the validation set—part of a dataset that is distinct from
the training portion and the test portion.) Compute and report the perplexity of each
of the language models (one trained on true reviews and fake reviews) on
the development corpora. Compute perplexity as follows:
\begin{align*}
PP &= \left(\prod_i^N\frac{1}{P\left(W_i\mid W_{i-1}, ...W_{i-n+1}\right)}\right)^{\frac{1}{N}}\\
&=\exp \frac{1}{N}\sum_{i}^N-\log P\left(W_i\mid W_{i-1}, ...W_{i-n+1}\right)
\end{align*}
where $N$ is the total number of tokens in the test corpus and $P\left(W_i\mid W_{i-1}, ...W_{i-n+1}\right)$
is the n-gram probability of your model. Under the second definition above, perplexity
is a function of the average (per-word) log probability: use this to avoid numerical
computation errors.

Please complete the following tasks and report what you have observed. Remember, lower perplexity means better model.

## Task 1: Compute perplexity for smoothed unigram and smoothed bigram. 
*Note: If you choose more than one smoothing method, pick one of them to compute. If you need to try different values of parameters, you can try them out here.*


In [82]:
# TODO: compute perplexity for one smoothing method on unigram, and one smoothing method on bigram.

tokens = real_word_count.keys()
freq = [real_word_count[key] for key in tokens]
real_total_num = sum(freq)

acc = 0
N_real = 0
for review in real_review_validation:
  for word in review:
    N_real += 1
    acc -= su_lprob(word, real_word_count, real_total_num, 65, all_unigrams)
print("real unigram ", math.exp(acc/N_real))

tokens = fake_word_count.keys()
freq = [fake_word_count[key] for key in tokens]
fake_total_num = sum(freq)

acc = 0
N_fake = 0
for review in fake_review_validation:
  for word in review:
    N_fake += 1
    acc -= su_lprob(word, fake_word_count, fake_total_num, 65, all_unigrams)
print("fake unigram ", math.exp(acc/N_fake))


acc = 0
N_real = 0
for review in real_review_validation:
  N_real += 1
  for i in range(len(review)-1):
    N_real += 1
    bigram = (review[i], review[i+1])
    acc -= sb_lprob(bigram, real_bigram_count, real_word_count, 40, all_unigrams)
print("real bigram ", math.exp(acc/N_real))

acc = 0
N_fake = 0
for review in fake_review_validation:
  N_fake += 1
  for i in range(len(review)-1):
    N_fake += 1
    bigram = (review[i], review[i+1])
    acc -= sb_lprob(bigram, fake_bigram_count, fake_word_count, 1, all_unigrams)
print("fake bigram ", math.exp(acc/N_fake))


real unigram  6327.406995350338
fake unigram  6412.516794423919
real bigram  8896.064302000623
fake bigram  9141.851943418138


**Q4.1: Why do we need to compute perplexity after smoothing?**

Please answer on your writeup doc!

**Q4.2: Did you choose any values for parameters?**

Please answer on your writeup doc!

## Task 2: Compute perplexity for other smoothing methods (BONUS 🎉). 
*Note: If you only pick one smoothing method, you can omit this task. If you need to try different values of parameters, you can try them out here.*

In [None]:
# TODO: compute perplexity for your rest of smoothing method.

**Q4.3: If your smoothing method needs to pick a parameter, what is the value of your parameter?**

Please answer on your writeup doc!

**Q4.4: Which smoothing method is the best among your choices?**

Please answer on your writeup doc!

# Part 5: Putting Everything Together and Submitting to Kaggle
Combining all the previous parts together, we have developed a bunch of language models. Before we proceed to the next step, let's check a few things (no need to answer):
* Did you train your model only on training set?
* Did you validate your model only on validation/development set?
* Did you determine all your parameters?

Finally, please answer:

**Q5: What is your choice of language model, and why?** (Hint: How do we usually choose language models? What is our selection criteria? _Look at the Sept. 9th lecture_)

Please answer on your writeup doc!



In [None]:
# TODO: anything that helps you answer/check the above points.

## Part 5.1: First Model Submission to Kaggle

Now we need to apply our model to testing data. What you need to do:
* Takes the test data as input, and generates an output of your prediction based on your chosen language model
* Your output file should be ONLY your predictions
* Submit to Kaggle

You should use your trained model to predict labels for all the reviews in `TestData.txt`. Output your predictions to a **csv** file and submit it to kaggle. Each line should contain the id of the test review and its corresponding prediction (in total 160 lines). In other words, your output should look like (**including the header**):
```
Id,Prediction
0,0
1,0
2,1
3,0
...
160,1
```
Note that you should add the header `Id,Prediction` and there is no space in the output. The Id starts from 0 (not 1).

Use this kaggle [link](https://www.kaggle.com/t/eb382e53c0cc448d9da21b3527d) to submit your output. Your team name should be the concatenation of your netids, **exactly in the same order as this notebook is named**. For example, if notebook is 4740_FA21_p1_mb2363_ssc255, then Kaggle group should be mb2363_ssc255.

You have 10 submissions **per day** so do not wait until the last minute! There is additionally a baseline score on Kaggle for you to benchmark against.


In [83]:
# TODO: Add code to generate the Kaggle output file and submit the output file to Kaggle

# real_review_test, fake_review_test = load_real_fake_dataset(dataset_path, "P1_real_fake_review_test.txt")
filename = "P1_real_fake_review_test.txt"
lst = []
with open(os.path.join(dataset_path, filename)) as fp:
        csvreader = list(csv.reader(fp, delimiter="\n"))
        csvreader = csvreader[1:]
        for txt in csvreader:
            lst.append(txt[0].split("\"")[1])

tokenized_review_test = tokenize_reviews(lst)

preprocessing(tokenized_review_test)

count = 0
print("Id,Prediction")
for review in tokenized_review_test:
  print(str(count)+","+str(int(add_k_bigram(review, k=1.0)[0])))
  count += 1


  0%|          | 0/160 [00:00<?, ?it/s]

Id,Prediction
0,1
1,1
2,1
3,1
4,0
5,1
6,0
7,1
8,1
9,1
10,0
11,0
12,1
13,0
14,1
15,1
16,1
17,0
18,1
19,1
20,0
21,0
22,1
23,1
24,1
25,0
26,0
27,0
28,1
29,1
30,1
31,0
32,0
33,1
34,0
35,1
36,1
37,1
38,0
39,1
40,1
41,0
42,0
43,1
44,1
45,1
46,0
47,0
48,1
49,1
50,1
51,0
52,0
53,1
54,1
55,0
56,1
57,1
58,0
59,0
60,0
61,0
62,1
63,0
64,1
65,1
66,1
67,1
68,0
69,0
70,1
71,1
72,0
73,1
74,0
75,0
76,1
77,1
78,0
79,0
80,0
81,0
82,0
83,1
84,0
85,1
86,0
87,0
88,0
89,0
90,1
91,0
92,1
93,0
94,1
95,0
96,1
97,1
98,1
99,1
100,0
101,1
102,1
103,1
104,1
105,1
106,1
107,0
108,0
109,1
110,1
111,1
112,1
113,0
114,1
115,1
116,1
117,0
118,0
119,1
120,1
121,1
122,0
123,1
124,1
125,1
126,1
127,0
128,1
129,1
130,1
131,0
132,0
133,0
134,1
135,1
136,1
137,0
138,1
139,0
140,1
141,0
142,0
143,1
144,0
145,1
146,1
147,1
148,1
149,0
150,0
151,0
152,0
153,0
154,1
155,0
156,0
157,0
158,0
159,1


# Part 6: Naive Bayes

The Naive Bayes classification method is based on Bayes Rule. Suppose we have a review *d* and its label *c* (either 0 or 1).
\begin{align*}
P(c|d)=\frac{P(d|c)P(c)}{P(d)}
\end{align*}
Likelihood: $P(d|c)$. In real/deception corpus, how likely *d* would appear.

Prior: $P(c)$. The probability of real/deceptive reviews in general.

Posterior: $P(c|d)$. Given *d*, how likely is it that it is real/deceptive.

Goal: $\underset{c\in \{0,1\}}{\operatorname{argmax}} P(c|d)$, which is equivalent to $\underset{c\in \{0,1\}}{\operatorname{argmax}} P(d|c)P(c)$.

The equivalence holds because $P(d)$ is the same for any $c$. Thus the denominator can be dropped.

Denote $d=\{x_1, x_2, ..., x_n\}$ where $x_i$'s are words in the reviews *d* (sometimes called features). Unlike n-gram language modelling, we make the multinomial Naive Bayes independence assumption here, where we assume positions of words do not matter. Formally, 
\begin{align*}
&\underset{c\in \{0,1\}}{\operatorname{argmax}} P(d|c)P(c)\\
=&\underset{c\in \{0,1\}}{\operatorname{argmax}} P(x_1, ..., x_n|c)P(c)\\
=&\underset{c\in \{0,1\}}{\operatorname{argmax}} P(x_1|c)P(x_2|c)...P(x_n|c)
\end{align*}

Now we only need to collect the occurences of each word for the classification. This is often called a **bag of words** feature. 

For instance, in the sentence `All for one and one for all .`, the bag of words feature would be `{"all": 2, "for": 2, "one": 2, "and": 1, ".": 1}`. Essentially, the bag of words feature is a dictionary which maps the word to its occurences. We can see that the order is not considered here.

Now, your goal is to implement the Multinomial Naive Bayes. You can use existing codes or Python packages, and adapt them to our reviews classification task.

You might find the following packages/functions useful:

* nltk.word_tokenize(), nltk.word_tokenize()
* nltk.classify.naivebayes()
* sklearn.feature_extraction.text
* sklearn.naive_bayes.MultinomialNB()

**Please answer the following question(s).**

**Q6: Comparing Multinomial Naive Bayes with the unigram language model, which one do you expect to perform better? Why?**

Please answer on your writeup doc!

## 6.1 Implementation

In [85]:
# TODO: Naive Bayes implementation 

def nb_prob(word, word_count, total_num):
  return math.log(word_count[word] / total_num) if word in word_count else 0  # skip words that aren't in the dict

def naive_bayes(lst):
  tokens = real_word_count.keys()
  freq = [real_word_count[key] for key in tokens]
  real_total_num = sum(freq)

  tokens = fake_word_count.keys()
  freq = [fake_word_count[key] for key in tokens]
  fake_total_num = sum(freq)

  real_prob = math.log(len(real_review_train) / (len(real_review_train) + len(fake_review_train)))
  fake_prob = math.log(len(fake_review_train) / (len(real_review_train) + len(fake_review_train)))
  for word in lst:
    real_prob += uu_lprob(word, real_word_count, real_total_num)
    fake_prob += uu_lprob(word, fake_word_count, fake_total_num)
  return real_prob < fake_prob, real_prob, fake_prob

# for review in tokenized_real_review_validation:
#   print(naive_bayes(review))
# print(naive_bayes(["SOS", "the", "hotel", "is", "good", ".", "SOS"]))

## 6.2 Putting Everything Together and Submitting to Kaggle

You should use your trained model to predict labels for all the reviews in `P1_real_fake_review_test.txt`. Output your predictions to a **csv** file and submit it to kaggle. The format should follow Part 6 as well.

Use the previous kaggle link to submit your output! (You are allowed multiple submissions!)

In [71]:
# TODO: Code for predicting the test labels and generating the output file. Then submit the output file to Kaggle

filename = "P1_real_fake_review_test.txt"
lst = []
with open(os.path.join(dataset_path, filename)) as fp:
        csvreader = list(csv.reader(fp, delimiter="\n"))
        csvreader = csvreader[1:]
        for txt in csvreader:
            lst.append(txt[0].split("\"")[1])

tokenized_review_test = tokenize_reviews(lst)

preprocessing(tokenized_review_test)

count = 0
print("Id,Prediction")
for review in tokenized_review_test:
  print(str(count)+","+str(int(naive_bayes(review)[0])))
  count += 1

  0%|          | 0/160 [00:00<?, ?it/s]

Id,Prediction
0,1
1,0
2,1
3,1
4,0
5,1
6,0
7,1
8,1
9,1
10,0
11,1
12,1
13,0
14,1
15,0
16,1
17,1
18,0
19,0
20,0
21,0
22,1
23,1
24,1
25,1
26,0
27,0
28,1
29,1
30,1
31,1
32,1
33,1
34,0
35,0
36,1
37,0
38,0
39,1
40,1
41,1
42,1
43,0
44,0
45,1
46,1
47,0
48,1
49,1
50,0
51,0
52,0
53,1
54,0
55,1
56,1
57,0
58,1
59,0
60,0
61,1
62,0
63,0
64,0
65,0
66,0
67,1
68,0
69,0
70,1
71,1
72,1
73,1
74,1
75,0
76,0
77,0
78,0
79,0
80,0
81,1
82,1
83,1
84,0
85,1
86,0
87,0
88,0
89,0
90,0
91,0
92,1
93,0
94,1
95,0
96,1
97,1
98,0
99,1
100,0
101,1
102,1
103,0
104,1
105,0
106,1
107,0
108,0
109,0
110,1
111,1
112,1
113,0
114,1
115,1
116,1
117,0
118,0
119,1
120,1
121,1
122,0
123,1
124,0
125,1
126,1
127,0
128,0
129,1
130,1
131,0
132,0
133,0
134,0
135,0
136,1
137,0
138,0
139,0
140,1
141,1
142,0
143,1
144,0
145,1
146,1
147,1
148,1
149,0
150,0
151,1
152,0
153,0
154,0
155,0
156,0
157,0
158,1
159,1


# Work Distribution

**Please briefly describe how you divided the work.**

Please answer on your writeup doc!


# Project Feedback [1 point]
 e on the course staff are trying our best to adapt our teaching, projects and everything else in the class to best help you learn (and hope you get as excited about NLP as we do!). We would immenselly appreciate it if you could provide us feedback (it's a super short form!!) on this project and **it's worth 1 point of your project grade**

Link to the feedback form: https://forms.gle/EfYoeggeGkr2Mb6Y7

We will use this feedback to improve both **upcoming projects** and projects for next year. 

Thank you so much!

# Submitting the Notebook

1. Go to File (upper left corner) -> Download .ipynb -> submit this downloaded file to cms
2. Run the first code block
3. Replace our placeholder for your correct Google Drive directory structure in the 2nd code block below. Run the code block
4. Put the name of this notebook into our placeholder in the 3rd code block. Run the code block
5. Then go to the folder icon on the very left panel, under the orange CO logo. Click on the folder and wait for a PDF version of your notebook to appear. Might take a few minutes.
6. Download the pdf version and submit to Gradescope

In [86]:
%%capture
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc

In [87]:
%%capture
# the red text is a placeholder! Change it to your directory structure!
!cp 'drive/My Drive/CS 4740/Project 1/CS4740_FA21_p1_nar73_krm74.ipynb' ./ 

In [88]:
# the red text is a placeholder! Change it to the name of this notebook!
!jupyter nbconvert --to PDF "CS4740_FA21_p1_nar73_krm74.ipynb"

[NbConvertApp] Converting notebook CS4740_FA21_p1_nar73_krm74.ipynb to PDF
[NbConvertApp] Writing 92418 bytes to ./notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: [u'xelatex', u'./notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: [u'bibtex', u'./notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 102908 bytes to CS4740_FA21_p1_nar73_krm74.pdf


# Once Again
Please make sure do the following:
1. Submit the PDF version of your colab notebook on Gradescope.
2. Submit a PDF version of your Google Doc with all your written questions complete on Gradescope.
3. Submit the .ipynb of your colab notebook on CMS.

# You are done! ✅