Homework 4: Sentiment Analysis - Task 2
----

Names 
----
Names: __YOUR NAMES HERE__ (Write these in every notebook you submit.)

Task 2: Train a Naive Bayes Model (30 points)
----

Using `nltk`'s `NaiveBayesClassifier` class, train a Naive Bayes classifier using a Bag of Words as features.

You will be implementing **binarized** (presence/absence of word) and **multinomial** (counts of word) BoW representations of your data

Learn more about Naive Bayes here: https://www.nltk.org/_modules/nltk/classify/naivebayes.html 

Naive Bayes classifiers use Bayes’ theorem for predictions. Naive Bayes can be a good baseline for NLP applications in particular. You can use it as a baseline for your project!

**

**10 points in Task 5 will be allocated for all 9 graphs (including the one generated here in Task 4 for Naive Bayes Classifier) being:**
- Legible
- Present below
- Properly labeled
     - x and y axes labeled
     - Legend for accuracy measures plotted
     - Plot Title with which model and run number the graph represents

In [1]:
# our utility functions
# RESTART your jupyter notebook kernel if you make changes to this file
import sentiment_utils as sutils

# nltk for Naive Bayes and metrics
import nltk
import nltk.classify.util
from nltk.metrics.scores import (precision, recall, f_measure, accuracy)
from nltk.classify import NaiveBayesClassifier

# some potentially helpful data structures from collections
from collections import defaultdict, Counter

# so that we can make plots
import matplotlib.pyplot as plt
# if you want to use seaborn to make plots
#import seaborn as sns

In [2]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

In [3]:
# load in your data and make sure you understand the format
# Do not print out too much so as to impede readability of your notebook
# train_tups = sutils.generate_tuples_from_file(TRAIN_FILE)
# dev_tups = sutils.generate_tuples_from_file(DEV_FILE)

# Load tokenized data
train_X, train_y = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_X,   dev_y   = sutils.generate_tuples_from_file(DEV_FILE)

In [4]:
# set up a sentiment classifier using NLTK's NaiveBayesClassifier and 
# a bag of words as features
# take a look at the function in lecture notebook 7 (feel free to copy + paste that function)
# the nltk classifier expects a dictionary of features as input where the key is the feature name
# and the value is the feature value

# need to return a dict to work with the NLTK classifier
# Possible problem for students: evaluate the difference 
# between using binarized features and using counts (non binarized features)
# (Optional) build a vocab from training set; you may use it to filter features if desired
vocab = set(sutils.create_index(train_X, min_freq=1))

# === Assignment-style feature function ===
def word_feats(tokens, binary: bool = False, use_train_vocab: bool = True) -> dict:
    """
    将一篇文档（分词列表）转为 NB 可用的特征字典。
    参数:
        tokens: List[str] 该样本的分词序列
        binary: True -> 二值特征; False -> 多项式（计数）特征
        use_train_vocab: 若为 True，则仅保留出现在训练词表中的词（避免引入未见词）
    返回:
        dict[str, int]  特征名->特征值
    """
    if use_train_vocab:
        cnt = Counter(t.lower() for t in tokens if t.lower() in vocab)
    else:
        cnt = Counter(t.lower() for t in tokens)
    if binary:
        return {w: 1 for w in cnt.keys()}
    else:
        return dict(cnt)

def build_instances(X_tok, y, binary: bool, use_train_vocab: bool = True):
    feats = [word_feats(toks, binary=binary, use_train_vocab=use_train_vocab) for toks in X_tok]
    return list(zip(feats, y))

def train_eval_nb(percent: int, binary: bool, seed: int = 0, use_train_vocab: bool = True):
    # sample percent% of training data deterministically
    sub_X, sub_y = sutils.take_percent(train_X, train_y, percent, shuffle=True, seed=seed)
    train_set = build_instances(sub_X, sub_y, binary=binary, use_train_vocab=use_train_vocab)
    dev_set   = build_instances(dev_X, dev_y, binary=binary, use_train_vocab=use_train_vocab)

    # Train NB
    clf = NaiveBayesClassifier.train(train_set)

    # Predict on dev
    preds = [clf.classify(feats) for feats, _ in dev_set]
    preds = [int(p) for p in preds]

    # Metrics
    prec, rec, f1, acc = sutils.get_prfa(dev_y, preds, verbose=False)
    return prec, rec, f1, acc

def plot_runs(binary: bool, run_id: int, percents=None, save_as=None, use_train_vocab: bool = True):
    if percents is None:
        percents = [10, 20, 40, 60, 80, 100]
    title = f"Naive Bayes ({'Binarized' if binary else 'Multinomial'}) — Run {run_id}"
    curves = sutils.create_training_graph(
        metrics_fun=lambda p: train_eval_nb(p, binary=binary, seed=run_id, use_train_vocab=use_train_vocab),
        percents=percents,
        title=title,
        savepath=save_as
    )
    return curves   


# set up & train a sentiment classifier using NLTK's NaiveBayesClassifier and
# classify the first example in the dev set as an example
# make sure your output is well-labeled



# test to make sure that you can train the classifier and use it to classify a new example
if __name__ == "__main__":
    # Produce both variants and save graphs (three runs each)
    for run in (1,2,3):
        plot_runs(binary=False, run_id=run, save_as=f"Naive_Bayes_multinomial_run{run}.png")
        plot_runs(binary=True,  run_id=run, save_as=f"Naive_Bayes_binarized_run{run}.png")

    # Quick comparison on full dev set
    _,_,f1_bin,_   = train_eval_nb(100, binary=True,  seed=1)
    _,_,f1_multi,_ = train_eval_nb(100, binary=False, seed=1)
    print(f"Final F1 (binarized)  : {f1_bin:.4f}")
    print(f"Final F1 (multinomial): {f1_multi:.4f}")

Final F1 (binarized)  : 0.7761
Final F1 (multinomial): 0.7813


<span style="color: red;">__Expected Behavior__ </span>

**Naive Bayes**:
Naive Bayes relies on word counts or feature frequencies to compute probabilities. Since it does not involve random initialization, it is a deterministic algorithm: meaning it will always produce identical results given the same data and preprocessing steps. So, if your Naive Bayes graphs are identical across runs, this is expected and completely fine!

<span style="color: red;">__Note on Training Data Increments__ </span>

When varying the amount of training data, choose increments that are meaningful and reasonable, you should be able to observe clear trends without making the experiment unnecessarily long. You may increment the training data percentage by **5%**, **10%** or **20%**.

**Make sure that one of your experiments includes 10% of the training data, as you will need this result to answer a question in Task 5.**

In [None]:
# Naive Bayes 使用 multinomial 特征时表现略好（F1=0.7813 vs 0.7761）。
# Using the provided dev set, evaluate your model with precision, recall, and f1 score as well as accuracy
# You may use nltk's implemented `precision`, `recall`, `f_measure`, and `accuracy` functions
# (make sure to look at the documentation for these functions!)
# you will be creating a similar graph for logistic regression and neural nets, so make sure
# you use functions wisely so that you do not have excessive repeated code
# write any helper functions you need in sentiment_utils.py (functions that you'll use in your other notebooks as well)


# create a graph of your classifier's performance on the dev set as a function of the amount of training data
# the x-axis should be the amount of training data (as a percentage of the total training data)
# NOTE : make sure one of your experiments uses 10% of the data, you will need this to answer the first question in task 5
# the y-axis should be the performance of the classifier on the dev set
# the graph should have 4 lines, one for each of precision, recall, f1, and accuracy
# the graph should have a legend, title, and axis labels



Test your model using both a __binarized__ (bag of words representation where we put 1 [true] if the word is there and 0 [false] otherwise) and a __multinomial__ (bag of words representation where we put the count of the word if the word occurs, and 0 otherwise). Use whichever one gives you a better final f1 score on the dev set to produce your graphs.

- f1 score binarized: 0.7761
- f1 score multinomial: 0.7813