Homework 4: Sentiment Analysis - Task 0, Task 1, Task 5 (all primarily written tasks)
----

The following instructions are only written in this notebook but apply to all notebooks and `.py` files you submit for this homework.

Due date: October 25th, 2023

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __two (pairs)__. If you prefer to work on your own, you may, but be aware that this homework has been designed as a partner project.

Allowed python modules:
- `numpy`, `matplotlib`, `keras`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, and all built-in python libraries (e.g. `math` and `string`)
- if you would like to use a library not on this list, post on piazza to request permission
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).

6120 students: complete __all__ problems.

4120 students: you are not required to complete problems marked "CS 6120 REQUIRED". If you complete these you will not get extra credit. We will not take points off if you attempt these problems and do not succeed.

Names & Sections
----
Names: Alex Kramer

Task 0: Name, References, Reflection (5 points)
---

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

- https://www.nltk.org/_modules/nltk/translate/metrics.html
    - Read about nltk source code metrics.
- https://www.nltk.org/howto/metrics.html
    - Read about fixing an error withing nltk.metrics.scores with taking in sets.
    
- https://arize.com/blog-course/binary-cross-entropy-log-loss/
    - Read about binary cross entrophy.
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    - Read about the sklearn count vectorizer and how to use it.
- https://keras.io/api/layers/activations/
    - Read about Keras activations. 
- https://keras.io/api/models/sequential/
    - Read about Keras sequential.
- https://keras.io/api/layers/core_layers/dense/
    - Read about Keras layers and density.
- https://stackoverflow.com/questions/71918564/valueerror-logits-and-labels-must-have-the-same-shape
    - Learned about fixing an error. 
- https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
    - Learned about overfitting and underfitting.
 
AI Collaboration
---
Following the *AI Collaboration Policy* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for. Additionally, provide comments in-line identifying the specific sections that you used LLMs on, if you used them towards the generation of any of your answers.

__NEW__: Do not include nested list comprehensions supplied by AI collaborators — all nested lists comprehensions __must__ be re-written.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort?
Yes, we both try our best to implement the helper functions and complete all the tasks.
2. What was/were the most challenging part(s) of the assignment?
We find that it is challenging to implement the helper functions.
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why?
We would like feedback from the graph we created.
4. Briefly reflect on how your partnership functioned--who did which tasks, how was the workload on each of you individually as compared to the previous homeworks, etc.


Task 1: Provided Data Write-Up (10 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ movie review data set.

1. Where did you get the data from? The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 
- We got the data from kaggle, it is a subset of their previous benchmark datasets.
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)?
The author doesn't mention where the data is collected. I assume the data is collected from an app.
3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)
There are 1600 reviews and 425421 tokens in train set. There are 200 reviews and 54603 tokens in the dev set.
4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc)
The data is movie reviews.
5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
Audience who watched the movie produced the data.
6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)?
The distribution is 50/50 for positive and negative class.
7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)?
There are 27132 vocabulary for train set, and 8145 vocabulary for dev set.
8. (1 pt) How big is the overlap between the vocabulary for the train and dev set?
There are 6123 vocabulary exist both in train and dev set.

In [1]:
# our utility functions
# RESTART your jupyter notebook kernel if you make changes to this file
import sentiment_utils as sutils

TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

# load in your data and make sure you understand the format
# Do not print out too much so as to impede readability of your notebook
train_tups = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_tups = sutils.generate_tuples_from_file(DEV_FILE)

# 3
# see how many reviews
print(f"There are {len(train_tups[0])} reviews in the training set")
print(f"There are {len(dev_tups[0])} reviews in the dev set")

num_tokens_train = sum([len(review) for review in train_tups[0]])
num_tokens_dev = sum([len(review) for review in dev_tups[0]])

print(f"There are {num_tokens_train} tokens in the training set")
print(f"There are {num_tokens_dev} tokens in the dev set")

# 7
# get the vocabulary
vocab_train = sutils.create_index(train_tups[0])
vocab_dev = sutils.create_index(dev_tups[0])

print("Vocabulary size of training set:", len(vocab_train))
print("Vocabulary size of dev set:", len(vocab_dev))

# 8
overlap = [vocab for vocab in vocab_dev if vocab in vocab_train]
print(f"There are {len(overlap)} vocabulary exists in both train and dev data")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ChenXi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


There are 1600 reviews in the training set
There are 200 reviews in the dev set
There are 425421 tokens in the training set
There are 54603 tokens in the dev set
Vocabulary size of training set: 27132
Vocabulary size of dev set: 8145
There are 6123 vocabulary exists in both train and dev data


Task 5: Model Evaluation (15 points)
---
Save your three graph files for the __best__ configurations that you found with your models using the `plt.savefig(filename)` command. The `bbox_inches` optional parameter will help you control how much whitespace outside of the graph is in your resulting image.
Run your each notebook containing a classifier 3 times, resulting in __NINE__ saved graphed (don't just overwrite your previous ones).

You will turn in all of these files.

10 points in this section are allocated for having all nine graphs legible, properly labeled, and present.




1. (1 pt) When using __10%__ of your data, which model had the highest f1 score?
2. (1 pt) Which classifier had the most __consistent__ performance (that is, which classifier had the least variation across all three graphs you have for it -- no need to mathematically calculate this, you can just look at the graphs)?
3. (1 pt) For each model, what percentage of training data resulted in the highest f1 score?
    1. Naive Bayes:
    2. Logistic Regression:
    3. Neural Net:
4. (2 pts) Which model, if any, appeared to overfit the training data the most? Why?


6120 REQUIRED
----

Find a second data set that is labeled for sentiment from a different domain (not movie reivews). Rerun your notebook with this data (you should set up your notebook so that you only need to change the paths and possibly run a different pre-processing function on the data). Note that you will want binary labels.

Answer the regular data questions for your new data set
----
1. Where did you get the data from?
We get the data from kaggle.
2. How was the data collected (where did the people acquiring the data get it from and how)?
The data is collected from twitter. The author doesn't mention how it was collected.
3. How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)
For the training set, there are 25569 reviews and 336636 tokens. For the dev set, there are 6393 reviews and 83943 tokens. 
4. What is your data? (i.e. newswire, tweets, books, blogs, etc)
The data is tweets.
5. Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
The data is collected from twitter users.
6. What is the distribution of labels in the data (answer for both the train and the dev set, separately)?
There are 29720 for class 0, and 2242 for class 1.
7. How large is the vocabulary (answer for both the train and the dev set, separately)?
The vocabulary for train set is 57567. The vocabulary size of dev set is 21244.
8. How big is the overlap between the vocabulary for the train and dev set?
There are 11589 vocabulary exist in both train and dev set.
Answer the model evaluation questions for your new data set
----
1. When using __10%__ of your data, which model had the highest f1 score?
2. Which classifier had the most __consistent__ performance (that is, which classifier had the least variation across all three graphs you have for it -- no need to mathematically calculate this, you can just look at the graphs)?
3. For each model, what percentage of training data resulted in the highest f1 score?
    1. Naive Bayes:
    2. Logistic Regression:
    3. Neural Net:
4. Which model, if any, appeared to overfit the training data the most? Why?

In [3]:
# Feel free to write code to help answer the above questions

# get the data from the csv file, I stored the data in [[lists of list of words], [labels]]
data = sutils.generate_tuples_from_file("twitter.csv", csvfile=True)

# split the data into train set and dev set, this process is not random
X_train, X_dev, y_train, y_dev = sutils.train_dev_split(data, train_ratio=0.8)

In [6]:
# 3
# see how many reviews
print(f"There are {len(X_train)} reviews in the training set")
print(f"There are {len(X_dev)} reviews in the dev set")

num_tokens_train = sum([len(review) for review in X_train])
num_tokens_dev = sum([len(review) for review in X_dev])

print(f"There are {num_tokens_train} tokens in the training set")
print(f"There are {num_tokens_dev} tokens in the dev set")

# 7
# get the vocabulary
vocab_train = sutils.create_index(X_train)
vocab_dev = sutils.create_index(X_dev)

print("Vocabulary size of training set:", len(vocab_train))
print("Vocabulary size of dev set:", len(vocab_dev))

# 8
overlap = [vocab for vocab in vocab_dev if vocab in vocab_train]
print(f"There are {len(overlap)} vocabulary exists in both train and dev data")

There are 25569 reviews in the training set
There are 6393 reviews in the dev set
There are 336636 tokens in the training set
There are 83943 tokens in the dev set
Vocabulary size of training set: 57567
Vocabulary size of dev set: 21244
There are 11589 vocabulary exists in both train and dev data


In [None]:
# any code you need to write here