Homework 4: Sentiment Analysis - Task 0, Task 1, Task 5 (all primarily written tasks)
----

The following instructions are only written in this notebook but apply to all notebooks and `.py` files you submit for this homework.

Due date: October 25th, 2023

Points: 
- Task 0: 5 points
- Task 1: 10 points
- Task 2: 30 points
- Task 3: 20 points
- Task 4: 20 points
- Task 5: 15 points

Goals:
- understand the difficulties of counting and probablities in NLP applications
- work with real world data to build a functioning language model
- stress test your model (to some extent)

Complete in groups of: __two (pairs)__. If you prefer to work on your own, you may, but be aware that this homework has been designed as a partner project.

Allowed python modules:
- `numpy`, `matplotlib`, `keras`, `pytorch`, `nltk`, `pandas`, `sci-kit learn` (`sklearn`), `seaborn`, and all built-in python libraries (e.g. `math` and `string`)
- if you would like to use a library not on this list, post on piazza to request permission
- all *necessary* imports have been included for you (all imports that we used in our solution)

Instructions:
- Complete outlined problems in this notebook. 
- When you have finished, __clear the kernel__ and __run__ your notebook "fresh" from top to bottom. Ensure that there are __no errors__. 
    - If a problem asks for you to write code that does result in an error (as in, the answer to the problem is an error), leave the code in your notebook but commented out so that running from top to bottom does not result in any errors.
- Double check that you have completed Task 0.
- Submit your work on Gradescope.
- Double check that your submission on Gradescope looks like you believe it should __and__ that all partners are included (for partner work).

6120 students: complete __all__ problems.

4120 students: you are not required to complete problems marked "CS 6120 REQUIRED". If you complete these you will not get extra credit. We will not take points off if you attempt these problems and do not succeed.

Names & Sections
----
Names: Shashidhar Gollamudi - 6120
       Sunny Huang - 4120


Task 0: Name, References, Reflection (5 points)
---

References
---
List the resources you consulted to complete this homework here. Write one sentence per resource about what it provided to you. If you consulted no references to complete your assignment, write a brief sentence stating that this is the case and why it was the case for you.

(Example)
- https://docs.python.org/3/tutorial/datastructures.html
    - Read about the the basics and syntax for data structures in python.

AI Collaboration
---
Following the *AI Collaboration Policy* in the syllabus, please cite any LLMs that you used here and briefly describe what you used them for. Additionally, provide comments in-line identifying the specific sections that you used LLMs on, if you used them towards the generation of any of your answers.

__NEW__: Do not include nested list comprehensions supplied by AI collaborators — all nested lists comprehensions __must__ be re-written.

Reflection
----
Answer the following questions __after__ you complete this assignment (no more than 1 sentence per question required, this section is graded on completion):

1. Does this work reflect your best effort? Yes
2. What was/were the most challenging part(s) of the assignment? The hardest part of this assignment was formatting data correctly so that it could be used as input for classifiers and neural networks. For example, the Naive Bayes Classifier needed a list of lists containing the featurized word vectors.
3. If you want feedback, what function(s) or problem(s) would you like feedback on and why? We'd like feedback on the neural network training and implementation because we'd like to know if we achieved optimal accuracy.
4. Briefly reflect on how your partnership functioned--who did which tasks, how was the workload on each of you individually as compared to the previous homeworks, etc. We both split up the workload evenly by splitting the writing and coding tasks. Shashi did tasks 1 and 3, Sunny did tasks 4 and 5, and since task 2 was weighed the most heavily we collaborated heavily on it.

Task 1: Provided Data Write-Up (10 points)
---

Every time you use a data set in an NLP application (or in any software application), you should be able to answer a set of questions about that data. Answer these now. Default to no more than 1 sentence per question needed. If more explanation is necessary, do give it.

This is about the __provided__ movie review data set.

1. Where did you get the data from? The provided dataset(s) were sub-sampled from https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews 
2. (1 pt) How was the data collected (where did the people acquiring the data get it from and how)? The data set was collected from IMDB movie reviews with <= 30 reviews chosen for any given movie. The data set was labeled with sentiment scores where reviews that got less than 5 were labeled negative and those greater than 6 were labeled positive.
3. (2 pts) How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets) Train - 1600 reviews,425421 tokens  Dev - 200 reviews, 54603 tokens
4. (1 pt) What is your data? (i.e. newswire, tweets, books, blogs, etc) The data is a collection of IMDB reviews saved as text files.
5. (1 pt) Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people) The data was produced by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts, researchers at Stanford for the Association for Computational Linguistics.
6. (2 pts) What is the distribution of labels in the data (answer for both the train and the dev set, separately)? Train - 796 0s, 804 1s, Dev - 95 0s, 105 1s.
7. (2 pts) How large is the vocabulary (answer for both the train and the dev set, separately)? Train - 30705 words, Dev - 8953 words
8. (1 pt) How big is the overlap between the vocabulary for the train and dev set? 6574 words.

In [18]:
# our utility functions
# RESTART your jupyter notebook kernel if you make changes to this file
import sentiment_utils as sutils

import pandas as pd
import nltk
from collections import Counter


In [30]:
# Feel free to write code to help answer the above questions
TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

train_tups = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_tups = sutils.generate_tuples_from_file(DEV_FILE)

print(len(train_tups[0]))
print(sum(map(len, (review for review in train_tups[0]))))
print(len(dev_tups[0]))
print(sum(map(len, (review for review in dev_tups[0]))))
counterTrain = Counter(train_tups[1])
counterDev = Counter(dev_tups[1])
print(counterTrain.get(0))
print(counterTrain.get(1))
print(counterDev.get(0))
print(counterDev.get(1))

trainList = [item for sublist in train_tups[0] for item in sublist]
vocabTrain = set(trainList)
print(len(vocabTrain))

devList = [item for sublist in dev_tups[0] for item in sublist]
devTrain = set(devList)
print(len(devTrain))

print(len(vocabTrain.intersection(devTrain)))


1600
425421
200
54603
796
804
95
105
30705
8953
6574


Task 5: Model Evaluation (15 points)
---
Save your three graph files for the __best__ configurations that you found with your models using the `plt.savefig(filename)` command. The `bbox_inches` optional parameter will help you control how much whitespace outside of the graph is in your resulting image.
Run your each notebook containing a classifier 3 times, resulting in __NINE__ saved graphed (don't just overwrite your previous ones).

You will turn in all of these files.

10 points in this section are allocated for having all nine graphs legible, properly labeled, and present.




1. (1 pt) When using __10%__ of your data, which model had the highest f1 score?
2. (1 pt) Which classifier had the most __consistent__ performance (that is, which classifier had the least variation across all three graphs you have for it -- no need to mathematically calculate this, you can just look at the graphs)?
3. (1 pt) For each model, what percentage of training data resulted in the highest f1 score?
    1. Naive Bayes:
    2. Logistic Regression:
    3. Neural Net:
4. (2 pts) Which model, if any, appeared to overfit the training data the most? Why?


6120 REQUIRED
----

Find a second data set that is labeled for sentiment from a different domain (not movie reivews). Rerun your notebook with this data (you should set up your notebook so that you only need to change the paths and possibly run a different pre-processing function on the data). Note that you will want binary labels.

Answer the regular data questions for your new data set
----
1. Where did you get the data from?
2. How was the data collected (where did the people acquiring the data get it from and how)?
3. How large is the dataset (answer for both the train and the dev set, separately)? (# reviews, # tokens in both the train and dev sets)
4. What is your data? (i.e. newswire, tweets, books, blogs, etc)
5. Who produced the data? (who were the authors of the text? Your answer might be a specific person or a particular group of people)
6. What is the distribution of labels in the data (answer for both the train and the dev set, separately)?
7. How large is the vocabulary (answer for both the train and the dev set, separately)?
8. How big is the overlap between the vocabulary for the train and dev set?

Answer the model evaluation questions for your new data set
----
1. When using __10%__ of your data, which model had the highest f1 score?
2. Which classifier had the most __consistent__ performance (that is, which classifier had the least variation across all three graphs you have for it -- no need to mathematically calculate this, you can just look at the graphs)?
3. For each model, what percentage of training data resulted in the highest f1 score?
    1. Naive Bayes:
    2. Logistic Regression:
    3. Neural Net:
4. Which model, if any, appeared to overfit the training data the most? Why?

In [None]:
# any code you need to write here