# Exercise: Classification I

In this exercise session, you will be training a logistic regression model to classify movie review text into positive and negative reviews. You will be using a bag-of-words approach, where the features are the TF-IDF scores of the tokens in the review.

# 1. Load the libraries

You will need to have installed:

- pandas
- numpy
- datasets
- sklearn
- wordcloud and matplotlib (optional)

It is good practice to have all the imports at the top of the notebook.

# 2. Load the *rotten_tomatoes* data set

This is a data set of short snippets from movie reviews on Rotten Tomatoes, along with whether the review gave the movie a positive ("fresh") or negative ("rotten") rating.

1. Have a look at the [documentation](https://huggingface.co/datasets/rotten_tomatoes) of the data set on HuggingFace
2. Load the dataset (train, validation and test splits) from the huggingface library.
3. Print some review to have an idea of what kind of data this is.

You can also browse all HF datasets visually online at [huggingface datasets](https://huggingface.co/datasets/tweet_eval).

# 3. Vectorizing the reviews with TF-IDF

1. Get the texts of the reviews and labels into separate lists for all the `rotten_tomatoes` data subsets
2. Turn the texts into numbers with TFIDF vectorizer from scikit-learn.

TF-IDF vectorizer documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

# 4. Exploratory analysis

1. How many classes does this data have? Are the classes balanced?
2. What are the top 10 frequent words in all the reviews?

Hint: if you use sklearn CountVectorizer and are running out of memory, you can limit how many words to compute frequencies for.

# 5. Logistic Regression

1. Create the classifier and train it on the train set. If you find that it doesn't converge, try increasing the number of iterations (`max_iter` parameter), e.g. to 1500. Documentation for logistic regression can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
2. Make predictions on the validation set (leave the test set aside for now)
3. Use the accuracy metric to compare the predicted labels to the ground truth labels you have from the original data. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).
4. Optional bonus task: scikit learn has a very useful [dummy classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). If your classifier were always simply predicting the majority class, how well would it do?

# 6. Testing on your own data!

1. Find 10-20 reviews outside of this dataset. You can pick any reviews you like, from rottentomatoes.com or ones you write yourself.
2. Put them into a spreadsheet and either manually annotate them with negative or positive sentiment, or use the labels provided on rottentomatoes.com. Make sure the columns are named "text" and "label" and the labels are consistent with the `rotten_tomatoes` markup scheme (1=positive, 0=negative).
3. Save this file as a .csv file, load it into your notebook, convert the text to TF-IDF scores.
4. Use this small dataset as a test dataset for the logistic regression classifier trained on the `rotten_tomatoes` data. Is your classifier doing better or worse than on the validation data?

# 7. Does pre-processing make a difference?

Everything you've done so far was just considering the raw text of the reviews. Let us try to add pre-processing.

1. What preprocessing do you think could help the classifier? Why do you think so?
2. Implement the pre-processing step(s) of your choice and re-vectorize the rotten_tomatoes reviews.
3. Re-run the classifier. Did the accuracy improve? Why do you think it improved (or didn't)?

# 8. Bonus: visualize the reviews with a word cloud

Wordcloud is a nice little library to visualize text. The only required argument is the text from which the wordcloud should be generated. Removing punctuation, lowercasing and stripping English stopwords happens automatically.

- [reference](https://github.com/amueller/word_cloud/blob/master/examples/simple.py)
- [tutorial](https://www.datacamp.com/community/tutorials/wordcloud-python)

In [None]:
#basic usage:

#type in your sentence
sentence = ''
wordcloud = WordCloud(background_color="white",
                      width=400
                     ).generate(sentence)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()