# Exercise: Classification I

In this exercise session, you will be training a logistic regression model to classify movie review text into positive and negative reviews. You will be using a bag-of-words approach, where the features are the TF-IDF scores of the tokens in the review.

# 1. Load the libraries

You will need to have installed:

- pandas
- numpy
- datasets
- sklearn
- wordcloud and matplotlib (optional)

It is good practice to have all the imports at the top of the notebook.

In [74]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from datasets import load_dataset, logging

# 2. Load the *rotten_tomatoes* data set

This is a data set of short snippets from movie reviews on Rotten Tomatoes, along with whether the review gave the movie a positive ("fresh") or negative ("rotten") rating.

1. Have a look at the [documentation](https://huggingface.co/datasets/rotten_tomatoes) of the data set on HuggingFace
2. Load the dataset (train, validation and test splits) from the huggingface library.
3. Print some review to have an idea of what kind of data this is.

You can also browse all HF datasets visually online at [huggingface datasets](https://huggingface.co/datasets/tweet_eval).

In [18]:
classification1_annotation = pd.read_csv("../dataset/classification1_annotation.csv")#, compression = 'bz2')

# load the 2-class sentiment classification model from rotten_tomatoes
train = load_dataset('rotten_tomatoes', 'sentiment', split='train')
val = load_dataset('rotten_tomatoes', 'sentiment', split='validation')
test = load_dataset('rotten_tomatoes', 'sentiment', split='test')


Found cached dataset rotten_tomatoes (C:/Users/asger/.cache/huggingface/datasets/rotten_tomatoes/sentiment/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)
Found cached dataset rotten_tomatoes (C:/Users/asger/.cache/huggingface/datasets/rotten_tomatoes/sentiment/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)
Found cached dataset rotten_tomatoes (C:/Users/asger/.cache/huggingface/datasets/rotten_tomatoes/sentiment/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46)


In [17]:
print(train.description)

Movie Review Dataset.
This is a dataset of containing 5,331 positive and 5,331 negative processed
sentences from Rotten Tomatoes movie reviews. This data was first used in Bo
Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for
sentiment categorization with respect to rating scales.'', Proceedings of the
ACL, 2005.



# 3. Vectorizing the reviews with TF-IDF

1. Get the texts of the reviews and labels into separate lists for all the `rotten_tomatoes` data subsets
2. Turn the texts into numbers with TFIDF vectorizer from scikit-learn.

TF-IDF vectorizer documentation can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [25]:
train_corpus = [x["text"] for x in train]
train_labels = [x["label"] for x in train]

In [39]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_corpus)
feature_names = vectorizer.get_feature_names_out()
sparse_array = X_train.toarray()

print(f"Trains is :{X_train.shape}")

val_corpus = [x["text"] for x in val]
val_features = vectorizer.transform(val_corpus)
val_labels = [x["label"] for x in val]

test_corpus = [x["text"] for x in test]
test_features = vectorizer.transform(test_corpus)
test_labels = [x["label"] for x in test]

Trains is :(8530, 16474)


# 4. Exploratory analysis

1. How many classes does this data have? Are the classes balanced?
2. What are the top 10 frequent words in all the reviews?

Hint: if you use sklearn CountVectorizer and are running out of memory, you can limit how many words to compute frequencies for.

In [48]:
all_labels = train_labels+val_labels+test_labels
print(f"The full dataset contains {len(all_labels)} instances, of which {all_labels.count(0)} are negative, and {all_labels.count(1)} are positive.")

The full dataset contains 10662 instances, of which 5331 are negative, and 5331 are positive.


In [62]:
corpus = " ".join(train_corpus+val_corpus+test_corpus)
#corpus = 

c_vectorizer = CountVectorizer()
X = c_vectorizer.fit_transform([corpus])

counts = pd.DataFrame(X.toarray(),
                      columns=c_vectorizer.get_feature_names_out())

# getting top 10 most common words
counts.T.sort_values(by=0, ascending=False).head(10)

Unnamed: 0,0
the,10209
and,6264
of,6148
to,4275
it,3435
is,3384
in,2675
that,2658
as,1808
but,1641


# 5. Logistic Regression

1. Create the classifier and train it on the train set. If you find that it doesn't converge, try increasing the number of iterations (`max_iter` parameter), e.g. to 1500. Documentation for logistic regression can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
2. Make predictions on the validation set (leave the test set aside for now)
3. Use the accuracy metric to compare the predicted labels to the ground truth labels you have from the original data. The documentation is [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).
4. Optional bonus task: scikit learn has a very useful [dummy classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). If your classifier were always simply predicting the majority class, how well would it do?

In [81]:
# Create logistic regression object # Train the model using the training sets
lr = LogisticRegression(max_iter=1500).fit(X_train, train_labels) 
    # Her bruger jeg X_train, som jeg har lavet med TF-IDF 
    # og så train_labels (som er hvorvidt det var et positivt eller negativt tweet)


# Make predictions using the testing set
val_pred = lr.predict(val_features)
    # Så predicter jeg på baggrund af min Logistiske regression

print(f"The logistic regression has an accuracy of {accuracy_score(val_pred, val_labels)}")

#Dummy Classifier - a "naive baseline" that simply predicts the majority class
from sklearn.dummy import DummyClassifier
majority_class = DummyClassifier(strategy= 'most_frequent').fit(X_train,train_labels)
val_pred_majority = majority_class.predict(val_features)

print("Majority baseline accuracy:", accuracy_score(val_labels, val_pred_majority))

The logistic regression has an accuracy of 0.7514071294559099
Majority baseline accuracy: 0.5


In [103]:
X_train

<8530x16474 sparse matrix of type '<class 'numpy.float64'>'
	with 143428 stored elements in Compressed Sparse Row format>

# 6. Testing on your own data!

1. Find 10-20 reviews outside of this dataset. You can pick any reviews you like, from rottentomatoes.com or ones you write yourself.
2. Put them into a spreadsheet and either manually annotate them with negative or positive sentiment, or use the labels provided on rottentomatoes.com. Make sure the columns are named "text" and "label" and the labels are consistent with the `rotten_tomatoes` markup scheme (1=positive, 0=negative).
3. Save this file as a .csv file, load it into your notebook, convert the text to TF-IDF scores.
4. Use this small dataset as a test dataset for the logistic regression classifier trained on the `rotten_tomatoes` data. Is your classifier doing better or worse than on the validation data?

In [100]:
mydata = pd.read_csv("../dataset/classification1_annotation.csv")
mydata["label"] = np.where(mydata["label"] == "negative", 0, 1) # ændre så de passer med det tidligere markup
print(mydata.head(9))

mytest_corpus= list(mydata["text"])
mytest_labels = list(mydata["label"])
mytest_features = vectorizer.transform(mytest_corpus)

mytest_pred = lr.predict(mytest_features)
print("/n Accuracy for LR on the new test data:", accuracy_score(mytest_labels, mytest_pred))

                                                text  label
0  It's extremely sad, but imagine u have to work...      0
1  #NLProc question about publication ethics: how...      0
2  Some institutions I know never could afford pa...      0
3  Looking back, I don’t see shortages & failures...      1
4  Very excited to speak at this workshop! Defini...      1
5  People have been asking, so I wanted to make i...      1
6  Aerospace defence research institute of the Ru...      1
7  i am at the stage of my life where i need to w...      0
8  I am actually really enjoying @svpino's daily ...      1
/n Accuracy for LR on the new test data: 0.4444444444444444


# 7. Does pre-processing make a difference?

Everything you've done so far was just considering the raw text of the reviews. Let us try to add pre-processing.

1. What preprocessing do you think could help the classifier? Why do you think so?
2. Implement the pre-processing step(s) of your choice and re-vectorize the rotten_tomatoes reviews.
3. Re-run the classifier. Did the accuracy improve? Why do you think it improved (or didn't)?

# 8. Bonus: visualize the reviews with a word cloud

Wordcloud is a nice little library to visualize text. The only required argument is the text from which the wordcloud should be generated. Removing punctuation, lowercasing and stripping English stopwords happens automatically.

- [reference](https://github.com/amueller/word_cloud/blob/master/examples/simple.py)
- [tutorial](https://www.datacamp.com/community/tutorials/wordcloud-python)

In [None]:
#basic usage:

#type in your sentence
sentence = ''
wordcloud = WordCloud(background_color="white",
                      width=400
                     ).generate(sentence)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()