In [None]:
import numpy as np
import random
# Set seed for reproducibility
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

## Introduction

Similarly to last week, in this week's tutorial we will again perform a **sentiment analysis**. The difference is that this time we want to learn the sentiment of words using **supervised learning** instead of looking them up in a dictionary.

The dataset is the same as last week, so we are still trying to classify movie reviews (from IMDB) as positive or negative.

## Data

The dataset we will use contains movie reviews from IMDB. Initially the data is stored as a dataframe with three columns (id, sentiment_human, text).

*Run the code below.*

In [None]:
import pandas as pd
#Loading the data from a csv file
reviews = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/imdb_sample.csv")

Since we want to perform a classification afterward, we recode the output variable such that a positive sentiment is represented by 1 and a negative sentiment by 0. So it can be used in supervised learning models.

*Run the code below.*

In [None]:
# Recode sentiment_human
reviews['sentiment_positive'] = np.where(reviews['sentiment_human'] == 'positive', 1, 0)


## Convert data to text format for processing
We will use the text of the movie reviews that we used last week for further analysis.

*Run the code below.*

In [None]:
# Display the first row of the data
reviews['text'].iloc[0]

## Preprocessing

Since unstructured data doesn't have an inherent and consistent structure we have to perform some preprocessing steps in order to make the data usable for the computer.
One thing to keep in mind is that the more preprocessing we perform the more information we lose, but the basic methods we are using here require it.

### Tokenize documents
First, we tokenize the texts. This means we transform the texts from one long string to a list of tokens. Additionally we also start removing unwanted characters (e.g punctuation between sentences, numbers, etc.). 

### Lemmatize all words
After tokenizing the texts we perform lemmatization (alternatively stemming could be performed). Lemmatization replaces each word with its dictionary form ([lemma](https://en.wikipedia.org/wiki/Lemma_(morphology))).

### Remove stopwords
Finally we remove words that don't contain real meaning and are commonly used (e.g. 'this', 'the', 'a', etc.). 


Additionally, we remove unwanted characters (e.g., punctuation and numbers).

*Run the code below.*

In [None]:
# Preprocessing
import nltk
import string
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download the punkt resource
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

#Define function with all necessary preprocessing steps for our IMDB reviews. In comparison to last week we now use Lemmatization instead of Stemming.
def preprocess(text):
    # tokenize the text
    tokens = nltk.word_tokenize(text)

    # create lemmatizer object
    lemmatizer = WordNetLemmatizer()

    # lemmatize each token
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # get list of stopwords in English
    stopwords = nltk.corpus.stopwords.words("english")

    # remove stopwords
    filtered_tokens = [token for token in lemmatized_tokens if token.lower() not in stopwords]
    
    # remove punctuation
    filtered_tokens_nopunct = [token for token in filtered_tokens if token not in string.punctuation]

    return  " ".join(filtered_tokens_nopunct)

## Apply preprocessing

After defining the different preprocessing steps, we now apply these preprocessing steps to our IMDB reviews. Running the code below we apply the preprocess function to the "text" column of our data and save the new preprocessed reviews as a new column in our dataset.

*Run the code below.*

In [None]:
# Apply text preprocessing
reviews['processed_text'] = reviews['text'].apply(preprocess)
reviews['processed_text'].iloc[0]  # Display first processed review

## Remove irrelevant words
In this case, we manually remove specific words that are irrelevant to the analysis.

*Run the code below.*

In [None]:
# Remove additional irrelevant words (amp, document)
reviews['processed_text'] = reviews['processed_text'].replace(['amp', 'document'], '', regex=True)
reviews['processed_text'].iloc[0] 


## Supervised Sentiment Analysis
Last week we performed **dictionary-based Sentiment Analysis** where we used a dictionary to look up each word's sentiment. This week we want to learn the sentiment of the words contained in the reviews by using **supervised learning**. However, this needs additional dataset transformations which we will perform subsequently.

### Prepare data for classifier
We split the data into a training and test set for supervised learning.

*Run the code below.*

In [None]:
from sklearn.model_selection import train_test_split

#Drop original review text from the data as we don't need it anymore
reviews_preprocessed = reviews.drop(columns="text")

#define X and y
X = reviews_preprocessed.drop(columns=['sentiment_positive'])
y = reviews_preprocessed['sentiment_positive']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Construct document-term matrix for training set and test set
We transform the training and the test data into a matrix where each row represents a text, and each column represents a word (token). We apply TF-IDF to the matrix. The cell (i,j) represents the number of occurrences of the j-th word in the i-th document. The resulting matrix has 5,000 rows (documents) and 33,071 columns (features/tokens).

*Run the code below.*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Vectorize the text using Term Frequency
count_vectorizer = CountVectorizer() 
tf_vectorizer = count_vectorizer.fit(X_train['processed_text'])
reviews_matrix_train = tf_vectorizer.transform(X_train['processed_text'])
reviews_matrix_test = tf_vectorizer.transform(X_test['processed_text'])

# Show the shape of the resulting matrix
print(reviews_matrix_train.shape)
print(reviews_matrix_test.shape)

# Turn matrix into a DataFrame so it can be used for supervised learning and merge with sentiment
reviews_matrix_train_df = pd.DataFrame(reviews_matrix_train.toarray(), columns=count_vectorizer.get_feature_names_out())
reviews_matrix_test_df = pd.DataFrame(reviews_matrix_test.toarray(), columns=count_vectorizer.get_feature_names_out())

# Display the first few rows of the training matrix
reviews_matrix_train_df.head()

### Remove rare terms from the train and test matrix (min_df)

In order to decrease the size of the matrix we filter out tokens that occur less than 15 times (columns whose sum is <15). This already decreases the number of columns from 33,071 to 3,927.

*Run the code below.*

In [None]:
# Vectorize the text using Term Frequency
count_vectorizer = CountVectorizer(min_df=15) #define min_df
tf_vectorizer = count_vectorizer.fit(X_train['processed_text'])
reviews_matrix_train = tf_vectorizer.transform(X_train['processed_text'])
reviews_matrix_test = tf_vectorizer.transform(X_test['processed_text'])

# Show the shape of the resulting matrix
print(reviews_matrix_train.shape)
print(reviews_matrix_test.shape)

# Turn matrix into a DataFrame so it can be used for supervised learning and merge with sentiment
reviews_matrix_train_df = pd.DataFrame(reviews_matrix_train.toarray(), columns=count_vectorizer.get_feature_names_out())
reviews_matrix_test_df = pd.DataFrame(reviews_matrix_test.toarray(), columns=count_vectorizer.get_feature_names_out())

# Display the first few rows of the training matrix
reviews_matrix_train_df.head()

### Transform counts to TF-IDF weights

In the previous step we used the **term frequency (TF)** to filter out very rare words since they most likely won't contribute a lot to our model. On the other hand terms that appear in almost every document are also most likely not so informative but TF isn't enough to determine that. That's why we need an additional metric called **inverse document frequency (IDF)** which takes on big values for terms that only appear in a few documents and small values for terms that appear very often. 

When we put these two metrics together we get the **TF-IDF = TF\*IDF**. We use this formula to recalculate our matrix entries.

*Run the code below.*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the text using TF-IDF
vectorizer = TfidfVectorizer(min_df=15) #define min_df
tfidf_vectorizer = vectorizer.fit(X_train['processed_text'])
reviews_matrix_train_tfidf = tfidf_vectorizer.transform(X_train['processed_text'])
reviews_matrix_test_tfidf = tfidf_vectorizer.transform(X_test['processed_text'])

# Show the shape of the resulting matrix
print(reviews_matrix_train_tfidf.shape)
print(reviews_matrix_test_tfidf.shape)

# Turn matrix into a DataFrame so it can be used for supervised learning and merge with sentiment
reviews_matrix_train_df_tfidf = pd.DataFrame(reviews_matrix_train_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
reviews_matrix_test_df_tfidf = pd.DataFrame(reviews_matrix_test_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

# Display the first few rows of the training matrix
reviews_matrix_train_df_tfidf.head()

## Train and evaluate classifier
From here on, almost everything is the same as before when we performed classification with the only difference being that the input features are now terms instead of numbers or categorical values.


### Train random forest classifier
We train a random forest classifier onthe training set (without hyperparameter tuning) to classify the sentiment based on the processed text features.

*Run the code below.*

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Train a Random Forest classifier
rf_01 = RandomForestClassifier(random_state=42).fit(reviews_matrix_train_df_tfidf, y_train)


We can see that there are different parameters that have to be specified in order to train a random forest. The __Number of trees__ (n_estimators) for example tells us how many trees have been trained.   
For a complete list of the `RandomForestClassifier` functions parameters, you can have a look at its [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

*Run the code below.*


In [None]:
print("Parameters:", rf_01.get_params())

### Make predictions and calculate evaluation metrics on test set

Similarly to last week, we can make predictions on the test set and calculate different evaluation metrics.

*Run the code below.*

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score

predictions_testset_rf01 = rf_01.predict_proba(reviews_matrix_test_df_tfidf)[:, 1]
predictions_testset_rf01_binary = np.where(predictions_testset_rf01 > 0.5, 1, 0)

# Calculate Accuracy

accuracy_rf = accuracy_score(y_test, predictions_testset_rf01_binary)
print("Accuracy (Random Forests):", accuracy_rf)

# Create the confusion matrix
ConfusionMatrixDisplay.from_predictions(y_test, predictions_testset_rf01_binary)

### ROC and Auc

Plot ROC curve and calculate AUC on test set.
With __binary__ classification we get relatively straight lines. With the classification __probability__ we can map the distribution better. That is why we use the classification probability (e.g., predictions_testset_rf01) to calculate the AUC.

*Run the code below.*

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import RocCurveDisplay

# Calculate and Print the AUC score
auc_score = roc_auc_score(y_test, predictions_testset_rf01)
print("AUC Score:", auc_score)

#plot ROC curve
RocCurveDisplay.from_predictions(y_test, predictions_testset_rf01, plot_chance_level=True)

# Summary
In this tutorial, we:

1. Preprocessed the text (tokenization, lemmatization, stopword removal, etc.).
2. Made a Train Test Split of the data.
2. Transformed the data into a matrix using TF-IDF.
3. Performed supervised sentiment analysis using a random forest classifier.
4. Evaluated the model using accuracy, confusion matrix, ROC curve, and AUC.

## Perform additional analyses below

You can use the cell below to perform and evaluate different sentiment analyses

In [None]:
# Enter your Code here!