# LLM Project Goals and Objectives
### Define the Task and Requirements
- Develop a sentiment analysis tool that uses an LLM to interpret emotions in text data from social media platforms.
- Purpose: Sentiment Classification or Review Classification

**Sentiment Analysis**:

Use the LLM model to classify reviews as positive or negative based on the label.

In [1]:
import warnings
warnings.filterwarnings("ignore")

## 1. Load data

### Load ready dataset from datasets library

In [2]:
from datasets import load_dataset
data = load_dataset('imdb')
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
data['train'].features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

**Comment**:

 - There are 3 splits: trainset (containing 25000 rows/documents), testset (containing 25000 rows/documents), and unsupervised set (containing 50000 rows/documents).

- There are 2 features: 'label' and 'text' in each dataset.

## 2. Convert data to DataFrame

In [4]:
import pandas as pd
data_train = pd.DataFrame(data['train'])

## 3. Preliminary EDA

In [5]:
data_train.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [6]:
# see an example of the first row of text
data_train.iloc[0,0]

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [7]:
# see an example of the 20th row of text
data_train.iloc[20,0]

'If the crew behind "Zombie Chronicles" ever read this, here\'s some advice guys: <br /><br />1. In a "Twist Ending"-type movie, it\'s not a good idea to insert close-ups of EVERY DEATH IN THE MOVIE in the opening credits. That tends to spoil the twists, y\'know...? <br /><br />2. I know you produced this on a shoestring and - to be fair - you worked miracles with your budget but please, hire people who can actually act. Or at least, walk, talk and gesture at the same time. Joe Haggerty, I\'m looking at you...<br /><br />3. If you\'re going to set a part of your movie in the past, only do this if you have the props and costumes of the time.<br /><br />4. Twist endings are supposed to be a surprise. Sure, we don\'t want twists that make no sense, but signposting the "reveal" as soon as you introduce a character? That\'s not a great idea.<br /><br />Kudos to the guys for trying, but in all honesty, I\'d rather they hadn\'t...<br /><br />Only for zombie completists.'

In [8]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [9]:
data_train.isnull().sum()    # Check for missing values

text     0
label    0
dtype: int64

In [10]:
data_train.nunique()       # Check for number of unique values in each feature

text     24904
label        2
dtype: int64

In [11]:
data_train.duplicated().sum()   # Check for duplicated rows

96

In [12]:
# Handle duplicates
data_train.drop_duplicates(keep='first', inplace=True)

# Check for duplicated rows again
data_train.duplicated().sum() 

0

In [13]:
data_train.shape

(24904, 2)

In [14]:
data_train['label'].value_counts()

label
1    12472
0    12432
Name: count, dtype: int64

***Comment***:

- There are **2 classes** in target variable - 'label', and the number of each class is pretty equal -> Balanced dataset.

## 4. Preprocessing data

## 4.1. Clean data

In [15]:
import re
import string
from textblob import TextBlob
import emoji
# from bs4 import BeautifulSoup

# Define a function to preprocess our messages
def clean_text(text):
    
    ## 1. Lowercase
    text = text.lower()
    
    ## 2. Remove HTML tags
    text = re.sub("<.*?>"," ", text)

    ## 3. Replace contractions with full words
    text = re.sub(r"\'m", " am", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
        
    ## 4. Remove Punctuation
    text = "".join([char for char in text if char not in string.punctuation])


    # ## 5. Replace common misspelled words with their correct forms
    # misspelling_mapping = {"u":"you", 
    #                        "gr8":"great"
    #                       }
    # for misspelling, correction in misspelling_mapping.items():
    #     text = re.sub(r"\b{}\b".format(misspelling), correction, text)

    # ## 6. Replace abbreviations with their expanded forms
    # abbreviation_mapping = {"lol":"laugh out loud",
    #                         "brb":"be right back",
    #                         "omg":"oh my god"
    #                        }
    # for abbreviation, expansion in abbreviation_mapping.items():
    #     text = text.replace(abbreviation, expansion)

    # ## 7. Correct Spelling
    # text = TextBlob(text).correct().string

    # ## 8. Handle/Translate Emoji (emojis can convey valuable information about emotions or sentiments)
    # text = emoji.demojize(text)
    
    return text

In [16]:
# Implement text cleaning

data_train['text'] = data_train['text'].apply(clean_text)
data_train.head()

Unnamed: 0,text,label
0,i rented i am curiousyellow from my video stor...,0
1,i am curious yellow is a risible and pretentio...,0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godard is m...,0
4,oh brotherafter hearing about this ridiculous ...,0


In [17]:
# see an example of the first row of text AFTER CLEANED
data_train.iloc[0,0]

'i rented i am curiousyellow from my video store because of all the controversy that surrounded it when it was first released in 1967 i also heard that at first it was seized by us customs if it ever tried to enter this country therefore being a fan of films considered controversial i really had to see this for myself  the plot is centered around a young swedish drama student named lena who wants to learn everything she can about life in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states in between asking politicians and ordinary denizens of stockholm about their opinions on politics she has sex with her drama teacher classmates and married men  what kills me about i am curiousyellow is that 40 years ago this was considered pornographic really the sex and nudity scenes are few and far between even then it is not shot like some cheapl

## 4.2. Manually Tokenize text and Remove stop words

In [18]:
from nltk.corpus import stopwords
import nltk

# Download the required NLTK data files
nltk.download('stopwords')


def tokenize_and_remove_stopwords(text):
    # Tokenize the text
    tokens = text.lower().split()

    # Remove English stopwords (reduction in dimensionality)
    stop_words = stopwords.words('english')
    tokens = [word for word in tokens if word not in stop_words]  
    
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vinh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
# Implement text tokenization and stop words removal

data_train['text'] = data_train['text'].apply(tokenize_and_remove_stopwords)
data_train.head()

Unnamed: 0,text,label
0,"[rented, curiousyellow, video, store, controve...",0
1,"[curious, yellow, risible, pretentious, steami...",0
2,"[avoid, making, type, film, future, film, inte...",0
3,"[film, probably, inspired, godard, masculin, f...",0
4,"[oh, brotherafter, hearing, ridiculous, film, ...",0


In [20]:
# Check datatype of 'text' (make sure it's still a list type)
data_train['text'].apply(type).value_counts()

text
<class 'list'>    24904
Name: count, dtype: int64

## 4.3. Lemmatization  
(reduction in dimensionality: reducing words to their base or dictionary form)

In [21]:
# Import necessay modules
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

# Download the required NLTK data files 
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger') # For pos (part of speech) tagging


# Step 1: Instantiate WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


# Step 2: Function to get POS tag for lemmatization
# (the get_wordnet_pos function to get the correct POS tag for each word)
def get_wordnet_pos(tag):
    """Map POS tag to first character lemmatize() accepts"""
    tag_dict = {"J": wordnet.ADJ, 
                "N": wordnet.NOUN, 
                "V": wordnet.VERB, 
                "R": wordnet.ADV
               }
    return tag_dict.get(tag[0].upper(), wordnet.NOUN)


# Step 3: Apply lemmatization
def lemmatize_tokens(tokens):
    # Perform POS tagging for all tokens at once
    pos_tags = pos_tag(tokens)
    
    # Lemmatize tokens based on POS tags
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in pos_tags]
    
    return lemmatized_tokens

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Vinh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Vinh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [22]:
# Implement text lemmatization

data_train['text'] = data_train['text'].apply(lemmatize_tokens)
data_train.head()

Unnamed: 0,text,label
0,"[rent, curiousyellow, video, store, controvers...",0
1,"[curious, yellow, risible, pretentious, steam,...",0
2,"[avoid, make, type, film, future, film, intere...",0
3,"[film, probably, inspire, godard, masculin, fé...",0
4,"[oh, brotherafter, hear, ridiculous, film, ump...",0


### Analyze what is the minimum, maximum, and average number of words in the tokenized reviews

In [23]:
# Calculate minimum, maximum, and average number of words in each tokenized review:
# min_words = data_train['text'].apply(len).min()
# max_words = data_train['text'].apply(len).max()
# average_words = data_train['text'].apply(len).mean()

print(f"Minimum number of words: {data_train['text'].apply(len).min()}")
print(f"Maximum number of words: {data_train['text'].apply(len).max()}")
print(f"Average number of words: {round(data_train['text'].apply(len).mean(), 0)}")

Minimum number of words: 4
Maximum number of words: 1425
Average number of words: 119.0


In [24]:
# How many total words make up the vocabulary(unique words) in the dataset?

unique_words = set()  # Create an empty set storing vocabularies
for row in data_train['text']:
    for word in row:
        unique_words.add(word)
        
print("Number of words make up the vocabulary (unique words) in the dataset:", len(unique_words))

Number of words make up the vocabulary (unique words) in the dataset: 99078


## 5. Text Data Representation - Data Transformation - Vectorization

In [26]:
# Convert list of words TO string of words that Vectorizer requires

data_train['text'] = data_train['text'].apply(lambda x: " ".join(x) )  

In [27]:
data_train['text']

0        rent curiousyellow video store controversy sur...
1        curious yellow risible pretentious steam pile ...
2        avoid make type film future film interesting e...
3        film probably inspire godard masculin féminin ...
4        oh brotherafter hear ridiculous film umpteen y...
                               ...                        
24995    hit time well categorise australian cult film ...
24996    love movie like another time try explain virtu...
24997    film sequel barry mckenzie hold two great come...
24998    adventure barry mckenzie start life satirical ...
24999    story center around barry mckenzie must go eng...
Name: text, Length: 24904, dtype: object

### Try BoW 

In [28]:
# from sklearn.feature_extraction.text import CountVectorizer
# bow_vectorizer = CountVectorizer()
# bow_X = bow_vectorizer.fit_transform(vectorizer_input)
# bow_matrix = pd.DataFrame(bow_X.toarray(), columns = bow_vectorizer.get_feature_names_out())
# bow_matrix

### Try TF-IDF 

Challenge: MemoryError

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=15000, ngram_range(1,2)) 
vectorizer = TfidfVectorizer(max_features=20000)     # Remain 20,000 words having highest score

# Tokenize and build vocabularies
vectorizer.fit(data_train['text'])   
print("Vocabularies:", vectorizer.vocabulary_)

# Encode document to vector to array
tfidf_X = vectorizer.transform(data_train['text']).toarray() 

In [30]:
# print("Number of vocabularies (unique words):", len(vectorizer.vocabulary_))    #20,000

Number of words make up the vocabulary (unique words) in the dataset: 20000


In [31]:
# Create matrix
tfidf_matrix = pd.DataFrame(tfidf_X, columns=vectorizer.get_feature_names_out())
tfidf_matrix 

Unnamed: 0,007,01,010,02,05,06,10,100,1000,10000,...,zoo,zoom,zorak,zorro,zp,zu,zucker,zulu,zuniga,zwick
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24899,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24901,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24902,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6. Modeling: Classification on  data
Using the TF-IDF transformed data to classify data as 0,1.

In [32]:
X = tfidf_X
y = data_train['label']

## 6.1. Try XGBoost with Cross-validation

In [33]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

model = xgb.XGBClassifier(objective='multi:softmax', num_class=5, eval_metric='mlogloss', use_label_encoder=False)

# Set cross-validation with 5 folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation and score on the data_train
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {scores.mean()}")

# Train the model on the entire training set (after cross-validation)
model.fit(X, y)

Cross-validation scores: [0.85003011 0.84260189 0.85264003 0.8544469  0.84658635]
Mean cross-validation score: 0.8492610554645807


## 6.2. Try XGBoost without Cross-validation

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import xgboost as xgb

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train the model on the entire training set
model.fit(X_train, y_train)

# Prediction on validation set (val)
y_val_pred = model.predict(X_val)

# Calculate accuracy on test set (val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print(f"Accuracy on validation set: {val_accuracy}")

Accuracy on validation set: 0.8486247741417386


## 6.3. Try Logistic Regression with Cross-validation

In [35]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

# Set cross-validation with 5 folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation and score on the data_train
scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')

print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {scores.mean()}")

# Train the model on the entire training set (after cross-validation)
model.fit(X, y)

Cross-validation scores: [0.8863682  0.88175065 0.88697049 0.88616744 0.88674699]
Mean cross-validation score: 0.8856007527399298


## 6.4. Try Logistic Regression without Cross-validation

In [36]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)

y_val_pred = clf.predict(X_val)
acc = accuracy_score(y_val, y_val_pred)
C = confusion_matrix(y_val, y_val_pred)

print(f'Accuracy: {acc}')
print(f'Confusion matrix:\n {C}')

Accuracy: 0.8823529411764706
Confusion matrix:
 [[2155  331]
 [ 255 2240]]


***Comment***:
#### How well does this model perform? 
- The XGBoost model, with an accuracy of 85%, is also a good performer but slightly less effective than Logistic Regression, with an accuracy of 88%, in this context.

#### What are some of the limitations of these models?
- Feature Representation: TF-IDF method is used for feature extraction, they do not capture the contextual meaning of words or the order of words in a sentence. For instance, "I didn't like the movie" and "I liked the movie" might end up having similar features.
- Sparse Representations: This method often lead to high-dimensional sparse matrices, which can be computationally expensive and may not effectively capture nuanced information.
- These models might not handle rare words or phrases effectively. ----> Pre-trained language models typically perform better in capturing the semantics of rare terms due to their vast training on diverse datasets.

In [37]:
# data_test.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


In [38]:
# # Prediction on data_test
# X_test = data_test.drop(columns=['label']) 
# y_test = data_test['label']

# y_test_pred = clf.predict(X_test)

# # Calculate accuracy on data_test
# test_accuracy = accuracy_score(y_test, y_test_pred)
# C = confusion_matrix(y_test, y_test_pred)
# print(f"Accuracy on test set: {test_accuracy}")
# print(f'Confusion matrix:\n {C}')