<a href="https://colab.research.google.com/github/Sriva29/bert-learning-analytics/blob/main/eda-and-preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the Sight Dataset and the Coursera Review Dataset and Preprocessing/Cleaning them

In [3]:
import pandas as pd

# sight_df = pd.read_csv(
#     "data/sight_dataset.csv",
#     delimiter=",",               # Specify delimiter
#     quotechar='"',               # Handle embedded quotes
#     escapechar="\\",             # Escape special characters
#     on_bad_lines="skip",         # Skip problematic lines
#     engine="python"              # Use the Python parser for flexibility
# )

# Upon analysis, we discovered that this dataset can be used to test since it is unlabelled.
#sight_df.head()

coursera_df = pd.read_csv("https://raw.githubusercontent.com/Sriva29/bert-learning-analytics/refs/heads/main/data/reviews_by_course.csv")
coursera_df.head()

Unnamed: 0,CourseId,Review,Label
0,2-speed-it,BOring,1
1,2-speed-it,Bravo !,5
2,2-speed-it,Very goo,5
3,2-speed-it,"Great course - I recommend it for all, especia...",5
4,2-speed-it,One of the most useful course on IT Management!,5


In [4]:
#Inspecting the data types and checking if there are any missing values
print(coursera_df.info())
print(coursera_df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140320 entries, 0 to 140319
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   CourseId  140320 non-null  object
 1   Review    140317 non-null  object
 2   Label     140320 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 3.2+ MB
None
CourseId    0
Review      3
Label       0
dtype: int64


In [5]:
# Since only 3 missing values, dropping them
coursera_df = coursera_df.dropna(subset=["Review"])
print(coursera_df.isnull().sum())

CourseId    0
Review      0
Label       0
dtype: int64


In [6]:
#Checking label distribution
print(coursera_df["Label"].value_counts())

Label
5    106514
4     22460
3      5923
1      2866
2      2554
Name: count, dtype: int64


In [8]:
'''At this point, given that the dataset is skewed with a lot more rows labelled 5 than not, given the choice of undersampling label 5, oversampling the lower labels 1-4,
and using class weights, we decided to go with class weights to avoid artificially adding data (fake reviews) to the dataset.
'''

from sklearn.utils.class_weight import compute_class_weight
import numpy as np

#Defining the classes and their freq

classes = sorted(coursera_df["Label"].unique())
print(classes)
classes = np.array(classes)
class_weights = compute_class_weight(
    class_weight = "balanced",
    classes = classes,
    y=coursera_df["Label"]
)

#Converting to dictionary for easy ref
class_weights_dict = {classes[i]: class_weights[i] for i in range(len(classes))}
print(class_weights_dict)


[1, 2, 3, 4, 5]
{1: 9.791835310537333, 2: 10.988018794048552, 3: 4.738038156339693, 4: 1.2494835262689226, 5: 0.26347146853934694}


Upon visual inspection of the dataset, we noticed some issues:

1. Non-English Reviews: Some of the reviews are in spanish. In BERT is pre-trained on English text, this will cause problems and affect fine-tuning quality.
2. Gibberish and Encoding issues: We found that some of the reviews had plain gibberish text. eg: Ð”Ð¾ÑÑ‚ÑƒÐ¿Ð½Ð¾ Ð¸ Ð¸Ð½Ñ‚ÐµÑ€ÐµÑÐ½Ð¾. We will either correct the encoding errors or drop them.

In [9]:
# install langdetect and unidecode

from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException
def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

#Detecting each review language
coursera_df["Language"] = coursera_df["Review"].apply(detect_language)

#filtering only English reviews
english_reviews_df = coursera_df[coursera_df["Language"]=="en"]

print(english_reviews_df["Language"].value_counts())
english_reviews_df.head()

ModuleNotFoundError: No module named 'langdetect'

In [None]:
#Fixing giberrish

from unidecode import unidecode

english_reviews_df["cleaned_review"] = english_reviews_df["Review"].apply(unidecode)

print(english_reviews_df["cleaned_review"].head())
print(english_reviews_df.shape)


In [None]:
#Checking whether original df had more langs

print(coursera_df["Language"].value_counts())



In [None]:
# Saving new csv
english_reviews_df.to_csv("data/coursera_english_reviews.csv", index=False)


Now that we only have English reviews, time to apply standard text preprocessing such as conversion to lowercase, punctuation removal, special character removal, and extra whitespace removal

In [None]:
import re

# Text cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Applying cleaning to the Review column
english_reviews_df["cleaned_review"] = english_reviews_df["Review"].apply(clean_text)

# Verifying the changes
print(english_reviews_df[["Review", "cleaned_review"]].head())

In [None]:
# Spliting the data
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(english_reviews_df, test_size=0.2, stratify=english_reviews_df["Label"], random_state=37)

print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")

Tokenizing using BERT from Hugging Face Transformers library

In [None]:
# Analyzing dataset to determine max_length for BERT tokenization
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

review_lengths = english_reviews_df["cleaned_review"].apply(lambda x: len(tokenizer.tokenize(x)))
print(review_lengths.describe())  # Check mean, median, and max token length


In [None]:
print(english_reviews_df.columns)

In [None]:
#Verifying GPU usage
import torch

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA version:", torch.version.cuda)
    print("GPU:", torch.cuda.get_device_name(0))


In [None]:
from transformers import BertTokenizer

# Load a tokenizer to test
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("Transformers library is functional!")


In [None]:
# Given that token lenght rarely even touches 128, we will keep max_lenght as 128
import torch
def tokenize_data(data):
    return tokenizer(
        list(data["cleaned_review"]),
        padding = True,
        truncation = True,
        max_length = 128,
        return_tensors="pt"
    )

train_encodings = tokenize_data(train_data)
val_encodings = tokenize_data(val_data)

print("Tokenization has been completed!")
