# Workshop 2 - Data Prep


Submitted by Jirapat Sereerat 662115004

Use the IMDB Dataset of 50K Movie Reviews (train and test)
- Step 1: Text cleaning
    - Remove special chars, numbers, and extra spaces
- Step 2. Tokenization
    - Split into sentences and words
- Step 3. Lowercasing and Stop word removal
    - Covert text to lowercase
- Step 4 Emoticons, Stemming and Lemmatization
  - Final: 
    1. Check readability score Flesch-Kincaid (report in class)
    2. Regex Explanation Provide a manual "translation" of the most complex

*Regular Expression used in Step 1*

***Manual function is just experiment from the slide 

## Setup and Libraries

In [1]:
# !pip install textstat
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('omw-1.4')

import pandas as pd
import re
import nltk
import textstat
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

df = pd.read_csv('IMDB_dataset/IMDB Dataset.csv', encoding='latin1')

print(df.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


## Step 1 - Text Cleaning

In [2]:
# Emoticon Dictionary
emoticon_map = {
    ":)": "happy", ":(": "sad", ":D": "laugh", ";)": "wink", ":-(": "sad", 
    "<3": "love", ":-P": "tongue"
}

# Manual Emoticon Function (Simple Replace)
# def manual_emoticon_handling(text):
#     for emo, word in emoticon_map.items():
#         text = text.replace(emo, word)
#     return text

# re Emoticon Function (Regex / Import re) - ACTIVE 
def re_emoticon_handling(text):
    for emo, word in emoticon_map.items():
        pattern = re.escape(emo)
        text = re.sub(pattern, " " + word + " ", text)
    return text

# CLEANING
def clean_text(text):
    # Handle Emoticons
    text = re_emoticon_handling(text) 
    # text = manual_emoticon_handling(text)
    
    # Remove HTML tags
    text = re.sub(r'<[^>]*>', '', text)
    
    # Remove special characters and numbers (Regex) ***can use to remove emoticons
    text = re.sub(r'(?<=[.,!?])(?=[a-zA-Z])', ' ', text)

    # Remove Special Characters BUT KEEP PUNCTUATION
    text = re.sub(r'[^a-zA-Z0-9\s.,!?\'"]', '', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

df['cleaned_review'] = df['review'].apply(clean_text)

print(df[['review', 'cleaned_review']].head())

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                      cleaned_review  
0  One of the other reviewers has mentioned that ...  
1  A wonderful little production. The filming tec...  
2  I thought this was a wonderful way to spend ti...  
3  Basically there's a family where a little boy ...  
4  Petter Mattei's "Love in the Time of Money" is...  


## Step 2 & 3 - Tokenization, Lowercasing, and Stop Words removal

In [3]:
stop_words = set(stopwords.words('english'))

def tokenize_and_prep(text):
    # Lowercasing 
    text = text.lower()
    
    # Tokenization
    words = text.split()
    
    # Remove Stop words
    filtered_words = [w for w in words if w not in stop_words]
    
    return filtered_words

df['tokens'] = df['cleaned_review'].apply(tokenize_and_prep)

print(df['tokens'].head())

0    [one, reviewers, mentioned, watching, 1, oz, e...
1    [wonderful, little, production., filming, tech...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there's, family, little, boy, jake...
4    [petter, mattei's, "love, time, money", visual...
Name: tokens, dtype: object


## Step 4 - Stemming and Lemmatization

In [4]:
# --- STEMMING ---
# Manual Stemming (Suffix Stripping)
# def manual_stemmer(word):
#     if word.endswith('ing'): return word[:-3]
#     if word.endswith('ed'): return word[:-2]
#     return word

# Import Stemming (NLTK PorterStemmer) - ACTIVE
ps = PorterStemmer()
def import_stemmer(word):
    return ps.stem(word)

# --- LEMMATIZATION ---
# Manual Lemmatization (Dictionary) 
# lemma_dict = {"better": "good", "ran": "run", "movies": "movie"}
# def manual_lemmatizer(word):
#     return lemma_dict.get(word, word)

# Import Lemmatization (NLTK WordNet) - ACTIVE
wnl = WordNetLemmatizer()
def import_lemmatizer(word):
    return wnl.lemmatize(word)

def step_4_pipeline(tokens):
    stemmed_tokens = [import_stemmer(w) for w in tokens]
    lemmatized_tokens = [import_lemmatizer(w) for w in tokens]
    return stemmed_tokens, lemmatized_tokens

df[['stemmed', 'lemmatized']] = df['tokens'].apply(
    lambda x: pd.Series(step_4_pipeline(x))
)

print(df[['tokens', 'stemmed', 'lemmatized']].head())

                                              tokens  \
0  [one, reviewers, mentioned, watching, 1, oz, e...   
1  [wonderful, little, production., filming, tech...   
2  [thought, wonderful, way, spend, time, hot, su...   
3  [basically, there's, family, little, boy, jake...   
4  [petter, mattei's, "love, time, money", visual...   

                                             stemmed  \
0  [one, review, mention, watch, 1, oz, episod, h...   
1  [wonder, littl, production., film, techniqu, u...   
2  [thought, wonder, way, spend, time, hot, summe...   
3  [basic, there', famili, littl, boy, jake, thin...   
4  [petter, mattei', "love, time, money", visual,...   

                                          lemmatized  
0  [one, reviewer, mentioned, watching, 1, oz, ep...  
1  [wonderful, little, production., filming, tech...  
2  [thought, wonderful, way, spend, time, hot, su...  
3  [basically, there's, family, little, boy, jake...  
4  [petter, mattei's, "love, time, money", visual..

## Final - Readability & Regex Explanation

In [5]:
# Flesch Reading Ease
def get_readability(text):
    return textstat.flesch_reading_ease(text)

# Original
df['readability_score_original'] = df['review'].apply(get_readability)

# Cleaned
df['readability_score_cleaned'] = df['cleaned_review'].apply(get_readability)

avg_original = df['readability_score_original'].mean()
avg_cleaned = df['readability_score_cleaned'].mean()

print("--- READABILITY COMPARISON (Flesch Reading Ease) ---")
print(f"Average Score (Original Text): {avg_original:.2f}")
print(f"Average Score (Cleaned Text):  {avg_cleaned:.2f}")
print("------------------------------")

# --- Generate Processed File ---
output_columns = [
    'review', 
    'cleaned_review', 
    'tokens', 
    'stemmed', 
    'lemmatized', 
    'readability_score_original', 
    'readability_score_cleaned'
]

df_final = df[output_columns]

# Save to CSV
df_final.to_csv('IMDB_Processed_Comparison.csv', index=False)
print("\nFile 'IMDB_Processed_Comparison.csv' has been generated successfully.")

--- READABILITY COMPARISON (Flesch Reading Ease) ---
Average Score (Original Text): 64.14
Average Score (Cleaned Text):  64.44
------------------------------

File 'IMDB_Processed_Comparison.csv' has been generated successfully.


### Regular Expressions (Regex) Explanation

#### 1. HTML Tag Removal
**Code:** `re.sub(r'<[^>]*>', ' ', text)`

**Translation:** "Find any text that starts with `<` and ends with `>`, and replace it with a space."

| Symbol | Function |
| :--- | :--- |
| **`<`** | Matches the starting angle bracket. |
| **`[^>]*`** | Matches any character that is **NOT** a closing bracket (`>`), repeated zero or more times. |
| **`>`** | Matches the closing angle bracket. |

* **Before:** `"This movie<br />was terrible."`
* **After:** `"This movie was terrible."`

---

#### 2. Grammar Fixer (The "Space Inserter")
**Code:** `re.sub(r'(?<=[.,!?])(?=[a-zA-Z])', ' ', text)`

**Translation:** "Find the **position** right after a punctuation mark and right before a letter, and insert a space there."

This uses **Lookarounds** (Zero-width assertions). It doesn't consume characters; it only checks boundaries.

| Symbol | Name | Function |
| :--- | :--- | :--- |
| **`(?<=...)`** | **Positive Lookbehind** | Checks if the previous character matches the set `[.,!?]`. |
| **`(?=...)`** | **Positive Lookahead** | Checks if the next character matches a letter `[a-zA-Z]`. |

* **Why?** Many reviews have typos like `"movie.It"` which confuse NLP models.
* **Before:** `"I loved the film.It was great."`
* **After:** `"I loved the film. It was great."`

---

#### 3. Smart Special Character Cleaning
**Code:** `re.sub(r'[^a-zA-Z0-9\s.,!?\'"]', '', text)`

**Translation:** "Delete any character that is **NOT** a letter, number, space, or standard punctuation mark."

Unlike standard cleaning (which deletes everything), this "Whitelist" approach preserves sentence structure.

| Symbol | Function |
| :--- | :--- |
| **`^`** | When inside `[...]`, it means **NOT**. Matches everything *except* what follows. |
| **`a-zA-Z`** | Matches all English letters. |
| **`0-9`** | Matches numbers. |
| **`\s`** | Matches whitespace (spaces, tabs). |
| **`.,!?\'"`** | **The Whitelist:** Explicitly keeps dots, commas, exclamation marks, question marks, and quotes. |

* **Before:** `"Wow!!! It cost $100 & was 100% fun :)"`
* **After:** `"Wow!!! It cost 100  was 100 fun "` (Note: `&`, `$`, `%` are removed; `!` is kept).

---

#### 4. Extra Space Removal
**Code:** `re.sub(r'\s+', ' ', text)`

**Translation:** "Find any sequence of one or more spaces and replace them with a single space."

| Symbol | Function |
| :--- | :--- |
| **`\s`** | Matches whitespace (space, tab, newline). |
| **`+`** | Quantifier meaning "one or more times". |

* **Before:** `"Movie    was  good."`
* **After:** `"Movie was good."`

---

#### 5. Sentence Splitting (For Manual Score)
**Code:** `re.split(r'[.!?]+', text)`

**Translation:** "Split the text into a list whenever you see one or more periods, exclamation marks, or question marks."

| Symbol | Function |
| :--- | :--- |
| **`[...]`** | Character set matching dot, exclamation, or question mark. |
| **`+`** | Matches one or more (handles multiple marks like `..` or `?!`). |

* **Input:** `"Really?! I didn't know."`
* **Output List:** `["Really", " I didn't know", ""]`

---

#### 6. Emoticon Escaping
**Code:** `re.escape(emo)`

**Translation:** "Automatically add backslashes `\` before special characters in a string so Regex treats them as text, not commands."

* **Input:** `":)"`
* **Output:** `"\:\)"` (Tells regex to look for a literal colon and parenthesis, not a group).

---