# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Tong Minh Hieu Le
#### Student ID: 4098368


Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy
* RegexpTokenizer 
* chain
* ...

## Introduction

In this file, we perform basic text pre-processing on the given dataset, including, but not limited to tokenization, removing most/least frequent words and stop words. In this task, we focus on pre-processing the “Review Text” only.
1. Extract information about the review. Perform the pre-processing steps mentioned below to the extracted reviews
2. Tokenize each clothing review. The word tokenization must use the following regular expression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?";
3. All the words must be converted into the lower case;
4. Remove words with a length less than 2.
5. Remove stopwords using the provided stop words list (i.e., stopwords_en.txt). It is located
inside the same downloaded folder.
6. Remove the word that appears only once in the document collection, based on term frequency.
7. Remove the top 20 most frequent words based on document frequency.
8. Save the processed data as processed.csv file.
9. Build a vocabulary of the cleaned/processed reviews, and save it in a txt file (please refer to the
Required Output section);

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import pandas as pd
import numpy as np
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
from nltk.probability import *

### 1.1 Examining and loading data
- Examine the data and explain your findings
- Load the data into proper data structures and get it ready for processing.

In [2]:
# Code to inspect the provided data file...
# Loading the data
df = pd.read_csv('assignment3.csv')

In [3]:
df.shape

(19662, 10)

In [4]:
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


In [5]:
# Code to check for missing values 
df.isnull().any().sum()

0

In [6]:
# Check data types
df.dtypes

Clothing ID                 int64
Age                         int64
Title                      object
Review Text                object
Rating                      int64
Recommended IND             int64
Positive Feedback Count     int64
Division Name              object
Department Name            object
Class Name                 object
dtype: object

In [7]:
df.describe()

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,19662.0,19662.0,19662.0,19662.0,19662.0
mean,921.297274,43.260808,4.183145,0.818177,2.652477
std,200.227528,12.258122,1.112224,0.385708,5.834285
min,1.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


### 1.2 Pre-processing data
Perform the required text pre-processing steps.

1. **Extract review information**: Extract the text from the "Review Text" column for processing.

2. **Tokenize reviews**: Use the regular expression `r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"` to tokenize each clothing review into individual words.

3. **Convert to lowercase**: Transform all words to lowercase to ensure consistency.

4. **Remove short words**: Remove words with length less than 2 characters.

5. **Remove stopwords**: Filter out common stopwords using the provided stopwords_en.txt file.

6. **Remove infrequent words**: Eliminate words that appear only once in the entire collection.

7. **Remove most frequent words**: Remove the top 20 most frequent words based on document frequency.

8. **Save processed data**: Store the processed reviews in a CSV file named "processed.csv".

#### 1.2.1. Extract review information**: Extract the text from the "Review Text" column for processing.

In [8]:
reviews = df['Review Text']

In [9]:
len(reviews)

19662

#### 1.2.2 + 1.2.3: Tokenize reviews and convert to lowercase

In [10]:
def tokenizeReview(raw_review):
    """
    This function converts all words to lowercase,
    segments the raw review into sentences, tokenizes each sentence
    and returns the review as a list of tokens.
    """
    # Handle NaN or non-string values
    if not isinstance(raw_review, str):
        return []
        
    nl_review = raw_review.lower()  # convert all words to lowercase
    
    # segment into sentences
    sentences = sent_tokenize(nl_review)
    
    # tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern)
    token_lists = [tokenizer.tokenize(sen) for sen in sentences]
    
    # merge them into a list of tokens
    tokenised_review = list(chain.from_iterable(token_lists))
    return tokenised_review

In [11]:
tk_reviews = [tokenizeReview(r) for r in reviews]  # list comprehension, generate a list of tokenized articles

In [12]:
def stats_print(tk_reviews):
    words = list(chain.from_iterable(tk_reviews)) # we put all the tokens in the corpus in a single list
    vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of reviews:", len(tk_reviews))
    lens = [len(article) for article in tk_reviews]
    print("Average review length:", np.mean(lens))
    print("Maximun review length:", np.max(lens))
    print("Minimun review length:", np.min(lens))
    print("Standard deviation of review length:", np.std(lens))

In [13]:
# index to test element in tk reviews
test_index = 1

In [14]:
print("Raw review:\n",reviews[test_index],'\n')
print("Tokenized review:\n",tk_reviews[test_index])

Raw review:
 I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments! 

Tokenized review:
 ['i', 'love', 'love', 'love', 'this', 'jumpsuit', "it's", 'fun', 'flirty', 'and', 'fabulous', 'every', 'time', 'i', 'wear', 'it', 'i', 'get', 'nothing', 'but', 'great', 'compliments']


In [15]:
stats_print(tk_reviews)

Vocabulary size:  14806
Total number of tokens:  1206688
Lexical diversity:  0.012269948818584423
Total number of reviews: 19662
Average review length: 61.37157969687723
Maximun review length: 113
Minimun review length: 2
Standard deviation of review length: 27.802596969841698


#### 1.2.4. Remove short words**: Remove words with length less than 2 characters.

In [16]:
# filter out single character tokens
tk_reviews = [[w for w in review if len(w) >=2] \
                      for review in tk_reviews]

In [17]:
print("Tokenized review:\n",tk_reviews[test_index])

Tokenized review:
 ['love', 'love', 'love', 'this', 'jumpsuit', "it's", 'fun', 'flirty', 'and', 'fabulous', 'every', 'time', 'wear', 'it', 'get', 'nothing', 'but', 'great', 'compliments']


In [18]:
stats_print(tk_reviews)

Vocabulary size:  14780
Total number of tokens:  1109634
Lexical diversity:  0.013319707218776641
Total number of reviews: 19662
Average review length: 56.43545926151968
Maximun review length: 104
Minimun review length: 2
Standard deviation of review length: 25.39546596696992


#### 1.2.5. Remove stopwords**: Filter out common stopwords using the provided stopwords_en.txt file.

In [19]:
# Loading the stop words
with open('stopwords_en.txt', 'r') as f:
    stopwords = f.read().splitlines()

In [20]:
len(stopwords)

571

In [21]:
# Filter out stopwords from tokenized reviews
tk_reviews = [[w for w in review if w not in stopwords] 
              for review in tk_reviews]

# Check the result on the sample review
print("Tokenized review after removing stopwords:\n", tk_reviews[test_index])

Tokenized review after removing stopwords:
 ['love', 'love', 'love', 'jumpsuit', 'fun', 'flirty', 'fabulous', 'time', 'wear', 'great', 'compliments']


In [22]:
stats_print(tk_reviews)

Vocabulary size:  14283
Total number of tokens:  452692
Lexical diversity:  0.031551253390826435
Total number of reviews: 19662
Average review length: 23.023700539110976
Maximun review length: 51
Minimun review length: 1
Standard deviation of review length: 10.165913222944233


#### 1.2.6. Remove infrequent words**: Eliminate words that appear only once in the document collection, based on term frequency

In [23]:
words = list(chain.from_iterable(tk_reviews)) # we put all the tokens in the corpus in a single list
term_freq = FreqDist(words) # compute the term frequency distribution 

In [24]:
# Find the less frequent words
lessFreqWords = set(term_freq.hapaxes())
lessFreqWords 

{'hulked',
 'geiger',
 'incidents',
 'over-indulge',
 'over-stretch',
 'goods',
 'incr',
 'whips',
 'whispered',
 'creativity',
 'goto',
 'sweathsirt',
 'top-it',
 'unforgettable',
 'worldly',
 'withhold',
 'makes-you',
 'slouch-y',
 'regardi',
 'taupe-totally',
 'nandita',
 'relentless',
 'supporting',
 'cold-its',
 'clinton',
 'mensware',
 "petite's",
 'attachments',
 's-on',
 'gateway',
 'gypsy',
 'tonal',
 'non-collar',
 'embarrassed',
 'discolored',
 'announce',
 'vribant',
 'sillhoette',
 'lampshades',
 'fluttered',
 'pashminas',
 "stylist's",
 "sevigny's",
 'sale-only',
 'ding',
 'propensity',
 'juuuuussst',
 'lef',
 'core',
 'possess',
 'ubiquitous',
 'vinta',
 'hermosa',
 'shel',
 'guidance',
 'pantyhose',
 'unforced',
 'shield',
 'fisherman',
 'escalante',
 'thinne',
 'lower-cut',
 'invent',
 'groin',
 'rico',
 'notwithstanding',
 'camouflaged',
 'succulents',
 'deceided',
 'rebought',
 'remarks',
 'flowered-design',
 'cooking',
 'jaw',
 'enjoyment',
 'pop-it',
 'jeans-super'

In [25]:
len(lessFreqWords)

6734

In [26]:
def removeLessFreqWords(review):
    return [w for w in review if w not in lessFreqWords]

tk_reviews = [removeLessFreqWords(review) for review in tk_reviews]

In [27]:
stats_print(tk_reviews)

Vocabulary size:  7549
Total number of tokens:  445958
Lexical diversity:  0.016927603047820646
Total number of reviews: 19662
Average review length: 22.681212491099583
Maximun review length: 51
Minimun review length: 1
Standard deviation of review length: 9.958115518909024


#### 1.2.7. Remove most frequent words**: Remove the top 20 most frequent words based on document frequency.

In [28]:
words_2 = list(chain.from_iterable([set(review) for review in tk_reviews]))

In [29]:
doc_fd = FreqDist(words_2)  # compute document frequency for each unique word/type
mostCommonFredWords = doc_fd.most_common(20) # top 20 most frequent words
mostCommonFredWords

[('love', 6416),
 ('size', 5888),
 ('fit', 5537),
 ('dress', 5346),
 ('wear', 4900),
 ('top', 4670),
 ('great', 4497),
 ('fabric', 3712),
 ('color', 3604),
 ('small', 3265),
 ('ordered', 3099),
 ('perfect', 2973),
 ('flattering', 2939),
 ('soft', 2805),
 ('comfortable', 2597),
 ('back', 2538),
 ('cute', 2398),
 ('fits', 2394),
 ('nice', 2393),
 ('bought', 2376)]

In [30]:
# Extract the words from the most common words
most_common_words = [word for word, freq in mostCommonFredWords]
most_common_words

['love',
 'size',
 'fit',
 'dress',
 'wear',
 'top',
 'great',
 'fabric',
 'color',
 'small',
 'ordered',
 'perfect',
 'flattering',
 'soft',
 'comfortable',
 'back',
 'cute',
 'fits',
 'nice',
 'bought']

In [31]:
def removeMostFreqWords(review):
    return [w for w in review if w not in most_common_words]

tk_reviews = [removeMostFreqWords(review) for review in tk_reviews]

In [32]:
stats_print(tk_reviews)

Vocabulary size:  7529
Total number of tokens:  355505
Lexical diversity:  0.021178323792914306
Total number of reviews: 19662
Average review length: 18.080815786796865
Maximun review length: 47
Minimun review length: 0
Standard deviation of review length: 8.833524535391433


#### 1.2.8. Save processed data**: Store the processed reviews in a CSV file named "processed.csv".

In [33]:
print("Raw review:\n",reviews[test_index],'\n')
print("Tokenized review:\n",tk_reviews[test_index])

Raw review:
 I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments! 

Tokenized review:
 ['jumpsuit', 'fun', 'flirty', 'fabulous', 'time', 'compliments']


In [34]:
joined_reviews = [' '.join(review) for review in tk_reviews]
joined_reviews[test_index]

'jumpsuit fun flirty fabulous time compliments'

In [35]:
# We save processed reviews to a file with the new column name
df['Processed Review Text'] = joined_reviews
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Processed Review Text
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,high hopes wanted work initially petite usual ...
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants,jumpsuit fun flirty fabulous time compliments
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,shirt due adjustable front tie length leggings...
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,tracy reese dresses petite feet tall brand pre...
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits,basket hte person store pick teh pale hte gorg...


In [36]:
# Check the count of null values in both DataFrames
print(f"Null values in process reviews: {df['Processed Review Text'].isnull().sum()}")

Null values in process reviews: 0


In [37]:
# Check for empty strings too
print(f"Empty strings in process reviews: {(df['Processed Review Text'] == '').sum()}")
display(df.shape)

Empty strings in process reviews: 10


(19662, 11)

In [38]:
# Remove rows with null or empty strings in the 'Processed Review Text' column because they are not useful for our analysis
df = df[(df['Processed Review Text'].notna()) & (df['Processed Review Text'] != '')]

In [39]:
# Check it again
print(f"Empty strings in process reviews: {(df['Processed Review Text'] == '').sum()}")

Empty strings in process reviews: 0


In [40]:
df.shape

(19652, 11)

In [41]:
# Save the processed DataFrame to a new CSV file
df.to_csv('processed.csv', index=False)

## Saving required outputs
Save the requested information as per specification.
- vocab.txt

In [42]:
stats_print(tk_reviews)

Vocabulary size:  7529
Total number of tokens:  355505
Lexical diversity:  0.021178323792914306
Total number of reviews: 19662
Average review length: 18.080815786796865
Maximun review length: 47
Minimun review length: 0
Standard deviation of review length: 8.833524535391433


In [43]:
# generating the vocabulary
words_3 = list(chain.from_iterable(tk_reviews)) # we put all the tokens in the corpus in a single list
vocab = sorted(list(set(words_3))) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
vocab[:10] # print the first 10 words in the vocabulary
len(vocab) 

7529

In [44]:
# Save the vocabulary to vocab.txt
with open('vocab.txt', 'w') as f:
    for i, word in enumerate(vocab):
        f.write(f"{word}:{i}\n") 

## Summary
Give a short summary and anything you would like to talk about the assessment task here.

## Reference
- Activities and labs files for this course. 
- Github Copilot