# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Nguyen Ba Duc Manh
#### Student ID: s3978506


Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction
You should give a brief information of this assessment task here.

<span style="color: red"> Note that this is a sample notebook only. You will need to fill in the proper markdown and code blocks. You might also want to make necessary changes to the structure to meet your own needs. Note also that any generic comments written in this notebook are to be removed and replace with your own words.</span>

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import pandas as pd
import re
import numpy as np
from nltk.tokenize import RegexpTokenizer

### 1.1 Examining and loading data
- Examine the data and explain your findings
- Load the data into proper data structures and get it ready for processing.

In [2]:
# Code to inspect the provided data file...
data = pd.read_csv('assignment3.csv')

In [3]:
data.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19662 entries, 0 to 19661
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              19662 non-null  int64 
 1   Age                      19662 non-null  int64 
 2   Title                    19662 non-null  object
 3   Review Text              19662 non-null  object
 4   Rating                   19662 non-null  int64 
 5   Recommended IND          19662 non-null  int64 
 6   Positive Feedback Count  19662 non-null  int64 
 7   Division Name            19662 non-null  object
 8   Department Name          19662 non-null  object
 9   Class Name               19662 non-null  object
dtypes: int64(5), object(5)
memory usage: 1.5+ MB


The data types for each column are already correct, and you can assume that the column title indicates its intended data type. For instance, the `Clothing ID` column should contain numeric values like 1077 or 1049. Similarly, columns such as `Age`, `Rating`, `Recommended IND`, and `Positive Feedback Count` should also be numeric. All other columns, which store text, should have the data type object. There's no need to change the data types in the provided dataset.

In [5]:
data.describe()

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,19662.0,19662.0,19662.0,19662.0,19662.0
mean,921.297274,43.260808,4.183145,0.818177,2.652477
std,200.227528,12.258122,1.112224,0.385708,5.834285
min,1.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


This result provides a quick summary of the count, mean, and standard deviation for the columns containing numeric data. I noticed several outliers, particularly in the `Age` and `Positive Feedback Count` columns. For example, the mean `Age` is 43, yet the maximum value is 99. Similarly, the mean for `Positive Feedback Count` is 2.6, but the maximum reaches 122.

In [6]:
nullvalue = data.isnull().sum()
nullvalue

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

There is no null value in the dataset

### 1.2 Pre-processing data
Perform the required text pre-processing steps.

...... Sections and code blocks on basic text pre-processing


<span style="color: red"> You might have complex notebook structure in this section, please feel free to create your own notebook structure. </span>

At first, I will check for the data of `Review Text` and process the data in this column

In [7]:
# code to perform the task...
review_texts = data['Review Text']

In [8]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")

In [9]:
# Load stopwords
with open('stopwords_en.txt', 'r') as f:
    stop_words = set([line.strip() for line in f])

In [10]:
# Function to process each review
def process_review(review):
    # Tokenize review
    tokens = tokenizer.tokenize(review.lower())  # Convert to lowercase and tokenize
    # Remove short words and stopwords
    filtered_tokens = [word for word in tokens if len(word) > 1 and word not in stop_words]
    return filtered_tokens

In [11]:
data['Processed_Review'] = data['Review Text'].apply(process_review)

In [12]:
data['Processed_Review']

0        [high, hopes, dress, wanted, work, initially, ...
1        [love, love, love, jumpsuit, fun, flirty, fabu...
2        [shirt, flattering, due, adjustable, front, ti...
3        [love, tracy, reese, dresses, petite, feet, ta...
4        [aded, basket, hte, mintue, person, store, pic...
                               ...                        
19657    [happy, snag, dress, great, price, easy, slip,...
19658    [reminds, maternity, clothes, soft, stretchy, ...
19659    [fit, top, worked, glad, store, order, online,...
19660    [bought, dress, wedding, summer, cute, fit, pe...
19661    [dress, lovely, platinum, feminine, fits, perf...
Name: Processed_Review, Length: 19662, dtype: object

This is the result after I processed the `Review Text` column. The words are separated with the ',' and transform to lowercase. 

In [13]:
# Flatten all tokens to calculate frequencies and remove rare/frequent words
all_words = [word for sublist in data['Processed_Review'] for word in sublist]

In [14]:
# Calculate word frequencies
word_counts = pd.Series(all_words).value_counts()

In [15]:
# Remove words that appear only once and top 20 frequent words
single_appearance_words = word_counts[word_counts == 1].index
top_20_frequent_words = word_counts.head(20).index
filtered_words = word_counts.drop(single_appearance_words).drop(top_20_frequent_words).index

In [16]:
# Filter the reviews based on the updated word list
data['Processed_Review'] = data['Processed_Review'].apply(lambda review: [word for word in review if word in filtered_words])

## Saving required outputs
Save the requested information as per specification.
- vocab.txt
- processed.csv

In [18]:
# Save the processed reviews to a CSV file
data['Processed_Review'] = data['Processed_Review'].apply(lambda x: ' '.join(x))
data[['Processed_Review']].to_csv('processed.csv', index=False)

In [19]:
unique_words = sorted(set(filtered_words))
with open('vocab.txt','w') as vocab_file:
    for idx, word in enumerate(unique_words):
        vocab_file.write(f"{word}:{idx}\n")
        
print("Processing complete. Files saved as 'processed.csv' and 'vocab.txt'.")

Processing complete. Files saved as 'processed.csv' and 'vocab.txt'.


## Summary
Give a short summary and anything you would like to talk about the assessment task here.

In this task, I have examined and processed the `Review Text` in the dataset according to the required steps. The processed data has been saved to a file named 'processed.csv', and each unique word is stored alphabetically in 'vocab.txt' in the format `word_string:word_integer_index`.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   
<span style="color: red"> This markdown block can be removed once the task is completed. </span>