# Robsut Text Normalisation

This notebook is used to apply robust normalisation to preprocessed text data before textual analysis to improve model accuracy and reduce computational time.

Entities will be removed from text are: 
- PERSON: Names of individuals.
- ORG: Names of organizations, including companies, governmental entities, and other groups.
- NORP: Nationalities, religious and political groups.
- FAC: Facilities, like buildings, airports, highways, bridges.
- GPE: Geo-political entities, such as countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (not services).
- EVENT: Named events, such as battles, wars, sports events, hurricanes, etc.
- WORK_OF_ART: Titles of books, songs, and other works of art.
- LAW: Named documents made into laws, including directives, regulations, and legislative acts.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage (including "%").
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: "first", "second", etc.
- CARDINAL: Numerals that do not fall under another type (e.g., a counting number).

We relied on spacy `EntityRecognizer` to identify the entities in the text. Due to our computational power constraints, the small model of spaCy web English which is `en_core_web_sm` was used instead of `en_core_web_trf` to trade some accuracy for speed and efficiency.

## Required Packages

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm

import spacy
from spacy.tokens import DocBin
nlp = spacy.load("en_core_web_sm")

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
ent_labels_to_remove = ["PERSON", "ORG", "NORP", "FAC", "GPE", "LOC", "PRODUCT", 
                          "EVENT", "WORK_OF_ART", "LAW", "LANGUAGE", "DATE", "TIME", 
                          "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"] # all possible entity in spacy entity recognition

## Define different normalisation functions

In [3]:
# normalise a text string, use with .apply()

def normalise_text(text):
    text = text.replace("\n", " ")
    text = text.lower()
    doc = nlp(text)
    tokens = []
    ents_to_exclude_index = set()

    for ent in doc.ents:
        if ent.label_ in ent_labels_to_remove:
            ents_to_exclude_index.update(range(ent.start, ent.end))

    for token in doc:
        if (
            not token.like_url
            and not token.like_email
            and not token.is_stop
            and not token.is_punct
            and token.is_alpha
            and token.i not in ents_to_exclude_index):
                tokens.append(token.lemma_.lower())
    return " ".join(tokens)  # return a string
    # return tokens # return a list


def hash_to_word(test_text):
    word = [nlp.vocab.strings[hash] for hash in test_text]
    return word

In [4]:
# normalise a column in list form consists of spacy nlp object
# this method is the fastest but consume huge memory at once

def normalise_doc_at_once(nlp_object_column_list, ent_labels_to_remove):
    results = []
    for doc in tqdm(nlp_object_column_list):
        tokens = []
        for token in doc:
            if (
                not token.like_url
                and not token.like_email
                and not token.is_stop
                and not token.is_punct
                and token.is_alpha
                and token.ent_type_ not in ent_labels_to_remove
            ):
                tokens.append(token.lemma_.lower())
        results.append(" ".join(tokens))
    return results

def column_to_nlp_object_list_at_once(a_df_column):
    return list(tqdm(nlp.pipe(a_df_column.tolist())))

In [5]:
#normalise in batches to prevent out of memory error

def normalise_doc(doc, ent_labels_to_remove):
    tokens = [
        token.lemma_.lower() for token in doc
        if not token.like_url
        and not token.like_email
        and not token.is_stop
        and not token.is_punct
        and token.is_alpha
        and token.ent_type_ not in ent_labels_to_remove
    ]
    return " ".join(tokens)

def process_column_in_batches(df_column, batch_size=50000):
    results = []
    for i in tqdm(range(0, len(df_column), batch_size)):
        batch = df_column[i:i+batch_size].tolist()
        docs = nlp.pipe(batch)
        for doc in docs:
            normalized_text = normalise_doc(doc, ent_labels_to_remove)
            results.append(normalized_text)
    return results


## Load preprocessed data

In [6]:
df_review = pd.read_csv('review.csv')
# the warning was caused by uncleaned and mixed datatype in vote column, we will deal with this when we need vote data

  df_review = pd.read_csv('review.csv')


In [7]:
df_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20566364 entries, 0 to 20566363
Data columns (total 10 columns):
 #   Column      Dtype  
---  ------      -----  
 0   overall     float64
 1   verified    bool   
 2   reviewTime  object 
 3   asin        object 
 4   reviewText  object 
 5   vote        object 
 6   image       bool   
 7   Year        int64  
 8   price       float64
 9   main_cat    object 
dtypes: bool(2), float64(2), int64(1), object(5)
memory usage: 1.3+ GB


In [8]:
# make sure there is no NA in reviewText column as this will cause error in spacy nlp.pipe
df_review["reviewText"].isna().sum()

0

In [9]:
# show top 10 main product category
df_review['main_cat'].value_counts().head(10)

main_cat
Computers                    6732262
All Electronics              3655780
Home Audio & Theater         3301869
Camera & Photo               2502223
Cell Phones & Accessories    2272921
Car Electronics               412355
Amazon Devices                296778
Sports & Outdoors             221047
Tools & Home Improvement      198603
Office Products               172867
Name: count, dtype: int64

## testing

In [10]:
# code used for testing
# df_review.info()
# df_review_filter = df_review.loc[df_review["main_cat"] == "Computers" ]
# df_review_filter = df_review_filter.sample(100000, random_state=0)
# df_review_filter.info()

In [11]:
# df_review_filter['reviewText'] = process_column_in_batches(df_review_filter['reviewText'], batch_size=10000)

In [12]:
# df_review_filter['reviewText'].to_csv('test.csv', index=False)

In [13]:
# df_review_filter["reviewText"] = normalise_list(column_to_nlp_object_list(df_review_filter["reviewText"]), ent_labels_to_remove)


In [14]:
# df_review_filter['reviewText'].to_csv('test2.csv', index=False)

In [15]:
# code used for testing
# df_review_filter["reviewText"] = normalise_list(column_to_nlp_object_list(df_review_filter["reviewText"]), ent_labels_to_remove)
# df_review_filter["reviewText"]

## Normalisation

### Normalise Camera & Photo  review

In [17]:
df_review_camera = df_review.loc[df_review["main_cat"] == "Camera & Photo" ]
# df_review_camera.to_csv('review_camera.csv', index=False) # unnormalised version to csv if needed

In [18]:
df_review_camera["reviewText"]

12386       I was skeptical about buying a generic replace...
12387       Battery arrived ahead of schedule and was 1/2 ...
12388       Muy importante tener una batera cargada de rep...
12389       The two rechargeable battery packs I ordered a...
12390       Battery charged quickly and installed in my ca...
                                  ...                        
20566300    I love this dry box!!!! Besides being extremel...
20566309    If you have more than day 5000 dollars of gear...
20566310                                  Highly Recommended.
20566313    I have been using this camera for about 5 mont...
20566314    I enjoyed how durable and small this product i...
Name: reviewText, Length: 2502223, dtype: object

In [20]:
df_review_camera['reviewText'] = process_column_in_batches(df_review_camera['reviewText'], batch_size=50000) # adjust batch based on memory available (16G memory could handle around 50000 per batch)

100%|██████████| 51/51 [3:04:29<00:00, 217.05s/it]  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_review_camera['reviewText'] = process_column_in_batches(df_review_camera['reviewText'], batch_size=50000) # adjust batch based on memory available (16G memory could handle around 50000 per batch)


In [21]:
df_review_camera['reviewText']

12386       skeptical buy generic replacement battery new ...
12387       battery arrive ahead schedule price anyplace b...
12388       muy importante tener una batera cargada de rep...
12389       rechargeable battery pack order work great cam...
12390       battery charge quickly instal camera easily gr...
                                  ...                        
20566300    love dry box extremely functional allow displa...
20566309    gear worth invest professional storage camera ...
20566310                                     highly recommend
20566313    camera truly k great video quality seriously g...
20566314           enjoy durable small product bulky easy use
Name: reviewText, Length: 2502223, dtype: object

In [22]:
df_review_camera.to_csv('review_camera_normalised.csv', index=False)
del df_review_camera

### Normalise Cell Phones & Accessories review

In [23]:
df_review_phone = df_review.loc[df_review["main_cat"] == "Cell Phones & Accessories" ]
# df_review_phone.to_csv('review_phone.csv', index=False) # unnormalised version to csv if needed

In [24]:
df_review_phone["reviewText"]

1245        Stephanie has spent time filtering out many pr...
1246        For the past two years I've taught math profic...
1247        This book has notes, definitions,and practice ...
1248        The resources in this book are like no other. ...
1249        I am a High School mathematics teacher in the ...
                                  ...                        
20566349    Great product, great customer care. Thanks & w...
20566350    Works great, love the longer cord. As with any...
20566351    Perfect length. Very durable braiding. Works g...
20566352    Ok here is an odd thing that happened to me, I...
20566353                                          Works well.
Name: reviewText, Length: 2272921, dtype: object

In [25]:
df_review_phone['reviewText'] = process_column_in_batches(df_review_phone['reviewText'], batch_size=50000) # adjust batch based on memory available (16G memory could handle around 50000 per batch)

  0%|          | 0/46 [00:00<?, ?it/s]

In [None]:
df_review_phone['reviewText']

In [None]:
df_review_phone.to_csv('review_phone_normalised.csv', index=False)
del df_review_phone

### Normalise Computer review

In [None]:
df_review_comp = df_review.loc[df_review["main_cat"] == "Computers" ]
# df_review_comp.to_csv('review_computer.csv', index=False) # unnormalised version to csv if needed

In [None]:
df_review_comp["reviewText"]

743         It does 2A and charges a DEAD Nook in a few ho...
744         Same charger can be bought at Barnes & Noble f...
745         Works well, a little pricey I think for a char...
746         My son crewed my HD charger cord so I needed a...
747         It works perfect, puppy chewed last one and I ...
                                  ...                        
20566359    Had it 1 day and it quit working, will be retu...
20566360    Received item in 2 days. Product worked as adv...
20566361    I have it plugged into a usb extension on my g...
20566362              Fast delivery product was simple to use
20566363           Working as advertised, so far no problems.
Name: reviewText, Length: 6732262, dtype: object

In [None]:
df_review_comp['reviewText'] = process_column_in_batches(df_review_comp['reviewText'], batch_size=50000) # adjust batch based on memory available (16G memory could handle around 5000 per batch)

485888it [40:24, 200.39it/s]

In [None]:
df_review_comp['reviewText']

In [None]:
df_review_comp.to_csv('review_computer_normalised.csv', index=False)
del df_review_comp