# Data Cleaning

In [30]:
'''
Import required packages and libraries for data exploration
'''
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import transformers
import pyabsa

In [24]:
'''
Set up file path and data handling objects
'''
PATH = "../data/reviews.csv"
data = pd.read_csv(PATH)

In [25]:
data.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


## Case Sensitivity
Convert the input features in the raw dataset into a case insensitive format (all lowercase/uppercase) to reduce the amount of distinct words in the data.

In [26]:
# Remove null values from tokenizer strings
data["Summary"] = data["Summary"].fillna("")
data["Text"] = data["Text"].fillna("")

In [27]:
# Convert all words to lowercase to reduce the number of unique features
data["Summary"] = data["Summary"].str.lower()
data["Text"] = data["Text"].str.lower()

data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,good quality dog food,i have bought several of the vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,not as advertised,product arrived labeled as jumbo salted peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""delight"" says it all",this is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,cough medicine,if you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,great taffy,great taffy at a great price. there was a wid...


## Punctuation Handling
Some words that contain punctuation can be recorded as separate features without punctuation handling (e.g., "Steve's pizza is great!" and "Steve makes great pizza!").

| is | great | great! | makes | pizza | pizza! | Steve | Steve's |
|----|-------|--------|-------|-------|--------|-------|---------|
|1   | 1     | 1      | 1     | 1     | 1      | 1     | 1       |

We want to remove uncessesary punctuation so that we don't have duplicates of effectively the same word.
| is | great | makes | pizza | Steve |
|----|-------|-------|-------|-------|
| 1  | 2     | 1     | 2     | 2     |

Doing this prevents our model from interpreting duplicate words as two separate features and reduces the number of dimensions our model has to process (increasing efficiency).

In [28]:
pattern = r"(?u)\b\w\w+\b"
tokenizer = lambda string : " ".join(re.findall(pattern=pattern, string=string))

data["Summary"] = data["Summary"].apply(tokenizer)
data["Text"] = data["Text"].apply(tokenizer)

## Remove Filler Words
Some words like "I", "the", "a", etc. don't impact the sentiment of the text content. Remove these words from all review content so there is less redundant features for the final model.

## Remove Irrelevant Data Points
The first stage of data cleaning is to identify and remove data points that aren't related to our task. In "Amazon Fine Food Reviews", we have many different product reviews including: pet food, medicine, microwavable food, fine foods, etc.
- Is this category of food or type of review relevant to our task?
- Would removing this type of review from the data improve the accuracy of our model?
- If we remove this type of review, how will it effect our training process (would there be too little data remaining?)

In [29]:
non_aspects = {
    "pet_species":{
        "dog","cat","puppy","kiten","fish","hamster","rabit","guinea pig","bird","parrot","turtle",
        "lizard", "snake", "ferret", "gerbil", "chinchilla", "mouse", "rat", "iguana", "gecko",
        "dogs","cats","puppys","kitens","fishs","hamsters","rabits","guinea pigs","birds","parrots","turtles",
        "lizards", "snakes", "ferrets", "gerbils", "chinchillas", "mouses", "rats", "iguanas", "geckos"
    },
    "pet_food_brands":{
        "purina", "pedigre", "iams", "blue buffalo", "hill science diet", "royal canin", "fancy feast", "friskies",
        "cesar", "meow mix", "nutro", "wellness", "orijen", "acana", "grenies", "temptations", "whiskas"
    },
    "otc_medicines": {
        "ibuprofen", "acetaminophen", "naproxen", "aspirin", "loperamide", "simethicone", "diphenhydramine", "loratadine", "cetirizine", "fexofenadine", "doxylamine",
        "phenylephrine", "pseudoephedrine", "guaifenesin", "dextromethorphan", "omeprazole", "famotidine", "ranitidine", "calcium carbonate", "bismuth subsalicylate", 
        "polyethylene glycol 3350", "docusate sodium", "hydrocortisone", "bacitracin", "neomycin", "polymyxin b", "benzocaine", "lidocaine", "menthol", "camphor", 
        "salicylic acid", "minoxidil", "fluoride", "nicotine", "melatonin", "vitamin", "zinc", "iron", "magnesium", "calcium", "probiotics", "electrolytes", 
        "oral rehydration salts", "antacids", "laxatives", "antihistamines", "decongestants", "cough suppressants", "expectants", "sleep aids", "pain relievers", "fever reducers",
        "anti diarrheal","anti gas","allergy relief", "cold medicine", "flu medicine", "heartburn relief", "acid reducer", "stomach remedy", "constipation relief", 
        "hemorrhoid treatment", "motion sickness relief", "smoking cessation aids", "eye drops", "ear drops", "nasal spray", "throat lozenges", "topical analgesics", 
        "antifungal creams", "antiseptic solutions", "first aid ointments", "wound care", "bandages", "thermometers", "blood pressure monitors", "glucose meters", "pregnancy tests", 
        "ovulation tests", "condoms", "personal lubricants", "feminine hygiene products", "incontinence products", "foot care products", "wart removers", "corn removers", 
        "callus removers", "antiperspirants", "deodorants", "oral care products", "toothpaste", "mouthwash", "dental floss", "denture care"
    },
    "medicine_brands":{
        "tylenol", "advil", "aleve", "motrin", "excedrin", "bayer", "bufferin", "midol", "benadryl", "claritin", "zyrtec", "allegra", 
        "xyzal", "sudafed", "mucinex", "robitussin", "delsym", "nyquil", "dayquil", "theraflu", "vicks", "pepto bismol", "tums",
        "rolaids", "gas x", "imodium", "dramamine", "preparation h", "monistat", "lotrimin", "lamisil", "neosporin", "polysporin", "cortizone 10",
        "hydrocortisone", "orajel", "anbesol", "abreva", "zicam", "airborne", "emergen c", "nature made", "nature bounty", "centrum", "one a day",
        "flintstones", "gnc", "Kirkland signature", "equate", "up & up", "amazon basic care", "rite aid", "cvs health", "walgrens", "boiron", "hyland",
        "similasan", "breathe right", "nicorete", "nicoderm", "zantac", "prilosec", "prevacid", "nexium", "pepcid", "omeprazole", "famotidine",
        "ranitidine", "lactaid", "beano", "align", "culturelle", "florastor", "metamucil", "miralax", "colace", "senokot", "flet",
        "tucks", "anusol", "hemaway", "preparation h", "voltaren", "salonpas", "icy hot", "bengay", "tiger balm", "biofreze", "aspercreme",
        "capzasin", "blue emu", "thermacare", "salonpas", "bengay", "flexall", "arnicare", "boiron", "hyland", "similasan", "genexa",
        "zarbe", "maty", "olly", "smartypants", "rainbow light", "garden of life", "new chapter", "megafood", "nature way", "solaray", "solgar",
        "now foods", "Jarrow formulas", "doctor best", "thorne research", "pure encapsulations", "designs for health", "douglas laboratories", 
        "integrative therapeutics", "vital nutrients", "standard process", "metagenics", "ortho molecular products", "xymogen", "biotics research", "nutri west", 
        "professional formulas", "ecological formulas", "progressive labs", "Klaire labs", "allergy research group", "ayush herbs", "ayurvedic herbs", 
        "himalaya herbal healthcare", "planetary herbals", "herb pharm", "gaia herbs", "nature answer", "nature sunshine", "nature way", "solaray",
        "solgar", "now foods", "Jarrow formulas", "doctor best", "thorne research", "pure encapsulations", "designs for health", "douglas laboratories", 
        "integrative therapeutics", "vital nutrients", "standard process", "metagenics", "ortho molecular products", "xymogen", "biotics research", "nutri west", 
        "professional formulas", "ecological formulas", "progressive labs", "Klaire labs", "allergy research group", "ayush herbs", "ayurvedic herbs",
        "himalaya herbal healthcare", "planetary herbals", "herb pharm", "gaia herbs", "nature answer", "nature sunshine"
    }
}

In [31]:
def search_prod(value, dataframe, series):
    products = set()

    for i, string in enumerate(series):
        if re.search(pattern=f" {value} ", string=string):
            products.add(dataframe.iloc[i]["ProductId"])
    return products

In [32]:
for key in non_aspects.keys():
    for value in non_aspects[key]:
        sum_id = search_prod(value=value, dataframe=data, series=data["Summary"])
        txt_id = search_prod(value=value, dataframe=data, series=data["Text"])
        prod_id = sum_id.union(txt_id)
        
        data = data[~data["ProductId"].isin(prod_id)]


In [11]:
data.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,283226.0,283226.0,283226.0,283226.0,283226.0
mean,286214.118337,1.627954,2.059009,4.187606,1296071000.0
std,164080.193161,7.10235,7.579589,1.321972,49046820.0
min,2.0,0.0,0.0,1.0,939340800.0
25%,144938.25,0.0,0.0,4.0,1270771000.0
50%,285881.5,0.0,1.0,5.0,1311466000.0
75%,429968.0,2.0,2.0,5.0,1333066000.0
max,568454.0,559.0,562.0,5.0,1351210000.0


In [12]:
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,not as advertised,product arrived labeled as jumbo salted peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,delight says it all,this is confection that has been around few ce...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,great taffy,great taffy at great price there was wide asso...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,nice taffy,got wild hair for taffy and ordered this five ...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,great just as good as the expensive brands,this saltwater taffy had great flavors and was...


## Remove Uncecessary Columns
- What columns are necessary for our model? 
- Is there anything that needs to be removed?

In [13]:
# Only include features that can be plotted in correlation matrix
# String features cannot be intepreted in correlation matrix
numeric_data = data.drop(columns=["ProductId", "UserId", "ProfileName", "Summary", "Text"])

In [14]:
# Calculate the helpfulness
helpfulness_scores = data["HelpfulnessNumerator"]/data["HelpfulnessDenominator"].replace(0,np.nan)

# Add the new helpfulness column to the numeric data as correlation feature
data["Helpfulness"] = helpfulness_scores

In [15]:
# As seen in the data exploration stage, most numerical features excluding 
# the newly created "Helpfulness" were not indicative of Score
data.drop(columns=[
    "Id",
    "UserId", 
    "ProfileName", 
    "HelpfulnessNumerator", 
    "HelpfulnessDenominator",
    "Time"
])

Unnamed: 0,ProductId,Score,Summary,Text,Helpfulness
1,B00813GRG4,1,not as advertised,product arrived labeled as jumbo salted peanut...,
2,B000LQOCH0,4,delight says it all,this is confection that has been around few ce...,1.0
4,B006K2ZZ7K,5,great taffy,great taffy at great price there was wide asso...,
5,B006K2ZZ7K,4,nice taffy,got wild hair for taffy and ordered this five ...,
6,B006K2ZZ7K,5,great just as good as the expensive brands,this saltwater taffy had great flavors and was...,
...,...,...,...,...,...
568441,B000NY8O9M,5,great for fast gulasch,quick and easy had similar gulasch in guest ho...,
568442,B006T7TKZO,5,great cafe latte,this product is great gives you so much energy...,
568443,B000H7K114,5,excellent tea,love this tea first discovered the pleasures o...,
568450,B003S1WTCU,2,disappointed,disappointed with the flavor the chocolate not...,


## Dependency Parsing Split
In this section we need to split the dataset into single entity and multiple entity data points. This step is necessary because the framework for our model requires that single entity data points are handled by **model A** and multiple entity data points are handled by **model B**.

In [16]:
from pyabsa.framework.checkpoint_class.checkpoint_template import CheckpointManager

checkpoint = CheckpointManager()
checkpoint_path = checkpoint._get_remote_checkpoint(checkpoint="multilingual", task_code="ATEPC")
print("Checkpoint downloaded to:", checkpoint_path)

[2025-05-07 05:44:45] (2.4.1.post1) ********** Available ATEPC model checkpoints for Version:2.4.1.post1 (this version) **********
[2025-05-07 05:44:45] (2.4.1.post1) ********** Available ATEPC model checkpoints for Version:2.4.1.post1 (this version) **********
[2025-05-07 05:44:45] (2.4.1.post1) Downloading checkpoint:multilingual 
[2025-05-07 05:44:45] (2.4.1.post1) Notice: The pretrained model are used for testing, it is recommended to train the model on your own custom datasets
[2025-05-07 05:44:45] (2.4.1.post1) Checkpoint already downloaded, skip
Checkpoint downloaded to: ./checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT


In [18]:
from pyabsa import AspectTermExtraction as ATEPC, available_checkpoints

# view available checkpoints
checkpoint_map = available_checkpoints()

# load model
aspect_extractor = ATEPC.AspectExtractor(
    checkpoint=checkpoint_path,
    auto_device=True,
    cal_perplexity=True
)

# single sentence prediction
results = aspect_extractor.predict(
    data.iloc[:1000]["Summary"].tolist(),
    print_result=True,
    ignore_error=True,
)

# Print aspect terms for each item
for i, result in enumerate(results):
    aspects = result.get("aspect", [])
    print(f"Item {i+1} aspects: {aspects}")


[2025-05-07 05:45:28] (2.4.1.post1) Please specify the task code, e.g. from pyabsa import TaskCodeOption
[2025-05-07 05:45:28] (2.4.1.post1) Load aspect extractor from ./checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT
[2025-05-07 05:45:28] (2.4.1.post1) config: ./checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.config
[2025-05-07 05:45:28] (2.4.1.post1) state_dict: ./checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.state_dict
[2025-05-07 05:45:28] (2.4.1.post1) model: None
[2025-05-07 05:45:28] (2.4.1.post1) tokenizer: ./checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.tokenizer
[2025-05-07 05:45:29] (2.4.1.post1) Set Model Device: cuda:0
[2025-05-07 05:45:29] (2.4.1.post1) Device Name: NVIDIA GeForce GTX 1070


preparing ate inference dataloader: 100%|██████████| 1000/1000 [00:00<00:00, 2705.68it/s]
extracting aspect terms: 100%|██████████| 32/32 [00:21<00:00,  1.49it/s]
preparing apc inference dataloader: 100%|██████████| 671/671 [00:00<00:00, 1235.47it/s]
classifying aspect sentiments: 100%|██████████| 21/21 [00:14<00:00,  1.44it/s]


[2025-05-07 05:46:19] (2.4.1.post1) The results of aspect term extraction have been saved in d:\Documents\Education\nn-fuzz-proj\Amazon-Sentiment-Analysis\src\cleaning\Aspect Term Extraction and Polarity Classification.FAST_LCF_ATEPC.result.json
[2025-05-07 05:46:19] (2.4.1.post1) Example 0: not as advertised
[2025-05-07 05:46:19] (2.4.1.post1) Example 1: delight says it all
[2025-05-07 05:46:19] (2.4.1.post1) Example 2: great taffy
[2025-05-07 05:46:19] (2.4.1.post1) Example 3: nice taffy
[2025-05-07 05:46:19] (2.4.1.post1) Example 4: great just as good as the expensive brands
[2025-05-07 05:46:19] (2.4.1.post1) Example 5: wonderful tasty taffy
[2025-05-07 05:46:19] (2.4.1.post1) Example 6: yay barley
[2025-05-07 05:46:19] (2.4.1.post1) Example 7: the best <hot sauce:Positive Confidence:0.998> in the world
[2025-05-07 05:46:19] (2.4.1.post1) Example 8: my cats love this <diet food:Positive Confidence:0.9886> better than their regular food
[2025-05-07 05:46:19] (2.4.1.post1) Example 9:

In [22]:
# you can view all available checkpoints by calling available_checkpoints()
checkpoint_map = available_checkpoints()

aspect_extractor = ATEPC.AspectExtractor(
    'multilingual',
    auto_device=True,  # False means load model on CPU
    cal_perplexity=True,
)

# instance inference
atepc_result = aspect_extractor.batch_predict(
    data.iloc[:1000]["Summary"].tolist(),  #
    save_result=True,
    print_result=True,  # print the result
    pred_sentiment=True,  # Predict the sentiment of extracted aspect terms
)

print(atepc_result)

[2025-05-07 07:29:14] (2.4.1.post1) Please specify the task code, e.g. from pyabsa import TaskCodeOption
[2025-05-07 07:29:15] (2.4.1.post1) ********** Available ATEPC model checkpoints for Version:2.4.1.post1 (this version) **********
[2025-05-07 07:29:15] (2.4.1.post1) ********** Available ATEPC model checkpoints for Version:2.4.1.post1 (this version) **********
[2025-05-07 07:29:15] (2.4.1.post1) Downloading checkpoint:multilingual 
[2025-05-07 07:29:15] (2.4.1.post1) Notice: The pretrained model are used for testing, it is recommended to train the model on your own custom datasets
[2025-05-07 07:29:15] (2.4.1.post1) Checkpoint already downloaded, skip
[2025-05-07 07:29:15] (2.4.1.post1) Load aspect extractor from checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT
[2025-05-07 07:29:15] (2.4.1.post1) config: checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.config
[2025-05-07 07:29:15] (2.4.1.post1) state_dict: checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.state_dict
[2025-0

preparing ate inference dataloader: 100%|██████████| 1000/1000 [00:00<00:00, 2454.85it/s]
extracting aspect terms: 100%|██████████| 32/32 [00:23<00:00,  1.36it/s]
preparing apc inference dataloader: 100%|██████████| 671/671 [00:00<00:00, 1051.12it/s]
classifying aspect sentiments: 100%|██████████| 21/21 [00:15<00:00,  1.32it/s]


[2025-05-07 07:30:10] (2.4.1.post1) The results of aspect term extraction have been saved in d:\Documents\Education\nn-fuzz-proj\Amazon-Sentiment-Analysis\src\cleaning\Aspect Term Extraction and Polarity Classification.FAST_LCF_ATEPC.result.json
[2025-05-07 07:30:10] (2.4.1.post1) Example 0: not as advertised
[2025-05-07 07:30:10] (2.4.1.post1) Example 1: delight says it all
[2025-05-07 07:30:10] (2.4.1.post1) Example 2: great taffy
[2025-05-07 07:30:10] (2.4.1.post1) Example 3: nice taffy
[2025-05-07 07:30:10] (2.4.1.post1) Example 4: great just as good as the expensive brands
[2025-05-07 07:30:10] (2.4.1.post1) Example 5: wonderful tasty taffy
[2025-05-07 07:30:10] (2.4.1.post1) Example 6: yay barley
[2025-05-07 07:30:10] (2.4.1.post1) Example 7: the best <hot sauce:Positive Confidence:0.998> in the world
[2025-05-07 07:30:10] (2.4.1.post1) Example 8: my cats love this <diet food:Positive Confidence:0.9886> better than their regular food
[2025-05-07 07:30:10] (2.4.1.post1) Example 9:

## Word Embedding