# Data Cleaning

In [None]:
'''
Import required packages and libraries for data exploration
'''
import pandas as pd
import numpy as np
import tensorflow as tf
import transformers
import pyabsa

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
'''
Set up file path and data handling objects
'''
PATH = "../data/reviews.csv"
data = pd.read_csv(PATH)

## Remove Irrelevant Data Points
The first stage of data cleaning is to identify and remove data points that aren't related to our task. In "Amazon Fine Food Reviews", we have many different product reviews including: pet food, medicine, microwavable food, fine foods, etc.
- Is this category of food or type of review relevant to our task?
- Would removing this type of review from the data improve the accuracy of our model?
- If we remove this type of review, how will it effect our training process (would there be too little data remaining?)

## Remove Uncecessary Columns
- What columns are necessary for our model? 
- Is there anything that needs to be removed?

In [None]:
# Only include features that can be plotted in correlation matrix
# String features cannot be intepreted in correlation matrix
numeric_data = data.drop(columns=["ProductId", "UserId", "ProfileName", "Summary", "Text"])

In [None]:
# Calculate the helpfulness
helpfulness_scores = data["HelpfulnessNumerator"]/data["HelpfulnessDenominator"].replace(0,np.nan)

# Add the new helpfulness column to the numeric data as correlation feature
data["Helpfulness"] = helpfulness_scores

In [None]:
# As seen in the data exploration stage, most numerical features excluding 
# the newly created "Helpfulness" were not indicative of Score
data.drop(columns=[
    "Id", 
    "ProfileName", 
    "HelpfulnessNumerator", 
    "HelpfulnessDenominator",
    "Time"
])

Unnamed: 0,ProductId,UserId,Score,Summary,Text,Helpfulness
0,B001E4KFG0,A3SGXH7AUHU8GW,5,Good Quality Dog Food,I have bought several of the Vitality canned d...,1.0
1,B00813GRG4,A1D87F6ZCVE5NK,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,
2,B000LQOCH0,ABXLMWJIXXAIN,4,"""Delight"" says it all",This is a confection that has been around a fe...,1.0
3,B000UA0QIQ,A395BORC6FGVXV,2,Cough Medicine,If you are looking for the secret ingredient i...,1.0
4,B006K2ZZ7K,A1UQRSCLF8GW1T,5,Great taffy,Great taffy at a great price. There was a wid...,
...,...,...,...,...,...,...
568449,B001EO7N10,A28KG5XORO54AY,5,Will not do without,Great for sesame chicken..this is a good if no...,
568450,B003S1WTCU,A3I8AFVPEE8KI5,2,disappointed,I'm disappointed with the flavor. The chocolat...,
568451,B004I613EE,A121AA1GQV751Z,5,Perfect for our maltipoo,"These stars are small, so you can give 10-15 o...",1.0
568452,B004I613EE,A3IBEVCTXKNOH,5,Favorite Training and reward treat,These are the BEST treats for training and rew...,1.0


## Case Sensitivity
Convert the input features in the raw dataset into a case insensitive format (all lowercase/uppercase) to reduce the amount of distinct words in the data.

## Remove Filler Words
Some words like "I", "the", "a", etc. don't impact the sentiment of the text content. Remove these words from all review content so there is less redundant features for the final model.

## Punctuation Handling
Some words that contain punctuation can be recorded as separate features without punctuation handling (e.g., "Steve's pizza is great!" and "Steve makes great pizza!").

| is | great | great! | makes | pizza | pizza! | Steve | Steve's |
|----|-------|--------|-------|-------|--------|-------|---------|
|1   | 1     | 1      | 1     | 1     | 1      | 1     | 1       |

We want to remove uncessesary punctuation so that we don't have duplicates of effectively the same word.
| is | great | makes | pizza | Steve |
|----|-------|-------|-------|-------|
| 1  | 2     | 1     | 2     | 2     |

Doing this prevents our model from interpreting duplicate words as two separate features and reduces the number of dimensions our model has to process (increasing efficiency).

## Dependency Parsing Split
In this section we need to split the dataset into single entity and multiple entity data points. This step is necessary because the framework for our model requires that single entity data points are handled by **model A** and multiple entity data points are handled by **model B**.

In [None]:
from pyabsa import AspectTermExtraction as ATEPC, available_checkpoints

# view available checkpoints
checkpoint_map = available_checkpoints()

# load model
aspect_extractor = ATEPC.AspectExtractor(
    checkpoint='fast_lcf_atepc',
    auto_device=True,
    cal_perplexity=True
)

# single sentence prediction
results = aspect_extractor.predict(
    data.iloc[:5]["Summary"].tolist(),
    print_result=True,
    ignore_error=True,
)

# Print aspect terms for each item
for i, result in enumerate(results):
    aspects = result.get("aspect", [])
    print(f"Item {i+1} aspects: {aspects}")


[2025-05-05 20:30:20] (2.4.1.post1) Please specify the task code, e.g. from pyabsa import TaskCodeOption
[2025-05-05 20:30:21] (2.4.1.post1) ********** Available ATEPC model checkpoints for Version:2.4.1.post1 (this version) **********
[2025-05-05 20:30:21] (2.4.1.post1) ********** Available ATEPC model checkpoints for Version:2.4.1.post1 (this version) **********
[2025-05-05 20:30:21] (2.4.1.post1) Checkpoint:fast_lcf_atepc is not found, you can raise an issue for requesting shares of checkpoints
[2025-05-05 20:30:22] (2.4.1.post1) No checkpoint found in Model Hub for task: fast_lcf_atepc
[2025-05-05 20:30:25] (2.4.1.post1) Load aspect extractor from checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT
[2025-05-05 20:30:25] (2.4.1.post1) config: checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.config
[2025-05-05 20:30:25] (2.4.1.post1) state_dict: checkpoints\ATEPC_MULTILINGUAL_CHECKPOINT\fast_lcf_atepc.state_dict
[2025-05-05 20:30:25] (2.4.1.post1) model: None
[2025-05-05 20:30:25] (2.4



[2025-05-05 20:34:03] (2.4.1.post1) The results of aspect term extraction have been saved in d:\Documents\OneDrive - UTS\2025\49275 - Neural Networks and Fuzzy Logic\Project\Amazon-Sentiment-Analysis\src\cleaning\absa\Aspect Term Extraction and Polarity Classification.FAST_LCF_ATEPC.result.json
[2025-05-05 20:34:03] (2.4.1.post1) Example 0: Good Quality Dog Food
[2025-05-05 20:34:03] (2.4.1.post1) Example 1: Not as Advertised
[2025-05-05 20:34:03] (2.4.1.post1) Example 2: " <Delight:Positive Confidence:0.9965> " says it all
[2025-05-05 20:34:03] (2.4.1.post1) Example 3: Cough Medicine
[2025-05-05 20:34:03] (2.4.1.post1) Example 4: Great taffy
Item 1 aspects: ['Dog Food']
Item 2 aspects: ['Advertised']
Item 3 aspects: ['Delight']
Item 4 aspects: []
Item 5 aspects: ['taffy']


## Word Embedding