# About the dataset:

Sourced from [cs.uic.edu](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
- Currently analyzing [Customer Review Datasets (5 products)](http://www.cs.uic.edu/~liub/FBS/CustomerReviewData.zip)
- Contains reviews for 5 products:
	1. digital camera: Canon G3
	2. digital camera: Nikon coolpix 4300
	3. celluar phone:  Nokia 6610
	4. mp3 player:     Creative Labs Nomad Jukebox Zen Xtra 40GB
	5. dvd player:     Apex AD2600 Progressive-scan DVD player

# Preprocessing

## Parsing raw text files and saving as csv files

This data comes in text file format.

Notes:

Regarding Apex DVD player:
- One sentence within text file does not have a ## to split on, mistakenly only have one pound sign #. Found on line number: 485
- Some of the sentences have broken brackets, these won't be picked up by the annotation extraction methods

Regarding Nokia phone:
- One of the title lines has a random triple asterisk, full line: "***[t]great phone , but no warranty ! "

In [73]:
import re
import pandas as pd

raw_container_path = 'raw_data/customer review data/'

file_name_dict = {
    'canon_g3': "Canon G3.txt",
    'nikon_coolpix_4300': "Nikon coolpix 4300.txt",
    'nokia_6610': "Nokia 6610.txt",
    'nomad_jukebox_zen_xtra': "Creative Labs Nomad Jukebox Zen Xtra 40GB.txt",
    'apex_ad2600_dvd_player': "Apex AD2600 Progressive-scan DVD player.txt",
}

def extract_sentiment(annotations_part: str) -> list[dict, int]:
    feature_sentiment_dict = {}
    sentiment_value_total = 0
    feature_sentiment_matches =  re.findall(r'(.*?)\[(\+|-)(\d)\]', annotations_part) # Just for extracting the "feature|[+|- sentiment]"
    for match in feature_sentiment_matches:
        feature_name = match[0]
        if match[1] == '+':
            sentiment = int(match[2])
        elif match[1] == '-':
            sentiment = int(match[2]) * -1
        else:
            raise Exception("Invalid sentiment: " + match[1])
        feature_sentiment_dict[feature_name] = sentiment
        sentiment_value_total += sentiment
    return feature_sentiment_dict, sentiment_value_total

def extract_other_features(annotations_part: str) -> dict:
    non_sentiment_feature_tags = {
        "[u]": False, 
        "[p]": False, 
        "[s]": False, 
        "[cc]": False, 
        "[cs]": False
    }
    for key, _ in non_sentiment_feature_tags.items():
        if key in annotations_part:
            non_sentiment_feature_tags[key] = True
    return non_sentiment_feature_tags

def parse_reviews(file_content, raw_text_file_name) -> pd.DataFrame:
    reviews = re.split(r'\[t\]', file_content) # Split the content by the review title tag [t]
    reviews = reviews[1:] # Skip header by skipping to the first [t] tag

    data = []
    for review in reviews:
        # 1. Remove leading and trailing whitespace from review
        # 2. Split into a list of individual lines by '\n'
        # 3. Remove leading and trailing whitespace from individual line
        lines = [line.strip() for line in review.strip().split(sep = '\n')]
        
        title = lines[0] # First line of each review is the review title
        sentences = lines[1:] # The rest are sentences
        
        for sentence in sentences:
            # Split annotations and sentence text
            # The annotations are before '##', the sentence text is after
            try:
                if '##' in sentence:
                    annotations_part, sentence_text = sentence.split(sep = '##')
                elif '#' in sentence:
                    annotations_part, sentence_text = sentence.split(sep = '#')
                else:
                    raise Exception(f"Sentence in '{raw_text_file_name}' does not contain a valid sentence starter symbol. Sentence detected: '" + sentence + "'. Invalid sentence will be skipped.")
            except Exception as e:
                print(e)
                continue
            sentiment_dict, sentiment_total = extract_sentiment(annotations_part)
            other_features = extract_other_features(annotations_part)
            # Append the data
            data.append({
                'title': title,
                'sentence': sentence_text.strip(),
                'sentiment_dict': sentiment_dict,
                'sentiment_total': sentiment_total,
                "[u]": other_features['[u]'], 
                "[p]": other_features['[p]'], 
                "[s]": other_features['[s]'],
                "[cc]": other_features['[cc]'],
                "[cs]": other_features['[cs]'],
                'annotations': annotations_part
            })

    df = pd.DataFrame(data)
    return df

for key, raw_text_file_name in file_name_dict.items():
    # Parse and save to csv
    with open(raw_container_path + raw_text_file_name, 'r') as f:
        content = f.read()
        df = parse_reviews(content, raw_text_file_name)
        df.to_csv('data/' + key + '.csv', index=False)

Sentence in 'Nokia 6610.txt' does not contain a valid sentence starter symbol. Sentence detected: '***'. Invalid sentence will be skipped.


In [74]:
nomad_df = pd.read_csv('data/nomad_jukebox_zen_xtra.csv')
nomad_df

Unnamed: 0,title,sentence,sentiment_dict,sentiment_total,[u],[p],[s],[cc],[cs],annotations
0,a quick update to the new zen nx ?,"this is an edited review , now that i have had...",{},0,False,False,False,False,False,
1,a quick update to the new zen nx ?,"while , there are flaws with the machine , the...",{'affordability': 3},3,False,False,False,False,False,affordability[+3]
2,a quick update to the new zen nx ?,it is the most bang-for-the-buck out there .,{'bang-for-the-buck': 3},3,True,False,False,False,False,bang-for-the-buck[+3][u]
3,a quick update to the new zen nx ?,"like it 's predecessor , the quickly revised n...","{'size': 2, ',weight': 2, ',navigational syste...",9,False,False,False,False,False,"size[+2],weight[+2],navigational system[+2], s..."
4,a quick update to the new zen nx ?,the xtra improves upon the zen nx with a large...,{'screen': 3},3,False,False,False,False,False,screen[+3]
...,...,...,...,...,...,...,...,...,...,...
1711,"nomads are wonderful , but be carefull !",in that model the hard drive just died one mor...,{'hard drive': -2},-2,False,False,False,False,False,hard drive[-2]
1712,"nomads are wonderful , but be carefull !","it 's nothing major , just a bad hard drive , ...",{},0,False,False,False,False,False,
1713,"nomads are wonderful , but be carefull !","so rule of thumb , no matter what you end up b...",{},0,False,False,False,False,False,
1714,"nomads are wonderful , but be carefull !",it always pays off .,{},0,False,False,False,False,False,
