# NLP with spaCy
--------------------------------------------------------------------------------

# Text Data: User reviews from Amazon Product Reviews.

Goal:
==
1. Perform named entity recognition (NER) to extract product names and brands.

2. Analyze sentiment (positive/negative) using a rule-based approach.

Deliverable: Code snippet and output showing extracted entities and sentiment.
==

# About Dataset
This dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels) for learning how to train fastText for sentiment analysis.

The idea here is a dataset is more of real business data on a reasonable scale.
# Content
The fastText supervised learning tutorial requires data in the following format:

__label__<X> __label__<Y> ... <Text>
where X and Y are the class names. No quotes, all on one line.

In this case, the classes are __label__1 and __label__2, and there is only one class per row.

__label__1 corresponds to 1- and 2-star reviews, and __label__2 corresponds to 4- and 5-star reviews.

(3-star reviews i.e. reviews with neutral sentiment were not included in the original),
The review titles, followed by ':' and a space, are prepended to the text.
Most of the reviews are in English, but there are a few in other languages, like Spanish.
    

# Sentiment Analysis
The rule based sentiment Analysis that defined here uses the extracted labels above which correspond to star-ratings where;
1. 1- and 2-star ratings are considered "NEGATIVE SENTIMENT".
2. 3-star rating i sconsidered "NEUTRAL SENTIMENT".
3. 4- and 5-star rating is considered "POSITIVE SENTIMENT".

In [1]:
# Import all the required libraries
import spacy
import bz2
import pandas as pd
import numpy as np
import re

In [2]:
# Load spaCy English model
nlp = spacy.load("en_core_web_md")


In [3]:
# Load Amazon reviews data (CSV file, adjust path as needed)
# Example: df = pd.read_csv("amazonreviews/amazonreviews.tsv", sep='\t')

# with bz2.open('test.ft.txt.bz2', 'rb') as f:
#         data = f.read()
#         print(data.decode())
        
df = pd.read_csv("Downloads/test.ft.txt", header=None, encoding_errors='ignore')  # Update path if needed


In [4]:
def assign_labels_and_comments(file, limit=100):
    labels = []           # List to store labels
    review = []         # List to store comments

    for line in bz2.BZ2File(file):    # Read each line from a bz2 compressed file
        x = line.decode("utf-8")      # Decode the byte string to a regular string
        
        # Use regex to find the number after 'label__' in review text
        match = re.search(r'label__([1-2]+)', x)

        if match:
            labels.append(int(match.group(1)))
    #         label = int(match.group(1))
           # print("Extracted label:", labels)
        else:
            print("No label")
        
        review.append(x[10:].strip())  # Get comment starting from 11th character

    return np.array(labels), review  # Return 

In [5]:
# Load data and call the assign labels function to split the data into label and reviewText 
labels, review = assign_labels_and_comments("Downloads/test.ft.txt.bz2/test.ft.txt.bz2")

In [6]:
df.head(10)

Unnamed: 0,0
0,__label__2 Great CD: My lovely Pat has one of ...
1,__label__2 One of the best game music soundtra...
2,__label__1 Batteries died within a year ...: I...
3,"__label__2 works fine, but Maha Energy is bett..."
4,__label__2 Great for the non-audiophile: Revie...
5,__label__1 DVD Player crapped out after one ye...
6,__label__1 Incorrect Disc: I love the style of...
7,__label__1 DVD menu select problems: I cannot ...
8,__label__2 Unique Weird Orientalia from the 19...
9,"__label__1 Not an ""ultimate guide"": Firstly,I ..."


In [7]:
# function to clean data

def cleanData(doc,stemming = False):
    doc = doc.lower()
    doc = nlp(doc)
    tokens = [tokens.lower_ for tokens in doc]
    tokens = [tokens for tokens in doc if (tokens.is_stop == False)]
    tokens = [tokens for tokens in tokens if (tokens.is_punct == False)]
    final_token = [token.lemma_ for token in tokens]
    
    return " ".join(final_token)

In [8]:
# Preview the first 10 rows of the reviews
for idx, row in df.head(10).iterrows():
    Review = row[0]
    clean_review = cleanData(Review)
    print(f'clean_review: {clean_review}')

clean_review: label__2 great cd lovely pat great voice generation listen cd year love good mood make feel well bad mood evaporate like sugar rain cd oozes life vocal jusat stuunning lyric kill life hide gem desert isle cd book big everytime play matter black white young old male female everybody say thing singing
clean_review: label__2 good game music soundtrack game play despite fact play small portion game music hear plus connection chrono trigger great lead purchase soundtrack remain favorite album incredible mix fun epic emotional song sad beautiful track especially like kind song video game soundtrack admit song life distant promise bring tear eye occasions.my complaint soundtrack use guitar fret effect song find distract include consider collection worth
clean_review: label__1 battery die year buy charger jul 2003 work ok design nice convenient year battery hold charge alkaline disposable look charger come battery well staying power
clean_review: label__2 work fine maha energy we

In [9]:
# Simple rule-based sentiment analyzer
def rule_based_sentiment(text):

    if label==2:
        return "Positive"
    elif label==1:
        return "Negative"
    else:
        return "Neutral"
# Function to extract product names and brands using NER
def extract_entities(text):
    doc = nlp(text)
    products = [ent.text for ent in doc.ents if ent.label_ == "PRODUCT"]
    brands = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
    return products, brands

# Process first 10 reviews as a sample
for idx, row in df.head(10).iterrows():
    Review = row[0]
    clean_review = cleanData(Review)
#     review = row[0]

    # Use regex to find the number after 'label__' in clean_review text
    match = re.search(r'label__([1-2]+)', clean_review)

    if match:
        label = int(match.group(1))
        
    else:
        print("No label")
    sentiment = rule_based_sentiment(label)
    products, brands = extract_entities(clean_review)
    print(f'clean_review: {clean_review}')
   
#     print(f"Review: {review}")
    print("Extracted label:", label)
    print(f"Sentiment: {sentiment}\n")
    print(f"Brands: {brands}")
    print(f"Products: {products}")
   

clean_review: label__2 great cd lovely pat great voice generation listen cd year love good mood make feel well bad mood evaporate like sugar rain cd oozes life vocal jusat stuunning lyric kill life hide gem desert isle cd book big everytime play matter black white young old male female everybody say thing singing
Extracted label: 2
Sentiment: Positive

Brands: []
Products: []
clean_review: label__2 good game music soundtrack game play despite fact play small portion game music hear plus connection chrono trigger great lead purchase soundtrack remain favorite album incredible mix fun epic emotional song sad beautiful track especially like kind song video game soundtrack admit song life distant promise bring tear eye occasions.my complaint soundtrack use guitar fret effect song find distract include consider collection worth
Extracted label: 2
Sentiment: Positive

Brands: []
Products: []
clean_review: label__1 battery die year buy charger jul 2003 work ok design nice convenient year batt