**📌 Extracting Product & Brand Intelligence from Amazon Reviews with spaCy: NER + Rule‑Based Sentiment**

## 🧠 Introduction

In today’s data‑driven marketplace, organizations rely on Natural Language Processing (NLP) to transform unstructured customer feedback into actionable insights. This mini‑project—part of the broader **AI Tools and Applications** assignment themed **“Mastering the AI Toolkit”**—focuses on applying **spaCy**, a production‑grade Python NLP framework, to Amazon product reviews.

By combining **Named Entity Recognition (NER)** and a lightweight **rule‑based sentiment model**, we demonstrate how AI tools can surface which **brands** and **products** customers mention and how they feel about them.

---

## 📚 Project Description

We will build a small pipeline that:

1. **Ingests** raw review text from the Amazon Product Reviews dataset.  
2. **Processes** each review with spaCy’s pretrained English model (`en_core_web_sm`), enhanced with an `EntityRuler`.  
3. **Extracts** entities of type `PRODUCT` and `ORG`.  
4. **Classifies** overall review sentiment (positive or negative) using a rule‑based lexicon.  
5. **Outputs** a structured summary showing:  
   - Review Text  
   - Extracted Entities  
   - Sentiment Label

---

## ❗ Problem Statement

Customer reviews often hide valuable signals about **which products** and **which brands** receive praise or criticism. Manually sifting through thousands of reviews is infeasible and prone to bias.

We need an automated, reproducible method to **identify product/brand mentions** and **gauge customer sentiment**, so that businesses can:

- Track product reputation in real time  
- Compare competing brands based on customer experience  
- Prioritize improvements or marketing where negativity is high

---

## 🎯 Main Objective

Design and implement a **spaCy‑based NLP workflow** that **automatically extracts product/brand entities** and **assigns a sentiment label** (positive or negative) to each Amazon review.

---

## 📌 Specific Objectives

1. **Data Preparation**
   - Load and inspect the Amazon Product Reviews dataset.
   - Handle missing or noisy text (e.g., remove HTML, standardize casing).

2. **NER Pipeline Setup**
   - Load spaCy model `en_core_web_sm`.
   - Add an `EntityRuler` with custom patterns to capture product and brand names.
   - Ensure correct extraction of `ORG` and `PRODUCT` entities.

3. **Rule‑Based Sentiment Analysis**
   - Create simple word lists of positive and negative terms.
   - Implement a scoring function to classify sentiment.
   - Test function on sample reviews.

4. **Review Processing Loop**
   - Run spaCy on each review.
   - For each, extract:
     - Entities (brands/products)
     - Sentiment label

5. **Result Display & Export**
   - Compile results into a DataFrame.
   - Display sample output and save to CSV.
   - (Optional) Create summary charts of top brands or sentiment ratios.

6. **Deliverable Packaging**
   - Provide clean, reproducible Python code or notebook.
   - Include at least 3 output examples with entities and sentiment correctly labeled.

---

> ✅ This task demonstrates mastery of practical AI tools through real-world application of NLP with spaCy.
*

## ✅1. DATA COLLECTION AND PREPARATION

🔹 LOAD AND PREPARE YOUR DATA

📦 Import Libraries:

In [41]:
import pandas as pd
import spacy
from spacy.pipeline import EntityRuler
from spacy import displacy
import re
from collections import defaultdict


Load NLP Model

In [34]:
# Load spaCy English model
nlp = spacy.load("en_core_web_sm")


Load the Dataset

In [14]:
import pandas as pd


df=pd.read_csvws.csv("Reviews.csv")

# Show first few rows and columns
print(df.head())



   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitality canned d...  
1 

In [15]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

✅ Step-by-Step: Clean the **Amazon Review Data**

In [48]:
## 1. Data Preparation

def load_data(filepath):
    """Load and clean the Amazon reviews dataset"""
    df = pd.read_csv(filepath)
    print(f"Original data shape: {df.shape}")

    # Basic cleaning
    df = df.dropna(subset=['Text'])  # Remove reviews with no text
    df['Text'] = df['Text'].apply(clean_text)
    print(f"Cleaned data shape: {df.shape}")
    print("\nSample data:")
    print(df[['Score', 'Text']].head(3))
    return df

def clean_text(text):
    """Clean review text"""
    if not isinstance(text, str):
        return ""
    text = re.sub(r'<[^>]+>', '', text)  # Remove HTML
    text = text.lower()  # Lowercase
    text = ' '.join(text.split())  # Remove extra whitespace
    return text



# NER Pipeline Setup

In [50]:
## 2. NER Pipeline Setup

def create_ner_pipeline():
    """Create and configure the spaCy NLP pipeline with enhanced entity recognition"""
    nlp = spacy.load("en_core_web_sm")

    # Expanded patterns with variations
    patterns = [
        {"label": "PRODUCT", "pattern": [{"LOWER": "kindle"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "echo"}, {"LOWER": "dot"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "fire"}, {"LOWER": "tv"}, {"LOWER": "stick"}]},
        {"label": "BRAND", "pattern": [{"LOWER": "amazon"}]},
        {"label": "BRAND", "pattern": [{"LOWER": "bose"}]},
        {"label": "BRAND", "pattern": [{"LOWER": "sony"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "dog"}, {"LOWER": "food"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "cat"}, {"LOWER": "treats"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "canned"}, {"LOWER": "food"}]},
        {"label": "BRAND", "pattern": [{"LOWER": "vitality"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "peanut"}, {"LOWER": "butter"}]},
        {"label": "PRODUCT", "pattern": [{"LOWER": "confection"}]}
    ]

    ruler = nlp.add_pipe("entity_ruler", before="ner")
    ruler.add_patterns(patterns)

    return nlp

# Rule-Based Sentiment Analysis

In [51]:
## 3. Rule-Based Sentiment Analysis

POSITIVE_WORDS = {
    'good', 'great', 'excellent', 'awesome', 'fantastic',
    'wonderful', 'amazing', 'love', 'liked', 'perfect', 'best',
    'superb', 'outstanding', 'recommend', 'satisfied', 'happy'
}

NEGATIVE_WORDS = {
    'bad', 'terrible', 'awful', 'horrible', 'hate',
    'dislike', 'poor', 'worst', 'broke', 'defective', 'waste',
    'disappointed', 'return', 'broken', 'junk', 'avoid'
}

def get_sentiment(text):
    """Enhanced sentiment analysis with negation handling"""
    if not text:
        return "neutral"

    words = text.split()
    positive_count = 0
    negative_count = 0

    for i, word in enumerate(words):
        if word in POSITIVE_WORDS:
            # Check for negation
            if i > 0 and words[i-1] in ['not', "n't", 'never']:
                negative_count += 1
            else:
                positive_count += 1
        elif word in NEGATIVE_WORDS:
            # Check for negation
            if i > 0 and words[i-1] in ['not', "n't", 'never']:
                positive_count += 1
            else:
                negative_count += 1

    if positive_count > negative_count:
        return "positive"
    elif negative_count > positive_count:
        return "negative"
    return "neutral"

# Review Processing Loop

In [52]:
## 4. Review Processing Loop

def process_reviews(df, nlp, sample_size=1000):
    """Process reviews with enhanced entity extraction"""
    sample_size = min(sample_size, len(df))
    sample_df = df.sample(sample_size, random_state=42)

    results = []
    for _, row in sample_df.iterrows():
        doc = nlp(row['Text'])
        entities = extract_entities(doc)

        results.append({
            'review_id': row['Id'],
            'text': row['Text'][:200] + "...",  # Show more context
            'score': row['Score'],
            'entities': entities,
            'sentiment': get_sentiment(row['Text'])
        })

    return pd.DataFrame(results)

def extract_entities(doc):
    """Enhanced entity extraction with merging multi-word entities"""
    entities = defaultdict(list)
    for ent in doc.ents:
        if ent.label_ in ['PRODUCT', 'BRAND', 'ORG']:
            # Merge multi-word entities
            merged_text = ' '.join([t.text for t in ent])
            entities[ent.label_].append(merged_text)

    # Remove duplicates and empty entries
    return {k: list(set(v)) for k, v in entities.items() if v}



 # Result Display & Export

In [53]:
## 5. Result Display & Export

def show_entity_examples(results_df, n=3):
    """Display reviews that actually contained entities"""
    has_entities = results_df[results_df['entities'].apply(bool)]
    if not has_entities.empty:
        print("\nReviews with extracted entities:")
        for _, row in has_entities.head(n).iterrows():
            print(f"\nScore: {row['score']}, Sentiment: {row['sentiment']}")
            print(f"Text: {row['text']}")
            print(f"Entities: {row['entities']}")
    else:
        print("\nWarning: No entities were extracted in any reviews")

def analyze_results(results_df):
    """Enhanced results analysis with entity examples"""
    print("\nSample processed reviews:")
    print(results_df.head(3))

    print("\nSentiment distribution:")
    print(results_df['sentiment'].value_counts())

    # Show entity examples
    show_entity_examples(results_df)

    # Calculate entity statistics
    all_entities = []
    for entities in results_df['entities']:
        for entity_type, entity_list in entities.items():
            all_entities.extend([(entity_type, entity) for entity in entity_list])

    if all_entities:
        entity_df = pd.DataFrame(all_entities, columns=['type', 'entity'])
        print("\nMost frequent entities:")
        print(entity_df.value_counts().head(10))

    # Save to CSV
    results_df.to_csv('amazon_reviews_analyzed.csv', index=False)
    print("\nResults saved to 'amazon_reviews_analyzed.csv'")

## Main Execution

if __name__ == "__main__":
    print("Loading data...")
    df = load_data("Reviews.csv")

    print("\nSetting up NLP pipeline...")
    nlp = create_ner_pipeline()

    print("\nProcessing reviews...")
    results_df = process_reviews(df, nlp, sample_size=1000)

    analyze_results(results_df)

Loading data...
Original data shape: (568454, 10)
Cleaned data shape: (568454, 10)

Sample data:
   Score                                               Text
0      5  i have bought several of the vitality canned d...
1      1  product arrived labeled as jumbo salted peanut...
2      4  this is a confection that has been around a fe...

Setting up NLP pipeline...

Processing reviews...

Sample processed reviews:
   review_id                                               text  score  \
0     165257  having tried a couple of other brands of glute...      5   
1     231466  my cat loves these treats. if ever i can't fin...      5   
2     427828  a little less than i expected. it tends to hav...      3   

  entities sentiment  
0       {}  positive  
1       {}  positive  
2       {}   neutral  

Sentiment distribution:
sentiment
positive    596
neutral     349
negative     55
Name: count, dtype: int64

Reviews with extracted entities:

Score: 5, Sentiment: positive
Text: i absolutely lov

In [54]:
import pandas as pd
df = pd.read_csv('amazon_reviews_analyzed.csv')
df.head()

Unnamed: 0,review_id,text,score,entities,sentiment
0,165257,having tried a couple of other brands of glute...,5,{},positive
1,231466,my cat loves these treats. if ever i can't fin...,5,{},positive
2,427828,a little less than i expected. it tends to hav...,3,{},neutral
3,433955,"first there was frosted mini-wheats, in origin...",2,{},positive
4,70261,and i want to congratulate the graphic artist ...,5,{},positive


In [55]:
from google.colab import files
files.download('amazon_reviews_analyzed.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>