<a href="https://colab.research.google.com/github/Fabian-lewis/PLP-AI-Model-WK3/blob/main/NLP_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 📘 Task 3 Report: Named Entity Recognition (NER) & Sentiment Analysis Using spaCy

####🎯 Objective
The goal of this task was to perform two Natural Language Processing (NLP) operations on user-generated reviews from Amazon:


1. Named Entity Recognition (NER) to extract meaningful entities like product names, brands, locations, and people from the text.

2. Sentiment Analysis using a basic rule-based approach to classify the emotional tone of each review as either positive, negative, or neutral.

####🗃️ Dataset
-  Source: Amazon Reviews (Kaggle Dataset)
- Format: Plain text (.txt) with each line structured as:
   - __label__2 This book is amazing!

- Labels:
  - __label__2 = Positive
  - __label__1 = Negative


#### 🧱 Methodology
1. Data Loading and Preprocessing
  - Opened the raw .txt file and read the first 1,000 lines into memory.

  - Parsed each line into:

    - label: Binary sentiment (1 = positive, 0 = negative)

    - review: The actual text content

  - Stored the cleaned data into a pandas DataFrame.

2. Named Entity Recognition (NER)
  - Loaded the en_core_web_sm spaCy model.

  - Defined a function to process each review and extract named entities.

  - spaCy’s doc.ents was used to return entities with types such as:

    - PERSON, ORG, PRODUCT, WORK_OF_ART, GPE, etc.

  - Applied this function to the first 100 reviews for efficient testing.

3. Rule-Based Sentiment Analysis
  - Created two lists of keywords:

    - positive_words: good, great, love, best, etc.

    - negative_words: bad, hate, terrible, worst, etc.

  - Wrote a function that:

  - Converts the review to lowercase

  - Counts how many positive and negative words appear

  - Assigns a sentiment label:

    - positive, negative, or neutral

  - Applied this function to the first 100 reviews.

In [2]:
# Install spaCy if not installed
# !pip install -U spacy

# Download English model (if not already downloaded)
!python -m spacy download en_core_web_sm

# Import spaCy
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm") # en_core_web_sm is a lightweight pretrained English NLP model from spaCy.

print("spaCy loaded and ready! 💪")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
spaCy loaded and ready! 💪


In [3]:
## Import Libraries
import pandas as pd

# Open and read the text files
with open("train.ft.txt", encoding="utf-8") as f:
  lines = f.readlines()

# Parse each line into label and review text
data = []
for line in lines[:1000]: # Limit to 1000 lines
  if line.strip():
    parts = line.strip().split(' ', 1)
    if len(parts) == 2:
      label, review = parts
      label = 1 if label == '__label__2' else 0 # 1= positive, 0 = negative
      data.append((label, review))

# Create a dataframe
df = pd.DataFrame(data, columns=['label', 'review'])

# Display the first few rows
df.head()

Unnamed: 0,label,review
0,1,Stuning even for the non-gamer: This sound tra...
1,1,The best soundtrack ever to anything.: I'm rea...
2,1,Amazing!: This soundtrack is my favorite music...
3,1,Excellent Soundtrack: I truly like this soundt...
4,1,"Remember, Pull Your Jaw Off The Floor After He..."


### Named Entity Recognition (NER) with spaCy

#### We’ll use spaCy to:

  - Process each review
  - Extract named entities like product names, brands, people, places, etc.


#### 🧠 Recap: What’s NER?
*Named Entity Recognition is the task of finding real-world "things" (entities) in text — like:*


#### 💡 Interpretation:
- ent.text = the actual named word/phrase
- ent.label_ = its type, like:
  - PERSON → human name
  - ORG → organization
  - PRODUCT → product
  - GPE → country, city, state
  - WORK_OF_ART → songs, albums, books

In [4]:
## Name Entity Recognition

# Load English NLP Model
nlp = spacy.load("en_core_web_sm")

# Function to extract name entities from text
def extract_entities(text):
  doc = nlp(text)
  return [(ent.text, ent.label_) for ent in doc.ents]

# Apply to the first 100 reviewas to start with:
df['entities'] = df['review'].head(100).apply(extract_entities)

# Show sample output
df[['review','entities']].head(10)

Unnamed: 0,review,entities
0,Stuning even for the non-gamer: This sound tra...,"[(Chrono Cross, ORG)]"
1,The best soundtrack ever to anything.: I'm rea...,"[(Yasunori Mitsuda's, PERSON), (years, DATE), ..."
2,Amazing!: This soundtrack is my favorite music...,"[(Prisoners of Fate, WORK_OF_ART), (A Distant ..."
3,Excellent Soundtrack: I truly like this soundt...,"[(Scars Of Time, FAC), (Between Life and Death..."
4,"Remember, Pull Your Jaw Off The Floor After He...","[(Chrono Cross, ORG), (Time, ORG), (Sea, LOC),..."
5,an absolute masterpiece: I am quite sure any o...,"[(Mitsuda, PERSON), (every single minute, TIME..."
6,"Buyer beware: This is a self-published book, a...","[(5, CARDINAL), (Haddon, PERSON), (an evening,..."
7,Glorious story: I loved Whisper of the wicked ...,"[(normaly, GPE)]"
8,A FIVE STAR BOOK: I just finished reading Whis...,"[(FIVE, CARDINAL), (Whisper of the Wicked, ORG..."
9,Whispers of the Wicked Saints: This was a easy...,[]


### Rule-Based Sentiment Analysis (DIY Style)

- spaCy doesn't have built-in sentiment analysis (like TextBlob or Vader)

- so instead we will build a simple rule based version using keyword spotting that explains how sentiment engines work in advanced models

### Steps to Follow
 1. Define sets of positive and negative words
 2. Write a function to scan each review
 3. Return positive, negative, or neutral based on the counts

In [5]:
## Rule-Based Sentiment Analysis (DIY Style)

# Define basic keyword lists
positive_words = ['good', 'great', 'amazing', 'awesome', 'fantastic', 'love', 'beautiful', 'best', 'wonderful', 'excellent']
negative_words = ['bad', 'worst', 'awful', 'boring', 'terrible', 'hate', 'poor', 'ugly', 'horrible', 'disappointing']


## Function to assign sentiment based on key words
def rule_based_sentiment(text):
  text_lower = text.lower()
  pos_hits = sum(word in text_lower for word in positive_words)
  neg_hits = sum(word in text_lower for word in negative_words)

  if pos_hits > neg_hits:
    return 'positive'
  elif neg_hits > pos_hits:
    return 'negative'
  else:
    return 'neutral'

## Apply to the first 100 reviews
df['sentiment'] = df['review'].head(100).apply(rule_based_sentiment)

# Preview rhe sentiment along the entities
df[['review', 'entities', 'sentiment']].head(10)


Unnamed: 0,review,entities,sentiment
0,Stuning even for the non-gamer: This sound tra...,"[(Chrono Cross, ORG)]",positive
1,The best soundtrack ever to anything.: I'm rea...,"[(Yasunori Mitsuda's, PERSON), (years, DATE), ...",positive
2,Amazing!: This soundtrack is my favorite music...,"[(Prisoners of Fate, WORK_OF_ART), (A Distant ...",positive
3,Excellent Soundtrack: I truly like this soundt...,"[(Scars Of Time, FAC), (Between Life and Death...",positive
4,"Remember, Pull Your Jaw Off The Floor After He...","[(Chrono Cross, ORG), (Time, ORG), (Sea, LOC),...",positive
5,an absolute masterpiece: I am quite sure any o...,"[(Mitsuda, PERSON), (every single minute, TIME...",positive
6,"Buyer beware: This is a self-published book, a...","[(5, CARDINAL), (Haddon, PERSON), (an evening,...",negative
7,Glorious story: I loved Whisper of the wicked ...,"[(normaly, GPE)]",positive
8,A FIVE STAR BOOK: I just finished reading Whis...,"[(FIVE, CARDINAL), (Whisper of the Wicked, ORG...",positive
9,Whispers of the Wicked Saints: This was a easy...,[],neutral
