<a href="https://colab.research.google.com/github/Kdavis2025/Projects/blob/main/Project_5_NLP_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

This project will give you practical experience using Natural Language Processing techniques. This project is in three parts:

1. in part 1) you will use a dataset in a CSV file
2. in part 2) you will use the Wikipedia API to directly access content on Wikipedia.
3. in part 3) you will make your notebook interactive

**Part 1**
- The CSV file is available at https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv
- The file contains a list of famous people and a brief overview.
- The goal of part 1) is to:
1. Pick one person from the list ( the target person ) and output 10 other people who's overview are "closest" to the target person in a Natural Language Processing sense
2. Also output the sentiment of the overview of the target person.
**Part 2**
- For the same target person that you chose in Part 1), use the Wikipedia API to access the whole content of the target person's Wikipedia page.
- The goal of Part 2) is to ...
1. Print out the text of the Wikipedia article for the target person
2. Determine the sentiment of the text of the Wikipedia page for the target person
3. Collect the text of the Wikipedia pages from the 10 nearest neighbors from Part 1)
4. Determine the nearness ranking of these 10 people to your target person based on their entire Wikipedia page
5. Compare, i.e. plot, the nearest ranking from Step 1) with the Wikipedia page nearness ranking. A difference of the rank is one means of comparison.
**Part 3**
- Make an interactive notebook where a user can choose or enter a name and the notebook displays the 10 closest individuals.

- In addition to presenting the project slides, at the end of the presentation each student will demonstrate their code using a famous person suggested by the other students that exists in the DBpedia set.

## Business Case: Problem Definition

Business Case Problem Definition
Marketing/PR teams, talent agencies, and content platforms need a fast, scalable way to identify which public figures are thematically similar and understand how they’re portrayed. Today, analysts manually review bios and disparate sentiment sources, a slow process that hinders campaign agility and partnership decisions. **The goal is to automate discovery of the top-10 closest personalities to any target, surface standardized sentiment scores (polarity and subjectivity), compare “short-form” vs. “long-form” similarity, and deliver insights via an intuitive interface—thereby cutting research time by up to 80%, improving targeting accuracy, and enabling data-driven talent strategy.**

## Data Science Problem Definition

**This is a Unsupervised Modeling by use of Similarity Retrieval**: Use cosine-based NearestNeighbors to rank the top-10 most similar figures in each space.

1. Data Ingestion & Cleaning: Load a CSV of names/overviews and fetch full Wikipedia articles via API. Preprocess text by lowercasing, stripping punctuation/possessives, and removing stopwords.

2. Feature Engineering: Build TF-IDF vector spaces for both the overview corpus and full-text corpus.

3. Sentiment Analysis: Apply TextBlob to compute polarity and subjectivity for both overview and full-page texts.

4. Evaluation & Visualization: Quantify alignment between overview- and full-text rankings via Spearman’s ρ, display rank-difference plots, and deploy an interactive widget for sub-second queries.

## Data Overview

1. **Columns:** 3 (URI, name, text)
2. **Entries:** 42,786 (0 to 42,785)
3. **Data Types:** 3 (Objects)
3. **File Size:** 1002.9 KB

## Data Collection/Sources

Prepare Proper Installs for Project Implementation

In [None]:
#---------Install all required libraries (Combined)-------------------------#
# wikipedia-api: Used to interact with and extract data from Wikipedia.
# wikipedia: Another library to access Wikipedia data.
# textblob: A library for performing common NLP tasks, including sentiment analysis.
# nltk: (Natural Language Toolkit) A comprehensive library for various NLP tasks, such as tokenization and part-of-speech tagging.
# scikit-learn: A machine learning library that includes tools for tasks like
!pip install --quiet wikipedia-api wikipedia textblob nltk scikit-learn ipywidgets matplotlib

#---------Import Implementation----------------------------------------------#
# Used for working with regular expressions (a way to find and manipulate patterns in text).
# Used for numerical computations and working with arrays of numbers.
# Used for data manipulation and analysis. It's particularly handy for working with data in tables (like the CSV file used in this project).
# Used for data visualization and plotting comparisons
# Used for wikipedia fetch, sentiment & ranking comparison
# Converts text into numerical representations that machine learning algorithms can understand. TF-IDF stands for "Term Frequency-Inverse Document Frequency" and is a common way to represent text data.
# Used to find data points that are similar to each other (like finding famous people with similar overviews).
# Used for common Natural Language Processing tasks like sentiment analysis (figuring out if a piece of text is positive, negative, or neutral).
# Used to decode special characters in URLs (web addresses).
# This function automatically creates user interface (UI) controls (like sliders, dropdowns, etc.) that are linked to a Python function. When the user interacts with the UI controls, the function is automatically re-run with the new values, making the notebook dynamic. Provides the building blocks for the UI controls themselves. Buttons, Text boxes, Sliders, etc., are all types of widgets.
# Clears the output of a previously executed code cell. Helpful when you're building interactive elements, as it lets you replace old results with new ones without cluttering the notebook.

import re
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import wikipediaapi
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from textblob import TextBlob
from urllib.parse import unquote
from ipywidgets import interact, widgets
from IPython.display import clear_output

#-------- Download necessary NLTK data---------------------------------------#
# Provides a pre-trained sentence tokenizer, used to split text into sentences.
# Provides a pre-trained part-of-speech tagger, used to identify the grammatical role of words in a sentence.
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

#-------- Download TextBlob corpora (sentiment lexicons)---------------------#
!python3 -m textblob.download_corpora >/dev/null

# Enable inline plotting
%matplotlib inline

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


Part 1: Load Data & Compute Nearest Neighbors on Overviews Section of Manny Pacquiao

In [None]:
# 1. Load the CSV of famous people and their overviews
DATA_URL = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.5-NLP/Data/NLP.csv'
df = pd.read_csv(DATA_URL)

#------------ Painpoint, output as: H%C3%BClya %C5%9Eahin-------------------------------#
# decode any percent-encoded names (e.g. H%C3%BClya %C5%9Eahin → Hülya Şahin)
df['name'] = df['name'].apply(unquote)

# Gather info on the Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42786 entries, 0 to 42785
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   URI     42786 non-null  object
 1   name    42786 non-null  object
 2   text    42786 non-null  object
dtypes: object(3)
memory usage: 1002.9+ KB


## Data Cleaning

Note: Its best to create function to implement program for simpler use of GUI in Part 3.

In [None]:
# 2. Preprocess function: lowercase, remove newlines, possessives, punctuation
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[\r\n]+', ' ', text)
    text = re.sub(r"\'s", "", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text.strip()

# 3. Apply cleaning to the overview text
df['clean_text'] = df['text'].fillna('').apply(preprocess)

# 4. Vectorize all overviews using TF-IDF (English stopwords removed)
tfidf_vect = TfidfVectorizer(stop_words='english')
X_overviews = tfidf_vect.fit_transform(df['clean_text'])

## Exploratory Data Analysis

In [None]:
# 5. Fit a NearestNeighbors model (cosine similarity)
nn_overview = NearestNeighbors(metric='cosine')
nn_overview.fit(X_overviews)

# 6. Identify the index for our target person (Manny Pacquiao)
target_name = 'Manny Pacquiao'
if target_name not in df['name'].values:
    raise ValueError(f"{target_name} not found in dataset") #Exception handling for name unfound
target_idx = int(df.index[df['name'] == target_name][0]) # Target Index beggining location.

# 7. Query the 11 nearest (including self), then drop the first (itself)
distances, indices = nn_overview.kneighbors(
    X_overviews[target_idx], n_neighbors=11 # 0-10 with Manny Pacquiao included for analysis
)
neighbor_idxs = indices.flatten()[1:] # indices stores the index within the dataset of each neighbor with MP removed.
neighbor_dists = distances.flatten()[1:] # indices stores the distances within the dataset of each neighbor with MP removed.

# 8. Display the 10 nearest neighbors with distances
print(f"10 nearest neighbors to {target_name} based on overview TF-IDF:\n")
for rank, (idx, dist) in enumerate(zip(neighbor_idxs, neighbor_dists), start=1):
    print(f"{rank:2d}. {df.at[idx, 'name']}  (cosine distance: {dist:.4f})")

# 9. Sentiment analysis of Manny Pacquiao's overview
manny_overview = df.at[target_idx, 'text']
manny_sent = TextBlob(manny_overview).sentiment
print(f"\nSentiment for {target_name} overview:\n"
      f"  • Polarity   = {manny_sent.polarity:.3f}\n"
      f"  • Subjectivity = {manny_sent.subjectivity:.3f}")

10 nearest neighbors to Manny Pacquiao based on overview TF-IDF:

 1. Bernard Hopkins  (cosine distance: 0.7745)
 2. Kaliesha West  (cosine distance: 0.7758)
 3. Hülya Şahin  (cosine distance: 0.7803)
 4. Danny Ildefonso  (cosine distance: 0.7907)
 5. Pernell Whitaker  (cosine distance: 0.7928)
 6. Curt McCune  (cosine distance: 0.7947)
 7. Mario Yagobi  (cosine distance: 0.8022)
 8. Marvelous Marvin Hagler  (cosine distance: 0.8067)
 9. Haider Ali (boxer)  (cosine distance: 0.8072)
10. Miguel Cotto  (cosine distance: 0.8073)

Sentiment for Manny Pacquiao overview:
  • Polarity   = 0.226
  • Subjectivity = 0.333


## Processing

Part 2: Wikipedia Fetch, Sentiment & Ranking Comparison

In [None]:
# 1. Initialize the Wikipedia API client
wiki = wikipediaapi.Wikipedia(language='en', user_agent='NLP-Project')

# 2. Helper: fetch & clean full Wikipedia page text for a person
def fetch_wiki_text(name):
    page = wiki.page(name)
    if not page.exists():
        return ""
    text = page.text
    # reuse our preprocess for basic cleaning
    return preprocess(text)

# 3. Fetch full page text for Manny and the 10 neighbors
neighbors = df.loc[neighbor_idxs, 'name'].tolist()
target_full = fetch_wiki_text(target_name)
neighbors_full = [fetch_wiki_text(n) for n in neighbors]

# 4. Sentiment on full pages
print(f"Full-page sentiment for {target_name}:")
ts = TextBlob(target_full).sentiment
print(f"  • Polarity={ts.polarity:.3f}, Subjectivity={ts.subjectivity:.3f}\n")

print("Full-page sentiment for each neighbor:")
for name, text in zip(neighbors, neighbors_full):
    s = TextBlob(text).sentiment
    print(f"  • {name}: Polarity={s.polarity:.3f}, Subjectivity={s.subjectivity:.3f}")

# 5. Build a TF-IDF corpus on the 11 full texts, then re-run NearestNeighbors
corpus = [target_full] + neighbors_full
tfidf_full = TfidfVectorizer(stop_words='english', min_df=1)
X_full = tfidf_full.fit_transform(corpus)

nn_full = NearestNeighbors(metric='cosine')
nn_full.fit(X_full)
dist_full, idx_full = nn_full.kneighbors(X_full[0], n_neighbors=11)

# 6. Extract full-page neighbor order (skip self at position 0)
full_order_idxs = idx_full.flatten()[1:]
full_order = [corpus[i] for i in full_order_idxs]  # not used directly
full_names = [ [target_name] + neighbors ][0]  # helper list

# Actually build full-page neighbor names:
corpus_names = [target_name] + neighbors
full_neighbors = [corpus_names[i] for i in full_order_idxs]

tfidf_full.shape()

Full-page sentiment for Manny Pacquiao:
  • Polarity=0.102, Subjectivity=0.393

Full-page sentiment for each neighbor:
  • Bernard Hopkins: Polarity=0.141, Subjectivity=0.413
  • Kaliesha West: Polarity=0.119, Subjectivity=0.162
  • Hülya Şahin: Polarity=0.072, Subjectivity=0.336
  • Danny Ildefonso: Polarity=0.078, Subjectivity=0.400
  • Pernell Whitaker: Polarity=0.136, Subjectivity=0.403
  • Curt McCune: Polarity=0.036, Subjectivity=0.289
  • Mario Yagobi: Polarity=0.121, Subjectivity=0.291
  • Marvelous Marvin Hagler: Polarity=0.099, Subjectivity=0.401
  • Haider Ali (boxer): Polarity=0.141, Subjectivity=0.268
  • Miguel Cotto: Polarity=0.089, Subjectivity=0.390


AttributeError: 'TfidfVectorizer' object has no attribute 'shape'

## Data Visualization/Communication of Results

In [None]:
# 7. Assemble ranking comparison
overview_ranks = {name: r for r,name in enumerate(neighbors, start=1)}
fullpage_ranks = {name: r for r,name in enumerate(full_neighbors, start=1)}

ranking_df = pd.DataFrame({
    'name': neighbors,
    'overview_rank': [overview_ranks[n] for n in neighbors],
    'fullpage_rank': [fullpage_ranks.get(n, None) for n in neighbors]
})
ranking_df['rank_diff'] = ranking_df['overview_rank'] - ranking_df['fullpage_rank']

print("\nRank comparison table:")
display(ranking_df)

# 8. Plot line graph of overview vs full-page ranks side by side
plt.figure(figsize=(10,6))
x = np.arange(len(neighbors)) + 1
plt.plot(x, ranking_df['overview_rank'],   marker='o', label='Overview Rank')
plt.plot(x, ranking_df['fullpage_rank'],   marker='x', label='Full-page Rank')
plt.xticks(x, ranking_df['name'], rotation=45, ha='right')
plt.xlabel('Neighbor')
plt.ylabel('Rank (1 = closest)')
plt.title('Nearest Neighbor Rank: Overview vs Full Wikipedia Page')
plt.legend()
plt.tight_layout()
plt.show()

Part 3: Interactive Selection of Any Person

In [None]:
# 1. Define a function to fetch & display top-N neighbors for any selected name
def show_neighbors(selected_name: str, k: int = 10):
    clear_output(wait=True)  # clear previous output
    if selected_name not in df['name'].values:
        print(f"❌  '{selected_name}' not found in dataset.")
        return

#-------- Can use sentiment analsis rather than distance---------------#
    idx = int(df.index[df['name'] == selected_name][0])
    dists, idxs = nn_overview.kneighbors(X_overviews[idx], n_neighbors=k+1)

    print(f"Top {k} neighbors for '{selected_name}':\n")
    for i, (nbr_idx, dist) in enumerate(zip(idxs.flatten()[1:], dists.flatten()[1:]), start=1): # ouc
            print(f"{i:2d}. {df.at[nbr_idx, 'name']} (distance: {dist:.4f})")

# 2. Build a dropdown widget of all names
name_widget = widgets.Combobox(
    placeholder='Type or select a name',
    options=sorted(df['name'].tolist()),
    description='Person:',
    ensure_option=True,
    layout=widgets.Layout(width='60%')
)

# 3. Tie it together with interact
interact(show_neighbors,
         selected_name=name_widget,
         k=widgets.IntSlider(value=10, min=5, max=20, step=1, description='# Neighbors:'));