# Week 4: Unsupervised Learning Techniques

**Web and Social Network Analytics**

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Explain** the difference between supervised and unsupervised learning
2. **Apply** sentiment analysis to classify customer reviews
3. **Calculate** support, confidence, and lift for association rules
4. **Implement** the A-Priori algorithm to find frequent itemsets
5. **Design** a collaborative filtering recommendation system

---

**ShopSocial Context**: How can we understand customer opinions, find product bundles, and make personalized recommendations?

## Setup

Run this cell first to import all required libraries.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
from itertools import combinations

# Machine Learning
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import NMF
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine

# Sentiment Analysis
try:
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    print("VADER imported successfully!")
except ImportError:
    print("VADER not installed. Run: pip install vaderSentiment")

# Visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm

print('All libraries imported successfully!')

---

# Part 1: Clustering Review

Clustering groups similar items together **without predefined labels**. This is the essence of unsupervised learning.

## 1.1 Distance Metrics

Clustering algorithms group data based on distance. Common metrics include:

- **Euclidean Distance**: $d = \sqrt{\sum_{i}(a_i - b_i)^2}$
- **Manhattan Distance**: $d = \sum_{i}|a_i - b_i|$

## 1.2 K-Means Clustering

K-Means partitions data into K clusters by:
1. Randomly initialize K centroids
2. Assign each point to the nearest centroid
3. Update centroids to the mean of assigned points
4. Repeat until convergence

**Key Parameter**: You must specify K (number of clusters) in advance.

In [None]:
# Load Starbucks location data
data = pd.read_csv("data/starbucks_locations.csv", index_col=0)
data = data.dropna()
print(f"Total locations: {len(data)}")
print(data.head())

In [None]:
# Filter to a specific region for faster processing
# Middle East region (UAE, Saudi Arabia, etc.)
region_data = data[(data["Longitude"].between(49, 56)) & (data["Latitude"].between(24, 27))]
print(f"Filtered locations: {len(region_data)}")
region_data.head()

In [None]:
# Apply K-Means with K=5
kmeans = KMeans(n_clusters=5, max_iter=500, random_state=42)
kmeans.fit(region_data)

print(f"Cluster centroids:")
for i, center in enumerate(kmeans.cluster_centers_):
    print(f"  Cluster {i}: Longitude={center[0]:.2f}, Latitude={center[1]:.2f}")

In [None]:
# Visualize K-Means clusters
plt.figure(figsize=(10, 8))
plt.scatter(region_data['Longitude'], region_data['Latitude'], 
            c=kmeans.labels_, cmap='tab10', s=50, alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            c='red', marker='X', s=200, label='Centroids')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('K-Means Clustering of Starbucks Locations (K=5)')
plt.legend()
plt.show()

## 1.3 DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shape and identifies **outliers**.

**Key Parameters**:
- `eps`: Maximum distance between points in a cluster
- `min_samples`: Minimum points to form a dense region

**Advantage**: Does not require specifying K in advance!

In [None]:
# Apply DBSCAN
dbscan = DBSCAN(eps=0.1, min_samples=3)
dbscan.fit(region_data)

labels = set(dbscan.labels_)
n_clusters = len(labels) - (1 if -1 in labels else 0)
n_outliers = list(dbscan.labels_).count(-1)

print(f"Number of clusters: {n_clusters}")
print(f"Number of outliers: {n_outliers}")
print(f"Cluster labels: {labels}")

In [None]:
# Visualize DBSCAN clusters (outliers shown as black X)
plt.figure(figsize=(10, 8))
colors = dbscan.labels_

# Plot clusters
for i, (lon, lat) in enumerate(zip(region_data['Longitude'], region_data['Latitude'])):
    if colors[i] == -1:
        plt.plot(lon, lat, 'kx', markersize=8)  # Outliers in black
    else:
        plt.scatter(lon, lat, c=[colors[i]], cmap='tab10', s=50, vmin=0, vmax=max(colors))

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('DBSCAN Clustering (Black X = Outliers)')
plt.show()

## 1.4 RFM Customer Segmentation

**RFM Analysis** segments customers using three metrics:
- **R**ecency: How recently did the customer purchase?
- **F**requency: How often do they purchase?
- **M**onetary: How much do they spend?

| Segment | Recency | Frequency | Monetary | Strategy |
|---------|---------|-----------|----------|----------|
| **Champions** | Recent | Often | High | Reward loyalty |
| **At Risk** | Long ago | Often | High | Win back campaigns |
| **New Customers** | Recent | Low | Low | Onboarding emails |
| **Hibernating** | Long ago | Low | Low | Re-engagement offers |

**Key Insight**: K-Means on RFM features helps ShopSocial personalize marketing without manual labeling!

---

# Part 2: Sentiment Analysis

**Sentiment Analysis** automatically determines the emotional tone of text (positive, negative, or neutral).

## 2.1 Why Sentiment Analysis Matters

**ShopSocial Challenge**: With 500,000 product reviews, a human reading 1 review per minute would take **347 days** (24/7) to read them all!

Automated sentiment analysis can:
- Identify unhappy customers for follow-up
- Track product quality over time
- Compare sentiment across product categories
- Detect emerging issues before they escalate

## 2.2 Sentiment Analysis Approaches

| Approach | How It Works | Pros | Cons |
|----------|--------------|------|------|
| **Lexicon-Based** | Count positive/negative words using a dictionary | Simple, interpretable | Misses context, sarcasm |
| **VADER** | Social media optimized lexicon | Handles emoji, slang, caps | Still rule-based |
| **Machine Learning** | Train classifier on labeled examples | More accurate | Needs training data |
| **LLM-Based** | Prompt GPT/Claude directly | Flexible, contextual | API costs, latency |

## 2.3 Lexicon-Based Sentiment Analysis

The simplest approach: build a dictionary of words with sentiment scores.

In [None]:
# Simple sentiment lexicon
sentiment_lexicon = {
    'excellent': 3, 'amazing': 3, 'love': 2, 'great': 2,
    'good': 1, 'nice': 1, 'okay': 0, 'fine': 0,
    'bad': -1, 'poor': -1, 'disappointing': -2,
    'terrible': -3, 'awful': -3, 'hate': -2
}

def lexicon_sentiment(text, lexicon):
    """Calculate sentiment score using a lexicon."""
    words = text.lower().split()
    score = 0
    matched_words = []
    
    for word in words:
        # Remove punctuation
        clean_word = ''.join(c for c in word if c.isalpha())
        if clean_word in lexicon:
            score += lexicon[clean_word]
            matched_words.append(f"{clean_word}({lexicon[clean_word]:+d})")
    
    return score, matched_words

In [None]:
# Worked example from lecture
review = "This product is excellent! The quality is good but shipping was bad."

score, matched = lexicon_sentiment(review, sentiment_lexicon)
print(f"Review: {review}")
print(f"Matched words: {matched}")
print(f"Total score: {score}")
print(f"Sentiment: {'Positive' if score > 0 else 'Negative' if score < 0 else 'Neutral'}")

## 2.4 VADER Sentiment Analysis

**VADER** (Valence Aware Dictionary for Sentiment Reasoning) is specifically designed for social media text. It handles:
- Capitalization ("AMAZING" vs "amazing")
- Punctuation ("good!!!" vs "good")
- Emoji and emoticons
- Slang and abbreviations

Returns a **compound score** from -1 (most negative) to +1 (most positive).

In [None]:
# Initialize VADER
analyzer = SentimentIntensityAnalyzer()

# Test reviews
reviews = [
    "This product is AMAZING!!!",
    "Meh, it's okay I guess...",
    "Worst purchase ever :(",
    "Pretty good value for the price",
    "DO NOT BUY! Terrible quality!!!"
]

print("VADER Sentiment Analysis")
print("-" * 60)
print(f"{'Review':<40} {'Compound':>10} {'Sentiment':>10}")
print("-" * 60)

for review in reviews:
    scores = analyzer.polarity_scores(review)
    compound = scores['compound']
    
    # Classify based on compound score
    if compound > 0.05:
        sentiment = "Positive"
    elif compound < -0.05:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    
    # Truncate long reviews for display
    display = review[:37] + "..." if len(review) > 40 else review
    print(f"{display:<40} {compound:>10.3f} {sentiment:>10}")

## 2.5 Challenges in Sentiment Analysis

| Challenge | Example | Problem |
|-----------|---------|--------|
| **Negation** | "This is **not** good" | Simple lexicon says positive! |
| **Sarcasm** | "Oh **great**, another delayed delivery" | Actually negative |
| **Context** | "This phone battery **dies** quickly" | "dies" not about death |
| **Domain-specific** | "This lens is **sharp**" | Positive for cameras! |
| **Comparative** | "Better than X but worse than Y" | Mixed sentiment |

In [None]:
# VADER handles some challenges well
challenging_reviews = [
    "This is NOT good",           # Negation
    "The product is good!!!",     # Emphasis
    "The product is GOOD",        # Capitalization
]

print("VADER handling challenges:")
for review in challenging_reviews:
    scores = analyzer.polarity_scores(review)
    print(f"  '{review}' -> compound: {scores['compound']:.3f}")

---

# Part 3: Frequent Itemset Analysis

**Market Basket Analysis** finds items that are frequently purchased together.

**Example**: "Customers who bought iPhone also bought AirPods"

## 3.1 Association Rules

An **association rule** has the form: **Antecedent -> Consequent**

Example: `{Beer, Pizza} -> {Diapers}`

We measure the strength of rules using three metrics:
1. **Support**: How often does the itemset appear?
2. **Confidence**: How often is the rule correct?
3. **Lift**: Is there a real association or just coincidence?

## 3.2 Support

**Support** measures how frequently an itemset appears in transactions:

$$sup(A) = \frac{|\{A \subseteq t | t \in T\}|}{|T|}$$

Where:
- $A$ is an itemset (e.g., {iPhone, AirPods})
- $T$ is the set of all transactions
- $t$ is a single transaction

In [None]:
# Example transactions (from lecture)
transactions = [
    ['iphoneX', 'S8', 'LG55', 'S9'],      # Basket 1
    ['iphoneX', 'S8', 'S9', 'LG55'],      # Basket 2  
    ['LG55', 'S2', 'iphoneX'],            # Basket 3
    ['iphoneX', 'S8', 'LG55']             # Basket 4
]

def support(itemset, transactions):
    """Calculate the support of an itemset."""
    if isinstance(itemset, str):
        itemset = [itemset]
    
    count = 0
    for trans in transactions:
        if set(itemset).issubset(set(trans)):
            count += 1
    return count / len(transactions)

# Calculate support for individual items
items = ['iphoneX', 'S8', 'LG55', 'S9', 'S2']
print("Support for individual items:")
for item in items:
    sup = support(item, transactions)
    print(f"  support({item}) = {sup:.2f} ({int(sup*4)}/4 transactions)")

## 3.3 Confidence

**Confidence** measures how often the rule is correct:

$$conf(A \rightarrow B) = \frac{sup(A \cap B)}{sup(A)}$$

"Of the transactions containing A, what fraction also contain B?"

In [None]:
def confidence(A, B, transactions):
    """Calculate confidence of rule A -> B."""
    sup_A = support(A, transactions)
    sup_AB = support(list(set([A] if isinstance(A, str) else A) | 
                         set([B] if isinstance(B, str) else B)), transactions)
    return sup_AB / sup_A if sup_A > 0 else 0

# Example: iphoneX -> S8
conf = confidence('iphoneX', 'S8', transactions)
print(f"Confidence(iphoneX -> S8) = {conf:.2f}")
print(f"  = support({{iphoneX, S8}}) / support({{iphoneX}})")
print(f"  = {support(['iphoneX', 'S8'], transactions):.2f} / {support('iphoneX', transactions):.2f}")
print(f"\nInterpretation: {conf*100:.0f}% of customers who bought iphoneX also bought S8")

## 3.4 Lift

**Lift** measures whether there's a real association:

$$lift(A \rightarrow B) = \frac{sup(A \cap B)}{sup(A) \times sup(B)}$$

**Interpretation**:
- **Lift > 1**: Items are **dependent** (buy together more than expected)
- **Lift = 1**: Items are **independent** (no association)
- **Lift < 1**: Items are **substitutes** (buy one OR the other)

In [None]:
def lift(A, B, transactions):
    """Calculate lift of rule A -> B."""
    sup_A = support(A, transactions)
    sup_B = support(B, transactions)
    sup_AB = support(list(set([A] if isinstance(A, str) else A) | 
                         set([B] if isinstance(B, str) else B)), transactions)
    return sup_AB / (sup_A * sup_B) if sup_A * sup_B > 0 else 0

# Calculate lift for iphoneX -> S8
lift_val = lift('iphoneX', 'S8', transactions)
print(f"Lift(iphoneX -> S8) = {lift_val:.2f}")
print(f"  = support({{iphoneX, S8}}) / (support({{iphoneX}}) * support({{S8}}))")
print(f"  = {support(['iphoneX', 'S8'], transactions):.2f} / ({support('iphoneX', transactions):.2f} * {support('S8', transactions):.2f})")

if lift_val > 1:
    print(f"\nInterpretation: Lift > 1, so iphoneX and S8 are DEPENDENT (bought together)")
elif lift_val < 1:
    print(f"\nInterpretation: Lift < 1, so iphoneX and S8 are SUBSTITUTES")
else:
    print(f"\nInterpretation: Lift = 1, so iphoneX and S8 are INDEPENDENT")

## 3.5 A-Priori Algorithm

The **A-Priori algorithm** efficiently finds all frequent itemsets:

1. Find support of all 1-item sets
2. Keep only those meeting minimum support (minSup)
3. Generate 2-item candidate sets from survivors
4. Repeat until no more candidates

**Key Insight**: If an itemset doesn't meet minSup, none of its supersets will either!

In [None]:
def mingle(items, level):
    """Generate candidate itemsets of size 'level' from items."""
    outcome = set()
    for item in items:
        for item2 in items:
            if item != item2:
                new_combination = set()
                if level > 2:  # Combine existing itemsets
                    for i in item:
                        new_combination.add(i)
                    for i in item2:
                        new_combination.add(i)
                else:  # Combine single items
                    new_combination.add(item)
                    new_combination.add(item2)
                
                if len(new_combination) == level:
                    outcome.add(frozenset(new_combination))
    return outcome

def support_level(itemset, transactions, level):
    """Calculate support for itemsets at any level."""
    count = 0
    for trans in transactions:
        contain = True
        if level > 1:
            for item in itemset:
                if item not in trans:
                    contain = False
                    break
        else:
            if itemset not in trans:
                contain = False
        if contain:
            count += 1
    return count / len(transactions)

In [None]:
def apriori(level, transactions, items, minsup):
    """A-Priori algorithm implementation."""
    print(f"\n{'='*50}")
    print(f"Level {level}:")
    print(f"{'='*50}")
    
    retain = set()
    
    # Calculate support for each item
    for item in items:
        sup = support_level(item, transactions, level)
        status = "KEEP" if sup >= minsup else "DROP"
        print(f"  {str(item):30} support: {sup:.2f}  [{status}]")
        if sup >= minsup:
            retain.add(item)
    
    print(f"\nRetained: {retain}")
    
    level += 1
    newsets = mingle(retain, level)
    print(f"New candidates for level {level}: {newsets}")
    
    if len(newsets) != 0 and level < len(items) + 1:
        apriori(level, transactions, newsets, minsup)

In [None]:
# Run A-Priori with minSup = 50%
print("A-PRIORI ALGORITHM")
print("minSup = 50% (0.5)")
print("\nTransactions:")
for i, t in enumerate(transactions):
    print(f"  Basket {i+1}: {t}")

items = {'iphoneX', 'S8', 'LG55', 'S9', 'S2'}
apriori(1, transactions, items, 0.5)

---

# Part 4: Recommendation Systems

**Recommendation systems** predict what items a user might like based on:
- **Collaborative Filtering**: Find similar users/items
- **Content-Based**: Match item features to user preferences

## 4.1 The Utility Matrix

A **utility matrix** stores user-item interactions (ratings, purchases, etc.):

| User | Python | R | MATLAB | Java |
|------|--------|---|--------|------|
| Douglas | 5 | 4 | 3 | - |
| Johannes | 4 | 5 | - | 2 |
| Maurizio | 5 | 4 | - | - |
| Tong | 3 | - | 5 | 4 |

**Challenge**: The matrix is very sparse! (Most users haven't rated most items)

## 4.2 Jaccard Similarity

For binary data (bought/didn't buy):

$$J(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}$$

In [None]:
def jaccard_similarity(set1, set2):
    """Calculate Jaccard similarity between two sets."""
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0

# Example from lecture
douglas = {'Python', 'R', 'MATLAB'}
maurizio = {'Python', 'R'}

sim = jaccard_similarity(douglas, maurizio)
print(f"Douglas knows: {douglas}")
print(f"Maurizio knows: {maurizio}")
print(f"\nJaccard(Douglas, Maurizio) = |{douglas & maurizio}| / |{douglas | maurizio}|")
print(f"                           = {len(douglas & maurizio)} / {len(douglas | maurizio)}")
print(f"                           = {sim:.2f}")

## 4.3 Cosine Similarity

For rating data (vectors):

$$cos(\theta) = \frac{X \cdot Y}{\|X\| \|Y\|} = \frac{\sum_{i} x_i y_i}{\sqrt{\sum_{i} x_i^2} \sqrt{\sum_{i} y_i^2}}$$

In [None]:
def cosine_sim(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    # scipy.distance.cosine returns DISTANCE, so we subtract from 1
    return 1 - cosine(vec1, vec2)

# Example with rating vectors
user_a = np.array([5, 3, 0, 1])  # Ratings for 4 items
user_b = np.array([4, 0, 0, 1])

sim = cosine_sim(user_a, user_b)
print(f"User A ratings: {user_a}")
print(f"User B ratings: {user_b}")
print(f"\nCosine similarity: {sim:.3f}")

## 4.4 Collaborative Filtering Example

Let's build a simple user-based recommendation system.

In [None]:
# Load ratings data
ratings = pd.read_csv('data/ratings.csv')
ratings = ratings[:5000]  # Sample for speed

noMovies = len(ratings['movieId'].unique())
noUsers = len(ratings['userId'].unique())

print(f"Dataset: {noMovies} movies rated by {noUsers} users")
print(ratings.head())

In [None]:
# Create utility matrix
utility = np.zeros(shape=(noUsers, noMovies))

# Map movie IDs to sequential indices
movieIds = {mid: idx for idx, mid in enumerate(ratings['movieId'].unique())}

# Populate the matrix
for _, row in ratings.iterrows():
    uid = int(row['userId']) - 1
    mid = movieIds[row['movieId']]
    utility[uid, mid] = row['rating']

print(f"Utility matrix shape: {utility.shape}")
print(f"Sparsity: {(utility == 0).sum() / utility.size * 100:.1f}% empty")

In [None]:
def find_similar_users(user_id, utility_matrix, min_similarity=0.5):
    """Find users similar to the given user."""
    similar_users = []
    user_ratings = utility_matrix[user_id]
    
    for other_id in range(len(utility_matrix)):
        if user_id != other_id:
            other_ratings = utility_matrix[other_id]
            # Only compare if both users have rated at least one item
            if np.any(user_ratings) and np.any(other_ratings):
                sim = cosine_sim(user_ratings, other_ratings)
                if sim > min_similarity:
                    similar_users.append((other_id, sim))
    
    return sorted(similar_users, key=lambda x: x[1], reverse=True)

# Find similar users for user 0
similar = find_similar_users(0, utility, min_similarity=0.3)
print(f"Users similar to User 0:")
for uid, sim in similar[:5]:
    print(f"  User {uid}: similarity = {sim:.3f}")

In [None]:
def recommend_movies(user_id, utility_matrix, similar_users, n_recommendations=5):
    """Recommend movies based on similar users' ratings."""
    user_ratings = utility_matrix[user_id]
    recommendations = []
    
    # For each movie the user hasn't rated
    for movie_id in range(utility_matrix.shape[1]):
        if user_ratings[movie_id] == 0:  # Not rated
            scores = []
            for sim_user_id, sim in similar_users:
                sim_rating = utility_matrix[sim_user_id, movie_id]
                if sim_rating > 0:  # Similar user has rated this movie
                    scores.append(sim_rating * sim)  # Weight by similarity
            
            if scores:
                avg_score = sum(scores) / len(scores)
                recommendations.append((movie_id, avg_score))
    
    return sorted(recommendations, key=lambda x: x[1], reverse=True)[:n_recommendations]

# Get recommendations for user 0
similar = find_similar_users(0, utility, min_similarity=0.3)
recs = recommend_movies(0, utility, similar, n_recommendations=5)

print(f"\nTop 5 recommendations for User 0:")
for movie_idx, score in recs:
    print(f"  Movie index {movie_idx}: predicted score = {score:.2f}")

## 4.5 Matrix Factorization with NMF

**Non-negative Matrix Factorization** decomposes the utility matrix:

$$M \approx U \times V^T$$

Where:
- $M$ is the original utility matrix (users x items)
- $U$ is user features (users x latent factors)
- $V^T$ is item features (latent factors x items)

In [None]:
# NMF decomposition
n_components = 20  # Number of latent factors

nmf = NMF(n_components=n_components, init='random', random_state=42, max_iter=500)
U = nmf.fit_transform(utility)
V_T = nmf.components_

print(f"Original matrix shape: {utility.shape}")
print(f"U (user features): {U.shape}")
print(f"V^T (item features): {V_T.shape}")

# Reconstruct matrix
M_reconstructed = np.dot(U, V_T)
print(f"\nReconstruction error: {np.sum(utility - M_reconstructed):.2f}")

---

# Summary

## Key Takeaways

| Topic | Key Concept | Application |
|-------|-------------|-------------|
| **Clustering** | Group similar items without labels | Customer segmentation (RFM) |
| **Sentiment Analysis** | Determine emotional tone of text | Product review analysis |
| **Frequent Itemsets** | Find items bought together | "Customers also bought" |
| **Recommendations** | Predict user preferences | Personalized suggestions |

## Quick Quiz

1. K-means requires specifying ___ in advance (Answer: K, number of clusters)
2. VADER is designed for analyzing ___ text (Answer: social media)
3. Support measures how ___ an itemset appears (Answer: frequently)
4. Lift > 1 indicates items are ___ (Answer: dependent/associated)
5. Collaborative filtering finds similar ___ (Answer: users or items)