# Week 4: Exercises - Unsupervised Learning Techniques

**Web and Social Network Analytics**

---

**Instructions**: Complete each exercise in the provided code cells. Use the hints if you get stuck - they progressively reveal more help.

**Topics Covered**:
- Sentiment Analysis with VADER
- Jaccard Similarity
- Support, Confidence, and Lift
- A-Priori Algorithm
- Collaborative Filtering

## Setup

Run this cell first to import all required libraries.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
from itertools import combinations

# Sentiment Analysis
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Machine Learning
from sklearn.cluster import KMeans
from scipy.spatial.distance import cosine

# Visualization
import matplotlib.pyplot as plt

print('All libraries imported successfully!')

---

## Exercise 1: Sentiment Analysis with VADER (Easy)

**Task**: Analyze the sentiment of 5 product reviews using VADER and classify each as Positive, Neutral, or Negative.

**Expected Output**:
- Compound score for each review
- Classification: Positive (>0.05), Neutral (-0.05 to 0.05), Negative (<-0.05)
- Summary count: How many positive/neutral/negative reviews?

**Skills Practiced**:
- Using VADER SentimentIntensityAnalyzer
- Interpreting compound scores
- Classifying sentiment based on thresholds

---

<details>
<summary>Hint 1: Initialize VADER</summary>

```python
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
```
</details>

<details>
<summary>Hint 2: Get sentiment scores</summary>

```python
for review in reviews:
    scores = analyzer.polarity_scores(review)
    compound = scores['compound']
    print(f"Compound: {compound:.3f}")
```
</details>

<details>
<summary>Hint 3: Classify sentiment</summary>

```python
def classify_sentiment(compound):
    if compound > 0.05:
        return "Positive"
    elif compound < -0.05:
        return "Negative"
    else:
        return "Neutral"
```
</details>

In [None]:
# Exercise 1: Your code here
# ===========================

# Step 1: Define the reviews
reviews = [
    "This product is absolutely AMAZING! Best purchase ever!!!",
    "Meh, it's okay. Nothing special.",
    "Terrible quality. Completely disappointed :(",
    "Pretty good value for the price, would recommend.",
    "DO NOT BUY! Worst experience of my life!!!"
]

# Step 2: Initialize VADER analyzer


# Step 3: Define classification function
def classify_sentiment(compound):
    """Classify sentiment based on compound score."""
    # YOUR CODE HERE
    pass

# Step 4: Analyze each review
print("Sentiment Analysis Results")
print("-" * 70)

pos_count, neu_count, neg_count = 0, 0, 0

for review in reviews:
    # YOUR CODE HERE
    pass

# Step 5: Print summary
print("-" * 70)
print(f"\nSummary: {pos_count} Positive, {neu_count} Neutral, {neg_count} Negative")

---

## Exercise 2: Jaccard Similarity Calculation (Easy-Medium)

**Task**: Calculate the Jaccard similarity between users based on their product purchases.

**Expected Output**:
- Jaccard similarity between all pairs of users
- Identify which two users are most similar
- Identify which two users are least similar

**Skills Practiced**:
- Implementing Jaccard similarity formula
- Set operations (intersection, union)
- Finding maximum/minimum values

---

<details>
<summary>Hint 1: Jaccard formula</summary>

```python
def jaccard_similarity(set1, set2):
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0
```
</details>

<details>
<summary>Hint 2: Generate all pairs</summary>

```python
from itertools import combinations

user_names = list(users.keys())
for user1, user2 in combinations(user_names, 2):
    sim = jaccard_similarity(users[user1], users[user2])
    print(f"{user1} - {user2}: {sim:.3f}")
```
</details>

<details>
<summary>Hint 3: Find most/least similar</summary>

```python
similarities = {}
for user1, user2 in combinations(user_names, 2):
    similarities[(user1, user2)] = jaccard_similarity(users[user1], users[user2])

most_similar = max(similarities, key=similarities.get)
least_similar = min(similarities, key=similarities.get)
```
</details>

In [None]:
# Exercise 2: Your code here
# ===========================

# Step 1: Define user purchases
users = {
    'Alice': {'iPhone', 'AirPods', 'MacBook', 'iPad'},
    'Bob': {'iPhone', 'AirPods', 'Galaxy Watch'},
    'Carol': {'MacBook', 'iPad', 'iMac'},
    'Dave': {'iPhone', 'AirPods', 'MacBook', 'iPad', 'iMac'}
}

# Step 2: Implement Jaccard similarity function
def jaccard_similarity(set1, set2):
    """Calculate Jaccard similarity between two sets."""
    # YOUR CODE HERE
    pass

# Step 3: Calculate similarity for all pairs
print("Jaccard Similarities:")
print("-" * 40)

similarities = {}
# YOUR CODE HERE


# Step 4: Find most and least similar pairs
print("\nMost similar pair:")
# YOUR CODE HERE

print("\nLeast similar pair:")
# YOUR CODE HERE

---

## Exercise 3: Support, Confidence, and Lift Calculation (Medium)

**Task**: Calculate support, confidence, and lift for association rules from transaction data.

**Required Calculations**:
1. Support for each individual item
2. Support for {bread, milk}
3. Confidence of bread -> milk
4. Confidence of milk -> bread
5. Lift of bread -> milk
6. Interpret the lift value

**Expected Output**: All support, confidence, and lift values with interpretations.

**Skills Practiced**:
- Calculating support from transactions
- Deriving confidence from support values
- Calculating and interpreting lift

---

<details>
<summary>Hint 1: Support calculation</summary>

```python
def support(itemset, transactions):
    """Calculate support of an itemset."""
    if isinstance(itemset, str):
        itemset = [itemset]
    count = 0
    for trans in transactions:
        if set(itemset).issubset(set(trans)):
            count += 1
    return count / len(transactions)
```
</details>

<details>
<summary>Hint 2: Confidence formula</summary>

```python
# Confidence(A -> B) = Support(A and B) / Support(A)
conf_bread_milk = support(['bread', 'milk'], transactions) / support('bread', transactions)
```
</details>

<details>
<summary>Hint 3: Lift formula and interpretation</summary>

```python
# Lift(A -> B) = Support(A and B) / (Support(A) * Support(B))
lift_bread_milk = support(['bread', 'milk'], transactions) / (
    support('bread', transactions) * support('milk', transactions)
)

# Interpretation:
# Lift > 1: Items are dependent (bought together)
# Lift = 1: Items are independent
# Lift < 1: Items are substitutes
```
</details>

In [None]:
# Exercise 3: Your code here
# ===========================

# Step 1: Define transactions
transactions = [
    ['bread', 'milk', 'eggs'],
    ['bread', 'butter'],
    ['milk', 'eggs', 'butter'],
    ['bread', 'milk', 'eggs', 'butter'],
    ['bread', 'milk'],
    ['eggs', 'butter'],
    ['bread', 'milk', 'butter'],
    ['bread', 'eggs']
]

# Step 2: Implement support function
def support(itemset, transactions):
    """Calculate the support of an itemset."""
    # YOUR CODE HERE
    pass

# Step 3: Calculate individual item supports
print("Individual Item Support:")
print("-" * 30)
items = ['bread', 'milk', 'eggs', 'butter']
# YOUR CODE HERE


# Step 4: Calculate support for {bread, milk}
print("\nPair Support:")
# YOUR CODE HERE


# Step 5: Calculate confidence values
print("\nConfidence:")
# YOUR CODE HERE


# Step 6: Calculate lift
print("\nLift:")
# YOUR CODE HERE


# Step 7: Interpret results
print("\nInterpretation:")
# Write your interpretation here

---

## Exercise 4: A-Priori Algorithm Implementation (Medium-Hard)

**Task**: Implement the A-Priori algorithm to find frequent itemsets.

**Requirements**:
1. Implement the `mingle()` function to generate candidate itemsets
2. Implement the `support()` function for different levels
3. Implement the `apriori()` function
4. Run with minSup = 0.6 (60%) on the baskets data

**Expected Output**: All frequent itemsets at each level that meet the minimum support threshold.

**Skills Practiced**:
- Candidate generation (mingle)
- Support calculation at different levels
- Recursive A-Priori implementation
- Working with frozensets

---

<details>
<summary>Hint 1: Mingle function structure</summary>

```python
def mingle(items, level):
    """Generate candidate itemsets of size 'level' from items."""
    outcome = set()
    for item in items:
        for item2 in items:
            if item != item2:
                new_combination = set()
                if level > 2:  # Combine existing itemsets
                    for i in item:
                        new_combination.add(i)
                    for i in item2:
                        new_combination.add(i)
                else:  # Combine single items
                    new_combination.add(item)
                    new_combination.add(item2)
                if len(new_combination) == level:
                    outcome.add(frozenset(new_combination))
    return outcome
```
</details>

<details>
<summary>Hint 2: Support for different levels</summary>

```python
def support(itemset, transactions, level):
    count = 0
    for trans in transactions:
        contain = True
        if level > 1:
            for item in itemset:
                if item not in trans:
                    contain = False
                    break
        else:
            if itemset not in trans:
                contain = False
        if contain:
            count += 1
    return count / len(transactions)
```
</details>

<details>
<summary>Hint 3: A-Priori recursion</summary>

```python
def apriori(level, transactions, items, minsup):
    retain = set()
    
    # Find items meeting minimum support
    for item in items:
        if support(item, transactions, level) >= minsup:
            retain.add(item)
    
    print(f"Level {level} - Retained: {retain}")
    
    level += 1
    newsets = mingle(retain, level)
    
    if len(newsets) > 0 and level <= len(items):
        apriori(level, transactions, newsets, minsup)
```
</details>

In [None]:
# Exercise 4: Your code here
# ===========================

from itertools import combinations

# Step 1: Implement mingle function
def mingle(items, level):
    """Generate candidate itemsets of size 'level' from items."""
    outcome = set()
    # YOUR CODE HERE
    return outcome

# Test mingle
assert mingle(["a","b","c"], 2) == {frozenset({'a', 'c'}), 
                                     frozenset({'b', 'c'}), 
                                     frozenset({'a', 'b'})}
print("mingle() test passed!")

In [None]:
# Step 2: Implement support function for levels
def support(itemset, transactions, level):
    """Calculate support of an itemset at a given level."""
    count = 0
    # YOUR CODE HERE
    return count / len(transactions)

# Test support
test_trans = [["a","b","c"], ["a","b","d"], ["b","c"], ["a","c"]]
assert support("a", test_trans, 1) == 0.75
assert support(["a","b"], test_trans, 2) == 0.5
print("support() tests passed!")

In [None]:
# Step 3: Implement apriori function
def apriori(level, transactions, items, minsup):
    """A-Priori algorithm implementation."""
    print(f"\nLevel {level}:")
    print("-" * 40)
    
    retain = set()
    # YOUR CODE HERE - find items meeting minsup
    
    print(f"Retained: {retain}")
    
    level += 1
    newsets = mingle(retain, level)
    print(f"New candidates: {newsets}")
    
    if len(newsets) != 0 and level < len(items) + 1:
        apriori(level, transactions, newsets, minsup)

In [None]:
# Step 4: Load baskets data and run A-Priori
file = open('data/baskets.csv', 'r')

transactions = []
items = set()

for line in file:
    line = line.replace('\n', '')
    litems = line.split(',')
    transactions.append(litems)
    for item in litems:
        items.add(item)

file.close()

print(f"Loaded {len(transactions)} transactions with {len(items)} unique items")
print(f"Items: {items}")

# Run A-Priori with minSup = 60%
print("\n" + "="*50)
print("A-PRIORI ALGORITHM (minSup = 60%)")
print("="*50)
apriori(1, transactions, items, 0.6)

---

## Exercise 5: Collaborative Filtering Recommendation (Medium-Hard)

**Task**: Build a collaborative filtering recommendation system using cosine similarity.

**Requirements**:
1. Load and preprocess the ratings data
2. Create a utility matrix (users x movies)
3. Implement `findSimilarUsers()` using cosine similarity
4. Implement `findNewProducts()` to recommend movies
5. Generate recommendations for a specific user

**Expected Output**: List of similar users and movie recommendations.

**Skills Practiced**:
- Building utility matrices from rating data
- Calculating cosine similarity between users
- User-based collaborative filtering
- Making personalized recommendations

---

<details>
<summary>Hint 1: Create utility matrix</summary>

```python
# Create empty matrix
utility = np.zeros(shape=(noUsers, noMovies))

# Map movieIds to indices
movieIds = {}
for i, mid in enumerate(ratings['movieId'].unique()):
    movieIds[mid] = i

# Populate matrix
for _, row in ratings.iterrows():
    uid = int(row['userId']) - 1
    mid = movieIds[row['movieId']]
    utility[uid, mid] = row['rating']
```
</details>

<details>
<summary>Hint 2: Finding similar users</summary>

```python
from scipy.spatial.distance import cosine

def findSimilarUsers(person_number, utility_matrix, minCos=0.5):
    similar_users = []
    for other in range(len(utility_matrix)):
        if person_number != other:
            # cosine() returns distance, so similarity = 1 - distance
            cos_sim = 1 - cosine(utility_matrix[person_number], 
                                  utility_matrix[other])
            if cos_sim > minCos:
                similar_users.append((other, cos_sim))
    return similar_users
```
</details>

<details>
<summary>Hint 3: Making recommendations</summary>

```python
def findNewProducts(similar_users, person_number, utility_matrix, minScore=2.0):
    recommendations = []
    for movie in range(utility_matrix.shape[1]):
        if utility_matrix[person_number, movie] == 0:  # Not rated
            scores = []
            for user_id, sim in similar_users:
                if utility_matrix[user_id, movie] > 0:
                    scores.append(utility_matrix[user_id, movie])
            if scores:
                avg = sum(scores) / len(scores)
                if avg > minScore:
                    recommendations.append((movie, avg))
    return sorted(recommendations, key=lambda x: x[1], reverse=True)
```
</details>

In [None]:
# Exercise 5: Your code here
# ===========================

# Step 1: Load ratings data
ratings = pd.read_csv('data/ratings.csv')
ratings = ratings[:5000]  # Sample for speed

noMovies = len(ratings['movieId'].unique())
noUsers = len(ratings['userId'].unique())

print(f"Dataset: {noMovies} movies rated by {noUsers} users")
print(ratings.head())

In [None]:
# Step 2: Create utility matrix
utility = np.zeros(shape=(noUsers, noMovies))

# Map movie IDs to indices
movieIds = {}
# YOUR CODE HERE

# Populate the matrix
# YOUR CODE HERE

print(f"Utility matrix shape: {utility.shape}")
print(f"Sparsity: {(utility == 0).sum() / utility.size * 100:.1f}% empty")

In [None]:
# Step 3: Implement findSimilarUsers
def findSimilarUsers(person_number, utility_matrix, minCos=0.5):
    """Find users similar to the given user using cosine similarity."""
    similar_users = []
    # YOUR CODE HERE
    return similar_users

# Test with user 0
similar = findSimilarUsers(0, utility, minCos=0.2)
print(f"Found {len(similar)} similar users for User 0")

In [None]:
# Step 4: Implement findNewProducts
def findNewProducts(similar_users, person_number, utility_matrix, minScore=2.0):
    """Recommend movies based on similar users' ratings."""
    recommendations = []
    # YOUR CODE HERE
    return recommendations

# Generate recommendations
recs = findNewProducts(similar, 0, utility, minScore=1.0)
print(f"\nTop 5 recommendations for User 0:")
for movie_idx, score in recs[:5]:
    print(f"  Movie index {movie_idx}: predicted score = {score:.2f}")

---

## Bonus Exercise: K-Means Clustering on Starbucks Data

**Task**: Apply K-Means clustering to Starbucks location data and visualize the results.

**Requirements**:
1. Load and filter location data to a specific region
2. Apply K-Means with different values of K (3, 5, 10)
3. Visualize clusters with different colors
4. Discuss: How do you choose the optimal K?

---

<details>
<summary>Hint 1: Load and filter data</summary>

```python
data = pd.read_csv("data/starbucks_locations.csv", index_col=0)
data = data.dropna()

# Filter to a region
filtered = data[(data["Latitude"].between(24, 27)) & 
                (data["Longitude"].between(49, 56))]
```
</details>

<details>
<summary>Hint 2: Apply K-Means</summary>

```python
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, max_iter=500, random_state=42)
kmeans.fit(filtered)
labels = kmeans.labels_
centers = kmeans.cluster_centers_
```
</details>

<details>
<summary>Hint 3: Visualize clusters</summary>

```python
plt.figure(figsize=(10, 8))
plt.scatter(filtered['Longitude'], filtered['Latitude'], 
            c=labels, cmap='tab10', s=50)
plt.scatter(centers[:, 0], centers[:, 1], 
            c='red', marker='X', s=200, label='Centroids')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title(f'K-Means Clustering (K=5)')
plt.legend()
plt.show()
```
</details>

In [None]:
# Bonus Exercise: Your code here
# ================================

# Step 1: Load and filter Starbucks data


# Step 2: Try different values of K


# Step 3: Visualize results


# Step 4: Discussion
# How would you choose the optimal K?
# Write your thoughts here:

---

## Submission Checklist

Before submitting, verify:

- [ ] Exercise 1: Analyzed 5 reviews with VADER and classified sentiment
- [ ] Exercise 2: Calculated Jaccard similarity for all user pairs
- [ ] Exercise 3: Calculated support, confidence, and lift correctly
- [ ] Exercise 4: Implemented A-Priori algorithm, passed all tests
- [ ] Exercise 5: Built collaborative filtering system, generated recommendations
- [ ] Bonus: K-Means clustering with visualization (optional)
- [ ] All code cells run without errors