# Capstone

Last week's notebook discussed a cummulative project that would be used as a test of knowledge from this series of courses.
This notebook will serve as a reference point for you while you work on said project. Included in this notebook is a set of answers to your tasks, based on a set dataset. Make sure in your final submission you are using a different dataset! 

You will be working on 4 tasks:
1. __Data Processing__ 
2. __Classification__ 
3. __Regression__ 
4. __Recommender Sytstems__

These tasks are each representative of one of the courses in the series. So if you need help with any one of these tasks, be sure to look back at those courses for reference! Along with the previous courses, there will be checkpoints with given solutions so you can check to make sure you are headed in the right direction. ___Good Luck!___

# Task 1: Data Processing

## The Data

For this final project you will be doing your work on a dataset of your choice. For reference, an example with checkpoint answers will be included. This example will be an amazon dataset, which does not need any cleaning before proper analysis. This dataset in particular can be found [here.](https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Home_Improvement_v1_00.tsv.gz)
This dataset is a set of Home Improvement Product reviews on amazon. It is a rather large dataset, so our computation might take slightly longer than normal.

### First Step: Imports

In the next cell we will give you all of the imports you should need to do your project. Feel free to add more if you would like, but these should be sufficient.

In [1]:
import gzip
from collections import defaultdict
import random
import numpy
import scipy.optimize
import string
from sklearn import linear_model
from nltk.stem.porter import PorterStemmer # Stemming

### TODO 1: Read the data and Fill your dataset

Take care of int casting the votes and rating. Also __add this bit of code__ to your for loop, taking off the outer " ":

" d['verified_purchase'] = d['verified_purchase'] == 'Y' "

This simple makes the verified purchase column be strictly true/false values rather than Y/N strings.

In [2]:
#YOUR CODE HERE
import pandas as pd

# Load the dataset
data_path = '/Users/joeko/Js/NCKU/2024Fall/Coursera/Python/CAPSTONE/amazon_reviews_us_Musical_Instruments_v1_00.tsv'
data = pd.read_csv(data_path, sep='\t', on_bad_lines='skip')

# Display basic information about the dataset
print(data.head())
print(data.info())

  marketplace  customer_id       review_id  product_id  product_parent  \
0          US     45610553   RMDCHWD0Y5OZ9  B00HH62VB6       618218723   
1          US     14640079   RZSL0BALIYUNU  B003LRN53I       986692292   
2          US      6111003   RIZR67JKUDBI0  B0006VMBHI       603261968   
3          US      1546619  R27HL570VNL85F  B002B55TRG       575084461   
4          US     12222213  R34EBU9QDWJ1GD  B00N1YPXW2       165236328   

                                       product_title     product_category  \
0  AGPtek® 10 Isolated Output 9V 12V 18V Guitar P...  Musical Instruments   
1         Sennheiser HD203 Closed-Back DJ Headphones  Musical Instruments   
2                   AudioQuest LP record clean brush  Musical Instruments   
3      Hohner Inc. 560BX-BF Special Twenty Harmonica  Musical Instruments   
4        Blue Yeti USB Microphone - Blackout Edition  Musical Instruments   

   star_rating  helpful_votes  total_votes vine verified_purchase  \
0            3         

To do this setup properly, you __should__ shuffle your data (which you should do in your submission), but the checkpoint values would change so for the sake of this example we will ___not___ shuffle the data.

### TODO 2: Split the data into a Training and Testing set

Have Training be the first 80%, and testing be the remaining 20%. 

In [3]:
#YOUR CODE HERE
from sklearn.model_selection import train_test_split

# Shuffle and split the dataset into training and testing sets
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

# Print the sizes of the datasets
print(f"Total data entries: {len(data)}")
print(f"Training set size: {len(train_set)}")
print(f"Testing set size: {len(test_set)}")

print(len(train_set), len(test_set))
print("Lengths should be: 2107824 526957")



Total data entries: 904004
Training set size: 723203
Testing set size: 180801
723203 180801
Lengths should be: 2107824 526957


#### Now delete your dataset
You don't want any of your answers to come from your original dataset any longer, but rather your Training Set, this will help you to not make any mistakes later on, especialy when referencing the checkpoint solutions.

In [4]:
del data

# Verify that the original dataset is deleted
try:
    print(data)
except NameError:
    print("Original dataset deleted successfully.")


Original dataset deleted successfully.


### TODO 3: Extracting Basic Statistics

Next you need to answer some questions through any means (i.e. write a function or just find the answer) all based on the __Training Set:__
1. What is the __average rating__?
2. What fraction of reviews are from __verified purchases__?
3. How many __total users__ are there?
4. How many __total items__ are there?
5. What fraction of reviews have __5-star ratings__?

In [5]:
#YOUR CODE HERE
# Ensure you have already loaded the train_set DataFrame

# 1. Calculate the average rating
average_rating = train_set['star_rating'].mean()
print(f"Average Rating: {average_rating}")

# 2. Calculate the fraction of reviews from verified purchases
verified_fraction = (train_set['verified_purchase'] == 'Y').mean()
print(f"Fraction of Verified Purchases: {verified_fraction}")

# 3. Count the total number of unique users
total_users = train_set['customer_id'].nunique()
print(f"Total Users: {total_users}")

# 4. Count the total number of unique items
total_items = train_set['product_id'].nunique()
print(f"Total Items: {total_items}")

# 5. Calculate the fraction of reviews with a 5-star rating
five_star_fraction = (train_set['star_rating'] == 5).mean()
print(f"Fraction of 5-Star Ratings: {five_star_fraction}")

Average Rating: 4.2507497894781965
Fraction of Verified Purchases: 0.8637837508970511
Total Users: 481442
Total Items: 111187
Fraction of 5-Star Ratings: 0.6331749176925427


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
1. 4.219492709068689
2. 0.9176558384381238
3. 1396587
4. 294787
5. 0.642813631498645

# Task 2: Classification

Next you will use our knowledge of classification to extract features and make predictions based on them. Here you will be using a Logistic Regression Model, keep this in mind so you know where to get help from.

### TODO 1: Define the feature function

This implementation will be based on the __star rating__ and the ___length___ of the __review body__. Hint: Remember the offset!

In [6]:
#YOUR CODE HERE

# Define the feature function
def feature_function(data):
    """
    Extract features for logistic regression:
    1. Offset (always 1)
    2. Star rating
    3. Review body length (word count)
    """
    # Calculate the length of the review body
    review_length = data['review_body'].apply(lambda x: len(str(x).split()))
    
    # Create the feature matrix
    features = pd.DataFrame({
        'offset': 1,  # Offset term
        'star_rating': data['star_rating'],
        'review_length': review_length
    })
    
    return features


In [7]:
# Generate features for the training set
X_train = feature_function(train_set)

# Display the first few rows of the feature matrix
print(X_train.head())


        offset  star_rating  review_length
235985       1            5             17
398088       1            5             85
893773       1            3             90
813575       1            5            163
288860       1            3              6


### TODO 2: Fit your model

1. Create your __Feature Vector__ based on your feature function defined above. 
2. Create your __Label Vector__ based on the "verified purchase" column of your training set.
3. Define your model as a __Logistic Regression__ model.
4. Fit your model.

In [8]:
#YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

# Step 1: Extract features for the training set
X_train = feature_function(train_set)

# Step 2: Create the label vector
# Encode 'Y' as 1 and 'N' as 0 in the verified_purchase column
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_set['verified_purchase'])

# Step 3: Initialize the Logistic Regression model
logistic_model = LogisticRegression()

# Step 4: Train the model
logistic_model.fit(X_train, y_train)

print("Model training complete!")

# Print the model coefficients
print("Model coefficients:", logistic_model.coef_)
print("Model intercept:", logistic_model.intercept_)



Model training complete!
Model coefficients: [[ 0.88858746  0.11158428 -0.00509123]]
Model intercept: [0.88859591]


### TODO 3: Compute Accuracy of Your Model

1. Make __Predictions__ based on your model.
2. Compute the __Accuracy__ of your model.

In [9]:
#YOUR CODE HERE

from sklearn.metrics import accuracy_score

# Step 1: Extract features for the testing set
X_test = feature_function(test_set)

# Step 2: Encode the labels for the testing set
y_test = label_encoder.transform(test_set['verified_purchase'])

# Step 3: Generate predictions using the trained model
y_pred = logistic_model.predict(X_test)

# Step 4: Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")


Model Accuracy: 0.8638


### TODO 4: Finding the Balanced Error Rate

1. Compute __True__ and __False Positives__
2. Compute __True__ and __False Negatives__
3. Compute __Balanced Error Rate__ based on your above defined variables.

In [10]:
#YOUR CODE HERE

from sklearn.metrics import confusion_matrix

# Step 1: Generate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()

# Step 2: Calculate FPR and FNR
fpr = fp / (fp + tn)  # False Positive Rate
fnr = fn / (fn + tp)  # False Negative Rate

# Step 3: Calculate BER
ber = (fpr + fnr) / 2
print(f"Balanced Error Rate (BER): {ber:.4f}")


Balanced Error Rate (BER): 0.4819


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
3. Accuracy = 0.9172573231920692
4. BER = 0.4895446422238072

# Task 3: Regression

In this section you will start by working though two examples of altering features to further differentiate. Then you will work through how to evaluate a Regularaized model.

Lets start by defining a new y vector, specific to our Regression model.

In [11]:
# Extract 'star_rating' column as a list
y_reg = train_set['star_rating'].tolist()

# Verify the result
print(y_reg[:5])  # Print the first 5 values


[5, 5, 3, 5, 3]


### TODO 1: Unique Words in a Sample Set

We are going to work with a smaller Sample Set here, as stemming on the normal training set will take a very long time. (Feel free to change sampleSet -> train_set if you would like to see)

1. Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to __Ignore Punctuation and Capitalization__.
2. Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of __Stemming,__ __Ignoring Puctuation,__ ___and___ __Capitalization__.

In [12]:
#GIVEN for 1.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

#GIVEN for 2.
wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

In [13]:
sampleSet = train_set[:2*len(train_set)//10]

In [14]:
#YOUR CODE HERE

from collections import defaultdict
from nltk.stem.porter import PorterStemmer
import string

# Initialize stemmer and punctuation set
stemmer = PorterStemmer()
punctuation = set(string.punctuation)

# Sample the dataset (20% of training set for efficiency)
sample_set = train_set.sample(frac=0.2, random_state=42)

# Step 1: Count unique words (ignoring punctuation and capitalization)
word_count = defaultdict(int)
for review in sample_set['review_body']:
    # Remove punctuation and convert to lowercase
    cleaned_review = ''.join([c.lower() for c in str(review) if c not in punctuation])
    # Count unique words
    for word in cleaned_review.split():
        word_count[word] += 1

# Step 2: Count unique words with stemming
word_count_stem = defaultdict(int)
for review in sample_set['review_body']:
    # Remove punctuation and convert to lowercase
    cleaned_review = ''.join([c.lower() for c in str(review) if c not in punctuation])
    # Apply stemming and count unique words
    for word in cleaned_review.split():
        stemmed_word = stemmer.stem(word)
        word_count_stem[stemmed_word] += 1

# Results
print(f"Unique words (no stemming): {len(word_count)}")
print(f"Unique words (with stemming): {len(word_count_stem)}")


Unique words (no stemming): 124564
Unique words (with stemming): 103340


### TODO 2: Evaluating Classifiers

1. Given the feature function and your counts vector, __Define__ your X_reg vector. (This being the X vector, simply labeled for the Regression model)
2. __Fit__ your model using a __Ridge Model__ with (alpha = 1.0, fit_intercept = True).
3. Using your model, __Make your Predictions__.
4. Find the __MSE__ between your predictions and your y_reg vector.

In [15]:
# Update the feature_reg function
def feature_reg(datum):
    # Initialize a feature vector with zeros
    feat = [0] * len(words)
    # Ensure the review_body is a string, otherwise use an empty string
    review_body = str(datum['review_body']) if isinstance(datum['review_body'], str) else ""
    # Remove punctuation and convert to lowercase
    r = ''.join([c for c in review_body.lower() if c not in punctuation])
    # Count word frequencies
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    return feat


def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)


In [16]:
#GIVEN COUNTS AND SETS
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:100]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [17]:
# Count words in the training set
word_count = defaultdict(int)
for review in train_set['review_body']:
    review_text = str(review) if isinstance(review, str) else ""
    cleaned_review = ''.join([c.lower() for c in review_text if c not in punctuation])
    for word in cleaned_review.split():
        word_count[word] += 1

# Create the top N words dictionary (e.g., top 100 words)
N = 100  # Adjust the number as needed
counts = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
words = [x[0] for x in counts[:N]]
wordId = {word: i for i, word in enumerate(words)}
wordSet = set(words)


In [18]:
#YOUR CODE HERE
from sklearn.linear_model import Ridge
import numpy as np

# Create the feature matrix (X_reg) using the training set
X_reg = np.array([feature_reg(row) for _, row in train_set.iterrows()])
# Ridge Regression target variable
y_reg = train_set['star_rating'].tolist()

# Initialize and fit the Ridge Regression model
ridge_model = Ridge(alpha=1.0, fit_intercept=True)
ridge_model.fit(X_reg, y_reg)

# Make predictions
y_pred = ridge_model.predict(X_reg)

# Calculate Mean Squared Error
def MSE(predictions, labels):
    differences = [(x - y) ** 2 for x, y in zip(predictions, labels)]
    return sum(differences) / len(differences)

mse = MSE(y_pred, y_reg)
print(f"Mean Squared Error (MSE): {mse:.4f}")


Mean Squared Error (MSE): 1.2245


In [19]:

# Initialize punctuation set
punctuation = set(string.punctuation)

# Step 1: Count words in the training set
word_count = defaultdict(int)
for review in train_set['review_body']:
    review_text = str(review) if isinstance(review, str) else ""
    cleaned_review = ''.join([c.lower() for c in review_text if c not in punctuation])
    for word in cleaned_review.split():
        word_count[word] += 1

# Checkpoint: Unique word counts
print(f"Checkpoint: Unique words (no stemming): {len(word_count)}")
print(f"Checkpoint: Unique words (with stemming): {len(words)}")

# Checkpoint: Feature matrix shape
print(f"Checkpoint: Feature matrix shape: {X_reg.shape}")

# Checkpoint: Mean Squared Error (MSE)
print(f"Checkpoint: Mean Squared Error (MSE): {mse:.4f}")


Checkpoint: Unique words (no stemming): 343747
Checkpoint: Unique words (with stemming): 100
Checkpoint: Feature matrix shape: (723203, 100)
Checkpoint: Mean Squared Error (MSE): 1.2245


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
1. len(wordCount) = 135769
2. len(wordCountStem) = 113888
4. MSE = 1.2869981011943792 (Roughly, could change slightly due to rounding errors)

In [20]:
# If you would like to work with this example more in your free time, here are some tips to improve your solution:
# 1. Implement a validation pipeline and tune the regularization parameter
# 2. Alter the word features (e.g. dictionary size, punctuation, capitalization, stemming, etc.)
# 3. Incorporate features other than word features

# Task 4: Recommendation Systems

For your final task, you will use your knowledge of simple latent factor-based recommender systems to make predictions. Then evaluating the performance of your predictions.

### Starting up

The next cell contains some starter code that you will need for your tasks in this section.
Notice you are back to using the __train_set__.

### TODO 1: Calculate the ratingMean

1. Find the __average rating__ of your training set.
2. Calculate a __baseline MSE value__ from the actual ratings to the average ratings.

In [21]:
#YOUR CODE HERE
# Step 1: Calculate the average rating
rating_mean = train_set['star_rating'].mean()
print(f"Rating Mean: {rating_mean:.4f}")

# Step 2: Calculate the baseline MSE
# Use the mean rating as a constant prediction for all items
y_baseline = [rating_mean] * len(train_set)
y_actual = train_set['star_rating'].tolist()

def MSE(predictions, labels):
    differences = [(x - y) ** 2 for x, y in zip(predictions, labels)]
    return sum(differences) / len(differences)

baseline_mse = MSE(y_baseline, y_actual)
print(f"Baseline Mean Squared Error (MSE): {baseline_mse:.4f}")


Rating Mean: 4.2507
Baseline Mean Squared Error (MSE): 1.4810


Here we are defining the functions you will need to optimize your MSE value. 

In [22]:
from collections import defaultdict
import numpy as np
import scipy.optimize

# Initialize dictionaries for user and item biases
user_biases = defaultdict(float)
item_biases = defaultdict(float)

# Group reviews by user and item
reviews_per_user = defaultdict(list)
reviews_per_item = defaultdict(list)

for _, row in train_set.iterrows():
    user = row['customer_id']
    item = row['product_id']
    reviews_per_user[user].append(row)
    reviews_per_item[item].append(row)

# Number of users and items
n_users = len(reviews_per_user)
n_items = len(reviews_per_item)

# User and item lists
users = list(reviews_per_user.keys())
items = list(reviews_per_item.keys())


In [23]:
# Prediction function
def prediction(user, item):
    return alpha + user_biases[user] + item_biases[item]

# Unpack the parameters
def unpack(theta):
    global alpha, user_biases, item_biases
    alpha = theta[0]
    user_biases = dict(zip(users, theta[1:n_users+1]))
    item_biases = dict(zip(items, theta[1+n_users:]))

# Cost function
def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(row['customer_id'], row['product_id']) for _, row in train_set.iterrows()]
    cost = MSE(predictions, labels)
    # Add regularization terms
    cost += lamb * sum(user_biases[u] ** 2 for u in user_biases)
    cost += lamb * sum(item_biases[i] ** 2 for i in item_biases)
    return cost

# Derivative function
def derivative(theta, labels, lamb):
    unpack(theta)
    dalpha = 0
    d_user_biases = defaultdict(float)
    d_item_biases = defaultdict(float)

    for _, row in train_set.iterrows():
        user = row['customer_id']
        item = row['product_id']
        true_rating = row['star_rating']
        pred = prediction(user, item)
        diff = pred - true_rating
        dalpha += 2 * diff / len(train_set)
        d_user_biases[user] += 2 * diff / len(train_set)
        d_item_biases[item] += 2 * diff / len(train_set)

    for u in user_biases:
        d_user_biases[u] += 2 * lamb * user_biases[u]
    for i in item_biases:
        d_item_biases[i] += 2 * lamb * item_biases[i]

    grad = [dalpha] + [d_user_biases[u] for u in users] + [d_item_biases[i] for i in items]
    return np.array(grad)


### TODO 2: Optimize

1. __Optimize__ your MSE using the scipy.optimize.fmin_1_bfgs_b("arguments") functions.

In [24]:
#YOUR CODE HERE
# Initial parameters
alpha = rating_mean
initial_user_biases = np.zeros(n_users)
initial_item_biases = np.zeros(n_items)
initial_theta = np.hstack(([alpha], initial_user_biases, initial_item_biases))

# Labels (actual ratings)
labels = train_set['star_rating'].tolist()

# Regularization parameter
lamb = 0.1

# Optimize using scipy
result = scipy.optimize.fmin_l_bfgs_b(cost, initial_theta, fprime=derivative, args=(labels, lamb))
optimized_theta = result[0]

# Unpack the optimized parameters
unpack(optimized_theta)

# Calculate the optimized MSE
optimized_predictions = [prediction(row['customer_id'], row['product_id']) for _, row in train_set.iterrows()]
optimized_mse = MSE(optimized_predictions, labels)
print(f"Optimized Mean Squared Error (MSE): {optimized_mse:.4f}")


Optimized Mean Squared Error (MSE): 1.4803


In [25]:
import numpy as np
from collections import defaultdict
import scipy.optimize

# Step 1: Calculate the average rating (ratingMean)
rating_mean = train_set['star_rating'].mean()
print(f"Rating Mean: {rating_mean:.4f}")

# Step 2: Calculate the baseline MSE (baseLine)
y_baseline = [rating_mean] * len(train_set)
y_actual = train_set['star_rating'].tolist()

def MSE(predictions, labels):
    differences = [(x - y) ** 2 for x, y in zip(predictions, labels)]
    return sum(differences) / len(differences)

baseline_mse = MSE(y_baseline, y_actual)
print(f"Baseline MSE: {baseline_mse:.4f}")

print(f"Optimized Mean Squared Error (MSE): {optimized_mse:.4f}")

Rating Mean: 4.2507
Baseline MSE: 1.4810
Optimized Mean Squared Error (MSE): 1.4803


### Checkpoint:

Here is a list of answers for the questions above. Use these to reference how you are doing in finding the correct solutions.
1. ratingMean = 4.219492709068689
2. baseLine = 1.634495697493549
3. optimized MSE -> converges to roughly 1.6083083.....

## You're all done!

Congratulations! This project was the end of 4 whole courses worth of content! This project clearly didn't cover every single topic from those courses, but it serves as a summary for everything you have learned. This is only the start of Python Data Projects, so continue to learn and good luck in your future endeavors!