<a href="https://colab.research.google.com/github/NinaMwangi/Pi_Swap/blob/main/PiSwap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimization Techniques in Machine Learning

Objective: This assignment aims to explore implementation or Machine Learning Models with regularization, optimization and Error analysis  techniques used in machine learning to improve models' performance, convergence speed, and efficiency..

A Notebook detailing the following

* Project name
* Clear out puts from cells






**Instructions**

1. Acquire a dataset suitable for ML tasks as per your proposal.
2. Implement a simple machine learning model based on neural networks on the chosen dataset without any defined optimization techniques. (Check instructions)
3. Implement and compare the model's performance after applying 3 to 4 disntict combinations regularization and optimization techniques.
4. Discuss the results on the README file.
5. Make predictions using test data
7. Implement error analysis techniques and ensure there is: F1-Score, Recall, Precision, RUC a confusion matrix using plotting libraries (not verbose)

Submit notebook to github repo



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install unidecode




# Case Study and Implementation




In [None]:
#Import Necessary Libraries
import numpy as np
import pandas as pd
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GroupShuffleSplit, StratifiedKFold,StratifiedShuffleSplit
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import TfidfVectorizer
import unidecode

# The Dataset
> My dataset contains a list of secondhand books from the pre-primary level all the way to the tertiary level. With the high cost of living I would like to create a model that will be used to recommend the right second hand books to the user based on different factors that are both user based and item based, to allow easier accessibility to fair priced good quality second hand books. The data contains the following columns. isbn, book_id, User_id, title, author, unit_price, unit_price_vat, level, subject, description, condition, purchase_intent, previous_purchase,rating, clicks, wishlist.


In [None]:
#TO DO: Load Data (Seprate into: Train, Validation and test sets)
#Importing dataset and defining columns.
data = pd.read_csv('/content/drive/MyDrive/Summative Intro 2 ML/PiSwap Data - Sheet1.csv')
columns = ['isbn', 'book_id', 'User_id', 'title', 'author', 'unit_price', 'unit_price_vat', 'level', 'subject', 'description', 'condition', 'purchase_intent', 'previous_purchase', 'rating', 'clicks', 'wishlist']
data.columns = columns
data.head()

Unnamed: 0,isbn,book_id,User_id,title,author,unit_price,unit_price_vat,level,subject,description,condition,purchase_intent,previous_purchase,rating,clicks,wishlist
0,978-9966-65-190-7,1,1176,KLB Skillgrow Mathematical Activities L/B PP1,J. Mbugua et al,240,279,PP1,Mathematical Activities,"KLB Skillgrow Mathematical Activities L/B PP1,...",Fair,0,0.0,2.0,8.0,0.0
1,978-9966-65-191-4,2,1031,KLB Skillgrow Mathematical Activities T/G PP1,J. Mbugua et al,340,395,PP1,Mathematical Activities,"KLB Skillgrow Mathematical Activities T/G PP1,...",Fair,0,0.0,2.0,8.0,1.0
2,978-9966-65-185-3,3,1010,KLB Skillgrow Language Activities(English) L/B...,G. Wambiri et al,345,401,PP1,English Activities,KLB Skillgrow Language Activities(English) L/B...,Fair,0,0.0,1.0,7.0,1.0
3,978-9966-65-184-6,4,1051,KLB Skillgrow Language Activities(English) T/G...,G. Wambiri et al,382,444,PP1,English Activities,KLB Skillgrow Language Activities(English) T/G...,Fair,0,1.0,3.0,7.0,1.0
4,978-9966-65-188-4,5,1158,KLB Skillgrow Kiswahili Activities L/B PP1,S. Wandera et al,240,279,PP1,Kiswahili Activities,"KLB Skillgrow Kiswahili Activities L/B PP1,S. ...",Fair,0,0.0,5.0,1.0,1.0


#Task: Define a function that creates models without and With specified Optimization techniques


In [None]:
def clean_title(title):
  return unidecode.unidecode(title.lower().strip())

#Cleaning titles and removing inconsistencies
data["title"] = data["title"].apply(clean_title)

#Removing duplicates
data = data.drop_duplicates(subset=["title"])

#Removing duplicate user-book interactions (if ratings exist)
data = data.drop_duplicates(subset=["User_id", "title"])


In [None]:
#Use StratifiedShuffleSplit to ensure overlap
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Getting the value counts of 'User_id'
user_counts = data['User_id'].value_counts()

# Filtering out users with only one occurrence
valid_users = user_counts[user_counts > 1].index

# Filter the data to include only valid users
filtered_data = data[data['User_id'].isin(valid_users)]

# performing the split on the filtered data
for train_indices, test_indices in splitter.split(filtered_data, filtered_data["User_id"]):
    train_data = filtered_data.iloc[train_indices]
    test_data = filtered_data.iloc[test_indices]


In [None]:
# Creating rating matrix from updated train data
train_rating_matrix = train_data.pivot_table(values="rating", index="User_id", columns="title")
test_rating_matrix = test_data.pivot_table(values="rating", index="User_id", columns="title")

# Ensuring test data has valid users
test_data = test_data[test_data["User_id"].isin(train_rating_matrix.index)]
# Ensuring test users exist in train_data
test_data = test_data[test_data["User_id"].isin(train_data["User_id"])]

# Filling missing values with each book's average rating
train_rating_matrix = train_rating_matrix.apply(lambda col: col.fillna(col.mean()), axis=0)

# Ensuring item overlap between train and test sets
train_items = train_rating_matrix.columns
test_rating_matrix = test_rating_matrix.loc[:, test_rating_matrix.columns.isin(train_items)]

# Compute cosine similarity
item_similarity = cosine_similarity(train_rating_matrix.T)

# Build similarity dataframe
item_similarity_df = pd.DataFrame(item_similarity,
                                  index=train_rating_matrix.columns,
                                  columns=train_rating_matrix.columns)


 No missing users to add.
 Users with all zero ratings: 0
 No zero-rated users found. Skipping replacement.


In [None]:
def recommend_books(user_id, num_recommendations=5):
    if user_id not in train_rating_matrix.index:
        most_popular = train_rating_matrix.count().sort_values(ascending=False).index[:num_recommendations]
        return list(most_popular)

    # Get user ratings
    user_ratings = train_rating_matrix.loc[user_id]
    rated_items = user_ratings[user_ratings > 0].index

    if len(rated_items) == 0:
        most_popular = train_rating_matrix.count().sort_values(ascending=False).index[:num_recommendations]
        return list(most_popular)

    # Store similar books
    similar_items = {}
    for item in rated_items:
        if item in item_similarity_df.index:
            similar_scores = item_similarity_df[item]
            similar_items[item] = similar_scores.drop(item)
        else:
            print(f" Item '{item}' not found in similarity matrix.")

    # Compute weighted scores
    weighted_scores = {}
    for item, similar in similar_items.items():
        for similar_item, score in similar.items():
            weighted_scores.setdefault(similar_item, 0)
            weighted_scores[similar_item] += score * user_ratings[item]

    # Return top recommended books
    if weighted_scores:
        recommendations = sorted(weighted_scores.items(), key=lambda x: x[1], reverse=True)[:num_recommendations]
        recommended_items = [item for item, score in recommendations]
        return recommended_items
    else:
        most_popular = train_rating_matrix.count().sort_values(ascending=False).index[:num_recommendations]
        return list(most_popular)

# Task: Evaluating the model using RMSE, Precision@K, Recall@k

In [None]:
# RMSE Calculation
def calculate_rmse(predictions, actuals):
    predictions = np.array(predictions)
    actuals = np.array(actuals)

    # Filter out NaN values
    valid_indices = ~np.isnan(predictions) & ~np.isnan(actuals)
    predictions = predictions[valid_indices]
    actuals = actuals[valid_indices]

    if actuals.size > 0:
        return np.sqrt(mean_squared_error(actuals, predictions))
    return 0

# Precision@K Calculation
def calculate_precision_at_k(predictions, actuals, k=5):
    if len(actuals) == 0 or len(predictions) == 0:
        return 0.0
    intersection = np.intersect1d(predictions, actuals)
    return len(intersection) / min(k, len(predictions))

# Recall@K Calculation
def calculate_recall_at_k(predictions, actuals, k=5):
    if len(actuals) == 0 or len(predictions) == 0:
        return 0.0
    intersection = np.intersect1d(predictions, actuals)
    return len(intersection) / len(actuals)


 **Evaluation Results:**
 **Average RMSE:** 0.0000
 **Average Precision@K:** 0.0023
 **Average Recall@K:** 0.0116


In [None]:
# Model Evaluation
def evaluate_predictions(test_rating_matrix, item_similarity_df, train_rating_matrix, k=5):
    rmse_scores = []
    precision_at_k_scores = []
    recall_at_k_scores = []

    for user_id in test_rating_matrix.index:
        user_test_ratings = test_rating_matrix.loc[user_id].dropna()
        if user_test_ratings.empty:
            continue

        actual_items = user_test_ratings[user_test_ratings > 0].index.tolist()
        predicted_items = recommend_books(user_id)

        if isinstance(predicted_items, str) or len(predicted_items) == 0:
            continue

        # RMSE Calculation (for rating prediction)

        for item in actual_items:
          try:
            predicted_rating = predict_rating(user_id, item, item_similarity_df, train_rating_matrix)
            predicted_ratings = np.array(predicted_ratings)
          except KeyError:
            pass

        # Precision@K Calculation
        precision_at_k = calculate_precision_at_k(predicted_items, actual_items, k)
        if not np.isnan(precision_at_k):
            precision_at_k_scores.append(precision_at_k)

        # Recall@K Calculation
        recall_at_k = calculate_recall_at_k(predicted_items, actual_items, k)
        if not np.isnan(recall_at_k):
            recall_at_k_scores.append(recall_at_k)

    # Compute Averages
    avg_rmse = np.mean(rmse_scores) if rmse_scores else 0
    avg_precision_at_k = np.mean(precision_at_k_scores) if precision_at_k_scores else 0
    avg_recall_at_k = np.mean(recall_at_k_scores) if recall_at_k_scores else 0

    return avg_rmse, avg_precision_at_k, avg_recall_at_k

In [None]:
# Predict Rating Function
def predict_rating(user_id, item_id, item_similarity_df, train_rating_matrix):

    if user_id not in train_rating_matrix.index:
        return train_rating_matrix.stack().mean()

    if item_id not in train_rating_matrix.columns:
        return train_rating_matrix.stack().mean()

    user_ratings = train_rating_matrix.loc[user_id]

    # If all user ratings are zero, return mean rating
    if user_ratings.sum() == 0:
        return train_rating_matrix.stack().mean()

    # Compute similarity-based rating
    rated_items = user_ratings[user_ratings > 0].index

    if rated_items.empty:
        return train_rating_matrix.stack().mean()

        similar_items = {}
        for rated_item in rated_items:
            if rated_item in item_similarity_df.index:
                similar_scores = item_similarity_df[rated_item]
                similar_items[rated_item] = similar_scores.drop(rated_item)

        weighted_scores = {}
        for rated_item, similar in similar_items.items():
            for similar_item, score in similar.items():
                if similar_item == item_id:
                    weighted_scores.setdefault(similar_item, 0)
                    weighted_scores[similar_item] += score * user_ratings[rated_item]

        if weighted_scores:
            return np.mean(list(weighted_scores.values()))
        else:
          return train_rating_matrix[item_id].mean() if item_id in train_rating_matrix else train_rating_matrix.stack().mean()


    return train_rating_matrix.stack().mean()

# SECTION 2: Optimization and Regularization Combinations
Using Pearson Correlation instead of cosine similarity and adjusting Top K to 10

In [52]:
# Use Pearson Correlation Instead of Cosine Similarity
item_similarity_df_pearson = train_rating_matrix.T.corr(method="pearson").fillna(0)

# Apply Weighting Based on Rating Count
book_rating_counts = train_rating_matrix.count()  # How many users rated each book
book_weights = np.sqrt(book_rating_counts)  # More ratings = higher weight
item_similarity_df_pearson = item_similarity_df_pearson.multiply(book_weights, axis=0)

# Set K to 10
k = 10

# Run evaluation using Pearson-based similarity
avg_rmse_pearson, avg_precision_at_k_pearson, avg_recall_at_k_pearson = evaluate_predictions(
    test_rating_matrix, item_similarity_df_pearson, train_rating_matrix, k=k
)



 **Evaluation Results (Using Pearson Correlation & K=10):**
 **Average RMSE:** 0.0000
 **Average Precision@K:** 0.0023
 **Average Recall@K:** 0.0116


# Combining collaborative filtering and content based to form a hybrid recommender system.

In [48]:
from scipy.stats import rankdata

# Compute Content-Based Similarity (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
item_feature_matrix = vectorizer.fit_transform(data['description'])

item_similarity_content_df = pd.DataFrame(
    cosine_similarity(item_feature_matrix),
    index=data['title'],
    columns=data['title']
)

# Compute Pearson-Based Collaborative Filtering Similarity
item_similarity_df_pearson = train_rating_matrix.T.corr(method="pearson").fillna(0)

# Apply Truncated SVD for Dimensionality Reduction
svd = TruncatedSVD(n_components=100, random_state=42)
svd_matrix = svd.fit_transform(train_rating_matrix.fillna(0))

# Compute Similarity on SVD Transformed Data
svd_similarity = cosine_similarity(svd_matrix)
svd_similarity_df = pd.DataFrame(
    svd_similarity,
    index=train_rating_matrix.index,
    columns=train_rating_matrix.index
)

In [None]:
# New Hybrid Recommendation Function
def hybrid_recommend_books_svd(user_id, content_weight=0.4, cf_weight=0.4, svd_weight=0.2, num_recommendations=10):
    if user_id not in train_rating_matrix.index:
        return list(train_rating_matrix.count().sort_values(ascending=False).index[:num_recommendations])

    # Content-Based Recommendations
    user_rated_books = train_rating_matrix.loc[user_id].dropna().index
    content_scores = item_similarity_content_df[user_rated_books].sum(axis=1)
    content_based_recs = content_scores.nlargest(num_recommendations).index.tolist()

    # Pearson-Based CF Recommendations
    weighted_scores_cf = {}
    user_ratings = train_rating_matrix.loc[user_id]
    rated_items = user_ratings[user_ratings > 0].index

    for item in rated_items:
        if item in item_similarity_df_pearson.index:
            similar_scores = item_similarity_df_pearson[item]
            for similar_item, score in similar_scores.items():
                weighted_scores_cf.setdefault(similar_item, 0)
                weighted_scores_cf[similar_item] += score * user_ratings[item]

    collaborative_recs = sorted(weighted_scores_cf.items(), key=lambda x: x[1], reverse=True)
    collaborative_recs = [item for item, _ in collaborative_recs[:num_recommendations]]

    # SVD-Based User Similarity Recommendations
    similar_users = svd_similarity_df[user_id].nlargest(10).index.tolist()
    svd_recs = train_rating_matrix.loc[similar_users].mean().nlargest(num_recommendations).index.tolist()

    # Merge Hybrid Recommendations
    hybrid_scores = {}

    for item in content_based_recs:
        hybrid_scores[item] = content_weight * content_scores.get(item, 0)

    for item in collaborative_recs:
        hybrid_scores[item] = hybrid_scores.get(item, 0) + cf_weight * weighted_scores_cf.get(item, 0)

    for item in svd_recs:
        hybrid_scores[item] = hybrid_scores.get(item, 0) + svd_weight

    hybrid_recommendations = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)
    hybrid_recommendations = [item for item, _ in hybrid_recommendations[:num_recommendations]]

    return hybrid_recommendations

# Test the New Hybrid Model
user_id = 1182
hybrid_recs_svd = hybrid_recommend_books_svd(user_id, content_weight=0.4, cf_weight=0.4, svd_weight=0.2, num_recommendations=10)


# Evaluating the hybrid recomendation system.

In [49]:
def evaluate_hybrid_predictions(test_rating_matrix, item_similarity_df, train_rating_matrix, k=10):
    rmse_scores = []
    precision_at_k_scores = []
    recall_at_k_scores = []

    for user_id in test_rating_matrix.index:
        user_test_ratings = test_rating_matrix.loc[user_id].dropna()
        if user_test_ratings.empty:
            continue

        actual_items = user_test_ratings[user_test_ratings > 0].index.tolist()
        predicted_items = hybrid_recommend_books_svd(user_id, content_weight=0.4, cf_weight=0.4, svd_weight=0.2, num_recommendations=k)

        if isinstance(predicted_items, str) or len(predicted_items) == 0:
            continue

        # Calculate RMSE (Rating Prediction Accuracy)
        actual_ratings = user_test_ratings[actual_items]
        predicted_ratings = []
        for item in actual_items:
            try:
                predicted_rating = predict_rating(user_id, item, item_similarity_df, train_rating_matrix)
                predicted_ratings.append(predicted_rating)
            except KeyError:
                pass

        if actual_ratings.size > 0 and len(predicted_ratings) > 0:
            rmse = calculate_rmse(predicted_ratings, actual_ratings)
            if not np.isnan(rmse):
                rmse_scores.append(rmse)

        # Calculate Precision@K
        precision_at_k = calculate_precision_at_k(predicted_items, actual_items, k)
        precision_at_k_scores.append(precision_at_k)

        # Calculate Recall@K
        recall_at_k = calculate_recall_at_k(predicted_items, actual_items, k)
        recall_at_k_scores.append(recall_at_k)

    avg_rmse = np.mean(rmse_scores) if rmse_scores else np.nan
    avg_precision_at_k = np.mean(precision_at_k_scores) if precision_at_k_scores else np.nan
    avg_recall_at_k = np.mean(recall_at_k_scores) if recall_at_k_scores else np.nan

    return avg_rmse, avg_precision_at_k, avg_recall_at_k




 **Evaluation Results (Hybrid Recommender, K=10):**
 **Average RMSE:** 0.2451
 **Average Precision@K:** 0.0047
 **Average Recall@K:** 0.0349


In [51]:
import pickle

with open("recommend_books.pkl", "wb") as file:
    pickle.dump(hybrid_recommend_books_svd, file)

#Save the hybrid function
with open("hybrid_recommender_svd.pkl", "wb") as file:
    pickle.dump(hybrid_recommend_books_svd, file)

#Task: Make Predictions using the best saved model


Create a confusion Matrix and F1 score for both Models. Ensure outputs for the cells are visible

Finally, Make predictions using the best model. By the time you get to this cell you may realise at some point you needed to save the model so that you cal load it later

In [None]:
model_path = "/content/hybrid_recommender_svd.pkl"
user_ids = 1017, 1049

def make_predictions(model_path, user_ids, num_recommendations=5):

    # Load the model
    with open(model_path, "rb") as file:
        hybrid_recommender = pickle.load(file)
    # Make predictions
    recommendations = {}
    for user_id in user_ids:
        recommendations[user_id] = hybrid_recommender(user_id, num_recommendations)

    return recommendations

# Run the predictions
predictions = make_predictions(model_path, user_ids)

# Print the results
for user, recs in predictions.items():
    print(f"Recommendations for User {user}: {recs}")

Recommendations for User 1017: ['klb visionary kiswahili grade 4 l/b', 'klb visionary kiswahili grade 4 t/g', 'klb visionary kiswahili grade 5 l/b', 'klb visionary kiswahili grade 5 t/g', 'klb visionary kiswahili grade 6 l/b', 'klb visionary kiswahili grade 6 t/g', 'klb visionary christian religious education grade 5 l/b', 'klb visionary christian religious education grade 5 t/g', 'klb visionary christian religious education grade 6 l/b', 'klb visionary christian religious education grade 6 t/g']
Recommendations for User 1049: ['klb visionary kiswahili grade 4 l/b', 'klb visionary kiswahili grade 4 t/g', 'klb visionary kiswahili grade 5 l/b', 'klb visionary kiswahili grade 5 t/g', 'klb visionary kiswahili grade 6 l/b', 'klb visionary kiswahili grade 6 t/g', 'klb visionary christian religious education grade 5 l/b', 'klb visionary christian religious education grade 5 t/g', 'klb visionary christian religious education grade 6 l/b', 'klb visionary christian religious education grade 6 t/

Congratulations!!
