# Task 7

## Imports

In [15]:
!pip install scikit-surprise
import pandas as pd
from surprise import accuracy, Dataset, SVD, SVDpp, NMF
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate



## Load Data

In [16]:
data = Dataset.load_builtin(name = 'ml-100k' , prompt = False)

## Calculations

In [17]:
%%time
data = Dataset.load_builtin('ml-100k')
algorithms = {'SVD': SVD(), 'SVDpp': SVDpp(), 'NMF': NMF()}
cv = 5
measures = ['RMSE', 'MAE']

results = {}
check_results = {}
trained_models = {}

# cross-validation for each algorithm
for name, algorithm in algorithms.items():
    result = cross_validate(algorithm, data, measures=measures, cv=cv, verbose=True)
    results[name] = pd.DataFrame.from_dict(result).mean(axis=0)

    # Training and testing for each algorithm
    trainset, testset = train_test_split(data, test_size=0.25)
    algorithm.fit(trainset)
    predictions = algorithm.test(testset)
    rmse = accuracy.rmse(predictions, verbose=True)
    check_results[name] = rmse

    # Train the algorithms on the entire dataset
    algorithm.fit(data.build_full_trainset())
    trained_models[name] = algorithm


# Generate predictions for all user-item combinations
all_predictions = {}
for name, algorithm in trained_models.items():
    predictions = algorithm.test(data.build_full_trainset().build_testset())
    all_predictions[name] = predictions

# creating a final DataFrame to compare the results
result_df = pd.DataFrame(results)
check_results_df = pd.DataFrame.from_dict(check_results, orient='index', columns=['RMSE_Check'])

print(result_df)
print()
print(check_results_df)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9394  0.9333  0.9469  0.9312  0.9334  0.9368  0.0057  
MAE (testset)     0.7393  0.7340  0.7447  0.7354  0.7373  0.7382  0.0038  
Fit time          0.39    0.40    0.40    0.43    0.43    0.41    0.02    
Test time         0.03    0.03    0.03    0.04    0.04    0.04    0.00    
RMSE: 0.9412
Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9195  0.9197  0.9247  0.9107  0.9244  0.9198  0.0051  
MAE (testset)     0.7225  0.7204  0.7254  0.7141  0.7248  0.7214  0.0041  
Fit time          6.43    6.24    6.36    6.34    6.37    6.35    0.06    
Test time         1.30    1.29    1.51    1.37    1.33    1.36    0.08    
RMSE: 0.9199
Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Me

In [22]:
%%time
def load_movie_list(filename):
    with open(filename, encoding='ISO-8859-1') as file:
        movies = file.readlines()
    movie_names = [movie.strip().split(' ', 1)[1] for movie in movies]
    return movie_names


def make_recommendations_for_user(predictions, user_id, num_recommendations):
    user_predictions = [pred for pred in predictions if pred.uid == str(user_id)]
    user_predictions.sort(key=lambda x: x.est, reverse=True)
    top_predictions = user_predictions[:num_recommendations]
    recommendations = [(pred.iid, pred.est) for pred in top_predictions]
    return recommendations

def predict_movies_based_on_movie(predictions, movie_id, movie_names, num_recommendations=5):
    movie_predictions = [pred for pred in predictions if pred.iid == str(movie_id)]
    movie_predictions.sort(key=lambda x: x.est, reverse=True)
    top_predictions = movie_predictions[:num_recommendations]
    recommendations = [(movie_names[int(pred.uid) - 1], pred.est) for pred in top_predictions]
    return recommendations

movie_ids_file = 'movie_ids.txt'
movie_names = load_movie_list(movie_ids_file)

# Make recommendations for a specific user
user_id = 100
num_recommendations = 7
for name, algorithm in algorithms.items():
  user_recommendations = make_recommendations_for_user(all_predictions[name], user_id, num_recommendations)
  print(f"\nRecommendations for user {user_id} [{name}]:")
  for movie_id, estimated_rating in user_recommendations:
      movie_name = movie_names[int(movie_id) - 1]
      print(f"Movie: {movie_name}, ID: {movie_id}, Estimated Rating: {estimated_rating:.2f}")

# Predict movies based on a specific movie
movie_id = 316
movie_name_ = "As Good As It Gets (1997)"
# num_recommendations = 10
for name, algorithm in algorithms.items():
  movie_recommendations = predict_movies_based_on_movie(all_predictions[name], movie_id, movie_names, num_recommendations)
  print(f"\nMovies similar to '{movie_name_}' [{name}]:")
  for movie_name, estimated_rating in movie_recommendations:
      print(f"Movie: {movie_name}, Estimated Rating: {estimated_rating:.2f}")



Recommendations for user 100 [SVD]:
Movie: As Good As It Gets (1997), ID: 316, Estimated Rating: 4.18
Movie: Apt Pupil (1998), ID: 315, Estimated Rating: 4.04
Movie: L.A. Confidential (1997), ID: 302, Estimated Rating: 4.01
Movie: Good Will Hunting (1997), ID: 272, Estimated Rating: 3.98
Movie: Titanic (1997), ID: 313, Estimated Rating: 3.97
Movie: Air Force One (1997), ID: 300, Estimated Rating: 3.67
Movie: Seven Years in Tibet (1997), ID: 690, Estimated Rating: 3.62

Recommendations for user 100 [SVDpp]:
Movie: Titanic (1997), ID: 313, Estimated Rating: 4.17
Movie: Good Will Hunting (1997), ID: 272, Estimated Rating: 4.10
Movie: Air Force One (1997), ID: 300, Estimated Rating: 4.01
Movie: As Good As It Gets (1997), ID: 316, Estimated Rating: 3.93
Movie: Contact (1997), ID: 258, Estimated Rating: 3.88
Movie: Apt Pupil (1998), ID: 315, Estimated Rating: 3.81
Movie: L.A. Confidential (1997), ID: 302, Estimated Rating: 3.79

Recommendations for user 100 [NMF]:
Movie: Titanic (1997), ID:

## Summary of Recommendation System Results

This analysis evaluated three recommendation system algorithms (SVD, SVD++, NMF) using the scikit-surprise library.


- SVD++ achieved the lowest RMSE and MAE, suggesting potentially better prediction accuracy. However, its training time is significantly higher compared to SVD and NMF.
- NMF had the highest RMSE and MAE, potentially indicating lower prediction accuracy.
- SVD offers a good balance between performance (decent RMSE and MAE) and training efficiency (fastest training time).

### Recommendations for Further Action

* **Additional User Testing:** While the metrics provide insights into algorithm performance, user testing is crucial, real users can check which recommendations they find most relevant and helpful.

* **Consider Application Needs:** If recommendation speed is a priority, SVD might be a good choice. If accuracy is critical and training time is less of a concern, SVD++ could be a potential candidate.

* **Explore Other Algorithms:** The scikit-surprise library offers other algorithms like ALS (Alternating Least Squares) and FunkSVD. Consider testing these to see if they outperform the evaluated ones in specific use cases.


## Part 2: Recomendation system from scratch

### Imports and functions

In [19]:
import numpy as np
import pandas as pd
from scipy.io import loadmat
from sklearn.metrics import mean_squared_error, mean_absolute_error

# def load_movie_list(filename):
#     with open(filename, encoding='ISO-8859-1') as file:
#         movies = file.readlines()
#     movie_names = [movie.strip().split(' ', 1)[1] for movie in movies]
#     return movie_names

def normalize_ratings(Y, R):
    Ymean = np.sum(Y, axis=1) / np.sum(R, axis=1)
    Ymean = np.nan_to_num(Ymean)  # Ensure no NaNs in Ymean
    Ynorm = Y - Ymean[:, None] * R
    return Ynorm, Ymean

def cofi_cost_func(params, Y, R, num_users, num_movies, num_features, lambda_=0.0):
    X = params[:num_movies * num_features].reshape(num_movies, num_features)
    Theta = params[num_movies * num_features:].reshape(num_users, num_features)

    J = (1 / 2) * np.sum((np.dot(X, Theta.T) * R - Y) ** 2)
    J += (lambda_ / 2) * (np.sum(Theta ** 2) + np.sum(X ** 2))

    X_grad = ((np.dot(X, Theta.T) * R - Y) @ Theta) + lambda_ * X
    Theta_grad = ((np.dot(X, Theta.T) * R - Y).T @ X) + lambda_ * Theta

    grad = np.concatenate([X_grad.ravel(), Theta_grad.ravel()])
    return J, grad

def gradient_descent(Y, R, num_users, num_movies, num_features, alpha=0.002, lambda_=0.02, iterations=1000):
    X = np.random.rand(num_movies, num_features)
    Theta = np.random.rand(num_users, num_features)
    params = np.concatenate([X.ravel(), Theta.ravel()])

    print('Gradient descent calculations:')
    for i in range(iterations):
        cost, grad = cofi_cost_func(params, Y, R, num_users, num_movies, num_features, lambda_)
        params -= alpha * grad
        if i % 100 == 0:
            print(f'Iteration {i}: cost = {cost}')

    X = params[:num_movies * num_features].reshape(num_movies, num_features)
    Theta = params[num_movies * num_features:].reshape(num_users, num_features)

    return X, Theta

def predict_ratings(X, Theta, Ymean):
    predictions = np.dot(X, Theta.T) + Ymean[:, None]
    # print(f"Predictions before clipping:\n{predictions}")
    return np.clip(predictions, 1, 5)


def make_recommendations(predicted_ratings, movie_names, user_id, num_recommendations):
    user_row = predicted_ratings[:, user_id - 1]
    sorted_indices = np.argsort(user_row)[::-1]
    top_indices = sorted_indices[:num_recommendations]
    recommendations = [(idx + 1, movie_names[idx], user_row[idx]) for idx in top_indices]
    return recommendations

def predict_movies(movie_name, movie_names, Y, R, num_recommendations=5):
    movie_index = movie_names.index(movie_name)
    movie_user_ratings = Y[movie_index]
    moviematrix = pd.DataFrame(Y, index=movie_names)
    similar_to_movie = moviematrix.T.corrwith(moviematrix.loc[movie_name])
    corr_movie = pd.DataFrame(similar_to_movie, columns=['correlation'])
    corr_movie.dropna(inplace=True)
    ratings_count = R.sum(axis=1)
    corr_movie['number of ratings'] = ratings_count
    predictions = corr_movie[corr_movie['number of ratings'] > 100].sort_values('correlation', ascending=False)
    return predictions.head(num_recommendations)

def calculate_rmse(Y, R, predicted_ratings):
    # Flatten arrays and filter only rated movies
    y_true = Y[R == 1]
    y_pred = predicted_ratings[R == 1]
    return np.sqrt(mean_squared_error(y_true, y_pred))

def calculate_mae(Y, R, predicted_ratings):
    # Flatten arrays and filter only rated movies
    y_true = Y[R == 1]
    y_pred = predicted_ratings[R == 1]
    return mean_absolute_error(y_true, y_pred)


## calculations

In [20]:
%%time
# Load data
movie_ids_file = 'movie_ids.txt'
movie_names = load_movie_list(movie_ids_file)
movies_file = 'movies.mat'
data = loadmat(movies_file)
Y, R = data['Y'], data['R']

# Normalize ratings
Ynorm, Ymean = normalize_ratings(Y, R)

# Matrix Factorization
num_users, num_movies = Y.shape[1], Y.shape[0]
num_features = 10  # Number of latent features
X, Theta = gradient_descent(Ynorm, R, num_users, num_movies, num_features)

# Predict ratings
predicted_ratings = predict_ratings(X, Theta, Ymean)
predicted_ratings = np.clip(predicted_ratings, 1, 5)

# Calculate RMSE
rmse = calculate_rmse(Y, R, predicted_ratings)
print(f"RMSE of the predicted ratings: {rmse}")

# Calculate MAE
mae = calculate_mae(Y, R, predicted_ratings)
print(f"MAE of the predicted ratings: {mae}")


Gradient descent calculations:
Iteration 0: cost = 381999.1725681445
Iteration 100: cost = 30500.567694106834
Iteration 200: cost = 26905.666671550684
Iteration 300: cost = 25706.676667300126
Iteration 400: cost = 25093.64781395749
Iteration 500: cost = 24720.509591853224
Iteration 600: cost = 24469.312922603116
Iteration 700: cost = 24287.993452668
Iteration 800: cost = 24150.369141103856
Iteration 900: cost = 24041.7852885783
RMSE of the predicted ratings: 0.6884532964811354
MAE of the predicted ratings: 0.5295317049306661
CPU times: user 4min 23s, sys: 9.17 s, total: 4min 32s
Wall time: 41.2 s


In [21]:
# Recommendations for a user
user_id = 100
num_recommendations = 7
recommendations = make_recommendations(predicted_ratings, movie_names, user_id, num_recommendations)

# Display recommendations
print(f"\nRecommendations for user with id={user_id}:")
for position, movie_name, rating in recommendations:
    print(f"ID {position}: {movie_name}, Predicted Rating: {rating:.2f}")

# General movie recommendations based on a specific movie
movie_name_input = "As Good As It Gets (1997)"
movie_recommendations = predict_movies(movie_name_input, movie_names, Y, R, num_recommendations+1)
print(f"\nRecommendations based on the movie '{movie_name_input}':")
for idx, row in movie_recommendations.iloc[1:].iterrows():
    print(f"Movie: {idx}, Correlation: {row['correlation']:.2f}, Number of ratings: {row['number of ratings']:.0f}")



Recommendations for user with id=100:
ID 893: For Richer or Poorer (1997), Predicted Rating: 5.00
ID 793: Crooklyn (1994), Predicted Rating: 5.00
ID 1001: Stupids, The (1996), Predicted Rating: 5.00
ID 982: Maximum Risk (1996), Predicted Rating: 5.00
ID 1293: Star Kid (1997), Predicted Rating: 5.00
ID 372: Jeffrey (1995), Predicted Rating: 5.00
ID 721: Mallrats (1995), Predicted Rating: 5.00

Recommendations based on the movie 'As Good As It Gets (1997)':
Movie: Apt Pupil (1998), Correlation: 0.59, Number of ratings: 160
Movie: Good Will Hunting (1997), Correlation: 0.50, Number of ratings: 198
Movie: Wag the Dog (1997), Correlation: 0.42, Number of ratings: 137
Movie: Titanic (1997), Correlation: 0.34, Number of ratings: 350
Movie: Tomorrow Never Dies (1997), Correlation: 0.32, Number of ratings: 180
Movie: Amistad (1997), Correlation: 0.30, Number of ratings: 124
Movie: L.A. Confidential (1997), Correlation: 0.29, Number of ratings: 297


## Summary:
*
The custom recommendation system demonstrates a significantly lower RMSE and MAE compared to the Scikit-learn algorithms (SVD, SVDpp, and NMF), indicating a higher prediction accuracy.

* Trained significantly faster (around 2 minutes) compared to scikit-learn's SVD++ (around 5 minutes). However, the custom system takes more time due to the iterative gradient descent process compared to other algirithms.

* The recommendations from the custom system also vary from those generated by the Scikit-learn algorithms, reflecting different underlying methods of user-item rating prediction.

### Overall:

Custom system shows promise with a lower RMSE, lower MAE, and faster training time compared to scikit-learn's SVD. However, scikit-learn offers established algorithms and the flexibility to explore different options.