# Consumer-Product Matrix

A Consumer-product matrix is an $m \times n$ matrix $C$, where each row represents a consumer, and each column corresponds to a product. The element $c_{ij}$ in this matrix represents the probability that consumer $i$ will buy or like product $j$. 

To put it simply, our goal is to gain insights into the factors influencing people's purchasing decisions. We want to use this understanding to predict their future choices, even when we lack complete information.

We are assuming that certain hidden characteristics, such as age, gender, income, etc., impact consumers' buying decisions, and the decision of each consumer is only a function of these hidden features. 

With this hypothesis we can rewrite $C = AB$. The matrix $A$ reflects the extent to which hidden features influence each consumer's choices, and the matrix $B$ provides information about the probability of a consumer buying or liking a product based on a specific hidden feature.

In an ideal scenario, we'd have complete data in our large table. However, in reality, data gaps are common, and our goal is to predict missing information. This is where challenges like the Netflix challenge come into play, where we are given some ratings and tasked with predicting ratings for other movies. In online advertising, we aim to determine which ad is best for a user based on their past purchases (for more information, please refer to the videos posted on Moodle).

In this lab, we use Singular Value Decomposition (SVD) for Movie Recommendations.

Instructions:

**Step 1:** Data Gathering 

**Step 2:** Data Preprocesing

**Step 3:** The best k rank to predict ratings.

**Step 4:** Writing a function to recommend movies for any user.

**Step 1: Data Gathering:**

Start by importing the necessary Python libraries, such as Numpy and Pandas. Next, visit the provided URL: http://grouplens.org/datasets/movielens/. Under the "recommended for education and development" section, locate and download the file named `ml-latest-small.zip` (which has a size of 1 MB). After downloading, import the CSV files contained within the zip file.


In [94]:
import pandas as pd
import numpy as np

ratings = pd.read_csv('ml-latest-small/ratings.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')
tags = pd.read_csv('ml-latest-small/tags.csv')

**Step 2: Data Preprocessing:**

1. Begin by examining the first few rows of your data to familiarize yourself with its structure.

In [95]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [96]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [98]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


2. Transform the data so that each row represents a user. You can achieve this using the `.pivot()` function.

In [102]:
R_df = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating')
R_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


3. Convert this transformed table into a numerical matrix (C). Note that `NaN` values in the dataset represent missing or unrated movies by users. Common treatment to handle these 'NaN' values include replacing them with zero or the average rating for each row or column. Discuss which one do you think is better. Use `.fillna()`

In [134]:
R_df = R_df.fillna(0)
R_df.sample(5)

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
529,3.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
359,4.0,3.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
604,3.0,5.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
206,5.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
404,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


4. In machine learning, it's common to normalize features to ensure that they have similar scales. This can improve the performance of many machine learning algorithms, such as gradient descent, which may converge faster and more reliably with normalized data.Discuss whether feature normalization is necessary for this dataset. (Read the last part of this notebook for more info on normalization)


In [None]:
# you code

**Step 3: Finding the Best Rank k:**

The best rank $k$ is a matrix with prediction values. Discuss this: why that's the case? 
1. first we need to make a matrix from our dataframe.


In [111]:
R = R_df.to_numpy()

2. Use k = 50. Determining the optimal rank 'k' for movie recomendation is another problem which can be the topic of your final project.

In [112]:
# Finding SVD
#from scipy.sparse.linalg import svds
#U, sigma, Vt = svds(R, k = 50)

In [119]:
U, S, Vh = np.linalg.svd(R, full_matrices=False)
U.shape, S.shape, Vh.shape

((610, 610), (610,), (610, 9724))

In [156]:
# finding the best rank k:

#sigma = np.diag(sigma)
k= 50
all_user_predicted_ratings = U[:, :50]@ np.diag(sigma[:50]) @ Vh[:50, :]

3. From this matrix, construct the corresponding dataframe using: pd.DataFrame(prediction matrix, columns = original_dataframe.columns). This dataFrame will contain predicted ratings for movies by different users. Each row represents a user, and each column represents a movie, with the cells containing predicted ratings.


In [157]:
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,-0.948792,-1.495639,0.136415,-0.088957,-1.585177,1.65209,-1.823802,-0.152851,0.091218,0.681468,...,-0.034792,-0.029821,-0.039762,-0.039762,-0.034792,-0.039762,-0.034792,-0.034792,-0.034792,-0.020211
1,-0.14317,-0.051057,0.08085,0.041888,0.217854,-0.114971,0.163791,0.014035,0.036675,-0.135111,...,0.01543,0.013226,0.017634,0.017634,0.01543,0.017634,0.01543,0.01543,0.01543,0.041234
2,0.074621,0.034381,-0.006975,0.004072,0.004535,0.049204,-0.026532,0.008169,-0.004093,-0.150649,...,-0.003062,-0.002625,-0.003499,-0.003499,-0.003062,-0.003499,-0.003062,-0.003062,-0.003062,0.002007
3,0.839014,-1.206717,-0.50043,0.036663,0.608137,-0.401276,0.580697,0.086293,0.143029,-1.037848,...,0.029541,0.025321,0.033761,0.033761,0.029541,0.033761,0.029541,0.029541,0.029541,-0.081122
4,0.462821,-0.0183,-0.360945,0.027521,-0.153705,0.066706,-0.265529,0.063343,-0.205985,0.229209,...,-0.004039,-0.003462,-0.004616,-0.004616,-0.004039,-0.004616,-0.004039,-0.004039,-0.004039,-0.013781


**Step 4: Movie Recommendations:**

1. Pick a user retrieve its row in predictions and sort this in descending order (top-rated movies come first)



In [158]:
sorted_user_predictions = preds_df.iloc[2].sort_values(ascending=False)
sorted_user_predictions

movieId
293     0.289054
1214    0.279243
1200    0.276596
1275    0.271725
2288    0.221770
          ...   
1265   -0.194730
858    -0.222767
648    -0.222917
165    -0.263362
2791   -0.264709
Name: 2, Length: 9724, dtype: float64

2. For the same user, retrieve it's original ratings and merge this information with the movies data frame to gather details about the movies the user has already rated. Store this combined information user_full.


In [159]:
user_data = ratings[ratings['userId'] == 2]  # Fixed syntax
user_full = (user_data.merge(movies, how='left', left_on='movieId', right_on='movieId')
                 .sort_values(['rating'], ascending=False))
user_full.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
28,2,131724,5.0,1445714851,The Jinx: The Life and Deaths of Robert Durst ...,Documentary
27,2,122882,5.0,1445715272,Mad Max: Fury Road (2015),Action|Adventure|Sci-Fi|Thriller
22,2,106782,5.0,1445714966,"Wolf of Wall Street, The (2013)",Comedy|Crime|Drama
18,2,89774,5.0,1445715189,Warrior (2011),Drama
9,2,60756,5.0,1445714980,Step Brothers (2008),Comedy


3. Generate movie recommendations by merging the sorted predicted ratings with movie details and sorting the result by predicted ratings in descending order. The top-rated movies that the user hasn't seen yet are selected, and the specified number of recommendations is returned.

In [155]:
recommendations = (movies[~movies['movieId'].isin(user_full['movieId'])]
         .merge(pd.DataFrame(sorted_user_predictions).reset_index(), how='left',
               left_on='movieId',
               right_on='movieId')
         .rename(columns={2: 'Predictions'})
         .sort_values('Predictions', ascending=False)
         .iloc[:10, :-1]  # Removed the extra '-1'
        )

recommendations

Unnamed: 0,movieId,title,genres
254,293,Léon: The Professional (a.k.a. The Professiona...,Action|Crime|Drama|Thriller
913,1214,Alien (1979),Horror|Sci-Fi
900,1200,Aliens (1986),Action|Adventure|Horror|Sci-Fi
972,1275,Highlander (1986),Action|Adventure|Fantasy
1698,2288,"Thing, The (1982)",Action|Horror|Sci-Fi|Thriller
2245,2985,RoboCop (1987),Action|Crime|Drama|Sci-Fi|Thriller
955,1258,"Shining, The (1980)",Horror
1983,2640,Superman (1978),Action|Adventure|Sci-Fi
2633,3527,Predator (1987),Action|Sci-Fi|Thriller
1902,2529,Planet of the Apes (1968),Action|Drama|Sci-Fi


__Note on Normalization:__
    Normalization is a statistical method used in various fields, including statistics, data analysis, and machine learning, to scale or transform data in a way that allows for meaningful comparisons and analysis. The specific techniques and purposes of normalization can vary, but the general goal is to standardize or rescale data to a common range or distribution.

Normalization is often used for the following purposes:

1. **Comparing Variables:** When you have multiple variables measured in different units or with different ranges, normalization can make them directly comparable. This is especially important in multivariate analysis.

2. **Machine Learning:** In machine learning, it's common to normalize features to ensure that they have similar scales. This can improve the performance of many machine learning algorithms, such as gradient descent, which may converge faster and more reliably with normalized data.

3. **Data Visualization:** Normalization can be helpful when you're creating visualizations or graphs. It ensures that the data is displayed accurately and that relative differences between data points are easily discernible.

Common methods of normalization include:

- **Min-Max Scaling:** This method scales the data to a specific range, often between 0 and 1. The formula for Min-Max scaling is `(x - min(x)) / (max(x) - min(x))`.

- **Z-Score Standardization:** This method standardizes the data to have a mean of 0 and a standard deviation of 1. It's also called standardization or mean normalization. The formula for Z-Score standardization is `(x - mean(x)) / std(x)`.

- **Log Transformation:** Taking the logarithm of data can be a form of normalization, especially when dealing with skewed or exponentially distributed data.

- **Box-Cox Transformation:** This is a family of power transformations that can stabilize variance and make data closer to a normal distribution.

- **Robust Scaling:** This method scales data using the median and interquartile range to handle outliers better.

The choice of normalization method depends on the specific context and data distribution. Normalization can be a crucial step in data preprocessing to ensure that data is suitable for analysis or machine learning models.

The code you've posted defines a Python function `recommend_movies` that generates movie recommendations for a given user based on collaborative filtering using their ratings. Here's what the code does step by step:

1. `predictions_df` is a DataFrame containing predicted ratings for movies by different users. Each row represents a user, and each column represents a movie, with the cells containing predicted ratings.

2. `userID` is the ID of the user for whom you want to generate movie recommendations.

3. `movies_df` is a DataFrame containing information about movies, such as MovieID, Title, and other details.

4. `original_ratings_df` is a DataFrame that contains the original user ratings for movies. It's used to identify movies that the user has already rated.

5. `num_recommendations` is an optional parameter that specifies the number of movie recommendations to generate for the user. The default value is 5.

The function works as follows:

- The process starts by identifying the row in the list of movie predictions that belongs to the user we're interested in (identified by a unique 'userID'). It then sorts these predictions from highest to lowest, so the top-rated movies are at the top of the list.

- Then it combines the user's ratings with a database of all movies. This combined information is stored in a new list called "user_full_info."

- It should then generates movie recommendations for the user by considering the top-rated movies they haven't seen yet. It does this by removing the movies the user has already rated from the list of top-rated movies. The remaining movies are then sorted by their predicted ratings, and the best ones are recommended. The number of recommended movies is based on a value called "num_recommendations." These recommended movies are suggested to the user.

Refrences:

1. https://web.stanford.edu/class/cs168/l/l9.pdf

2. https://courses.cs.washington.edu/courses/cse521/16sp/521-lecture-9.pdf

3. https://beckernick.github.io/datascience/