# Consumer-Product Matrix

A Consumer-product matrix is an $m \times n$ matrix $C$, where each row represents a consumer, and each column corresponds to a product. The element $c_{ij}$ in this matrix represents the probability that consumer $i$ will buy or like product $j$. 

To put it simply, our goal is to gain insights into the factors influencing people's purchasing decisions. We want to use this understanding to predict their future choices, even when we lack complete information.

We are assuming that certain hidden characteristics, such as age, gender, income, etc., impact consumers' buying decisions, and the decision of each consumer is only a function of these hidden features. 

With this hypothesis we can rewrite $C = AB$. The matrix $A$ reflects the extent to which hidden features influence each consumer's choices, and the matrix $B$ provides information about the probability of a consumer buying or liking a product based on a specific hidden feature.

In an ideal scenario, we'd have complete data in our large table. However, in reality, data gaps are common, and our goal is to predict missing information. This is where challenges like the Netflix challenge come into play, where we are given some ratings and tasked with predicting ratings for other movies. In online advertising, we aim to determine which ad is best for a user based on their past purchases.

In this lab, we usebest k rank for Movie Recommendations (for more information, please refer to the related videos posted on Moodle for this week).

Instructions:

**Step 1:** Data Gathering 

**Step 2:** Data Preprocesing

**Step 3:** The best k rank predicts ratings!!!

**Step 4:** Writing a function to recommend movies for any user.

**Step 1: Data Gathering:**

1. Start by importing the necessary Python libraries, such as Numpy and Pandas.
2. Next, visit the provided URL: http://grouplens.org/datasets/movielens/. Under the "recommended for education and development" section, locate and download the file named `ml-latest-small.zip` (which has a size of 1 MB).
3. After downloading, import the CSV files contained within the zip file.


**Step 2: Data Preprocessing:**

1. Begin by examining the first few rows of your data to familiarize yourself with its structure.
2. Transform the data so that each row represents a user. You can achieve this using the `.pivot()` function.
3. Note that 'NaN' values in the dataset represent missing or unrated movies by users. Common treatment to handle these 'NaN' values include replacing them with zero or the average rating for each row or column. Discuss which one do you think is better. Use `.fillna()`
4. Convert this transformed table into a numerical matrix (C). 
5. Discuss whether feature normalization is necessary for this dataset.


In [42]:
# you code

**Step 3: Finding the Best Rank k:**

The best rank $k$ is a matrix with prediction values; discuss this. 

1. Use k = 50. Determining the optimal rank 'k' for movie recomendation is another problem which can be the topic of your final project.

2. Computing SVD might be time consuming. If thats the case, discuss how to find an effcient algorithim.

3. From this matrix, construct the corresponding dataframe using: pd.DataFrame(prediction matrix, columns = original_dataframe.columns). This dataFrame will contain predicted ratings for movies by different users. Each row represents a user, and each column represents a movie, with the cells containing predicted ratings.


In [None]:
# you code

**Step 4: Movie Recommendations:**
1. Pick a user retrieve its row in predictions and sort this in descending order (top-rated movies come first)

2. For the same user, retrieve it's original ratings and merge this information with the movies data frame to gather details about the movies the user has already rated. Store this combined information user_full.

3. Generate movie recommendations by merging the sorted predicted ratings with movie details and sorting the result by predicted ratings in descending order. The top-rated movies that the user hasn't seen yet are selected, and the specified number of recommendations is returned.





In [None]:
# you code!

__Step 5__ Can you write a Python code that computes this for an arbitary user?

In [1]:
# you code!

Well Done! You are done with this lab too!

__Note on Normalization:__
    Normalization is a statistical method used in various fields, including statistics, data analysis, and machine learning, to scale or transform data in a way that allows for meaningful comparisons and analysis. The specific techniques and purposes of normalization can vary, but the general goal is to standardize or rescale data to a common range or distribution.

In machine learning, it's common to normalize features to ensure that they have similar scales. This can improve the performance of many machine learning algorithms, such as gradient descent, which may converge faster and more reliably with normalized data.


Common methods of normalization include:

- **Min-Max Scaling:** This method scales the data to a specific range, often between 0 and 1. The formula for Min-Max scaling is `(x - min(x)) / (max(x) - min(x))`.

- **Z-Score Standardization:** This method standardizes the data to have a mean of 0 and a standard deviation of 1. It's also called standardization or mean normalization. The formula for Z-Score standardization is `(x - mean(x)) / std(x)`.

- **Log Transformation:** Taking the logarithm of data can be a form of normalization, especially when dealing with skewed or exponentially distributed data.

- **Box-Cox Transformation:** This is a family of power transformations that can stabilize variance and make data closer to a normal distribution.

- **Robust Scaling:** This method scales data using the median and interquartile range to handle outliers better.

The choice of normalization method depends on the specific context and data distribution. Normalization can be a crucial step in data preprocessing to ensure that data is suitable for analysis or machine learning models.

Refrences:

1. https://web.stanford.edu/class/cs168/l/l9.pdf

2. https://courses.cs.washington.edu/courses/cse521/16sp/521-lecture-9.pdf

3. https://beckernick.github.io/datascience/