# Movie Rating Analysis with Normalization and Standardization

In this notebook, we will analyze movie ratings collected from multiple users. 
We will:

1. Load the ratings directly into a Pandas DataFrame.
2. Calculate the average ratings for each user and each movie.
3. Normalize the ratings to scale them between 0 and 1.
4. Standardize the ratings to ensure they have a mean of 0 and standard deviation of 1.
5. Compare the effects of normalization and standardization and discuss their use cases.


In [2]:
import pandas as pd

# Input movie rating data directly in a dictionary
data = {
    'User': ['John', 'Grace', 'Modesto', 'Malcolm', 'Jane'],
    'Inception': [5, 4, 3, 4, 4],
    'Interstellar': [4, 3, 4, 3, 4],
    'The Matrix': [3, 5, 4, 4, 4],
    'Titanic': [2, 3, 4, 3, 4],
    'Parasite': [4, 4, 5, 4, 4],
    'The Godfather': [5, 4, 4, 5, 3]
}

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)
print("Original Data:")
display(df)  # Use display for better formatting in Jupyter


Original Data:


Unnamed: 0,User,Inception,Interstellar,The Matrix,Titanic,Parasite,The Godfather
0,John,5,4,3,2,4,5
1,Grace,4,3,5,3,4,4
2,Modesto,3,4,4,4,5,4
3,Malcolm,4,3,4,3,4,5
4,Jane,4,4,4,4,4,3


In [4]:
# Calculate the average rating per user and per movie
df['User Average'] = df.iloc[:, 1:].mean(axis=1)
movie_avg = df.iloc[:, 1:].mean(axis=0)

print("\nAverage Ratings per User:")
display(df[['User', 'User Average']])  # Display user averages

print("\nAverage Ratings per Movie:")
display(movie_avg)  # Display movie averages



Average Ratings per User:


Unnamed: 0,User,User Average
0,John,3.833333
1,Grace,3.833333
2,Modesto,4.0
3,Malcolm,3.833333
4,Jane,3.833333



Average Ratings per Movie:


Inception        4.000000
Interstellar     3.600000
The Matrix       4.000000
Titanic          3.200000
Parasite         4.200000
The Godfather    4.200000
User Average     3.866667
dtype: float64

### Average Ratings

The average ratings for each user and movie help us understand individual preferences and overall movie performance. 
This gives insight into which users rate more generously and which movies are most liked.


In [7]:
from sklearn.preprocessing import MinMaxScaler

# Normalize the ratings (scale between 0 and 1)
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df.iloc[:, 1:7])  # Exclude 'User' column
df_normalized = pd.DataFrame(normalized_data, columns=df.columns[1:7], index=df['User'])

print("\nNormalized Data:")
display(df_normalized)



Normalized Data:


Unnamed: 0_level_0,Inception,Interstellar,The Matrix,Titanic,Parasite,The Godfather
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.0,1.0,0.0,0.0,0.0,1.0
Grace,0.5,0.0,1.0,0.5,0.0,0.5
Modesto,0.0,1.0,0.5,1.0,1.0,0.5
Malcolm,0.5,0.0,0.5,0.5,0.0,1.0
Jane,0.5,1.0,0.5,1.0,0.0,0.0


### Normalization

Normalization scales the ratings between 0 and 1, making it easier to compare users who might have different rating scales. 
This ensures all values are on the same scale, which is especially useful when input data varies significantly.


In [10]:
from sklearn.preprocessing import StandardScaler

# Standardize the ratings (mean = 0, std = 1)
scaler = StandardScaler()
standardized_data = scaler.fit_transform(df.iloc[:, 1:7])  # Exclude 'User' column
df_standardized = pd.DataFrame(standardized_data, columns=df.columns[1:7], index=df['User'])

print("\nStandardized Data:")
display(df_standardized)



Standardized Data:


Unnamed: 0_level_0,Inception,Interstellar,The Matrix,Titanic,Parasite,The Godfather
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,1.581139,0.816497,-1.581139,-1.603567,-0.5,1.069045
Grace,0.0,-1.224745,1.581139,-0.267261,-0.5,-0.267261
Modesto,-1.581139,0.816497,0.0,1.069045,2.0,-0.267261
Malcolm,0.0,-1.224745,0.0,-0.267261,-0.5,1.069045
Jane,0.0,0.816497,0.0,1.069045,-0.5,-1.603567


### Standardization

Standardization ensures that the data has a mean of 0 and a standard deviation of 1. 
This method is useful when we expect the data to follow a normal distribution or when we need to minimize the effect of outliers.


### Conclusion

Both normalization and standardization have their advantages:

- **Normalization** is ideal when we want all data points to lie on a uniform scale, making comparisons easy.
- **Standardization** is helpful when we need the data to have consistent statistical properties, such as a mean of 0.

In a movie recommendation context, normalization might be more suitable since it aligns users' ratings on the same scale, 
allowing for fair comparisons. However, if we are working with more complex data or need to detect outliers, 
standardization might be more appropriate.
