**Shakhyry Walsh**  
IS 362: Data Acquisition and Management  
Professor Larry Cohen  
10/12/2025


In [7]:
import pandas as pd
import numpy as np


**Ratings by Friends** 


In [26]:
df = pd.read_csv("movie_reviews.csv", index_col="friends")
df


Unnamed: 0_level_0,Sinners,Fantastic 4,Superman,Final Destination,Wicked,Weapons
friends,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shadera,5.0,4.0,3.0,,5.0,5.0
Juan,3.0,2.0,4.0,1.0,3.0,
Quana,4.0,5.0,,3.0,4.0,4.0
Marc,2.0,3.0,2.0,2.0,1.0,2.0
Trent,5.0,4.0,5.0,4.0,2.0,


**Average Ratings for Each Viewer and Each Movie** 


In [35]:
avg_per_viewer = df.mean(axis=1, skipna=True)
print("Average rating per viewer")
print(avg_per_viewer)

avg_per_movie = df.mean(axis=0, skipna=True)
print("\nAverage rating per movie")
print(avg_per_movie)


Average rating per viewer
friends
Shadera    4.4
Juan       2.6
Quana      4.0
Marc       2.0
Trent      4.0
dtype: float64

Average rating per movie
Sinners              3.800000
Fantastic 4          3.600000
Superman             3.500000
Final Destination    2.500000
Wicked               3.000000
Weapons              3.666667
dtype: float64


**Normalized Ratings for Each Viewer** 


In [34]:
def normalize_viewer_minmax(row):
    valid = row.dropna()
    if valid.empty:
        return row 
    min_v = valid.min()
    max_v = valid.max()
    if max_v == min_v:
        normed = row.apply(lambda x: 0.5 if not pd.isna(x) else np.nan)
        return normed
    return row.apply(lambda x: (x - min_v) / (max_v - min_v) if not pd.isna(x) else np.nan)

df_norm = df.apply(normalize_viewer_minmax, axis=1)
df_norm


Unnamed: 0_level_0,Sinners,Fantastic 4,Superman,Final Destination,Wicked,Weapons
friends,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shadera,1.0,0.5,0.0,,1.0,1.0
Juan,0.666667,0.333333,1.0,0.0,0.666667,
Quana,0.5,1.0,,0.0,0.5,0.5
Marc,0.5,1.0,0.5,0.5,0.0,0.5
Trent,1.0,0.666667,1.0,0.666667,0.0,


**Average Normalized Rating per Viewer and Movie** 


In [33]:
avg_norm_viewer = df_norm.mean(axis=1, skipna=True)
avg_norm_movie = df_norm.mean(axis=0, skipna=True)

print("Average normalized rating per viewer")
print(avg_norm_viewer)

print("\nAverage normalized rating per movie")
print(avg_norm_movie)


Average normalized rating per viewer
friends
Shadera    0.700000
Juan       0.533333
Quana      0.500000
Marc       0.500000
Trent      0.666667
dtype: float64

Average normalized rating per movie
Sinners              0.733333
Fantastic 4          0.700000
Superman             0.625000
Final Destination    0.291667
Wicked               0.433333
Weapons              0.666667
dtype: float64


**Conclusion**

Advantages of using normalized ratings per user:
Normalization helps even things out between people who rate differently. For example, someone who usually gives high scores and someone who gives low scores will now be on the same level, so their opinions can be compared more fairly. It lets recommendation systems focus on patterns in how people rate, rather than on the actual numbers. This makes it easier to combine ratings from many users who each have their own way of scoring.

Disadvantages of normalization:
When we normalize, we lose the sense of how much someone actually liked a movie overall. For example, if one person gives all 5s and another gives all 3s, their ratings might look similar after normalization, even though the first person liked everything more. It can also be unreliable when someone has rated only a few movies or gave the same rating for all of them. Another downside is that the normalized numbers (like 0.8) don’t directly translate into star ratings (like 4 out of 5), which makes them harder for people to interpret.

Standardization notes:
Standardization adjusts each person’s ratings by how far their scores are from their own average. This helps show which movies they liked more or less compared to their usual ratings. It’s useful for more advanced data analysis, but it works best when a person’s ratings follow a roughly normal pattern (like a bell curve). If someone has rated only a few movies, standardization might not be very meaningful.

**Extra Credit**  
**Standardized Ratings per Viewer** 


In [37]:
def standardize_viewer(row):
    valid = row.dropna()
    if valid.empty:
        return row
    mean_v = valid.mean()
    std_v = valid.std(ddof=0)  # population std, ddof=0, or ddof=1 for sample std
    if std_v == 0:
        # all ratings the same, set z to 0
        return row.apply(lambda x: 0 if not pd.isna(x) else np.nan)
    return row.apply(lambda x: (x - mean_v) / std_v if not pd.isna(x) else np.nan)

df_std = df.apply(standardize_user, axis=1)
df_std


Unnamed: 0_level_0,Sinners,Fantastic 4,Superman,Final Destination,Wicked,Weapons
friends,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shadera,0.75,-0.5,-1.75,,0.75,0.75
Juan,0.392232,-0.588348,1.372813,-1.568929,0.392232,
Quana,0.0,1.581139,,-1.581139,0.0,0.0
Marc,0.0,1.732051,0.0,0.0,-1.732051,0.0
Trent,0.912871,0.0,0.912871,0.0,-1.825742,


**Averages on Standardized Data** 


In [39]:
avg_std_viewer = df_std.mean(axis=1, skipna=True)
avg_std_movie = df_std.mean(axis=0, skipna=True)

print("Average standardized rating per viewer")
print(avg_std_viewer)

print("\nAverage standardized rating per movie")
print(avg_std_movie)


Average standardized rating per viewer
friends
Shadera   -3.996803e-16
Juan      -6.661338e-17
Quana      0.000000e+00
Marc       0.000000e+00
Trent      0.000000e+00
dtype: float64

Average standardized rating per movie
Sinners              0.411021
Fantastic 4          0.444968
Superman             0.133921
Final Destination   -0.787517
Wicked              -0.483112
Weapons              0.250000
dtype: float64


**CSVs** 


In [40]:
df.to_csv("movie_reviews_raw.csv")
df_norm.to_csv("movie_reviews_normalized.csv")
df_std.to_csv("movie_reviews_standardized.csv")
