# MovieLens Data Analysis Project

This project analyzes the MovieLens dataset provided by GroupLens Research. The dataset contains movie ratings collected from MovieLens users in the late 1990s and early 2000s.

## Dataset Description
The dataset includes:
- Movie ratings
- Movie metadata (genres and release year)
- User demographic information:
  - Age
  - Zip code
  - Gender identification
  - Occupation

In [None]:
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Movies data - Holds the movie metadata, that is, the movie id, title, and genre
# Load the dataset
mnames = ["movie_id", "title", "genre"]
movies = pd.read_table("movies.dat", sep="::", header=None, names=mnames, engine='python') 
movies.head()

# Check for missing values
print("Missing values in each column:")
print(movies.isna().sum())

# Check for duplicates
print("\nNumber of duplicate rows:", movies.duplicated().sum())

# Count the number of unique movies
unique_movie_count = movies["movie_id"].nunique()
print("\nNumber of unique movies:", unique_movie_count)

The above operation needs to be repeated on two other different data files, thus, I write the function below to avoid code repetition

In [None]:
def load_data(df, col_names, id_col=None):
    """
    This function loads data, prints the first few rows, and performs checks.
    
    """
    data = pd.read_table(df, sep="::", header=None, names=col_names, engine="python")
    print(data.head())
    print("\nMissing values in each column:")
    print(data.isna().sum())
    print("\nNumber of duplicate rows:", data.duplicated().sum())
    
    if id_col:  
        print(f"\nNumber of unique {id_col}s:", data[id_col].nunique())
    else:
        print("\nUnique ID column not provided, skipping uniqueness check.")
        
    return data

# 1) Movies data - Holds the movie metadata, that is, the movie id, title, and genre
mnames = ["movie_id", "title", "genre"]    
movies = load_data("movies.dat", mnames, "movie_id")


In [None]:
# 2) Users data - contains 5 columns: the user id, gender, age, occupation, and the zip code
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = load_data("users.dat", unames, "user_id")

- The data has no missing values and duplicates 
- There are 6,040 distinct users 
- Ages and occupation are coded as integers

In [None]:
# 3) Ratings - contains the user id, movie id, rating given, and a timestamp
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = load_data("ratings.dat", rnames)

In [None]:
# To properly analyze these data sets, join them into a single data set

df = users.merge(ratings, on="user_id").merge(movies, on="movie_id")
print(df.head())
df.count()

### Analysis Overview

This notebook contains three main analytical sections:

#### 1. Exploratory Data Analysis
Examines the basic distributions in the dataset including:
- Movie rating patterns
- User age distribution
- User occupation distribution

#### 2. Movie Rating Analysis by Gender
Investigates gender-based rating patterns including:
- Average ratings per movie by gender
- Most frequently rated movies by female and male users
- Analysis of rating disagreements between genders, identifying movies with the largest rating disparities

#### 3. Genre Analysis
Explores patterns and trends across different movie genres, providing insights into genre preferences and rating behaviors.

#### 1. Exploratory Analysis

In [None]:
# Distribution of ratings
df["rating"].unique()     # The ratings are from 1 to 5

import matplotlib.pyplot as plt

rating_counts = df["rating"].value_counts().sort_index()

# Bar chart
plt.figure(figsize=(8, 6))
plt.bar(rating_counts.index, rating_counts.values, color="skyblue", edgecolor="black")

# Add titles and labels
plt.title("Rating Distribution", fontsize=14)
plt.xlabel("Rating", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rating_counts.index)  
plt.grid(axis="y", linestyle="--", alpha=0.6)

plt.tight_layout()
plt.show()

# The plot shows that majority of the movies had a rating of 4

In [None]:
# Age Distribution of the users
df["age"].unique()

# First replace the encoded age values with their respective age groups
age_mapping = {
    1: "Under 18", 18: "18-24", 25: "25-34",
    35: "35-44", 45: "45-49", 50: "50-55",
    56: "56+"
}

df["age_group"] = df["age"].map(age_mapping)
df.columns

# Since the age groups are categorical, we use a bar chart to display the age distribution of the users

age_group_counts = df["age_group"].value_counts()

plt.figure(figsize=(8, 6))
age_group_counts.loc[["Under 18", "18-24", "25-34", "35-44", "45-49", "50-55", "56+"]].plot(
    kind="bar", color="skyblue", edgecolor="black")

plt.title("Age Distribution by Group", fontsize=14)
plt.xlabel("Age Group", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, fontsize=10)
plt.grid(axis="y", linestyle="--", alpha=0.6)

plt.tight_layout()
plt.show()

# Users aged 25-34 years watched the most movies and gave the most ratings 

In [None]:
# Occupation Distirbution

occupation_mapping = {
    0: "other or not specified", 1: "academic/educator", 2: "artist", 3: "clerical/admin",
    4: "college/grad student", 5: "customer service", 6: "doctor/health care",  7: "executive/managerial", 
     8: "farmer",  9: "homemaker", 10: "K-12 student", 11: "lawyer", 12: "programmer", 13: "retired",
    14: "sales/marketing", 15: "scientist", 16: "self-employed", 17: "technician/engineer", 18: "tradesman/craftsman",
    19: "unemployed", 20:  "writer"
}

occupation_mapping

# Replace the encoded occupation values with their respective descriptions

df["occupation_description"] = df["occupation"].map(occupation_mapping)
df.columns


occupation_counts = df["occupation_description"].value_counts()
plt.figure(figsize=(12, 8))
plt.bar(occupation_counts.index, occupation_counts.values, color="skyblue", edgecolor="black")

plt.title("Occupation Distirbution", fontsize=14)
plt.xlabel("Occupation", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.grid(axis="y", linestyle="--", alpha=0.6)

plt.tight_layout()
plt.show()


# College/grad students were the most active in rating movies while farmers were the least active.

#### 2. Movie Analysis

In [None]:
# What are the most rated movies?
top_rated = df.groupby("title")["rating"].count().sort_values(ascending=False).head(10)
print("Top rated movies")
print(top_rated)


# Which movies had the highest average ratings
avg_ratings = df.groupby("title")["rating"].mean().sort_values(ascending=False)
print("\n Highest average movie ratings")
print("\n", avg_ratings)


In [None]:
# Movie Ratings By Gender

# Average movie ratings for each film grouped by gender
avg_ratings_gender = df.pivot_table("rating", index="title", columns="gender", aggfunc="mean")
print("\n Average Ratings by gender")
print("\n", avg_ratings_gender)


# What are the most active movies among females and males respectively?
# First find the most active movies overally
ratings_by_title = df.groupby("title").size()
print("\n Most active movies")
print(ratings_by_title)

# Filter movies that received at least 250 ratings
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

# Select the top movies
avg_ratings_gender = avg_ratings_gender.loc[active_titles]
avg_ratings_gender

# To see top films among females, sort the F column in descending order
top_female_films = avg_ratings_gender.sort_values("F", ascending=False)
print(top_female_films)

"""
Close Shave, Wrong Trousers, and Sunset Blvd are the top 3 most common movies among the female viewers

""" 

# Top films among males
top_male_films = avg_ratings_gender.sort_values("M", ascending=False)
print(top_male_films)

"""
Godfather, Seven Samurai, and Shawshank Redemption are the top films among male viewers

"""

In [None]:
# Measuring Rating Disagreement
# Which were the most divisive movies among female and male viewers? Which movies were highly 
# rated by females but received low ratings among males?

avg_ratings_gender["diff"] = avg_ratings_gender["F"] - avg_ratings_gender["M"]

# Sort by diff to see which movies had the highest disparities
sort_by_diff = avg_ratings_gender.sort_values("diff", ascending=False)
sort_by_diff

"""
Films such as Dirty Dancing, Jumpin Jack Flash, and Grease received high ratings among females

"""

In [None]:
# Movies that elicited the most disagreement among viewers, independent of gender - Movies that have high standard deviations

# Standard deviation of ratings per movie
rating_std_by_title = df.groupby("title")["rating"].std()

# Filter down to the active titles
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.sort_values(ascending=False)[:10]

#### 4. Genre Analysis

In [None]:
# Movie genres are separated by "|" operators

df["genres"] = df["genre"].str.split("|")    # Turns genre column into lists of genres for each movie
exploded_df = df.explode("genres")
exploded_df.head()

# What are the popular movie genres?
genre_counts = exploded_df["genres"].value_counts()

# Bar chart visualizing genre popularity
plt.figure(figsize=(8, 6))
genre_counts.head(10).plot(kind="bar", color="skyblue")
plt.title("Top 10 Most Common Genres")
plt.xlabel("Genre")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.grid(axis="y", linestyle="--", alpha=0.6)
plt.show()

# Comedy and Drama were the most popular genres


In [None]:
# What's the average rating by genre
avg_rating_by_genre = exploded_df.groupby("genres")["rating"].mean().sort_values(ascending=False)
avg_rating_by_genre

"""
Film Noir had the highest average ratings where as Horror genre had the least

""" 

# What genres are liked the most by females and males
gender_genre_ratings = exploded_df.pivot_table("rating", index="genres", columns="gender", aggfunc="mean")


# Most active genres
genre_counts = exploded_df.groupby("genres").size()
genre_counts

# Filter genres that received at least 30,000 ratings
active_genres = genre_counts.index[genre_counts >= 30000]
active_genres

# Filter the top genres
gender_genre_ratings = gender_genre_ratings.loc[active_genres]
gender_genre_ratings


# Most preferred genre by females
female_genre = gender_genre_ratings.sort_values("F", ascending=False)
female_genre

"""
War, Musical, and Drama are the highly rated genres among females
"""

# Most preferred genres by males
male_genre = gender_genre_ratings.sort_values("M", ascending=False)
male_genre

"""
War, Drama, and Crime are the highly rated genres among males 
"""
