# MovieLens 1M - Exploratory Data Analysis (EDA)

This notebook covers the initial exploration of the MovieLens 1M dataset. We will load the data, check for data quality issues, and visualize key distributions (ratings, genres, user activity) to understand the dataset before modeling.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Set style for plots
sns.set_style("whitegrid")
%matplotlib inline

## 1. Load Data

MovieLens 1M files are `.dat` files with `::` as separator. We need to specify the encoding as `latin-1` or `ISO-8859-1`.

In [2]:
# Define file paths
MOVIES_FILE = "../data/ml-1m/movies.dat"
RATINGS_FILE = "../data/ml-1m/ratings.dat"
USERS_FILE = "../data/ml-1m/users.dat"

# Define column names (based on README/documentation)
movies_cols = ['MovieID', 'Title', 'Genres']
ratings_cols = ['UserID', 'MovieID', 'Rating', 'Timestamp']
users_cols = ['UserID', 'Gender', 'Age', 'Occupation', 'Zip-code']

# Load data
print("Loading movies...")
movies = pd.read_csv(MOVIES_FILE, sep='::', header=None, names=movies_cols, engine='python', encoding='latin-1')

print("Loading ratings...")
ratings = pd.read_csv(RATINGS_FILE, sep='::', header=None, names=ratings_cols, engine='python', encoding='latin-1')

print("Loading users...")
users = pd.read_csv(USERS_FILE, sep='::', header=None, names=users_cols, engine='python', encoding='latin-1')

print("Data loaded successfully!")

In [3]:
# Check dimensions
print(f"Movies shape: {movies.shape}")
print(f"Ratings shape: {ratings.shape}")
print(f"Users shape: {users.shape}")

In [4]:
movies.head()

In [5]:
ratings.head()

## 2. Basic Analysis & Data Quality
Check for missing values and duplicates.

In [6]:
print("Missing values in Movies:")
print(movies.isnull().sum())
print("\nMissing values in Ratings:")
print(ratings.isnull().sum())

## 3. Visualizations

### 3.1 Rating Distribution
What are the most common ratings? Are users generous or critical?

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(x='Rating', data=ratings, palette='viridis')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

### 3.2 Most Rated Movies
Which movies have the highest number of ratings (Popularity)?

In [None]:
# Merge ratings with movie titles
movie_ratings = ratings.merge(movies, on='MovieID')

# Count ratings per movie
rating_counts = movie_ratings.groupby('Title').size().sort_values(ascending=False)

print("Top 10 Most Rated Movies:")
print(rating_counts.head(10))

# Plot top 20
plt.figure(figsize=(10, 8))
sns.barplot(y=rating_counts.head(20).index, x=rating_counts.head(20).values, palette='magma')
plt.title('Top 20 Most Rated Movies')
plt.xlabel('Number of Ratings')
plt.show()

### 3.3 Top Rated Movies (with Minimum Interactions)
High ratings are only meaningful if there is a sufficient number of votes. Let's find the best movies among those that have at least 100 ratings.

In [None]:
# Calculate count and mean rating per movie
movie_stats = ratings.groupby('MovieID')['Rating'].agg(['count', 'mean'])
movie_stats = movie_stats.merge(movies, on='MovieID')

# Filter: at least 100 ratings
min_votes = 100
popular_movies = movie_stats[movie_stats['count'] >= min_votes]

# Sort by mean rating
top_rated = popular_movies.sort_values('mean', ascending=False)

print(f"Top 10 Highest Rated Movies (with > {min_votes} ratings):")
print(top_rated[['Title', 'count', 'mean']].head(10))

# Plot top 10
plt.figure(figsize=(10, 6))
sns.barplot(x='mean', y='Title', data=top_rated.head(10), palette='RdYlGn')
plt.title(f'Top 10 High Rated Movies (> {min_votes} ratings)')
plt.xlabel('Average Rating')
plt.xlim(3, 5) # Focus on the scale 3-5
plt.show()

### 3.4 User Activity
How many ratings does the average user give? This helps identify the "Long Tail".

In [None]:
user_counts = ratings.groupby('UserID').size()

print(f"Max ratings by one user: {user_counts.max()}")
print(f"Min ratings by one user: {user_counts.min()}")
print(f"Average ratings per user: {user_counts.mean():.2f}")

plt.figure(figsize=(10, 5))
sns.histplot(user_counts, bins=50, kde=True)
plt.title('Distribution of Number of Ratings per User')
plt.xlabel('Number of Ratings')
plt.ylabel('Count of Users')
plt.show()

### 3.5 Genres Analysis (Validation for Content-Based Filtering)

Since we decided to use TF-IDF for genres, it is crucial to understand the genre space better.
1.  **Genre Distribution**: Which genres are common vs rare (rare genres have higher IDF weight).
2.  **Genre Co-occurrence**: Which genres tend to appear together?
3.  **TF-IDF Sanity Check**: Can we find similar movies purely based on genres?

In [None]:
# 1. Basic Counts
genres_df = movies.copy()
genres_df['Genres'] = genres_df['Genres'].str.split('|')
genres_exploded = genres_df.explode('Genres')

genre_counts = genres_exploded['Genres'].value_counts()

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='coolwarm')
plt.title('Distribution of Movie Genres')
plt.xticks(rotation=45)
plt.ylabel('Count')
plt.show()

In [None]:
# 2. Genre Co-occurrence Matrix (Heatmap)
# Create a Multi-Hot Matrix manually for visualization
genre_list = sorted(genres_exploded['Genres'].unique())
co_occurrence = pd.DataFrame(0, index=genre_list, columns=genre_list)

for genres in movies['Genres'].str.split('|'):
    for g1 in genres:
        for g2 in genres:
            co_occurrence.loc[g1, g2] += 1

plt.figure(figsize=(12, 10))
sns.heatmap(co_occurrence, cmap='YlOrRd', annot=False)
plt.title('Genre Co-occurrence Heatmap')
plt.show()

In [None]:
# 3. TF-IDF Validation check
print("\n--- TF-IDF Validation ---")

# Prepare data
movies['genres_str'] = movies['Genres'].str.replace('|', ' ', regex=False)

# Fit TF-IDF
tfidf = TfidfVectorizer(token_pattern=r"(?u)\b[A-Za-z-]+\b")
tfidf_matrix = tfidf.fit_transform(movies['genres_str'])

print("Top 10 High-IDF Terms (Rarest Genres):")
indices = np.argsort(tfidf.idf_)[::-1]
features = np.array(tfidf.get_feature_names_out())
print(features[indices][:10])

# Function to get similar movies
def get_similar_movies(title, k=5):
    idx = movies[movies['Title'].str.contains(title, case=False)].index
    if len(idx) == 0:
        return f"Movie '{title}' not found."
    idx = idx[0]
    
    # Compute cosine similarity for this movie against all others
    sim_scores = cosine_similarity(tfidf_matrix[idx], tfidf_matrix).flatten()
    
    # Get top k indices
    top_indices = sim_scores.argsort()[-(k+1):-1][::-1]
    
    input_movie = movies.iloc[idx]
    print(f"\nMovies similar to '{input_movie['Title']}' ({input_movie['Genres']}):")
    
    return movies.iloc[top_indices][['Title', 'Genres']]

# Dry Run
print(get_similar_movies("Toy Story"))
print(get_similar_movies("Star Wars"))

## 4. Conclusion

Key takeaways from the EDA:
- The rating distribution is skewed towards positive ratings (3 and 4).
- **Top Rated**: With a minimum of 100 votes, we identified the true audience favorites, separating them from niche movies with few 5.0 ratings.
- **Genre Analysis**: The TF-IDF sanity check confirms that finding similar movies by genre works reasonably well.