<h1 style="color:yellow; text-align:center; font-size:50px;">MovieMate</h1>
<h2 align="center" style="color:yellow;">--Movie Recommendation Engine--</h2>



<h3 style="color:cyan;"> Installation of required Python Libraries. </h3>
<i>This is only needed for first time initialization for a new system running it for the first time.</i>

In [1]:
!pip install pandas numpy scikit-learn matplotlib



<h4 style="color:cyan;">Code Block 1: Loading and Previewing the Dataset</h4>

In [2]:
import pandas as pd

# Load movies and ratings datac
movies_df = pd.read_csv('../data/u.item', sep='|', encoding='latin-1', header=None,
                        names=['movie_id', 'movie_title', 'release_date', 'video_release_date', 
                               'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation', 
                               'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 
                               'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 
                               'Sci-Fi', 'Thriller', 'War', 'Western'])


ratings_df = pd.read_csv('ml-100k/u.data', sep='\t', encoding='latin-1', header=None,
                         names=['user_id', 'movie_id', 'rating', 'timestamp'])

# Preview the data
display(movies_df.head(20))
display(ratings_df.head(20))

FileNotFoundError: [Errno 2] No such file or directory: 'ml-100k/u.data'

<h4 style="color:orange;"> Explanation: </h4>

The above block introduces the two main datasets:

<b>Movies Dataset:</b> Contains details about movies, including their titles, release dates, genres, and IMDb links.

<b>Ratings Dataset:</b> Records how users have rated movies, complete with timestamps.

Think of these datasets as two puzzle pieces. The movies_df gives you a list of movies, and the ratings_df provides insights into what people think about them. For example, if you want to know whether "Toy Story (1995)" was a fan favorite, this dataset combination can tell you.


The head(20) function shows the first 20 rows, letting us peek into the data. This is like flipping through the first few pages of a book to see if the story grabs your attention!

<h4 style="color:cyan;">Code Block 2: Inspecting the Structure of the DataFrames </h4>

In [None]:
# Basic Info about datasets
print("Movies DataFrame Info:")
print(movies_df.info())

print("Ratings DataFrame Info:")
print(ratings_df.info())

<h4 style="color:orange;"> Explanation: </h4>

The above block examines the structure and metadata of both DataFrames. Key insights include:

Total number of rows and columns.
Data types of each column (e.g., integers, strings, floats).
Presence of missing values.
<b>Example Insights:</b>

The movies_df has 1,682 rows, with some missing release dates.
The ratings_df has 100,000 ratings from users.

Understanding your dataset's structure is essential for planning your data cleaning and analysis steps. It’s like knowing the tools in your toolbox before starting a DIY project.

<h4 style="color:cyan;">Code Block 3: Counting Missing Values </h4>

In [None]:
# Check for any missing values
print("\nMissing values in Movies Data:")
print(movies_df.isnull().sum())

print("\nMissing values in Ratings Data:")
print(ratings_df.isnull().sum())

<h4 style="color:orange;"> Explanation: </h4>

Here, we use isnull().sum() to count the number of missing values in each column. This helps identify incomplete data that may need cleaning or imputation.

<b>For instance:</b>

The video_release_date column in movies_df is completely empty (1682 missing values), suggesting it’s irrelevant and could be dropped.
This is like finding broken links in a chain before using it.

In [None]:
# Display basic statistics of the ratings
print("\nRatings statistics:")
print(ratings_df.describe())

<h4 style="color:cyan;">Code Block 4: Cleaning the Movies Dataset </h4>

In [None]:
# Check if 'video_release_date' exists before dropping
if 'video_release_date' in movies_df.columns:
    movies_df.drop(columns=['video_release_date'], inplace=True)
else:
    print("'video_release_date' column does not exist or was already dropped.")
    
# Handle missing values in 'release_date' and 'IMDb_URL'
movies_df['release_date'] = movies_df['release_date'].fillna('Unknown')
movies_df['IMDb_URL'] = movies_df['IMDb_URL'].fillna('Unknown')

print("Missing values after cleaning:")
print(movies_df.isnull().sum())

<h4 style="color:orange;"> Explanation: </h4>

This block cleans the movies_df dataset by addressing missing or irrelevant data. Here's what happens step-by-step:

<b>Dropping the 'video_release_date' Column:</b>

The code checks if the video_release_date column exists before attempting to drop it. This avoids runtime errors if the column is missing.
This column has already been identified as irrelevant (completely empty), so it is removed.
Example Thought: Imagine a column in your data that's like a ghost town—completely empty and offering no insights. Dropping it makes your dataset cleaner and more manageable.

<b>Handling Missing Values:</b>

Missing values in the release_date and IMDb_URL columns are replaced with the string 'Unknown'.
This ensures there are no null entries, making the dataset consistent and preventing issues during further analysis.
Why is this important? For example, if a movie’s release date is missing, marking it as 'Unknown' helps you recognize it explicitly rather than dealing with cryptic NaN values.

<b>Checking for Remaining Missing Values:</b>

After cleaning, the code prints the count of missing values in each column to confirm that the data is ready for analysis.
This step is like tidying up your workspace before starting a project—it eliminates unnecessary distractions (empty columns) and ensures all tools (data fields) are in their place.

<h4 style="color:cyan;">Code Block 5: Visualizing Movie Release Years </h4>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Ensure 'release_date' is treated as a string
movies_df['release_date'] = movies_df['release_date'].astype(str)

# Extract the release year from the 'release_date' column
movies_df['release_year'] = movies_df['release_date'].apply(
    lambda x: x.split('-')[-1] if x != 'Unknown' and '-' in x else 'Unknown'
)

# Remove entries with 'Unknown' year for the visualization and create a copy
movies_with_year = movies_df[movies_df['release_year'] != 'Unknown'].copy()

# Convert 'release_year' to numeric for sorting and visualization
movies_with_year['release_year'] = pd.to_numeric(movies_with_year['release_year'])

# Plot a histogram of movie release years
plt.figure(figsize=(12, 6))
sns.histplot(data=movies_with_year, x='release_year', bins=30, kde=False, color="skyblue")

plt.title('Distribution of Movie Release Years', fontsize=16)
plt.xlabel('Release Year', fontsize=14)
plt.ylabel('Number of Movies', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

<h4 style="color:cyan;">Code Block 6: Visualizing Movie Genres </h4>

In [None]:
import matplotlib.pyplot as plt

# Identify genre columns dynamically by checking for binary columns (values 0 or 1)
genre_columns = movies_df.columns[movies_df.columns.str.contains(r'unknown|Action|Adventure|Animation|Children|Comedy|Crime|Documentary|Drama|Fantasy|Film-Noir|Horror|Musical|Mystery|Romance|Sci-Fi|Thriller|War|Western')]

# Ensure genres are binary (0s and 1s) before summing
genre_counts = movies_df[genre_columns].apply(pd.to_numeric, errors='coerce').sum().sort_values(ascending=False)

# Plot the genre distribution as a bar chart
plt.figure(figsize=(12, 6))
genre_counts.plot(kind='bar', color='coral', alpha=0.8)

plt.title('Distribution of Movie Genres', fontsize=16)
plt.xlabel('Genres', fontsize=14)
plt.ylabel('Number of Movies', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

<h4 style="color:cyan;">Code Block 7: Visualizing Ratings Distribution </h4>

In [None]:
# Plot the distribution of ratings
plt.figure(figsize=(10, 6))
sns.histplot(data=ratings_df, x='rating', bins=5, discrete=True, kde=False, color="green")

plt.title('Distribution of User Ratings', fontsize=16)
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

<h4 style="color:cyan;">Code Block 8: Saving the Cleaned Datasets </h4>

In [None]:
# Save the cleaned movies dataset
movies_df.to_csv('cleaned_data/movies_cleaned.csv', index=False)

# Save the cleaned ratings dataset
ratings_df.to_csv('cleaned_data/ratings_cleaned.csv', index=False)

print("Cleaned datasets saved successfully:")
print("movies_cleaned.csv and ratings_cleaned.csv are stored in the 'cleaned_data/' folder.")