## Handling Missing Values

### Strategy 1: Drop Rows with Missing Values
One approach to handling missing data is to remove any row that contains at least one missing value. While this guarantees a dataset free of missing values, it can lead to **significant data loss**, especially if missing values are spread across many rows and columns.

After applying this strategy, the dataset was reduced from **1000 rows to 713 rows**.

### Strategy 2: Fill Missing Values with Median / ‘unknown’
A more conservative approach is to **fill missing values**:

- **Numerical columns**: Missing values are replaced with the **median** (a more robust alternative to the mean, as it is less sensitive to outliers).
- **Categorical columns**: Missing values are replaced with the string **`'unknown'`**.

This method retains **all rows** and prevents data loss, making it especially valuable when working with relatively small datasets like ours (Top 1000 movies). Additionally, it allows for **smoother downstream analysis** and avoids introducing bias by simply dropping data.

### Selected Strategy: Filling Missing Values
Given the **limited size** of the dataset (1000 rows) and the importance of **preserving data integrity**, we chose to proceed with **Strategy 2**. This allows us to **keep all records** while still addressing the issue of missing data effectively.


In [1]:
import pandas as pd

# Load dataset
df = pd.read_csv("top1000movies.csv")

df_dropna = df.dropna()
print("Shape after dropping rows with missing values:", df_dropna.shape)







Shape after dropping rows with missing values: (713, 16)


## Data Cleaning: Preparing Numerical and Categorical Data

After deciding to fill missing values with the median, we performed a series of cleaning steps to ensure the dataset is ready for analysis.

### 1. Cleaning the ‘Gross’ Column
The `Gross` column, representing box office revenue, contained **commas** (e.g., `1,300,000`) that prevented numerical conversion. We removed all commas to allow this column to be treated as a **numeric feature**.

### 2. Converting Columns to Numeric
The following columns were identified as **numerical** and explicitly converted:

- `Released_Year`
- `Runtime` (minutes)
- `IMDB_Rating`
- `Meta_score`
- `No_of_Votes`
- `Gross`

To ensure consistency, any value that could not be converted was set to `NaN` and later filled with the **median** of that column.

### 3. Filling Remaining Missing Values in Numerical Columns
To address any remaining missing values, we filled all missing entries in **numerical columns** with the **median value** of their respective columns. This ensures that **outliers have minimal impact** on the imputation and preserves underlying data patterns.

### 4. Filling Missing Categorical Values
For **categorical features**, such as `Certificate`, `Genre`, `Director`, etc., we replaced all missing entries with the placeholder **`'unknown'`**. This approach maintains dataset completeness while clearly indicating the **absence of information** in those fields.


In [71]:
numeric_cols = ['Released_Year', 'Runtime', 'IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross']
df['Gross'] = df['Gross'].replace({',': ''}, regex=True)

for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')


df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
print(df[numeric_cols].head(20))



# For categorical columns, fill with 'Unknown'
categorical_cols = ['Certificate', 'Genre', 'Overview', 'Director', 'Star1', 'Star2', 'Star3', 'Star4']
df[categorical_cols] = df[categorical_cols].fillna('unknown')




    Released_Year  Runtime  IMDB_Rating  Meta_score  No_of_Votes        Gross
0          1994.0      142          9.3        80.0      2343110   28341469.0
1          1972.0      175          9.2       100.0      1620367  134966411.0
2          2008.0      152          9.0        84.0      2303232  534858444.0
3          1974.0      202          9.0        90.0      1129952   57300000.0
4          1957.0       96          9.0        96.0       689845    4360000.0
5          2003.0      201          8.9        94.0      1642758  377845905.0
6          1994.0      154          8.9        94.0      1826188  107928762.0
7          1993.0      195          8.9        94.0      1213505   96898818.0
8          2010.0      148          8.8        74.0      2067042  292576195.0
9          1999.0      139          8.8        66.0      1854740   37030102.0
10         2001.0      178          8.8        92.0      1661481  315544750.0
11         1994.0      142          8.8        82.0      1809221

## Data Cleaning: Preparing Numerical and Categorical Data

After deciding to fill missing values, we performed a series of cleaning steps to ensure the dataset is ready for analysis.

### 1. Cleaning the ‘Gross’ Column
The `Gross` column, representing box office revenue, contains **commas** (e.g., `1,300,000`) that prevent numerical conversion. We removed all commas to allow this column to be treated as a **numeric feature**.

### 2. Converting Columns to Numeric
The following columns were identified as **numerical** and explicitly converted:

- `Released_Year`
- `Runtime` (minutes)
- `IMDB_Rating`
- `Meta_score`
- `No_of_Votes`
- `Gross`

To ensure consistency, any value that could not be converted was set to `NaN` and later filled with the **median** of that column.

### 3. Filling Remaining Missing Values in Numeric Columns
To address any remaining missing values, we filled all missing entries in **numerical columns** with the **median** value of their respective columns. This ensures that **outliers have minimal effect** on the imputation and preserves the general trends in the data.

### 4. Filling Missing Categorical Values
For **categorical features**, such as `Certificate`, `Genre`, `Director`, etc., we replaced all missing entries with the placeholder **`'unknown'`**. This allows the data to remain intact while clearly indicating the **absence of information**.


In [72]:
from sklearn.preprocessing import MinMaxScaler

# Columns to normalize
numeric_cols = ['IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross']

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Display the transformed data
print(df[numeric_cols].head())


   IMDB_Rating  Meta_score  No_of_Votes     Gross
0     1.000000    0.722222     1.000000  0.030257
1     0.941176    1.000000     0.688207  0.144092
2     0.823529    0.777778     0.982797  0.571025
3     0.823529    0.861111     0.476641  0.061173
4     0.823529    0.944444     0.286778  0.004653


## Initial Data Exploration: Value Counts for Key Features

To better understand the structure and distribution of key features in the dataset, we analyzed **value counts** for several columns. This step was crucial during **debugging and data validation** to ensure that data cleaning was effective and to identify any unusual patterns or inconsistencies.

### Features Analyzed:
- **`Certificate`**: Frequency of movie certification types (e.g., PG, R, etc.)
- **`Released_Year`**: Distribution of movie release years to spot potential outliers or gaps.
- **`Genre`**: Common genre combinations and frequency.
- **`Meta_score`**: Distribution of Metacritic scores, useful to assess score granularity.
- **`Director`**: Count of movies per director — helps identify prolific directors or potential data duplication.
- **`Runtime`**: Most common movie runtimes — useful for detecting outliers or data entry errors.

By analyzing **value counts**, we could quickly assess:
- Whether certain categories were **dominant or underrepresented**.
- If **missing or placeholder values** like `'unknown'` were present.
- The **effectiveness of data cleaning**, particularly in `Certificate`, `Runtime`, and `Meta_score


In [73]:
vc_certificate =  df['Certificate'].value_counts()


vc_release_year = df['Released_Year'].value_counts()

vc_genre = df['Genre'].value_counts()

vc_metascore = df['Meta_score'].value_counts()

vc_director = df['Director'].value_counts()


vc_runtime = df['Runtime'].value_counts()



## Data Cleaning: Standardizing and Encoding the `Certificate` Feature

### 1. Dropping Irrelevant Column
We dropped the `Poster_Link` column as it contains **image URLs** irrelevant to our analysis and adds no analytical value.

### 2. Standardizing Categorical Text
To ensure **consistency** across categorical data, all entries in categorical columns (e.g., `Certificate`, `Genre`, `Director`, etc.) were **stripped of leading/trailing spaces** and **converted to lowercase**. This step prevents issues caused by inconsistent text formats (e.g., `' PG'`, `'pg'`, `'Pg-13 '`).

### 3. Normalizing Certificate Categories
The `Certificate` feature had **multiple variations** of the same rating system, including international and TV-specific certifications. To simplify analysis:
- We **mapped variations** (e.g., `'pg-13'`, `'tv-14'`) into **standardized categories** (e.g., `'pg13'`).
- Rare or outdated certifications (e.g., `'passed'`, `'gp'`) were merged into modern equivalents.
- Entries labeled `'unrated'` were replaced with `'unknown'`.

### 4. Encoding Certificate for Numerical Analysis
To allow for **numerical analysis and similarity scoring**, we created a new column: `Cert_numeric`, which maps each standardized certificate to a **numeric value** based on its **age-appropriateness**. Lower values indicate more general audience suitability, while higher values represent more restricted content.

| Certificate | Encoded Value | Meaning |
|-------------|----------------|---------|
| `g`, `u`    | 1              | General Audience |
| `pg`, `a`   | 2              | Parental



In [74]:
#Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross

df = df.drop('Poster_Link', axis=1)


for col in categorical_cols:
    df[col] = df[col].str.strip().str.lower()



df['Certificate'] = df['Certificate'].str.lower().str.strip()

certificate_mapping = {
    'u/a': 'ua',  # Merge variations
    'pg-13': 'pg13',
    'tv-pg': 'pg',
    'tv-14': 'pg13',
    'tv-ma': 'r',
    'gp': 'pg',
    'passed': 'approved',
    '16': 'r',
    'unrated': 'unknown'  # Unrated movies can be labeled as 'unknown'
}


# Apply mapping
df['Certificate'] = df['Certificate'].replace(certificate_mapping)


certificate_mapping2 = {
    'u': 1,         # Universal (suitable for all)
    'a': 2,         # Approved (similar to PG)
    'ua': 3,        # Unrestricted but for adult audiences (some material may be inappropriate for children)
    'r': 4,         # Restricted (under 17 requires accompanying parent or adult guardian)
    'unknown': 3,   # Replacing with mean, assumed to be similar to PG-13 or R
    'pg': 2,        # Parental Guidance (some material may not be suitable for children)
    'g': 1,         # General Audience (appropriate for all ages)
    'approved': 2,  # Approved (similar to PG)
    'tv-14': 3,     # TV-14 (similar to PG-13)
    'pg13' : 3,
    'tv-ma': 5,     # TV Mature (intended for adult audiences only)
}


df['Cert_numeric'] = df['Certificate'].map(certificate_mapping2)

print(df['Certificate'].value_counts().sum())
print(df['Cert_numeric'].value_counts().sum())





1000
1000


## Encoding Multi-Label Genre Data

### 1. Splitting Genre Strings into Lists
The `Genre` column initially contained comma-separated strings representing multiple genres per movie (e.g., `"Action, Adventure, Sci-Fi"`). To handle this **multi-label data**, we:
- Converted all text to **lowercase** for consistency.
- Split the string values into **lists of genres** using `str.split(', ')`.

### 2. Multi-Label Binarization
Since each movie can belong to **multiple genres**, traditional encoding methods like one-hot or label encoding are not applicable. Instead, we used `MultiLabelBinarizer` to:
- Transform each genre list into a **binary matrix**, where each genre becomes a **new column** (e.g., `action`, `drama`, `thriller`, etc.).
- Assign **1** if the movie belongs to that genre, otherwise **0**.

This transformation enables:
- **Efficient analysis** of genre distributions.
- Use of genre data in **similarity-based recommendations** or **feature importance evaluation**.

### 3. Merging with Original DataFrame
The resulting **genre dummies** were concatenated back to the main DataFrame, expanding our feature set with individual genre indicators for each movie.

This **multi-label encoding** preserves the richness of the genre data while enabling powerful analytical and modeling capabilities.


In [75]:
df_en = df.copy()


df_en['Genre'] = df_en['Genre'].str.lower().str.split(', ')

vc_genre = df_en['Genre'].value_counts()

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genre_dummies = pd.DataFrame(mlb.fit_transform(df_en['Genre']), columns=mlb.classes_, index=df.index)

# Merge with original DataFrame
df_en = pd.concat([df_en, genre_dummies], axis=1)



print(df_en.columns)

df = df_en.copy()



Index(['Series_Title', 'Released_Year', 'Certificate', 'Runtime', 'Genre',
       'IMDB_Rating', 'Overview', 'Meta_score', 'Director', 'Star1', 'Star2',
       'Star3', 'Star4', 'No_of_Votes', 'Gross', 'Cert_numeric', 'action',
       'adventure', 'animation', 'biography', 'comedy', 'crime', 'drama',
       'family', 'fantasy', 'film-noir', 'history', 'horror', 'music',
       'musical', 'mystery', 'romance', 'sci-fi', 'sport', 'thriller', 'war',
       'western'],
      dtype='object')


In [76]:
import plotly.express as px

# Group by 'Released_Year' and calculate mean of 'Gross'
df_mean_gross_by_year = df.groupby('Released_Year')['Gross'].mean().reset_index()

# Create a line plot for the mean gross by year
fig = px.line(df_mean_gross_by_year, x='Released_Year', y='Gross', title='Mean Gross by Year')

# Show the plot
fig.show()



This histogram shows the distribution of IMDB ratings across the top 1000 movies. It helps us understand the concentration of movies in different rating ranges. This kind of distribution is crucial for assessing the overall quality of the movies in the dataset.

In [77]:
import plotly.express as px

# Create a histogram of IMDB ratings
fig = px.histogram(df, x='IMDB_Rating', nbins=20, title='Distribution of IMDB Ratings')
fig.show()


This scatter plot visualizes the relationship between a movie's IMDB rating and its gross revenue. It helps us understand if there’s a trend that higher-rated movies tend to earn more at the box office.

In [78]:
import plotly.express as px

# Create a scatter plot of Gross vs. IMDB Rating
fig = px.scatter(df, x='IMDB_Rating', y='Gross', title='Gross Revenue vs. IMDB Rating', trendline='ols')
fig.show()


--------------------

This bar chart displays the average runtime for movies in each genre. By visualizing this, we can identify which genres tend to have longer or shorter films.

In [79]:
import plotly.express as px

# Melt the dataframe to have one genre per row
df_melted = df.melt(id_vars=['Runtime'], value_vars=mlb.classes_, var_name='Genre', value_name='Present')
df_genre_runtime = df_melted[df_melted['Present'] == 1].groupby('Genre')['Runtime'].mean().reset_index()

# Create a bar chart of average runtime by genre
fig = px.bar(df_genre_runtime, x='Genre', y='Runtime', title='Average Runtime by Genre')
fig.show()


This bar chart shows the number of movies released each year. It helps us observe trends in movie production over time and see if there has been a surge or decline in the number of top-rated movies over the years.

In [80]:
import plotly.express as px

# Create a bar chart of the number of movies released per year
df_year_count = df['Released_Year'].value_counts().reset_index()
df_year_count.columns = ['Released_Year', 'Count']
fig = px.bar(df_year_count, x='Released_Year', y='Count', title='Number of Movies Released per Year')
fig.show()


This bar chart showcases the top 10 directors with the most films in the top 1000. It helps identify prolific filmmakers and their contribution to the top-rated movies in the dataset.

In [81]:
import plotly.express as px

# Create a bar chart of the top 10 directors by number of movies
df_director_count = df['Director'].value_counts().head(10).reset_index()
df_director_count.columns = ['Director', 'Count']
fig = px.bar(df_director_count, x='Director', y='Count', title='Top 10 Directors by Number of Movies')
fig.show()


This pie chart represents the distribution of movie certificates in the dataset. It shows the prevalence of different content ratings (like PG, R, etc.) and how they are distributed across the top 1000 movies.

In [82]:
import plotly.express as px

# Create a pie chart of movie certificates
df_certificate_count = df['Certificate'].value_counts().reset_index()
df_certificate_count.columns = ['Certificate', 'Count']
fig = px.pie(df_certificate_count, names='Certificate', values='Count', title='Distribution of Movie Certificates')
fig.show()


This heatmap visualizes the correlations between various numerical features, such as IMDB ratings, metascore, votes, and gross revenue. It helps identify relationships, such as whether a higher IMDB rating correlates with higher gross revenue.

In [83]:
import plotly.express as px
import numpy as np

# Compute the correlation matrix
corr_matrix = df[['IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross', 'Runtime']].corr()

# Create a heatmap
fig = px.imshow(corr_matrix, text_auto=True, title='Correlation Heatmap of Numerical Features')
fig.show()


This bar chart displays the average IMDB rating by movie certificate. It reveals how different certificates correlate with the average ratings of the movies, which can be useful for understanding the audience’s preferences.

In [84]:
import plotly.express as px

# Calculate average IMDB rating by certificate
df_certificate_rating = df.groupby('Certificate')['IMDB_Rating'].mean().reset_index()

# Create a bar chart
fig = px.bar(df_certificate_rating, x='Certificate', y='IMDB_Rating', title='Average IMDB Rating by Certificate')
fig.show()


This line plot tracks the popularity of various genres over the years. By analyzing the data, we can identify which genres have gained or lost popularity over time, providing insights into changing audience tastes.

In [85]:
import plotly.express as px

# Melt the dataframe to have one genre per row
df_melted = df.melt(id_vars=['Released_Year'], value_vars=mlb.classes_, var_name='Genre', value_name='Present')
df_genre_year = df_melted[df_melted['Present'] == 1].groupby(['Released_Year', 'Genre']).size().reset_index(name='Count')

# Create a line plot
fig = px.line(df_genre_year, x='Released_Year', y='Count', color='Genre', title='Genre Popularity Over Time')
fig.show()


### Movie Similarity Calculation

#### Helper Functions for Similarity Calculation
This section contains helper functions and the main logic to calculate the similarity between two movies based on different features such as the year of release, runtime, IMDB rating, certificate, genre, and director/stars. Each function calculates a specific aspect of the similarity, and they are all combined to generate a total similarity score. This score can be used to recommend similar movies.

## Normalizing Similarity Scores for Equal Impact

To ensure that each feature (e.g., year, runtime, rating, etc.) contributes equally to the final similarity score, we normalize the values to a common range (0-1 scale). This ensures that no single feature disproportionately affects the final score. Below are the updated similarity functions with normalization for each attribute.

### 1. **Released Year Similarity**
We normalize the year difference by dividing it by the maximum possible year difference (100 years by default). This ensures the year difference is on a 0-1 scale.



In [86]:
import numpy as np


# Helper functions for similarity calculation

# Released Year similarity
BasePointReleasedYear = 10  # Base points for same year
year_multiplier = 0.1  # Decay factor for the year difference
def get_year_similarity_points(movie_year, compare_year, max_year_diff=100):
    year_diff = abs(movie_year - compare_year)
    # Normalize the year difference to a 0-1 scale
    normalized_year_diff = year_diff / max_year_diff
    return BasePointReleasedYear * np.exp(-year_multiplier * normalized_year_diff)

# Runtime similarity
# Runtime similarity
BasePointRuntime = 5  # Base points for matching runtime
runtime_multiplier = 0.05  # Decay factor for runtime difference
def get_runtime_similarity_points(movie_runtime, compare_runtime, max_runtime_diff=120):
    runtime_diff = abs(movie_runtime - compare_runtime)
    # Normalize the runtime difference to a 0-1 scale
    normalized_runtime_diff = runtime_diff / max_runtime_diff
    return BasePointRuntime * np.exp(-runtime_multiplier * normalized_runtime_diff)


# IMDB Rating similarity
BasePointIMDB = 5  # Base points for matching IMDB rating
rating_multiplier = 0.2  # Decay factor for rating difference
def get_rating_similarity_points(movie_rating, compare_rating):
    rating_diff = abs(movie_rating - compare_rating)
    return BasePointIMDB * np.exp(-rating_multiplier * rating_diff)


# Certificate similarity
BasePointCert = 10  # Base points for certificate similarity
def get_cert_similarity_points(movie_cert, compare_cert):
    cert_diff = abs(movie_cert - compare_cert) / 4  # Normalize the certificate difference to a 0-1 scale
    return BasePointCert * (1 - cert_diff)


# Genre similarity
BasePointGenres = 10  # Base points for matching genre
def get_genre_similarity_points(movie_genres, compare_genres, max_genres=5):
    common_genres = set(movie_genres) & set(compare_genres)
    # Normalize common genres to a 0-1 scale (max 5 genres per movie)
    normalized_common_genres = len(common_genres) / max_genres
    return BasePointGenres * np.exp(normalized_common_genres)

# Director and Stars similarity
BasePointDirectorStars = 10  # Base points for matching director or stars
def get_director_star_similarity_points(movie_director, movie_stars, compare_director, compare_stars, max_stars=4):
    points = 0
    if movie_director == compare_director:
        points += BasePointDirectorStars
    points += len(set(movie_stars) & set(compare_stars)) * BasePointDirectorStars
    # Normalize the points to a 0-1 scale (assuming max overlap is 4 stars)
    normalized_points = points / (max_stars * BasePointDirectorStars)
    return BasePointDirectorStars * normalized_points



### Main Function for Similarity Score

#### Main Logic for Similarity Score Calculation
The main function, `get_movie_similarity_score`, aggregates all the individual similarity scores calculated by the helper functions. The scores for year, runtime, IMDB rating, certificate, genre, and director/stars are summed to give an overall similarity score between two movies.


In [87]:
# Main function to calculate the similarity score
def get_movie_similarity_score(movie, compare_movie):
    score = 0

    # Calculate scores for each feature and print the individual scores
    year_score = get_year_similarity_points(movie['Released_Year'], compare_movie['Released_Year'])
    #print(f"Year similarity points: {year_score}")
    score += year_score

    runtime_score = get_runtime_similarity_points(movie['Runtime'], compare_movie['Runtime'])
    #print(f"Runtime similarity points: {runtime_score}")
    score += runtime_score

    rating_score = get_rating_similarity_points(movie['IMDB_Rating'], compare_movie['IMDB_Rating'])
   # print(f"IMDB Rating similarity points: {rating_score}")
    score += rating_score

    cert_score = get_cert_similarity_points(movie['Cert_numeric'], compare_movie['Cert_numeric'])
  #  print(f"Certificate similarity points: {cert_score}")
    score += cert_score

    genre_score = get_genre_similarity_points(movie['Genre'], compare_movie['Genre'])
    #print(f"Genre similarity points: {genre_score}")
    score += genre_score

    director_star_score = get_director_star_similarity_points(
        movie['Director'], [movie['Star1'], movie['Star2'], movie['Star3'], movie['Star4']], compare_movie['Director'], [compare_movie['Star1'], compare_movie['Star2'], compare_movie['Star3'], compare_movie['Star4']])
   # print(f"Director and Star similarity points: {director_star_score}")
    score += director_star_score

  #  print(f"Total similarity score: {score}")
    return score





### Movie Recommendation Function

#### Function to Recommend Top 5 Similar Movies
This function, `recommend_similar_movies`, calculates the similarity score for each movie in the dataset relative to the given movie, then sorts the movies by their similarity score and returns the top 5 most similar movies.



In [88]:
# Function to recommend the Top 5 most similar movies
def recommend_similar_movies(movie_title, df):
    movie = df[df['Series_Title'] == movie_title].iloc[0]  # Get the movie by title
    scores = []

    for _, compare_movie in df.iterrows():
        if compare_movie['Series_Title'] != movie_title:
            score = get_movie_similarity_score(movie, compare_movie)
            scores.append((compare_movie['Series_Title'], score))

    # Sort by score and return the top 5
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:5]



### Movie Similarity Recommendations

#### Test the Movie Recommendation System

This section tests the movie recommendation system by recommending the top 5 similar movies for a list of movie titles. The movies are compared based on their similarity scores derived from various attributes such as release year, runtime, IMDB rating, genre, certificate, and director/stars. The top 5 most similar movies are displayed along with their similarity scores.

The following list of movie titles will be used for testing:
- "Spider-Man: Into the Spider-Verse"
- "Avengers: Endgame"
- "Star Wars: Episode VI - Return of the Jedi"
- "Sherlock Jr."
- "Before Sunset"


In [89]:
# List of movies to test
movie_titles = [
    "Spider-Man: Into the Spider-Verse",
    "Avengers: Endgame",
    "Star Wars: Episode VI - Return of the Jedi",
    "Sherlock Jr.",
    "Before Sunset"
]
# Assuming df has been preprocessed and contains the necessary columns
for movie_title in movie_titles:
    print(f"Top 5 similar movies for: {movie_title}")
    top_5_similar_movies = recommend_similar_movies(movie_title, df)
    for rank, (movie, score) in enumerate(top_5_similar_movies, start=1):
        print(f"{rank}. {movie} (Score: {score:.2f})")
    print("\n" + "-"*50 + "\n")


Top 5 similar movies for: Spider-Man: Into the Spider-Verse
1. Mononoke-hime (Score: 47.98)
2. How to Train Your Dragon (Score: 47.93)
3. The Incredibles (Score: 47.85)
4. Big Hero 6 (Score: 47.81)
5. How to Train Your Dragon 2 (Score: 47.81)

--------------------------------------------------

Top 5 similar movies for: Avengers: Endgame
1. Avengers: Infinity War (Score: 54.84)
2. Captain America: Civil War (Score: 54.48)
3. Captain America: The Winter Soldier (Score: 51.88)
4. The Avengers (Score: 49.54)
5. Gladiator (Score: 47.92)

--------------------------------------------------

Top 5 similar movies for: Star Wars: Episode VI - Return of the Jedi
1. Star Wars: Episode V - The Empire Strikes Back (Score: 52.95)
2. Star Wars (Score: 50.47)
3. Indiana Jones and the Last Crusade (Score: 47.29)
4. Aliens (Score: 44.88)
5. Raiders of the Lost Ark (Score: 44.81)

--------------------------------------------------

Top 5 similar movies for: Sherlock Jr.
1. The General (Score: 47.29)
2. A

# Top 5 Similar Movies

## Spider-Man: Into the Spider-Verse
1. **Incredibles 2** (Score: 226.16)
2. **Big Hero 6** (Score: 224.58)
3. **How to Train Your Dragon 2** (Score: 224.58)
4. **The Lego Movie** (Score: 224.30)
5. **Kubo and the Two Strings** (Score: 223.95)

---

## Avengers: Endgame
1. **The Lord of the Rings: The Two Towers** (Score: 222.03)
2. **Dawn of the Planet of the Apes** (Score: 221.86)
3. **The Revenant** (Score: 221.76)
4. **Letters from Iwo Jima** (Score: 218.97)
5. **Gladiator** (Score: 218.66)

---

## Star Wars: Episode VI - Return of the Jedi
1. **Star Wars: Episode V - The Empire Strikes Back** (Score: 262.56)
2. **Star Wars** (Score: 250.20)
3. **Wo hu cang long** (Score: 216.28)
4. **Pirates of the Caribbean: The Curse of the Black Pearl** (Score: 215.78)
5. **Avatar** (Score: 213.37)

---

## Sherlock Jr.
1. **Andaz Apna Apna** (Score: 213.88)
2. **The General** (Score: 108.68)
3. **The Circus** (Score: 96.83)
4. **It Happened One Night** (Score: 92.76)
5. **City Lights** (Score: 92.30)

---

## Before Sunset
1. **Before Sunrise** (Score: 124.71)
2. **Before Midnight** (Score: 124.01)
3. **Jeux d'enfants** (Score: 100.26)
4. **Once** (Score: 99.83)
5. **Bom Yeoareum Gaeul Gyeoul Geurigo Bom** (Score: 99.46)

---

# Conclusion

The results show how well the similarity scores reflect the relationships between movies in terms of genres, themes, and key attributes:
- **Spider-Man: Into the Spider-Verse** is most similar to other animated superhero films, like *Incredibles 2* and *Big Hero 6*.
- **Avengers: Endgame** shares the highest similarity with other action-packed, epic films, including *The Lord of the Rings: The Two Towers* and *Gladiator*.
- **Star Wars: Episode VI - Return of the Jedi** has strong connections with other Star Wars movies, notably *The Empire Strikes Back*, along with other action-adventure classics like *Avatar*.
- **Sherlock Jr.**, a silent film, is most similar to other classics from the silent era, such as *Andaz Apna Apna* and *The General*.
- **Before Sunset** aligns closely with its prequel *Before Sunrise* and its sequel *Before Midnight*, as well as other romantic films like *Jeux d'enfants*.

These results highlight how well the scoring system identifies similarity based on genre, content, and style, providing meaningful recommendations for movie lovers.

