In [None]:
import pandas as pd
import matplotlib.pyplot as plt



# Movie Dataset Analysis and Recommendation System

### Introduction
This project involves analyzing a movie dataset to uncover interesting patterns and insights related to genres, actors, movies, and themes. The dataset comprises approximately 1 million movies and is structured across four CSV files, detailing different aspects of each movie.
I'm an extreme movie enthusiast and I'm always looking for new movies to watch. This is way I have the passion to work on this challange. I hope to find some interesting insights that will help me discover new movies to watch :)

### Objectives
- Analyze and explore the data to discover patterns and trends within the movie industry.
- Develop a recommendation system based on the insights gathered from the analysis.

### Dataset
The dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/gsimonx37/letterboxd/data) and includes:
1. **Genres** - Categorization of movies by genre.
2. **Actors** - Information about the cast in each movie.
3. **Movies** - Details about individual movies.
4. **Themes** - Various themes associated with each movie.

### Let's start by loading the data and performing an initial exploration.


In [None]:
df = pd.read_csv('movie_dataset/genres.csv')
df2 = pd.read_csv('movie_dataset/actors.csv')
df3 = pd.read_csv('movie_dataset/movies.csv')
df4 = pd.read_csv('movie_dataset/themes.csv')

### Genres Dataset

In [None]:
genre_counts = df["genre"].value_counts()

# Plot the distribution of genres
genre_counts.plot(kind='bar')
plt.xlabel("Genre")
plt.ylabel("Count")
plt.title("Genre Distribution")
plt.show()

In [None]:
# Actors Dataset
actor_counts = df2["name"].value_counts()
actor_counts[:20].plot(kind='bar')
plt.xlabel("Actor")
plt.ylabel("Count")
plt.title("Top 20 Actors in Movies")
plt.show()

In [None]:
# Movies Dataset
df3.head()

In [None]:
df3['date'].value_counts().sort_index().plot()
plt.xlabel("Year")
plt.ylabel("Number of Movies")
plt.title("Number of Movies Released Each Year")
plt.show()

In [None]:
df3['rating'].plot(kind='hist', bins=20)

In [None]:
# Themes Dataset
theme_counts = df4["theme"].value_counts()
theme_counts[:20].plot(kind='bar')
plt.xlabel("Theme")
plt.ylabel("Count")
plt.title("Top 20 Themes in Movies")
plt.show()

### Recommendations
Based on the analysis, we can develop a recommendation system that suggests movies based on the following criteria:
1. **Genre** - Recommend movies based on the user's preferred genre.
2. **Actors** - Suggest movies that feature the user's favorite actors.
3. **Themes** - Recommend movies with themes that the user finds interesting.