## 1. Introduction
**This notebook is designed to perform various tasks related to movie data processing, such as handling missing values, segmenting movies based on genres using KMeans clustering, and generating movie recommendations based on user input using cosine similarity.
The project use the TMDb API for fetching movie posters, and the data used comes from a CSV file containing information about various movies.**
## 2. Importing Libraries and Setting Up
**We start by importing all necessary libraries that will be used throughout the notebook.
These include libraries for data manipulation (Pandas, Numpy), visualization (Matplotlib, Seaborn), machine learning (KMeans, CountVectorizer, cosine_similarity), and interacting with the TMDb API.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from IPython.display import display, Image, HTML
from tmdbv3api import TMDb, Movie
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Setting up the TMDb API key for fetching movie posters
tmdb = TMDb()
tmdb.api_key = '4076f686fa1a8ca894500d3eef307265'

## 3. Data Preprocessing
### 3.1 Loading the Data

In [2]:
# We load the movie dataset from a CSV file. The delimiter is specified as ";" and
# encoding is set to 'ISO-8859-1' to handle special characters correctly
df = pd.read_csv("../dataset/movies2.csv", delimiter=";",encoding='ISO-8859-1')

# Display general information about the dataset, including the number of entries,
# data types, and memory usage
print("General information about the Movie data: ")
df.info()

# Checking for missing values in the dataset to identify columns that need treatment
print("Missing values in the Movie data: ")
print(df.isnull().sum())

### 3.2 Visualization of Missing Values

In [3]:
# Visualize the number of missing values per column using a bar chart
# This helps in identifying which columns have the most missing data
missing_counts = df.isnull().sum()
sns.set_palette('viridis')
plt.figure(figsize=(10, 6))
missing_counts.plot(kind='bar')
plt.title('Number of Missing Values per Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.tight_layout()

## 4. Handling Missing Values
### 4.1 Treating Missing Values

In [4]:
# We handle missing values by filling them with either a specific string (e.g., 'No rating', 'Unknown')
# or using statistical measures like mean or median for numeric columns
print("Handling missing values:")
df['rating'].fillna('No rating', inplace=True)
df['released'].fillna('Unknown', inplace=True)
df['score'].fillna(df['score'].mean(), inplace=True)
df['votes'].fillna(df['votes'].median(), inplace=True)
df['writer'].fillna('Unknown', inplace=True)
df['star'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)
df['budget'].fillna(df['budget'].median(), inplace=True)
df['gross'].fillna(df['gross'].median(), inplace=True)
df['company'].fillna('Unknown', inplace=True)
df['runtime'].fillna(df['runtime'].median(), inplace=True)

# Verify that missing values have been treated by printing the updated count of missing values
print(df.isnull().sum())

### 4.2 Visualization After Treatment

In [5]:
# Visualize the number of missing values per column after treatment to ensure that
# the missing values have been appropriately handled
missing_counts = df.isnull().sum()
sns.set_palette('magma')
plt.figure(figsize=(10, 6))
missing_counts.plot(kind='bar')
plt.title('Number of Missing Values per Column (After Treatment)')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.tight_layout()

## 5. Genre Distribution Analysis
### 5.1 Analyze the distribution of movie genres

In [6]:
# We separate the genres and count their occurrences to see which genres are most common
genres_list = df['genre'].str.split('|').explode()

# Count occurrences of each genre and display them using a bar chart
genre_counts = genres_list.value_counts()
plt.figure(figsize=(12, 9))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='viridis')
plt.title('Distribution of Movie Genres')
plt.xlabel('Number of Movies')
plt.ylabel('Genres')

## 6. KMeans Clustering for Movie Segmentation
### 6.1 Applying KMeans Clustering
**We apply KMeans clustering to segment movies into clusters based on their genres.
This helps in grouping similar movies together, which can later be used for recommendations.**


In [7]:
# Example data (replace with your data)
X = np.random.rand(100, 2)

# Determining the optimal number of clusters using the Elbow Method
k_range = range(1, 11)
inertias = []
for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plotting the elbow method graph to find the optimal number of clusters
plt.figure(figsize=(8, 6))
plt.plot(k_range, inertias, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters k')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True)

# Encode genres using one-hot encoding
genres_encoded = pd.get_dummies(df['genre'].str.split('|').explode()).groupby(level=0).sum()

# Standardize the data
scaler = StandardScaler()
genres_scaled = scaler.fit_transform(genres_encoded)

# Apply KMeans clustering for segmentation
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(genres_scaled)

# Add clusters to the DataFrame
df['cluster'] = clusters

# Displaying the movies in each cluster to understand how the segmentation worked
for cluster_id in range(3):
    cluster_movies = df[df['cluster'] == cluster_id]['name']
    print(f"Cluster {cluster_id}:")
    print(cluster_movies)

## 7. Movie Recommendation System
### 7.1 Calculating Cosine Similarity
**We calculate the cosine similarity between movies based on their genre encoding. This similarity matrix will be used to recommend movies that are similar to a given movie.**

In [8]:
cosine_sim = cosine_similarity(genres_scaled)

print("Segmentation score matrix:")
print("")
print(cosine_sim[:, :10])

### 7.2 Getting Recommendations

In [9]:
# The user is prompted to enter the name of a movie,
# and the system will find similar movies based on the cosine similarity matrix
film_name = input("Enter the movie name: ")
film_index = df.index[df['name'] == film_name].tolist()

# The user is prompted to enter the name of a movie, and the system will find similar movies
# based on the cosine similarity matrix
if not film_index:
    print("Movie not found.")
else:
    film_index = film_index[0]
    similar_movies_indices = cosine_sim[film_index].argsort()[:-6:-1]
    print(f"Recommended movies for {df.iloc[film_index]['name']}:")
    for idx in similar_movies_indices:
        if idx != film_index:
            print(df.iloc[idx]['name'])

## 8. Feature Engineering for Enhanced Recommendations
### 8.1 Creating a Combined Feature
**To improve the movie recommendation system, we combine multiple columns into a single feature. This feature will be used to calculate more accurate similarities between movies based on various aspects, not just genres.**

In [10]:
# Combine relevant features into a single string for each movie
df['info'] = df[['name','genre', 'director', 'writer', 'star','company']].apply(lambda x: ' '.join(x.astype(str)), axis=1)

# Example of what the 'info' column looks like after combining the features
print(df['info'].head())

### 8.2 Transforming the Combined Feature into Numerical Form
**We use the `CountVectorizer` from the `sklearn.feature_extraction.text` module to transform the combined feature into a numerical matrix. This allows us to calculate similarities between movies based on this feature.**

In [11]:
# Use CountVectorizer to transform the 'info' feature into numerical form
countV = CountVectorizer(min_df=20, stop_words='english')
matrice_transform=countV.fit_transform(df['info'])
matrice_transform.shape

# Display the shape of the transformed matrix
print(matrice_transform[:, :10].toarray())

### 8.3 Calculating Similarity Based on the Combined Feature
**Next, we calculate the cosine similarity between movies based on the transformed 'info' feature. This similarity matrix will be used for making recommendations.**

In [12]:
# Calculate cosine similarity based on the transformed features
matrice_score=cosine_similarity(matrice_transform)

# Print the first 10 columns of the similarity matrix as an example
print('Score matrix')
print(matrice_score[:, :10])

## 9. Advanced Movie Recommendation System
### 9.1 Fetching Movie Posters Using TMDb API
**To enhance the user experience, we fetch movie posters using the TMDb API. The get_movie_poster function retrieves the poster for a given movie title.**

In [13]:
def get_movie_poster(movie_title):
    movie = Movie()
    try:
        search = movie.search(movie_title)
        if search:
            poster_path = search[0].poster_path
            if poster_path:
                return f"https://image.tmdb.org/t/p/original{poster_path}"
    except Exception as e:
        print(f"Error fetching poster for '{movie_title}': {e}")
    return None

### 9.2 Building the Recommendation System
**The `film_recommender` function takes a movie title as input and returns a list of recommended movies along with their posters.**

In [16]:
def film_recommander(titre, data, matrice_score, nombre=5):
    lignes = data.index[data['name'].str.lower() == titre.lower()]
    if len(lignes) == 0:
        return [{'titre': 'Sorry! No similar movies found.', 'poster': None}]

    ligne = lignes[0]
    if ligne >= len(matrice_score):
        return []

    films_similaires = list(enumerate(matrice_score[ligne]))
    recommandations = sorted(films_similaires, key=lambda x: x[1], reverse=True)
    top_films_recommandes = recommandations[1:nombre + 1]
    print("Beacause you liked '",utilisateur,"' we recommend: ")
    films_recommandes = []
    for i in range(len(top_films_recommandes)):
        indice_film = top_films_recommandes[i][0]
        if indice_film < len(data):
            titre_film = data.iloc[indice_film]['name']
            poster_url = get_movie_poster(titre_film)
            films_recommandes.append({'titre': titre_film, 'poster': poster_url})
    return films_recommandes

### 9.3 Displaying Recommendations
**Finally, we take user input and display the top 5 recommended movies with their posters.**

In [18]:
utilisateur = input("Enter a movie title: ")
recommandations = film_recommander(titre=utilisateur, data=df, matrice_score=matrice_score, nombre=5)

html_content = "<div style='display: flex; flex-wrap: wrap;'>"
for film in recommandations:
    html_content += f"<div style='margin: 10px; text-align: center;'>"
    if film['poster']:
        html_content += f"<img src='{film['poster']}' style='width: 150px;'><br>"
    html_content += f"<p>{film['titre']}</p>"
    html_content += "</div>"
html_content += "</div>"

display(HTML(html_content))

## 10. Conclusion
**In this notebook, we've gone through various steps of movie data processing, including handling missing values, analyzing genre distribution, applying KMeans clustering for segmentation, and building a movie recommendation system using cosine similarity and feature engineering. We've also enhanced the recommendations with movie posters fetched from the TMDb API.**