### Importing Libraries

In [None]:
import os
import sys
import numpy as np
import pandas as pd

### Adding `utils` directory to `PYTHONPATH`

In [None]:
sys.path.append(os.path.abspath("../utils"))

### Reading Merged Data

In [None]:
# Importing load_csv function from read_data module
from read_data import load_csv

In [None]:
# Loading movies data
merged_df = load_csv('clean_data', 'merged_data.csv')
merged_df.head()

### Summary of the DataFrame

In [None]:
# Importing dataframe_summary function from summary module
from summary import dataframe_summary

In [None]:
# Printing the summary of merged dataframe
dataframe_summary(merged_df)

### Filtering Important Columns

Since we're building a content based movie recommender system,
<br>
We need columns that provides meaningful information about the movies to generate recommendations.
<br>
After analyzing the data, I filtered some columns that are most relevant for this task.
- `movie_id` : Uniquely identifies each movie (useful when building the web application for displaying the posters).
- `title` : Name of the movie.
- `overview` : A short summary of the movie.
- `genres` : Lists the genres the movie belongs to (e.g., action, comedy, drama).
- `keywords` : Key phrases associated with the movie (e.g., space travel, dystopia).
- `cast` : The main actors in the movie.
- `crew` : Details about people involved behind the scenes (e.g., directors, writers).

In [None]:
# Filtering important columns
filtered_df = merged_df[['movie_id','title','overview','genres','keywords','cast','crew']]
filtered_df.head()

### Summary of Filtered DataFrame

In [None]:
# Importing dataframe_summary function from summary module
from summary import dataframe_summary

In [None]:
# Printing the summary of filtered dataframe
dataframe_summary(filtered_df)

### Handling Null Values

In [None]:
# Since we only have a very small amount of null values in overview column,
# and because our data is large enough we can drop those.
filtered_df = filtered_df.dropna()

In [None]:
# Data after removing null values
filtered_df.isna().sum()

### Analyzing Data before Cleaning

In [None]:
# Top 5 Rows of Filtered Data
filtered_df.head()

We can start cleaning the data with these weird structured columns, but there is one problem.
<br>
These column have data in form of list of dictionaries but the list itself is a string.
<br>
And this is the problem with other columns too, so if we can come up with a solution it will be very helpful. 

```python
'[
  {"id": 28, "name": "Action"},
  {"id": 12, "name": "Adventure"},
  {"id": 14, "name": "Fantasy"},
  {"id": 878, "name": "Science Fiction"}
]'
```

In [None]:
# Structure of the Data
filtered_df['genres'][0]

In [None]:
# Type of the Data
print(f'Type: {type(filtered_df['genres'][0])}')

There is a module in Python named `ast` or `Abstract Syntax Trees`, which has a function `literal_eval`.
<br>
It takes a string as input that looks like a Python object (like a list, dictionary, etc.) and converts it into a real Python object.

In [None]:
# Importing ast
import ast

In [None]:
# It converts the string back into the real list of dictionaries object
print(ast.literal_eval(filtered_df['genres'][0]))
print()
print(f'Type: {type(ast.literal_eval(filtered_df['genres'][0]))}')

#### Cleaning `genres` Column

In [None]:
# Importing extract_genre function from get_genre module
from get_genre import extract_genre

In [None]:
# Extracting genres of each Movie
filtered_df['genres'] = filtered_df['genres'].apply(extract_genre)
filtered_df.head()

#### Cleaning `keywords` Column

In [None]:
# Since the structure of keywords column is exactly same as genres,
# We can use the same function to clean keywords column too.
# So we can just rename it and update the docstring to make it work for keywords column.
filtered_df['keywords'][0]

In [None]:
# Importing extract_keyword function from get_keyword module
from get_keyword import extract_keyword

In [None]:
# Extracting keywords of each Movie
filtered_df['keywords'] = filtered_df['keywords'].apply(extract_keyword)
filtered_df.head()

#### Cleaning `cast` Column

In [None]:
# Since cast of a movie is very huge which includes lead actor/actress, supporting actor/actress, cameo, extras, etc.
# We only extract the top 3 cast members and assign it as cast.
filtered_df['cast'][0]

In [None]:
# Importing extract_cast function from get_cast module
from get_cast import extract_cast

In [None]:
# Extracting top 3 cast members of each Movie
filtered_df['cast'] = filtered_df['cast'].apply(extract_cast)
filtered_df.head()

#### Cleaning `crew` Column

In [None]:
# Crew of a movie can be very huge containing director, producer, writer, cameraman, sound designer, editor, etc.
# We only fetch the name of director who directed the movie, as directors are the huge part of the movie hype and popularity.
# Like James Cameron for Avatar and Christopher Nolan for Interstellar.
# Because it can help recommend the movies of the same director.
filtered_df['crew'][0]

In [None]:
# Importing extract_director function from get_director module
from get_director import extract_director

In [None]:
# Extracting only directors from crew of each Movie
filtered_df['crew'] = filtered_df['crew'].apply(extract_director)
filtered_df.head()

#### Cleaning `overview` Column

In [None]:
# As we can see, overview column is in string format,
# and it can create problem when we try to create tags for each movie by combining all columns in one, because other columns are in list format,
# So we will convert it into list format for easy handling.
filtered_df['overview'][0]

In [None]:
# Converting overview column to list from string
filtered_df['overview'] = filtered_df['overview'].apply(lambda x: x.split(' '))
filtered_df.head()

#### Removing Spaces between Words

##### Problem with Spaces
Let's suppose we have two directors in our data with similar names, like :
- `Paul Thomas Anderson`
- `Paul William Anderson`

The model might get confused because "`Paul`" is a common part of both names.
<br>
The model might mistakenly think "`Paul Thomas Anderson`" and "`Paul William Anderson`" are the same person, leading to incorrect recommendations.
<br>
In this case, it might end up recommending movies directed by "`Paul Thomas Anderson`" to "`Paul William Anderson`",
<br>
Because the model may not be taking into account the full name, just the common term "`Paul`".

<hr>

##### After Removing Spaces
If we remove the spaces, the model treats "`PaulThomasAnderson`" and "`PaulWilliamAnderson`" as single units, which can help avoid confusion.
<br>
The model will then know that "`PaulThomasAnderson`" is not the same as "`PaulWilliamAnderson`",
<br>
Because both are distinct tokens and "`Thomas`" and "`William`" are key differentiators.

In [None]:
# Importing remove_spaces function from clean_text module
from clean_text import remove_spaces

In [None]:
# Removing spaces between words from genres, keywords, cast and crew column
filtered_df['genres'] = remove_spaces(filtered_df, 'genres')
filtered_df['keywords'] = remove_spaces(filtered_df, 'keywords')
filtered_df['cast'] = remove_spaces(filtered_df, 'cast')
filtered_df['crew'] = remove_spaces(filtered_df, 'crew')

In [None]:
# Top 5 rows after removing spaces
filtered_df.head()