### Importing Libraries

In [None]:
import os
import sys
import numpy as np
import pandas as pd

### Adding `utils` directory to `PYTHONPATH`

In [None]:
sys.path.append(os.path.abspath("../utils"))

### Reading Merged Data

In [None]:
# Importing load_csv function from read_data module
from read_data import load_csv

In [None]:
# Loading movies data
merged_df = load_csv('clean_data', 'merged_data.csv')
merged_df.head()

### Summary of the DataFrame

In [None]:
# Importing dataframe_summary function from summary module
from summary import dataframe_summary

In [None]:
# Printing the summary of merged dataframe
dataframe_summary(merged_df)

### Filtering Important Columns

Since we're building a content based movie recommender system,
<br>
We need columns that provides meaningful information about the movies to generate recommendations.
<br>
After analyzing the data, I filtered some columns that are most relevant for this task.
- `movie_id` : Uniquely identifies each movie (useful when building the web application for displaying the posters).
- `title` : Name of the movie.
- `overview` : A short summary of the movie.
- `genres` : Lists the genres the movie belongs to (e.g., action, comedy, drama).
- `keywords` : Key phrases associated with the movie (e.g., space travel, dystopia).
- `cast` : The main actors in the movie.
- `crew` : Details about people involved behind the scenes (e.g., directors, writers).

In [None]:
# Filtering important columns
filtered_df = merged_df[['movie_id','title','overview','genres','keywords','cast','crew']]
filtered_df.head()

### Summary of Filtered DataFrame

In [None]:
# Importing dataframe_summary function from summary module
from summary import dataframe_summary

In [None]:
# Printing the summary of filtered dataframe
dataframe_summary(filtered_df)

### Handling Null Values

In [None]:
# Since we only have a very small amount of null values in overview column,
# and because our data is large enough we can drop those.
filtered_df = filtered_df.dropna()

In [None]:
# Data after removing null values
filtered_df.isna().sum()

### Analyzing Data before Cleaning

In [None]:
# Top 5 Rows of Filtered Data
filtered_df.head()

We can start cleaning the data with these weird structured columns, but there is one problem.
<br>
These column have data in form of list of dictionaries but the list itself is a string.
<br>
And this is the problem with other columns too, so if we can come up with a solution it will be very helpful. 

```python
'[
  {"id": 28, "name": "Action"},
  {"id": 12, "name": "Adventure"},
  {"id": 14, "name": "Fantasy"},
  {"id": 878, "name": "Science Fiction"}
]'
```

In [None]:
# Structure of the Data
filtered_df['genres'][0]

In [None]:
# Type of the Data
print(f'Type: {type(filtered_df['genres'][0])}')

There is a module in Python named `ast` or `Abstract Syntax Trees`, which has a function `literal_eval`.
<br>
It takes a string as input that looks like a Python object (like a list, dictionary, etc.) and converts it into a real Python object.

In [None]:
# Importing ast
import ast

In [None]:
# It converts the string back into the real list of dictionaries object
print(ast.literal_eval(filtered_df['genres'][0]))
print()
print(f'Type: {type(ast.literal_eval(filtered_df['genres'][0]))}')

#### Cleaning `genres` Column

In [None]:
# Importing extract_genre function from get_genre module
from get_genre import extract_genre

In [None]:
# Extracting genres of each Movie
filtered_df['genres'] = filtered_df['genres'].apply(extract_genre)
filtered_df.head()