### **Data Cleaning: TMDB Movies**
The TMDB dataset provides vital metrics on movie popularity and genre classification. To ensure the data is ready for analysis, we perform the following preprocessing steps:

1.  **Removing Redundancy**: Drop the `Unnamed: 0` column, which serves as a duplicate index, to streamline the dataframe.
2.  **Temporal Feature Engineering**: Convert `release_date` to a standard datetime format and extract the `release_year`. This allows us to perform time-series analysis on movie trends.
3.  **Genre Data Parsing**: The `genre_ids` are currently stored as strings (e.g., `"[12, 14]"`). We use `ast.literal_eval` to convert these into actual Python lists, enabling us to later map these IDs to their specific genre names (like Action or Comedy).

In [None]:
# ==========================================
# DATA CLEANING: TMDB Movies
# ==========================================
# 1. Drop redundant index column
if 'Unnamed: 0' in tmdb_movies.columns:
    tmdb_movies.drop(columns=['Unnamed: 0'], inplace=True)

# 2. Feature Engineering: Convert release_date to Datetime and extract Year
tmdb_movies['release_date'] = pd.to_datetime(tmdb_movies['release_date'])
tmdb_movies['release_year'] = tmdb_movies['release_date'].dt.year

# 3. Clean Genre IDs 
# They are strings like "[12, 14]". We turn them into actual Python lists.
tmdb_movies['genre_ids'] = tmdb_movies['genre_ids'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)


### **Data Export**
After performing feature engineering and cleaning, we save the processed TMDB dataset to a CSV file. This ensures that the cleaned data is ready for the exploratory data analysis (EDA) phase without needing to rerun the cleaning script.

In [None]:
tmdb_movies_cleaned = tmdb_movies.copy()
tmdb.to_csv(
    r"C:\Users\Hp\Flatiron\Phase_2\Phase-2-Movie-Analysis-Project\data\cleanedData\tmdb_cleaned_data.csv",
    index=False
)
print("Saved successfully")