<a href="https://colab.research.google.com/github/Mihan0207/Avatar_Fire_-_Ash_Prediction/blob/main/Avatar_Fire_%26Ash_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import re

### Data Loading

This section loads the movie data from the `movie_data_2008_2024.csv` file into a pandas DataFrame named `df`. This is the foundational step, making the data available for all subsequent processing, cleaning, and analysis.

In [None]:
# 1. Load Data
df = pd.read_csv('movie_data_2008_2024.csv')


## Explain Date Cleaning and Filtering

This section focuses on cleaning and standardizing the `release_date` column and then filtering the dataset.

### `parse_date` Function Explanation:
The `parse_date` function is designed to robustly convert various date string formats into a consistent pandas `datetime` object. It handles the following:
- **Initial Cleaning**: It first removes any characters that appear after a newline (`\n`) or a hyphen (`-`) in the date string, taking only the first part. This helps in standardizing entries that might contain extra information beyond the date itself.
- **Format Conversion**: It then attempts to convert the cleaned string into a datetime object using the format `'%b %d, %Y'` (e.g., 'Jul 18, 2008').
- **Missing and Invalid Values**: If the input date string is `NaN` (Not a Number) or if the conversion fails due to an invalid format, the function gracefully returns `pd.NaT` (Not a Time), which is pandas' representation for a missing or null datetime value.

### Data Filtering:
After applying the `parse_date` function to the `release_date` column, two subsequent filtering steps are performed:
1. **Drop Missing Dates**: Rows where `release_date` is `pd.NaT` (i.e., dates that were originally missing or could not be parsed) are dropped from the DataFrame. This ensures that all remaining entries in the `release_date` column are valid datetime objects.
2. **Filter by Year**: The DataFrame is then filtered to include only movies released from the year 2008 onwards (`df['release_date'].dt.year >= 2008`). This aligns the dataset with the specified time frame of interest, ensuring that only relevant movie data is analyzed.

In [None]:
# --- CLEANING STEPS ---
def parse_date(date_str):
    try:
        if pd.isna(date_str): return pd.NaT
        # Extract first part before newline or hyphen
        clean_str = re.split(r'[\n-]', str(date_str))[0].strip()
        return pd.to_datetime(clean_str, format='%b %d, %Y')
    except: return pd.NaT

# Clean Dates & Filter Year
df['release_date'] = df['release_date'].apply(parse_date)
df = df.dropna(subset=['release_date'])
df = df[df['release_date'].dt.year >= 2008]

### Explaining Column Dropping

Before proceeding with further data processing, we are dropping the following columns:

- **`opening_source`**: This column contains information about the source of the opening revenue data, which is not relevant for our current analytical or machine learning objectives. It often contains uniform or uninformative values.
- **`imdb_votes`**: While IMDB votes can indicate a movie's popularity, we already have `tmdb_rating` and `tmdb_votes` which provide similar information and are often more complete or consistent with the TMDb data used elsewhere. Additionally, this column might have a significant number of missing values or be less directly impactful than other features for the target predictions.

Dropping these columns helps to reduce noise, simplify the dataset, and avoid potential issues with irrelevance or redundancy in downstream tasks.

In [None]:
# Drop Columns
cols_to_drop = ['opening_source', 'imdb_votes']
df = df.drop(columns=[col for col in cols_to_drop if col in df.columns])

### Cleaning the `distributor` Column

This step focuses on cleaning the `distributor` column to ensure consistency and remove irrelevant information. The code performs the following actions:

- **Converts to String**: Ensures the column is of string type using `.astype(str)` to handle potential mixed data types.
- **Removes Placeholder Text**: Uses `.str.replace('See full company information', '', regex=False)` to remove a specific placeholder phrase that might appear in some entries, which is not part of the actual distributor name.
- **Removes Extra Whitespace**: Applies `.str.strip()` to remove any leading or trailing whitespace that might result from the cleaning process or exist in the original data, ensuring clean and standardized distributor names.

This process helps to standardize the distributor names, making the column more suitable for analysis or categorical encoding.

In [None]:
# We use .str.replace() to find the specific text and replace it with nothing ('')
if 'distributor' in df.columns:
    df['distributor'] = df['distributor'].astype(str).str.replace('See full company information', '', regex=False).str.strip()
    df['distributor'] = df['distributor'].replace({'nan': None, 'None': None})

## Genre Processing Explanation

To effectively use genre information for analysis or machine learning, the raw 'genres' string needs to be processed. This section performs two key transformations:

1.  **Extracting Primary Genre**: For each movie, the first genre listed is identified and stored as `primary_genre`. This provides a single, main categorization for simpler segmentation and high-level analysis.

2.  **One-Hot Encoding All Genres**: To allow for a movie to belong to multiple genres and to enable quantitative analysis, a one-hot encoding scheme is applied. For every unique genre present across all movies, a new binary column is created (e.g., `genre_Action`, `genre_Comedy`). A value of `1` in these columns indicates that the movie belongs to that specific genre, while `0` indicates its absence. This method transforms categorical genre data into a numerical format that machine learning models can readily interpret and use, avoiding issues associated with direct string-based categorical encoding.

In [None]:
# --- GENRE PROCESSING ---

# 1. Primary Genre (First item in list) - Useful for charts
df['primary_genre'] = df['genres'].apply(lambda x: str(x).split(',')[0].strip() if pd.notna(x) else 'Unknown')

# 2. Extract Genres List
df['genres_list'] = df['genres'].fillna('').apply(lambda x: [g.strip() for g in str(x).split(',') if g.strip()])

# 3. Define Target Genres (Top 5 + Sci-Fi)
# Top 5: Drama, Comedy, Action, Adventure, Thriller
# Plus: Science Fiction (Essential for Avatar)
target_genres = ['Drama', 'Comedy', 'Action', 'Adventure', 'Thriller', 'Science Fiction']

# 4. Create One-Hot Columns
for genre in target_genres:
    # Example: Creates 'genre_Action', 'genre_Science Fiction'
    df[f'genre_{genre}'] = df['genres_list'].apply(lambda x: 1 if genre in x else 0)

# 5. Create 'genre_Other'
# If the movie has a genre NOT in our target list (e.g., Horror), mark it here
def has_other(g_list):
    for g in g_list:
        if g not in target_genres:
            return 1
    return 0

df['genre_Other'] = df['genres_list'].apply(has_other)

# Drop the temporary list column
df = df.drop(columns=['genres_list'])

## Column Reordering for Readability and Structure

To enhance the readability and logical structure of our DataFrame, columns are being reordered. This arrangement places key identifier and monetary columns at the beginning, making it easier to quickly grasp essential information about each movie. These include `title`, `year`, `release_date`, `total_gross`, `opening_revenue`, `final_budget`, and `primary_genre`. Following these, other relevant data columns are listed, and finally, the genre-specific one-hot encoded columns are grouped together at the end. This organized structure is particularly beneficial for data exploration and subsequent machine learning tasks.

In [None]:
# --- REORDER COLUMNS ---
main_cols = [
    'title', 'year', 'release_date', 'primary_genre',
    'total_gross', 'opening_revenue', 'final_budget',  # Financials
    'distributor',
    'tmdb_id', 'imdb_id',                              # IDs (MOVED HERE)
    'tmdb_rating', 'imdb_rating', 'tmdb_votes',        # Ratings
    'runtime', 'director', 'genres'                    # Details
]

# Append the specific genre columns
genre_cols = [f'genre_{g}' for g in target_genres] + ['genre_Other']

# Append any other remaining columns
extra_cols = [c for c in df.columns if c not in main_cols and c not in genre_cols]

# Apply Final Order
final_order = main_cols + extra_cols + genre_cols
df = df[[c for c in final_order if c in df.columns]]

## Saving Cleaned Data

Before proceeding with any further analysis or machine learning tasks, it's crucial to save the cleaned and processed data. We are saving two distinct versions of the cleaned dataset:

1.  **`cleaned_movie_data_ml_ready.csv`**: This version contains all numerical values (e.g., total_gross, opening_revenue, final_budget) in their original numeric format. This is the dataset specifically prepared for machine learning model training, where numerical consistency and lack of special characters are essential.

2.  **`cleaned_movie_data_display.csv`**: This version is intended for human-readable reports and displays. Numerical columns like `total_gross`, `opening_revenue`, and `final_budget` have been formatted as currency strings (e.g., '$123,456.78') for better presentation. This version should **not** be used directly for machine learning as the currency symbols and commas will interfere with numerical calculations.

Choose the appropriate file based on your next steps: `_ml_ready.csv` for modeling, and `_display.csv` for reporting.

In [None]:

# --- SAVE OUTPUTS ---

# Version 1: Machine Learning Ready (Numeric)
df.to_csv('cleaned_movie_data_ml_ready.csv', index=False)

# Version 2: Display Ready (Currency Strings)
df_display = df.copy()
def clean_currency(x):
    try:
        if pd.isna(x): return "$0.00"
        return "${:,.2f}".format(float(x))
    except: return x

for col in ['total_gross', 'opening_revenue', 'final_budget']:
    df_display[col] = df_display[col].apply(clean_currency)

df_display.to_csv('cleaned_movie_data_display.csv', index=False)

print("âœ… Files Created:")
print("1. cleaned_movie_data_ml_ready.csv (Numeric - Use this for Standardization/ML)")
print("2. cleaned_movie_data_display.csv (Currency Strings - Use this for Reports)")



âœ… Files Created:
1. cleaned_movie_data_ml_ready.csv (Numeric - Use this for Standardization/ML)
2. cleaned_movie_data_display.csv (Currency Strings - Use this for Reports)


### The Importance of Feature Scaling for Machine Learning


As observed in our dataset, features like `'total_gross'` can have values ranging into hundreds of millions, while `'tmdb_rating'` is on a much smaller scale, typically between 0 and 10. If these features are used directly without scaling, the feature with the larger magnitude (`'total_gross'`) will disproportionately influence the model's objective function and learning process.

To prevent features with larger values from dominating the learning process and to ensure that all features contribute equally to the model's performance, it is essential to perform **standardization** or **normalization**. Common techniques include:
- **Min-Max Scaling:** Rescales features to a fixed range, usually 0 to 1.
- **Standardization (Z-score normalization):** Rescales features to have a mean of 0 and a standard deviation of 1.

Applying one of these techniques ensures that the features are on a comparable scale, leading to better model convergence and improved predictive performance.

In [None]:
# --- STANDARDIZATION CHECK ---
print("\nðŸ“Š Standardization Check (Why you need it):")
stats = df[['total_gross', 'tmdb_rating']].describe().loc[['mean', 'max']]
print(stats)
print("\n--> 'total_gross' max is ~900 Million, while 'rating' max is 10.")
print("--> You MUST standardize these features before Machine Learning.")


ðŸ“Š Standardization Check (Why you need it):
       total_gross  tmdb_rating
mean  5.816772e+07     6.626091
max   9.366622e+08    10.000000

--> 'total_gross' max is ~900 Million, while 'rating' max is 10.
--> You MUST standardize these features before Machine Learning.


## Summary:


*   The meticulous data cleaning and preparation, covering date parsing, feature selection, categorical encoding, and structured data output, provides a highly robust and versatile dataset, ready for a wide array of analytical tasks and machine learning model development.
*   The clear distinction between the ML-ready and display-ready datasets is a best practice that prevents data type errors in machine learning pipelines and caters to diverse stakeholder needs (analytical reporting vs. model training).
