# Dirty IMDb Top 1000 Movies Data Cleaning
**Datasets :** [Dirty_imdb_top_1000.csv](../Datasets/Dirty_imdb_top_1000.csv)

**Author   :** Fajar Laksono 

**Github   :** http://fajarlaksono.github.io/

## 1. Overview

This project focuses on the understanding and preparing the data sets of **Dirty IMDb Top 1000 Movies** (`Dirty_imdb_top_1000.csv`), Which is provided in a "dirty" or an unclear format. The dataset contains multiple issues such as missing required columns, incomplete values, and inconsistent formating that prevent us to have futher of extracting any insight from the data set. 

## 2. Objectives
The primarly objective of this project is to identify and clean the issues in order to produce a more reliable dataset that can serve aas a foundation for meaningful analysis.

Key steps in this project will include:
- Identifying data quality issues.
- Retriving the column country.
- Cleaning and resolving the issues.
- Producing a cleaned version of the dataset.

The dataset contains 1,000 movies from various countries. As part of the insight extraction process, it is required for us to recover the missing column that indicates each movie's country of origin.

By systematically addressing these problems, we aim to transform the raw dataset into a dependable resource for exploratory and descriptive analysis. The actual extraction of insights will be conducted in the subsequent project.

## 3. Preparation

### 3.1. Import Libraries

In [100]:
import pandas as pd
import os

print("Pandas Version:", pd.__version__)
print("Current Working Directory:", os.getcwd())

Pandas Version: 2.2.3
Current Working Directory: D:\Project\Github\FajarLaksono\analytics-dirty-imdb-data


### 3.2. Load dataset

In [117]:
dataset_path = 'Datasets/01_Raw/Dirty_imdb_top_1000.csv'
df = pd.read_csv(dataset_path)

## 4. Exploratory Data Analysis (EDA)

### 4.1. Data Preview

In [102]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Genre
0,https://m.media-amazon.com/images/M/MV5BMDFkY...,the shawshank redemption,1994.0,,142 min,9.3,Two imprisoned men bond over a number of years...,80.0,Tim @ Robbins,,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont *Drama
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,THE GODFATHER,1972.0,A,175 min,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon @ Brando,,James Caan,Diane Keaton,1620367.0,,"Francis Ford Coppola*Crime, Drama"
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152 min,9.0,When the menace known as the Joker wreaks hav...,84.0,@ Christian @ Bale @,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,"Christopher Nolan * Action, Crime, Drama"
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,THE GODFATHER: PART II,1974.0,A,202 min,,,90.0,@ Al @ Pacino @,,Robert Duvall,Diane Keaton,,,"Francis Ford Coppola * Crime, Drama"
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 angry men,1957.0,U,96 min,9.0,A jury holdout attempts to prevent a miscarria...,96.0,@ Henry @ Fonda @,Lee J. Cobb,Martin Balsam,,689845.0,4360000.0,"Sidney Lumet*Crime, Drama"


### 4.2. Schema

In [103]:
df.shape

(1000, 15)

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Poster_Link     902 non-null    object 
 1   Series_Title    901 non-null    object 
 2   Released_Year   900 non-null    float64
 3   Certificate     813 non-null    object 
 4   Runtime         900 non-null    object 
 5   IMDB_Rating     900 non-null    float64
 6   Overview        900 non-null    object 
 7   Meta_score      760 non-null    float64
 8   Star1           900 non-null    object 
 9   Star2           900 non-null    object 
 10  Star3           900 non-null    object 
 11  Star4           900 non-null    object 
 12  No_of_Votes     900 non-null    float64
 13  Gross           744 non-null    object 
 14  Director_Genre  900 non-null    object 
dtypes: float64(4), object(11)
memory usage: 117.3+ KB


### 4.3. Missing Values

In [105]:
df.isnull().sum()

Poster_Link        98
Series_Title       99
Released_Year     100
Certificate       187
Runtime           100
IMDB_Rating       100
Overview          100
Meta_score        240
Star1             100
Star2             100
Star3             100
Star4             100
No_of_Votes       100
Gross             256
Director_Genre    100
dtype: int64

### 4.4. Duplications

In [106]:
df.duplicated().sum()

np.int64(0)

### 4.5. Summary
1. <ins>Duplications:</ins> No duplication is detected.
2. <ins>Concatenated fields:</ins> Director_Genre field is identified to be Concatenated.
3. <ins>Inconsistent Formating:</ins> Formarting Inconsistent is detected in some of the Categorical and Textual colomns. 
4. <ins>Missing Column:</ins> The data is missing an important column that define the movies' country of origin.
5. <ins>Missing Values:</ins> Hundreds of values are missing from the respective columns.


## 5. Data Cleaning

### 5.1. Generate "ID" column
ID is used to make data cleaning easier.

In [118]:
df.insert(0, 'ID', range(1, len(df) + 1)) 

### 5.2. Clean Concatenated Fields

In [119]:
# ======= Split "Director_Genre" Column ======= 
df[['Director_Name', 'Genre']] = df["Director_Genre"].str.split('*', expand=True)
df.drop(columns=["Director_Genre"], inplace=True)
df.head()

Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre
0,1,https://m.media-amazon.com/images/M/MV5BMDFkY...,the shawshank redemption,1994.0,,142 min,9.3,Two imprisoned men bond over a number of years...,80.0,Tim @ Robbins,,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont,Drama
1,2,https://m.media-amazon.com/images/M/MV5BM2MyNj...,THE GODFATHER,1972.0,A,175 min,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon @ Brando,,James Caan,Diane Keaton,1620367.0,,Francis Ford Coppola,"Crime, Drama"
2,3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152 min,9.0,When the menace known as the Joker wreaks hav...,84.0,@ Christian @ Bale @,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,Christopher Nolan,"Action, Crime, Drama"
3,4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,THE GODFATHER: PART II,1974.0,A,202 min,,,90.0,@ Al @ Pacino @,,Robert Duvall,Diane Keaton,,,Francis Ford Coppola,"Crime, Drama"
4,5,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 angry men,1957.0,U,96 min,9.0,A jury holdout attempts to prevent a miscarria...,96.0,@ Henry @ Fonda @,Lee J. Cobb,Martin Balsam,,689845.0,4360000.0,Sidney Lumet,"Crime, Drama"


### 5.3. Handle the Inconsistency of Formating

In [120]:
import re

# ===== Standardize "Poster_Link" =====
df['Poster_Link'] = df['Poster_Link'].astype('string').str.strip()

# ===== Standardize "Series_Title" ===== 
def normalize_name(name):    
    if pd.isna(name):
        return pd.NA

    name = name.strip().lower()
    name = re.sub(r'\s+', ' ', name) 
    
    roman_pattern = r'^(?=[MDCLXVI])M{0,4}(CM|CD|D?C{0,3})?' \
                r'(XC|XL|L?X{0,3})?(IX|IV|V?I{0,3})$'

    words = name.split(' ')

    normalized = []
    for word in words: 
        if re.match(roman_pattern, word.upper()):
            normalized.append(word.upper())
        else:
            normalized.append(word.capitalize())

    return ' '.join(normalized)

df['Series_Title'] = df['Series_Title'].astype('string').apply(normalize_name)

# ===== Standardize "Released_Year" ===== 
# no action needed

# ===== Standardize "Certificaate" ===== 
df['Certificate'] = df['Certificate'].astype('string').str.strip().str.upper()

# ===== Standardize "Runtime" =====
df['Runtime'] = df['Runtime'].astype('string').str.strip().str.replace(' min', '', regex=False).apply(pd.to_numeric, errors='coerce').astype('Int64')
df.rename(columns={'Runtime': 'Runtime_Minutes'}, inplace=True)

# ===== Standardize "IMDB_Rating" =====
# no action needed

# ===== Standardize "Overview" =====
df['Overview'] = df['Overview'].astype('string').str.strip().replace(', ...', '.').str.replace('See full summary »', '')

# ===== Standardize "Meta_score" =====
# no action needed

# # ===== Standardize "Star1" =====
df['Star1'] = df['Star1'].astype('string').str.replace('@', '').apply(normalize_name)

# # ===== Standardize "Star2" =====
df['Star2'] = df['Star2'].astype('string').apply(normalize_name)

# ===== Standardize "Star3" =====
df['Star3'] = df['Star3'].astype('string').apply(normalize_name)

# ===== Standardize "Star4" =====
df['Star4'] = df['Star4'].astype('string').apply(normalize_name)

# ===== Standardize "No_of_Votes" =====
df['No_of_Votes'] = df['No_of_Votes'].astype('string').str.strip().astype('Float64')

# ===== Standardize "Gross" =====
df['Gross'] = df['Gross'].str.replace(',', '', regex=False).astype('Float64')

# # ===== Standardize "Director_Name" =====
df['Director_Name'] = df['Director_Name'].astype('string').apply(normalize_name)

# # ===== Standardize "Genre" =====
# # no action needed

df.head()


Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime_Minutes,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre
0,1,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994.0,,142,9.3,Two imprisoned men bond over a number of years...,80.0,Tim Robbins,,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont,Drama
1,2,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972.0,A,175,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon Brando,,James Caan,Diane Keaton,1620367.0,,Francis Ford Coppola,"Crime, Drama"
2,3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152,9.0,When the menace known as the Joker wreaks havo...,84.0,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,Christopher Nolan,"Action, Crime, Drama"
3,4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974.0,A,202,,,90.0,Al Pacino,,Robert Duvall,Diane Keaton,,,Francis Ford Coppola,"Crime, Drama"
4,5,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957.0,U,96,9.0,A jury holdout attempts to prevent a miscarria...,96.0,Henry Fonda,Lee J. Cobb,Martin Balsam,,689845.0,4360000.0,Sidney Lumet,"Crime, Drama"


In [122]:
df.isnull().sum()

ID                   0
Poster_Link         98
Series_Title        99
Released_Year      100
Certificate        187
Runtime_Minutes    100
IMDB_Rating        100
Overview           100
Meta_score         240
Star1              100
Star2              100
Star3              100
Star4              100
No_of_Votes        100
Gross              256
Director_Name      100
Genre              100
dtype: int64

### Recover "Country" Column With AI

Check Important Values for GenAI

In [123]:
df[
    (df['Series_Title'].isnull() | df['Series_Title'].str.strip().eq('')) 
    & (df['Director_Name'].isnull() | df['Director_Name'].str.strip().eq('')) 
    & (df['Overview'].isnull() | df['Overview'].str.strip().eq('')) 
]

Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime_Minutes,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre
552,553,https://m.media-amazon.com/images/M/MV5BODcxYj...,,1956.0,U,,7.9,,,Charlton Heston,Yul Brynner,Anne Baxter,Edward G. Robinson,63560.0,93740000.0,,


To retieve the Country values we can rely on existing values like Series_Title, Overview, or Director_Name.
For the another 1 row, we can rely on the Star names and Year.

We will perform Gen AI processing to generate the Country colomn.

In [None]:
import os

from dotenv import load_dotenv
import google.generativeai as genai

def generate_prompt(data):
    """
    Generate a prompt for the Gemini model based on the provided data.
    """
    prompt = (
        "You have 2 objective: , "
        "1. Tell me the a country origin of these Series/Movies, "
        "2. Fill in the missing value within the data."
        "Note: "
        " - Use the existing values to fill in the missing values and guess the country origin in the new column. "
        " - If you cannot find the country origin or the missing value, just put 'Unknown' in the new cell. "
        "This is the data set: "
        "" 
        + str(data)
    )
    return prompt

def extract_columns_as_clues(df, columns):
    return df[[col for col in columns]]





========================================================================================

In [2]:
from dotenv import load_dotenv
import os
import google.generativeai as genai
import json
import pandas as pd

load_dotenv()
api_key = os.getenv("GEMINI_API_KEY")
model_name = os.getenv("GEMINI_MODEL", "gemini-1.5-flash")

# Configure Gemini API
genai.configure(api_key=api_key)
model = genai.GenerativeModel(model_name)

# Define candidate countries for validation
candidate_countries = [
    'United States', 'United Kingdom', 'France', 'Germany', 'Italy', 'Spain', 
    'India', 'Japan', 'China', 'South Korea', 'Canada', 'Australia', 
    'Russia', 'Brazil', 'Mexico', 'Argentina', 'Unknown'
]

def infer_countries_batch(df, batch_size=50):
    """
    Infer countries for movies in batches using a single API request per batch.
    Returns a dictionary mapping ID to country.
    """
    results = {}
    
    # Process in batches
    for i in range(0, len(df), batch_size):
        batch_df = df.iloc[i:i+batch_size]
        
        # Create batch prompt with all movies
        movie_data = []
        for _, row in batch_df.iterrows():
            movie_id = row['ID']
            title = str(row['Series_Title']) if pd.notnull(row['Series_Title']) else 'Unknown'
            director = str(row['Director_Name']) if pd.notnull(row['Director_Name']) else 'Unknown'
            movie_data.append({
                'id': movie_id,
                'title': title,
                'director': director
            })
        
        # Create batch prompt
        prompt = f"""Given the following list of movies, predict the country of origin for each one.
Return your response as a JSON object where each key is the movie ID and the value is the country name.
Only use countries from this list: {', '.join(candidate_countries[:-1])}. If uncertain, use 'Unknown'.

Movies:
"""
        
        for movie in movie_data:
            prompt += f"ID {movie['id']}: Title: '{movie['title']}', Director: '{movie['director']}'\n"
        
        prompt += f"\nReturn format: {json.dumps({str(movie['id']): 'Country' for movie in movie_data[:2]})}"
        
        try:
            response = model.generate_content(prompt)
            response_text = response.text.strip()
            
            # Try to extract JSON from response
            try:
                # Sometimes the response includes markdown formatting
                if '```json' in response_text:
                    json_start = response_text.find('{')
                    json_end = response_text.rfind('}') + 1
                    response_text = response_text[json_start:json_end]
                elif '```' in response_text:
                    lines = response_text.split('\n')
                    response_text = '\n'.join([line for line in lines if not line.strip().startswith('```')])
                
                batch_results = json.loads(response_text)
                
                # Convert string IDs back to integers and validate countries
                for movie_id_str, country in batch_results.items():
                    movie_id = int(movie_id_str)
                    validated_country = country if country in candidate_countries else 'Unknown'
                    results[movie_id] = validated_country
                    
            except (json.JSONDecodeError, ValueError) as e:
                print(f"Error parsing JSON response for batch {i//batch_size + 1}: {e}")
                print(f"Response was: {response_text}")
                # Fallback: assign Unknown to all movies in this batch
                for movie in movie_data:
                    results[movie['id']] = 'Unknown'
                    
        except Exception as e:
            print(f"Error processing batch {i//batch_size + 1}: {e}")
            # Fallback: assign Unknown to all movies in this batch
            for movie in movie_data:
                results[movie['id']] = 'Unknown'
    
    return results

print("Batch country inference function created successfully!")
print(f"Will process {len(df)} movies in batches of 50 (approximately {len(df)//50 + 1} API requests)")

Batch country inference function created successfully!


NameError: name 'df' is not defined

In [None]:
# Apply batch country inference
print("Starting batch country inference...")
country_results = infer_countries_batch(df, batch_size=50)

# Map results back to dataframe using ID
df['Country'] = df['ID'].map(country_results)

# Verify results
print(f"\nCountry inference completed!")
print(f"Total movies processed: {len(df)}")
print(f"Countries found: {len([c for c in df['Country'] if c != 'Unknown'])}")
print(f"Unknown countries: {len([c for c in df['Country'] if c == 'Unknown'])}")

# Show distribution of countries
print("\nCountry distribution:")
print(df['Country'].value_counts().head(10))

In [None]:
# Test with a small sample first to validate the approach
print("Testing batch approach with first 5 movies...")

# Create a small test dataframe
test_df = df.head(5).copy()
test_results = infer_countries_batch(test_df, batch_size=5)

print(f"Test results: {test_results}")
print("\nSample movies and their inferred countries:")
for _, row in test_df.iterrows():
    movie_id = row['ID']
    title = row['Series_Title']
    director = row['Director_Name']
    country = test_results.get(movie_id, 'Unknown')
    print(f"ID {movie_id}: '{title}' by {director} -> {country}")

=====================================

https://docs.google.com/document/d/1HNaIX8a1rqtNvj0xQJNNJ9jK4zM_AZw1Vr96yaS7nWY/edit?tab=t.0#heading=h.ixompnlae5qn

Todo: 
1. Overview
    - Background and Overview
    - Probelm Statement
    - Research Question
    - Methodology
    - Data Collection
    - Data Description
    - Data Collection Procedures
2. Exploratory Data Analysis EDA
3. Descriptive Analysis  
    - Introduction
    - Continuous Variables
    - Categorical Variables
    - Bivariate Analysis
        - Continuous Variable Against Target Variable
        - T-Test (Continuous Variable)
        - Categorical Variable Against Target Variable
        - Chi Square Test for Categorical Variables
        - Correlation
    - Graphs and Charts 
4. DATA VISUALIZATION
    - Demographic
    - Summary
5. PRESCRIPTIVE ANALYTICS
    - Introduction 
    - Model Performance Evaluation
    - Confusion Matrix
    - Classification Metrics
    - Accuracy and AUC
    - Feature Importance (Model Coefficients)
    - Prescriptive Analytics Recommendations
    - SmartPLS Model: Structural Relationship Analysis
    - Influence on Default Risk
    - Latent Variable Composition
    - Practical Implications
    - Integration with Logistic Regression Findings
    - Conclusion
6. Discussions
    - Descriptive Analysis
    - Data Visualisation
    - Predictive Analysis
7. RECOMMENDATIONS
    - Execute Risk-Based Monitoring Utilizing Payment History
    - Modify Credit Limits According to Utilization and Repayment Patterns
    - Integrate sophisticated machine learning models to enhance precision.
    - Guarantee Ethical Utilization of Predictive Analytics via Fairness Audits


















Recommended Structure for a Jupyter Analysis Notebook

### Title & Introduction

Short, clear project title

Brief background about the dataset/problem

The key questions you want to answer

### Objectives / Research Questions

Define what you want to investigate (e.g., “Which factors influence sales performance?”)

State hypotheses if relevant

### Data Understanding

Data source (where it comes from)

Dataset description (size, variables, what each means)

Any business/real-world context

### Data Loading & Setup

Import libraries

Load dataset(s)

Quick look at raw data (first rows, dimensions)

### Data Cleaning & Preprocessing

Handle missing values

Remove duplicates

Data type conversions (e.g., dates, categories)

Outlier detection & handling

Feature engineering (if needed)

### Exploratory Data Analysis (EDA)

Summary statistics (mean, median, distributions)

Visualizations (histograms, bar plots, scatter plots, boxplots, correlations)

Key insights discovered

### Data Visualization for Insights

Advanced visuals: heatmaps, time series plots, interactive charts (Plotly, Seaborn, etc.)

Explain what each visualization means in context

### Analysis / Modeling (Optional, depending on your goal)

If you want to show machine learning: regression, classification, clustering, etc.

If pure business analytics: trends, comparisons, KPI calculations

### Evaluation & Validation (if modeling)

Model metrics (accuracy, RMSE, confusion matrix, etc.)

Compare models if you tried more than one

### Business Insights & Interpretation

Translate numbers/graphs into business meaning

What does the analysis suggest?

Which hypotheses were confirmed/refuted?

### Conclusion & Recommendations

Key takeaways from the analysis

Actionable recommendations (if this were for a stakeholder)

### Next Steps / Limitations

Mention limitations of data/analysis

Suggest future improvements or additional data that would help

### Appendix (Optional)

Extra charts, code snippets, references