# Dirty IMDb Top 1000 Movies Data Cleaning V1
**Datasets :** [Dirty_imdb_top_1000.csv](../Datasets/Dirty_imdb_top_1000.csv)

**Author   :** Fajar Laksono 

**Github   :** http://fajarlaksono.github.io/

## 1. Overview

This project focuses on the understanding and preparing the data sets of **Dirty IMDb Top 1000 Movies** (`Dirty_imdb_top_1000.csv`), Which is provided in a "dirty" or an unclear format. The dataset contains multiple issues such as missing required columns, incomplete values, and inconsistent formating that prevent us to have futher of extracting any insight from the data set. 

## 2. Objectives
The primarly objective of this project is to identify and clean the issues in order to produce a more reliable dataset that can serve aas a foundation for meaningful analysis.

Key steps in this project will include:
- Identifying data quality issues.
- Retriving the column country.
- Cleaning and resolving the issues.
- Producing a cleaned version of the dataset.

The dataset contains 1,000 movies from various countries. As part of the insight extraction process, it is required for us to recover the missing column that indicates each movie's country of origin.

By systematically addressing these problems, we aim to transform the raw dataset into a dependable resource for exploratory and descriptive analysis. The actual extraction of insights will be conducted in the subsequent project.

## 3. Preparation

### 3.1. Import Libraries

In [240]:
import pandas as pd
import os
import io
from io import StringIO

from dotenv import load_dotenv
import google.generativeai as genai

load_dotenv()
print("Pandas Version:", pd.__version__)
print("dotenv Version:", load_dotenv.__module__.split('.')[0], pd.__version__)

# For dotenv version
import importlib.metadata
print("python-dotenv Version:", importlib.metadata.version("python-dotenv"))

# For google.generativeai version
print("google-generativeai Version:", importlib.metadata.version("google-generativeai"))

print("Current Working Directory:", os.getcwd())

Pandas Version: 2.2.3
dotenv Version: dotenv 2.2.3
python-dotenv Version: 1.1.0
google-generativeai Version: 0.8.5
Current Working Directory: D:\Project\Github\FajarLaksono\analytics-dirty-imdb-data


### 3.2. Load dataset

In [241]:
dataset_path = 'Datasets/01_Raw/Dirty_imdb_top_1000.csv'
df = pd.read_csv(dataset_path)

## 4. Exploratory Data Analysis (EDA)

### 4.1. Data Preview

In [242]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Genre
0,https://m.media-amazon.com/images/M/MV5BMDFkY...,the shawshank redemption,1994.0,,142 min,9.3,Two imprisoned men bond over a number of years...,80.0,Tim @ Robbins,,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont *Drama
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,THE GODFATHER,1972.0,A,175 min,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon @ Brando,,James Caan,Diane Keaton,1620367.0,,"Francis Ford Coppola*Crime, Drama"
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152 min,9.0,When the menace known as the Joker wreaks hav...,84.0,@ Christian @ Bale @,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,"Christopher Nolan * Action, Crime, Drama"
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,THE GODFATHER: PART II,1974.0,A,202 min,,,90.0,@ Al @ Pacino @,,Robert Duvall,Diane Keaton,,,"Francis Ford Coppola * Crime, Drama"
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 angry men,1957.0,U,96 min,9.0,A jury holdout attempts to prevent a miscarria...,96.0,@ Henry @ Fonda @,Lee J. Cobb,Martin Balsam,,689845.0,4360000.0,"Sidney Lumet*Crime, Drama"


### 4.2. Schema

In [243]:
df.shape

(1000, 15)

In [244]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Poster_Link     902 non-null    object 
 1   Series_Title    901 non-null    object 
 2   Released_Year   900 non-null    float64
 3   Certificate     813 non-null    object 
 4   Runtime         900 non-null    object 
 5   IMDB_Rating     900 non-null    float64
 6   Overview        900 non-null    object 
 7   Meta_score      760 non-null    float64
 8   Star1           900 non-null    object 
 9   Star2           900 non-null    object 
 10  Star3           900 non-null    object 
 11  Star4           900 non-null    object 
 12  No_of_Votes     900 non-null    float64
 13  Gross           744 non-null    object 
 14  Director_Genre  900 non-null    object 
dtypes: float64(4), object(11)
memory usage: 117.3+ KB


### 4.3. Missing Values

In [245]:
df.isnull().sum()

Poster_Link        98
Series_Title       99
Released_Year     100
Certificate       187
Runtime           100
IMDB_Rating       100
Overview          100
Meta_score        240
Star1             100
Star2             100
Star3             100
Star4             100
No_of_Votes       100
Gross             256
Director_Genre    100
dtype: int64

### 4.4. Duplications

In [246]:
df.duplicated().sum()

np.int64(0)

### 4.5. Summary
1. <ins>Duplications:</ins> No duplication is detected.
2. <ins>Concatenated fields:</ins> Director_Genre field is identified to be Concatenated.
3. <ins>Inconsistent Formating:</ins> Formarting Inconsistent is detected in some of the Categorical and Textual colomns. 
4. <ins>Missing Column:</ins> The data is missing an important column that define the movies' country of origin.
5. <ins>Missing Values:</ins> Hundreds of values are missing from the respective columns.


## 5. Data Cleaning

### 5.1. Generate "ID" column
ID is used to make data cleaning easier.

In [247]:
df.insert(0, 'ID', range(1, len(df) + 1)) 

### 5.2. Clean Concatenated Fields

In [248]:
# ======= Split "Director_Genre" Column ======= 
df[['Director_Name', 'Genre']] = df["Director_Genre"].str.split('*', expand=True)
df.drop(columns=["Director_Genre"], inplace=True)
df.head()

Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre
0,1,https://m.media-amazon.com/images/M/MV5BMDFkY...,the shawshank redemption,1994.0,,142 min,9.3,Two imprisoned men bond over a number of years...,80.0,Tim @ Robbins,,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont,Drama
1,2,https://m.media-amazon.com/images/M/MV5BM2MyNj...,THE GODFATHER,1972.0,A,175 min,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon @ Brando,,James Caan,Diane Keaton,1620367.0,,Francis Ford Coppola,"Crime, Drama"
2,3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152 min,9.0,When the menace known as the Joker wreaks hav...,84.0,@ Christian @ Bale @,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,Christopher Nolan,"Action, Crime, Drama"
3,4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,THE GODFATHER: PART II,1974.0,A,202 min,,,90.0,@ Al @ Pacino @,,Robert Duvall,Diane Keaton,,,Francis Ford Coppola,"Crime, Drama"
4,5,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 angry men,1957.0,U,96 min,9.0,A jury holdout attempts to prevent a miscarria...,96.0,@ Henry @ Fonda @,Lee J. Cobb,Martin Balsam,,689845.0,4360000.0,Sidney Lumet,"Crime, Drama"


### 5.3. Handle the Inconsistency of Formating

In [249]:
import re

# ===== Standardize "Poster_Link" =====
df['Poster_Link'] = df['Poster_Link'].astype('string').str.strip()

# ===== Standardize "Series_Title" ===== 
def normalize_name(name):    
    if pd.isna(name):
        return pd.NA

    name = name.strip().lower()
    name = re.sub(r'\s+', ' ', name) 
    
    roman_pattern = r'^(?=[MDCLXVI])M{0,4}(CM|CD|D?C{0,3})?' \
                r'(XC|XL|L?X{0,3})?(IX|IV|V?I{0,3})$'

    words = name.split(' ')

    normalized = []
    for word in words: 
        if re.match(roman_pattern, word.upper()):
            normalized.append(word.upper())
        else:
            normalized.append(word.capitalize())

    return ' '.join(normalized)

df['Series_Title'] = df['Series_Title'].astype('string').apply(normalize_name)

# ===== Standardize "Released_Year" ===== 
# no action needed

# ===== Standardize "Certificaate" ===== 
df['Certificate'] = df['Certificate'].astype('string').str.strip().str.upper()

# ===== Standardize "Runtime" =====
df['Runtime'] = df['Runtime'].astype('string').str.strip().str.replace(' min', '', regex=False).apply(pd.to_numeric, errors='coerce').astype('Int64')
df.rename(columns={'Runtime': 'Runtime_Minutes'}, inplace=True)

# ===== Standardize "IMDB_Rating" =====
# no action needed

# ===== Standardize "Overview" =====
df['Overview'] = df['Overview'].astype('string').str.strip().replace(', ...', '.').str.replace('See full summary »', '')

# ===== Standardize "Meta_score" =====
# no action needed

# # ===== Standardize "Star1" =====
df['Star1'] = df['Star1'].astype('string').str.replace('@', '').apply(normalize_name)

# # ===== Standardize "Star2" =====
df['Star2'] = df['Star2'].astype('string').apply(normalize_name)

# ===== Standardize "Star3" =====
df['Star3'] = df['Star3'].astype('string').apply(normalize_name)

# ===== Standardize "Star4" =====
df['Star4'] = df['Star4'].astype('string').apply(normalize_name)

# ===== Standardize "No_of_Votes" =====
df['No_of_Votes'] = df['No_of_Votes'].astype('string').str.strip().astype('Float64')

# ===== Standardize "Gross" =====
df['Gross'] = df['Gross'].str.replace(',', '', regex=False).astype('Float64')

# # ===== Standardize "Director_Name" =====
df['Director_Name'] = df['Director_Name'].astype('string').apply(normalize_name)

# # ===== Standardize "Genre" =====
# # no action needed

df.head()


Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime_Minutes,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre
0,1,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994.0,,142,9.3,Two imprisoned men bond over a number of years...,80.0,Tim Robbins,,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont,Drama
1,2,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972.0,A,175,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon Brando,,James Caan,Diane Keaton,1620367.0,,Francis Ford Coppola,"Crime, Drama"
2,3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152,9.0,When the menace known as the Joker wreaks havo...,84.0,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,Christopher Nolan,"Action, Crime, Drama"
3,4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974.0,A,202,,,90.0,Al Pacino,,Robert Duvall,Diane Keaton,,,Francis Ford Coppola,"Crime, Drama"
4,5,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957.0,U,96,9.0,A jury holdout attempts to prevent a miscarria...,96.0,Henry Fonda,Lee J. Cobb,Martin Balsam,,689845.0,4360000.0,Sidney Lumet,"Crime, Drama"


In [250]:
df.isnull().sum()

ID                   0
Poster_Link         98
Series_Title        99
Released_Year      100
Certificate        187
Runtime_Minutes    100
IMDB_Rating        100
Overview           100
Meta_score         240
Star1              100
Star2              100
Star3              100
Star4              100
No_of_Votes        100
Gross              256
Director_Name      100
Genre              100
dtype: int64

### 5.4. Recover Missing Data With AI

Check Important Values for GenAI

In [251]:
df[
    (df['Series_Title'].isnull() | df['Series_Title'].str.strip().eq('')) 
    & (df['Director_Name'].isnull() | df['Director_Name'].str.strip().eq('')) 
    & (df['Overview'].isnull() | df['Overview'].str.strip().eq('')) 
]

Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime_Minutes,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre
552,553,https://m.media-amazon.com/images/M/MV5BODcxYj...,,1956.0,U,,7.9,,,Charlton Heston,Yul Brynner,Anne Baxter,Edward G. Robinson,63560.0,93740000.0,,


To retieve the Country values we can rely on existing values like Series_Title, Overview, or Director_Name.
For the another 1 row, we can rely on the Star names and Year.

We will perform Gen AI processing to generate the Country colomn.

#### 5.4.1 Preparation

In [252]:
def generate_prompt(data, tasks):
    prompt = (
        "Tasks: \n"
        + str(tasks) +
        "\nRules: \n"
        " - Use the existing values as clues to find out the missing value, to fill in the missing cells or columns\n"
        " - If you cannot find the country origin or the missing value, just put 'Unknown' into the cell.\n"
        " - DO NOT provide any explanations, reasoning, or commentary.\n"
        " - DO NOT include any text before or after the dataset.\n"
        " - Return data in JSON format for easier parsing.\n"
        "\nResponse Format: \n"
        " - Provide ONLY a JSON array of objects.\n"
        " - Each object should have all the required fields.\n"
        " - Start immediately with the JSON array (no introductory text).\n"
        " - Ensure all missing values are filled in appropriately.\n"
        " - Example format: [{\"ID\": 1, \"Series_Title\": \"Movie Name\", \"Director_Name\": \"Director\", \"Overview\": \"Description\", \"Country_Origin\": \"USA\"}]\n"
        "\nThis is the data set:\n"
        + data.to_json(orient='records', indent=2)
    )
    return prompt


def extract_columns_as_clues(df, columns):
    return df[[col for col in columns]]


def parse_ai_response_to_dataframe(ai_response_text):
    import json
    try:
        # Clean the response - remove any markdown formatting
        clean_text = ai_response_text.strip()
        
        # Remove markdown code blocks if present
        if clean_text.startswith('```json'):
            clean_text = clean_text[7:]
        if clean_text.endswith('```'):
            clean_text = clean_text[:-3]
        clean_text = clean_text.strip()
        
        # Try to find JSON array in the response
        start_idx = clean_text.find('[')
        end_idx = clean_text.rfind(']') + 1
        
        if start_idx != -1 and end_idx > start_idx:
            json_text = clean_text[start_idx:end_idx]
            
            # Parse JSON and convert to DataFrame
            data = json.loads(json_text)
            df = pd.DataFrame(data)
            return df
        else:
            print("No JSON array found in response")
            return None
            
    except json.JSONDecodeError as e:
        print(f"JSON parsing error: {e}")
        print(f"Response text: {ai_response_text[:200]}...")
        return None
    except Exception as e:
        print(f"Error parsing AI response: {e}")
        print(f"Raw response: {ai_response_text[:200]}...")
        return None


def call_gemini_api(prompt):
    api_key = os.getenv('GEMINI_API_KEY')
    model_name = os.getenv("GEMINI_MODEL", "gemini-1.5-flash")

    if not api_key:
        print("GEMINI_API_KEY environment variable not set.")
    
    try:
        genai.configure(api_key=api_key)

        model = genai.GenerativeModel(model_name)

        response = model.generate_content(prompt)

        return response.text

    except Exception as e:
        print(f"Error calling Gemini API: {e}")

        return None
    

def process_data_in_batches(df, tasks, batch_size=50, max_retries=3):
    all_results = []
    total_batches = len(df) // batch_size + (1 if len(df) % batch_size != 0 else 0)
    
    print(f"Processing {len(df)} rows in {total_batches} batches of {batch_size} rows each...")
    
    for i in range(0, len(df), batch_size):
        batch_num = i // batch_size + 1
        batch = df.iloc[i:i+batch_size].copy()
        
        print(f"Processing batch {batch_num}/{total_batches} (rows {i+1}-{min(i+batch_size, len(df))})")
        
        success = False
        for retry in range(max_retries):
            try:
                prompt = generate_prompt(batch, tasks)
                
                print(f"Calling Gemini API for batch {batch_num} (attempt {retry + 1}/{max_retries})...")
                ai_response_text = call_gemini_api(prompt)
                print(f"Gemini API response received for batch {batch_num}")
                
                if ai_response_text:
                    print(f"Parsing AI response for batch {batch_num}...")
                    batch_result = parse_ai_response_to_dataframe(ai_response_text)
                    
                    if batch_result is not None and len(batch_result) > 0:
                        all_results.append(batch_result)
                        print(f"✓ Batch {batch_num} completed successfully")
                        success = True
                        break
                    else:
                        print(f"✗ Batch {batch_num} failed to parse (attempt {retry + 1}/{max_retries})")
                else:
                    print(f"✗ Batch {batch_num} API call failed (attempt {retry + 1}/{max_retries})")
                    
            except Exception as e:
                print(f"✗ Batch {batch_num} error: {e} (attempt {retry + 1}/{max_retries})")
        
        if not success:
            print(f"⚠️  Batch {batch_num} failed after {max_retries} attempts. Using original data.")
            all_results.append(batch)
    
    if all_results:
        combined_result = pd.concat(all_results, ignore_index=True)
        print(f"✓ All batches completed. Final dataset: {len(combined_result)} rows")
        return combined_result
    else:
        print("✗ No successful batches. Returning original data.")
        return df

#### 5.4.2. Recover Missing values and columns

In [253]:

series_director_overview_country_tasks_list = "1. Add a new column called 'Country_Origin' that tells the country origin of the Series/Movies, \n" \
"2. Fill in the missing value within the data.\n"

series_director_overview_country_prompt = generate_prompt(df, series_director_overview_country_tasks_list)

print(series_director_overview_country_prompt)

Tasks: 
1. Add a new column called 'Country_Origin' that tells the country origin of the Series/Movies, 
2. Fill in the missing value within the data.

Rules: 
 - Use the existing values as clues to find out the missing value, to fill in the missing cells or columns
 - If you cannot find the country origin or the missing value, just put 'Unknown' into the cell.
 - DO NOT provide any explanations, reasoning, or commentary.
 - DO NOT include any text before or after the dataset.
 - Return data in JSON format for easier parsing.

Response Format: 
 - Provide ONLY a JSON array of objects.
 - Each object should have all the required fields.
 - Start immediately with the JSON array (no introductory text).
 - Ensure all missing values are filled in appropriately.
 - Example format: [{"ID": 1, "Series_Title": "Movie Name", "Director_Name": "Director", "Overview": "Description", "Country_Origin": "USA"}]

This is the data set:
[
  {
    "ID":1,
    "Poster_Link":"https:\/\/m.media-amazon.com\/i

In [254]:
result = process_data_in_batches(
    df, 
    series_director_overview_country_tasks_list,
    batch_size=10
)

if result is not None:
    print("\n AI results:")
    print(result.head())
    print(f"\n Columns: {list(result.columns)}")
else:
    print("AI cleaning failed")

Processing 1000 rows in 100 batches of 10 rows each...
Processing batch 1/100 (rows 1-10)
Calling Gemini API for batch 1 (attempt 1/3)...
Gemini API response received for batch 1
Parsing AI response for batch 1...
✓ Batch 1 completed successfully
Processing batch 2/100 (rows 11-20)
Calling Gemini API for batch 2 (attempt 1/3)...
Gemini API response received for batch 2
Parsing AI response for batch 2...
✓ Batch 2 completed successfully
Processing batch 3/100 (rows 21-30)
Calling Gemini API for batch 3 (attempt 1/3)...
Gemini API response received for batch 3
Parsing AI response for batch 3...
✓ Batch 3 completed successfully
Processing batch 4/100 (rows 31-40)
Calling Gemini API for batch 4 (attempt 1/3)...
Gemini API response received for batch 4
Parsing AI response for batch 4...
✓ Batch 4 completed successfully
Processing batch 5/100 (rows 41-50)
Calling Gemini API for batch 5 (attempt 1/3)...
Gemini API response received for batch 5
Parsing AI response for batch 5...
✓ Batch 5 comp

In [255]:
result.head()

Unnamed: 0,ID,Poster_Link,Series_Title,Released_Year,Certificate,Runtime_Minutes,IMDB_Rating,Overview,Meta_score,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Director_Name,Genre,Country_Origin
0,1,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994.0,R,142,9.3,Two imprisoned men bond over a number of years...,80.0,Tim Robbins,Unknown,Bob Gunton,William Sadler,2343110.0,28341469.0,Frank Darabont,Drama,USA
1,2,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972.0,A,175,9.2,An organized crime dynasty's aging patriarch t...,100.0,Marlon Brando,Unknown,James Caan,Diane Keaton,1620367.0,Unknown,Francis Ford Coppola,"Crime, Drama",USA
2,3,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008.0,UA,152,9.0,When the menace known as the Joker wreaks havo...,84.0,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232.0,534858444.0,Christopher Nolan,"Action, Crime, Drama",USA
3,4,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974.0,A,202,Unknown,Unknown,90.0,Al Pacino,Unknown,Robert Duvall,Diane Keaton,Unknown,Unknown,Francis Ford Coppola,"Crime, Drama",USA
4,5,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957.0,U,96,9.0,A jury holdout attempts to prevent a miscarria...,96.0,Henry Fonda,Lee J. Cobb,Martin Balsam,Unknown,689845.0,4360000.0,Sidney Lumet,"Crime, Drama",USA


## 6. Export The Result

In [None]:
result.to_csv('Datasets/02_Cleaned/Cleaned_imdb_top_1000_v1.csv', index=False)