## Hybrid Enrichment

these two codes fills both the director and imdb_rating columns USING IMDb Static Datasets and API  , only when they are currently empty or missing.

#### Phase 1: IMDb Static Datasets (Bulk Fill)

Data Fill source: https://datasets.imdbws.com/?spm=a2ty_o01.29997173.0.0.5e88c921oLLsPK
How to Use:
1. Download the 3 IMDb files From the link:
    - title.basics.tsv.gz
    - title.crew.tsv.gz
    - name.basics.tsv.gz
2. Extract them into a folder called imdb_datasets (or update IMDB_DIR).
3. Run the script ‚Äî it will match by title + year and fill the "director" and "imdb_rating" columns.
4. after That Run the code Second Code, to fill the rest of the missing value using API 



In [None]:


import pandas as pd
import os

# === CONFIG ===
INPUT_FILE = "netflix_titles.csv"
OUTPUT_FILE = "netflix_titles(modified using Dataset).csv"
IMDB_DIR = r"D:\Python\imdb_datasets"  # Folder with extracted .tsv files

# Check IMDb data folder
if not os.path.exists(IMDB_DIR):
    raise FileNotFoundError(
        f"Please download and extract IMDb .tsv files into: {IMDB_DIR}\n"
        "Required: title.basics.tsv, title.ratings.tsv, title.crew.tsv, name.basics.tsv"
    )

# === LOAD NETFLIX DATA ===
print("Loading Netflix dataset...")
df = pd.read_csv(INPUT_FILE)

# Add enrichment columns if missing
if 'imdb_rating' not in df.columns:
    df['imdb_rating'] = None
    print("‚úÖ Added missing column: 'imdb_rating'")
if 'director' not in df.columns:
    df['director'] = None
    print("‚úÖ Added missing column: 'director'")

# Record null counts (now safe to access)
orig_director_nulls = df['director'].isna().sum()
orig_rating_nulls = df['imdb_rating'].isna().sum()
orig_cast_nulls = df['cast'].isna().sum() if 'cast' in df.columns else 0
orig_country_nulls = df['country'].isna().sum() if 'country' in df.columns else 0

print(f"\nOriginal missing values:")
print(f"  director:     {orig_director_nulls}")
print(f"  imdb_rating:  {orig_rating_nulls}")
print(f"  cast:         {orig_cast_nulls}")
print(f"  country:      {orig_country_nulls}")

# Prepare matching keys
df = df.copy()
df['match_title'] = df['title'].astype(str).str.strip().str.lower()
df['match_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# === LOAD IMDb title.basics ===
print("\nLoading title.basics.tsv...")
basics = pd.read_csv(
    os.path.join(IMDB_DIR, "title.basics.tsv"),
    sep='\t',
    usecols=['tconst', 'primaryTitle', 'startYear'],
    na_values='\\N',
    keep_default_na=False,
    dtype={'startYear': 'str'}
)

basics['imdb_title'] = basics['primaryTitle'].str.lower()
basics['imdb_year'] = pd.to_numeric(basics['startYear'], errors='coerce')

# === MATCH NETFLIX ‚Üí IMDb (left join) ===
print("Matching titles...")
merged = pd.merge(
    df,
    basics[['tconst', 'imdb_title', 'imdb_year']],
    left_on=['match_title', 'match_year'],
    right_on=['imdb_title', 'imdb_year'],
    how='left'
)

# === LOAD RATINGS ===
print("Loading title.ratings.tsv...")
ratings = pd.read_csv(
    os.path.join(IMDB_DIR, "title.ratings.tsv"),
    sep='\t',
    na_values='\\N',
    keep_default_na=False
)
merged = pd.merge(merged, ratings[['tconst', 'averageRating']], on='tconst', how='left')

# === LOAD DIRECTORS ===
print("Loading title.crew.tsv and name.basics.tsv...")
crew = pd.read_csv(
    os.path.join(IMDB_DIR, "title.crew.tsv"),
    sep='\t',
    na_values='\\N',
    keep_default_na=False
)
names = pd.read_csv(
    os.path.join(IMDB_DIR, "name.basics.tsv"),
    sep='\t',
    usecols=['nconst', 'primaryName'],
    na_values='\\N',
    keep_default_na=False
)

name_dict = dict(zip(names['nconst'], names['primaryName']))

def directors_to_names(director_ids):
    if pd.isna(director_ids) or director_ids == '':
        return None
    ids = str(director_ids).split(',')
    name_list = [name_dict.get(id.strip()) for id in ids if id.strip() in name_dict]
    return ', '.join([n for n in name_list if n]) if name_list else None

merged = pd.merge(merged, crew[['tconst', 'directors']], on='tconst', how='left')
merged['imdb_director'] = merged['directors'].apply(directors_to_names)

# === SAFELY FILL ONLY MISSING VALUES ===
print("\nFilling missing imdb_rating and director...")

# Fill imdb_rating (all are initially None, so fill where IMDb has data)
rating_mask = merged['averageRating'].notna()
df.loc[rating_mask, 'imdb_rating'] = merged.loc[rating_mask, 'averageRating']

# Fill director (only where currently null/empty)
director_mask = (
    (df['director'].isna() | (df['director'] == '')) &
    merged['imdb_director'].notna()
)
df.loc[director_mask, 'director'] = merged.loc[director_mask, 'imdb_director']

# === REPORT & SAVE ===
new_director_nulls = df['director'].isna().sum()
new_rating_nulls = df['imdb_rating'].isna().sum()
new_cast_nulls = df['cast'].isna().sum() if 'cast' in df.columns else 0
new_country_nulls = df['country'].isna().sum() if 'country' in df.columns else 0

print(f"\nAfter enrichment:")
print(f"  director:     {orig_director_nulls} ‚Üí {new_director_nulls}")
print(f"  imdb_rating:  {orig_rating_nulls} ‚Üí {new_rating_nulls}")
print(f"  cast:         {orig_cast_nulls} ‚Üí {new_cast_nulls} (unchanged)")
print(f"  country:      {orig_country_nulls} ‚Üí {new_country_nulls} (unchanged)")

df.to_csv(OUTPUT_FILE, index=False)
print(f"\n‚úÖ Saved to: {OUTPUT_FILE}")

Loading Netflix dataset...
‚úÖ Added missing column: 'imdb_rating'

Original missing values:
  director:     2634
  imdb_rating:  8807
  cast:         825
  country:      831

Loading title.basics.tsv...
Matching titles...
Loading title.ratings.tsv...
Loading title.crew.tsv and name.basics.tsv...

Filling missing imdb_rating and director...

After enrichment:
  director:     2634 ‚Üí 770
  imdb_rating:  8807 ‚Üí 3932
  cast:         825 ‚Üí 825 (unchanged)
  country:      831 ‚Üí 831 (unchanged)

‚úÖ Saved to: netflix_titles(modified).csv


#### Phase 2: OMDb API (Gap Filling)


‚úÖ Specifically:

- If a row already has a value in imdb_rating (e.g., 8.5) OR director (e.g., "Jane Campion"), the code leaves it untouched.
- If either field is missing (i.e., None, NaN, or an empty string ""), the code:
 - Makes one API call to OMDb using the title (and optionally year).
 - Fills in the missing field(s) from the API response:
    - imdb_rating ‚Üê from OMDb‚Äôs imdbRating (converted to a number if possible).
    - director ‚Üê from OMDb‚Äôs Director field (unless it‚Äôs "N/A").
- Stoping when The API OMDb daily request limit reached
---
##### üîë How to Get a Free OMDb API Key (for enrichment script)

1. **Go to**: [https://www.omdbapi.com/apikey.aspx](https://www.omdbapi.com/apikey.aspx)  
2. **Enter your email address** in the box.  
3. **Click ‚ÄúSign Up‚Äù**.  
4. **Check your email inbox** ‚Äî you‚Äôll receive a message from OMDb with your **API key** (looks like: `abcd1234`).  
5. **Copy the key** and paste it into the script where it says:  
   ```python
   API_KEY = "your_key_here"
   ```

> ‚úÖ That‚Äôs it! **No credit card**, no approval needed.  
> ‚ö†Ô∏è Free tier allows **1,000 requests per day** ‚Äî enough for most small projects.

API Important Notes From The Website:
- Rate Limit: ~1,000 requests/day (free).
- No bulk requests: You must query one title at a time.
- Not real-time: Data is updated periodically, not instantly.
- Commercial use? Requires a paid plan (not covered here).



In [None]:


import pandas as pd
import requests
import time
import os
import sys

# === CONFIG ===
API_KEY = "[insert your created API Key Here]"      # see "How to Get a Free OMDb API Key" above to know how to create it
INPUT_FILE = "netflix_titles(modified using Dataset).csv"
OUTPUT_FILE = "netflix_titles(modified_Final).csv"
SAVE_EVERY = 10  

# === LOAD DATA ===
if os.path.exists(OUTPUT_FILE):
    print(f"üìÅ Resuming from existing {OUTPUT_FILE}")
    df = pd.read_csv(OUTPUT_FILE)
else:
    df = pd.read_csv(INPUT_FILE)
    if 'imdb_rating' not in df.columns:
        df['imdb_rating'] = None
    if 'director' not in df.columns:
        df['director'] = None

if 'title' not in df.columns:
    raise ValueError("Your CSV must have a 'title' column!")

# Track how many rows we've processed in this run (for saving)
processed_count = 0

# FETCH RATINGS AND DIRECTORS 
for idx, row in df.iterrows():
    title = str(row['title']).strip()
    year = row.get('release_year', '')
    if pd.notna(year):
        try:
            year = str(int(float(year)))
        except (ValueError, TypeError):
            year = ''
    else:
        year = ''

    # Check current state
    current_rating = row.get('imdb_rating', None)
    current_director = row.get('director', '')

    rating_done = pd.notna(current_rating) and current_rating != ''
    director_done = not (pd.isna(current_director) or str(current_director).strip() == '')

    if rating_done and director_done:
        print(f"‚è≠Ô∏è  Skipping (rating & director already present): {title}")
        processed_count += 1
    else:
        url = f"http://www.omdbapi.com/?t={requests.utils.quote(title)}&y={year}&apikey={API_KEY}"

        try:
            response = requests.get(url, timeout=10)
            data = response.json()
        except Exception as e:
            print(f"‚ö†Ô∏è  Network error for '{title}' ({year}): {e}")
            time.sleep(1)
            processed_count += 1
            if processed_count % SAVE_EVERY == 0:
                df.to_csv(OUTPUT_FILE, index=False)
                print(f"üíæ Saved progress at row {idx + 1} (every {SAVE_EVERY})")
            continue

        # Handle successful response
        if data.get('Response') == 'True':
            # --- Update IMDb Rating (if missing) ---
            if not rating_done:
                rating = data.get('imdbRating', 'N/A')
                if rating != 'N/A':
                    try:
                        df.at[idx, 'imdb_rating'] = float(rating)
                    except ValueError:
                        df.at[idx, 'imdb_rating'] = None
                else:
                    df.at[idx, 'imdb_rating'] = None

            # --- Update Director (if missing) ---
            if not director_done:
                omdb_director = data.get('Director', '').strip()
                if omdb_director and omdb_director != 'N/A':
                    df.at[idx, 'director'] = omdb_director
                else:
                    df.at[idx, 'director'] = None

            print(f"‚úÖ {title} ({year}) ‚Üí "
                  f"Rating: {df.at[idx, 'imdb_rating']}, "
                  f"Director: {df.at[idx, 'director'] or 'N/A'}")

        else:
            error_msg = data.get('Error', 'Unknown error')
            print(f"‚ùå Not found: '{title}' ({year}) ‚Üí Reason: {error_msg}")
            # Only set rating/director to None if we were trying to fetch them
            if not rating_done:
                df.at[idx, 'imdb_rating'] = None
            if not director_done:
                df.at[idx, 'director'] = None

            # üî¥ Stop on rate limit
            if error_msg == "Request limit reached!":
                print("\nüõë OMDb daily request limit reached! Stopping.")
                df.to_csv(OUTPUT_FILE, index=False)
                print(f"‚úÖ Final progress saved to {OUTPUT_FILE}")
                sys.exit(0)

        processed_count += 1
        time.sleep(1)  

    # üíæ Save progress periodically (including skipped rows)
    if processed_count % SAVE_EVERY == 0:
        df.to_csv(OUTPUT_FILE, index=False)
        print(f"üíæ Saved progress at row {idx + 1} (every {SAVE_EVERY})")

# Final save
df.to_csv(OUTPUT_FILE, index=False)
print(f"\n‚ú® All done! Final results saved to {OUTPUT_FILE}")

üìÅ Resuming from existing netflix_titles(modified_Final).csv
‚è≠Ô∏è  Skipping (rating & director already present): Dick Johnson Is Dead
‚úÖ Blood & Water (2021) ‚Üí Rating: nan, Director: Clarence Horatio
‚è≠Ô∏è  Skipping (rating & director already present): Ganglands
‚è≠Ô∏è  Skipping (rating & director already present): Jailbirds New Orleans
‚ùå Not found: 'Kota Factory' (2021) ‚Üí Reason: Movie not found!
‚è≠Ô∏è  Skipping (rating & director already present): Midnight Mass
‚è≠Ô∏è  Skipping (rating & director already present): My Little Pony: A New Generation
‚è≠Ô∏è  Skipping (rating & director already present): Sankofa
‚è≠Ô∏è  Skipping (rating & director already present): The Great British Baking Show
‚è≠Ô∏è  Skipping (rating & director already present): The Starling
üíæ Saved progress at row 10 (every 10)
‚ùå Not found: 'Vendetta: Truth, Lies and The Mafia' (2021) ‚Üí Reason: Movie not found!
‚è≠Ô∏è  Skipping (rating & director already present): Bangkok Breaking
‚è≠Ô∏è  Skipping (

SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
