# **ETL**

## Objectives

* Write your notebook objective here, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write down which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

In [None]:
import pandas as pd
import numpy as np
from ast import literal_eval # to convert string representation of list to list
from pathlib import Path





In [None]:
# load the datasets
movies_df = pd.read_csv('Data/RAW/tmdb_5000_movies.csv')
credits_df = pd.read_csv('Data/RAW/tmdb_5000_credits.csv')

In [None]:
print(movies_df.shape)
print(credits_df.shape)

In [None]:
print(movies_df.head())
print(credits_df.head())

In [None]:
# missing values
print(movies_df.isna().sum().sort_values(ascending=False))
print(credits_df.isna().sum().sort_values(ascending=False))


In [None]:
print(movies_df.dtypes)
print(credits_df.dtypes)

In [None]:
# Loading first 5 rows of movies_df

movies_df.describe(include='all')

In [None]:
# Extra checks for duplicates and unique IDs
print("Full Dupe check movies_df", movies_df.duplicated().sum())
print("Full Dupe check credits_df", credits_df.duplicated().sum())

In [None]:
# Unqique IDs in both datasets
print(f"Unique movie IDs in movies_df", movies_df['id'].is_unique)
print(f"Unique credit IDs in credits_df", credits_df['movie_id'].is_unique)    


In [None]:
movies_df["runtime"] = movies_df["runtime"].fillna(movies_df["runtime"].median())

# check runtime missing values
movies_df['runtime'].isna().sum()

In [None]:
# fill home, tagline and overview missing values with null string
for col in ['homepage', 'tagline', 'overview']:
    movies_df[col] = movies_df[col].fillna('')
movies_df.isna().sum().sort_values(ascending=False)

In [None]:
# make sure release_date is in datetime format
movies_df["release_date"] = pd.to_datetime(movies_df["release_date"], errors="coerce")

In [None]:
# mark which row have a valid release date
movies_df["has_release_date"] = movies_df["release_date"].notna()
movies_df["has_release_date"].value_counts()

In [None]:
# create a release year column for ease of analysis
movies_df["release_year"] = movies_df["release_date"].dt.year

In [None]:
# Place holder fill for release date missing values
movies_df["release_date"] = movies_df["release_date"].fillna(pd.Timestamp("1900-01-01"))

In [None]:
# check the range of release dates
movies_df["release_date"].min(), movies_df["release_date"].max()


In [None]:
# check no missing values in release_date after placeholder fill
movies_df["release_date"].isna().sum()

In [None]:
# check which rows have the placeholder date
movies_df.loc[movies_df["release_date"] == "1900-01-01", ["id","title","release_date"]]

In [None]:
# describe all for credits_df
credits_df.describe(include='all')

In [None]:
# Check if all movie IDs in credits are also in movies
all_ids_match = credits_df['movie_id'].isin(movies_df['id'])
print("All movie IDs in credits are in movies:", all_ids_match.all())


In [None]:
# merge the the two datset on the Movie ID columns
merged_df = movies_df.merge(credits_df, left_on='id', right_on='movie_id', how='left', validate='one_to_one')

# print the shape of the merged dataframe
print("Shape of merged dataframe:", merged_df.shape)
# print missing values after merge
print("Missing values after merge:\n", merged_df.isna().sum().sort_values(ascending=False))

# print first 5 rows transposed for better readability
merged_df.head()


---

In [None]:
# Check that every merged row has its movie_id filled
print("Any missing movie_id after merge?", merged_df["movie_id"].isna().sum())

# Double-check duplicates
print("Duplicate IDs in merged_df:", merged_df["id"].duplicated().sum())

# Confirm column names and count
print("Columns in merged_df:", merged_df.columns.tolist())
print("Merged shape:", merged_df.shape)

In [None]:
# Range Validation
print("Release year range:", merged_df["release_year"].min(), "to", merged_df["release_year"].max())

In [None]:
# two titles now check they match
title_match = (merged_df["title_x"] == merged_df["title_y"])
print("All titles match between original_title and title_y:", title_match.all)

In [None]:
# Two ids now check they match
id_match = (merged_df["id"] == merged_df["movie_id"])
print("All IDs match between id and movie_id:", id_match.all)

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.