#  Data Collection 

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection


## Inputs

* Kaggle JSON file - the authentication token.


## Outputs

* Generate Dataset: outputs/datasets/collection/raw_movie_data.csv
* outputs/datasets/collection/raw_movie_credits_data.csv


---

## Install Requirements

In [1]:
%pip install -r /workspace/Film_Hit_prediction/requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Change working directory

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Film_Hit_prediction/jupyter_notebooks'

The new current directory


In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/Film_Hit_prediction'

# Fetch data from Keggle

Install Kaggle package to fetch data

In [5]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


---

# Fetch data from Keggle

Install Kaggle package to fetch data

In [6]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Verify Kaggle.json

In [7]:
import os

os.chdir('/workspace/Film_Hit_prediction/jupyter_notebooks')

print(f"Current working directory: {os.getcwd()}")
print("Files in directory:", os.listdir())

print(f"Current working directory: {os.getcwd()}")
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
!chmod 600 kaggle.json

print("\nTesting Kaggle API connection:")
!kaggle datasets list --sort-by votes

Current working directory: /workspace/Film_Hit_prediction/jupyter_notebooks
Files in directory: ['1_data_collection.ipynb', '2_data_cleaning.ipynb', '3_Film_success_study.ipynb', '4_feature_engineering.ipynb', '5_modeling_evaluation.ipynb', 'inputs', 'kaggle.json', 'outputs']
Current working directory: /workspace/Film_Hit_prediction/jupyter_notebooks



Testing Kaggle API connection:
ref                                                           title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
jessicali9530/animal-crossing-new-horizons-nookplaza-dataset  Animal Crossing New Horizons Catalog                577KB  2021-06-08 15:05:09          71862      49657  0.88235295       
mlg-ulb/creditcardfraud                                       Credit Card Fraud Detection                          66MB  2018-03-23 01:17:27         808878      11837  0.85294116       
allen-institute-for-ai/CORD-19-research-challenge             COVID-19 Open Research Dataset Challenge (CORD-19)   18GB  2022-06-06 19:39:40         183549      11029  0.88235295       
shivamb/netflix-shows                 

Install Kagglehub package

In [8]:
%pip install kagglehub

Note: you may need to restart the kernel to use updated packages.


Destination folder:

In [9]:

DestinationFolder = "inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)

Download the dataset

In [10]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata -p {DestinationFolder}

tmdb-movie-metadata.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip the dataset

In [11]:
import zipfile
dataset_zip = os.path.join(DestinationFolder, "tmdb-movie-metadata.zip")
with zipfile.ZipFile(dataset_zip, 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

print(f"Dataset downloaded and extracted to: {DestinationFolder}")

Dataset downloaded and extracted to: inputs/datasets/raw


---

# Load and inspect Kaggle Data #


In [12]:
import pandas as pd

movie_credits_path = "inputs/datasets/raw/tmdb_5000_credits.csv"
movies_path = "inputs/datasets/raw/tmdb_5000_movies.csv"

df_movie_credits = pd.read_csv(movie_credits_path)
df_movies = pd.read_csv(movies_path)

print("Movie Credits Data:")
print(df_movie_credits.head())

print("\nMovies Data:")
print(df_movies.head())

Movie Credits Data:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                                cast  \
0  [{"cast_id": 242, "character": "Jake Sully", "...   
1  [{"cast_id": 4, "character": "Captain Jack Spa...   
2  [{"cast_id": 1, "character": "James Bond", "cr...   
3  [{"cast_id": 2, "character": "Bruce Wayne / Ba...   
4  [{"cast_id": 5, "character": "John Carter", "c...   

                                                crew  
0  [{"credit_id": "52fe48009251416c750aca23", "de...  
1  [{"credit_id": "52fe4232c3a36847f800b579", "de...  
2  [{"credit_id": "54805967c3a36829b5002c41", "de...  
3  [{"credit_id": "52fe4781c3a36847f81398c3", "de...  
4  [{"credit_id": "52fe47

Detailed columns in Movies Data

In [13]:
print("Movies DataFrame Columns and Types:")
for col in df_movies.columns:
    print(f"{col}: {df_movies[col].dtype}")

Movies DataFrame Columns and Types:
budget: int64
genres: object
homepage: object
id: int64
keywords: object
original_language: object
original_title: object
overview: object
popularity: float64
production_companies: object
production_countries: object
release_date: object
revenue: int64
runtime: float64
spoken_languages: object
status: object
tagline: object
title: object
vote_average: float64
vote_count: int64


Data frame summary

In [None]:
print("Movie Credits DataFrame Info:")
df_movie_credits.info()

print("\nMovies DataFrame Info:")
df_movies.info()

Movie Credits DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB

Movies DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   ob

Check for duplicates

In [15]:
print("\nDuplicate Rows in Movie Credits DataFrame:")
duplicates_movie_credits = df_movie_credits[df_movie_credits.duplicated()]
print(duplicates_movie_credits)


print("\nDuplicate Rows in Movies DataFrame:")
duplicates_movies = df_movies[df_movies.duplicated()]
print(duplicates_movies)


Duplicate Rows in Movie Credits DataFrame:
Empty DataFrame
Columns: [movie_id, title, cast, crew]
Index: []

Duplicate Rows in Movies DataFrame:
Empty DataFrame
Columns: [budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count]
Index: []


---


## Merging datasets

Show the two datasets to merge

In [16]:
print("Movies columns:", df_movies.columns.tolist())
print("\nMovie Credits columns:", df_movie_credits.columns.tolist())

Movies columns: ['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']

Movie Credits columns: ['movie_id', 'title', 'cast', 'crew']


Rename movie_id to id in the credits dataframe to match the movies dataframe

In [17]:
df_movie_credits = df_movie_credits.rename(columns={'movie_id': 'id'})

Merge the datasets

In [18]:
df_merged = df_movies.merge(df_movie_credits, on='id', how='left')

Print info about the merged dataset

In [19]:
print("\nMerged dataset shape:", df_merged.shape)


Merged dataset shape: (4803, 23)


---

# Push file to repo

In [20]:
import os  
import pandas as pd  

output_dir = "outputs/datasets/collection"
merged_file_path = f"{output_dir}/merged_movie_data.csv"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")
else:
    print(f"Directory already exists: {output_dir}")

try:
    if 'df_merged' in locals():
        df_merged.to_csv(f"{output_dir}/merged_movie_data.csv", index=False)
        print(f"\nMerged file saved successfully in {output_dir}")
        print(f"Merged dataset shape: {df_merged.shape}")
    else:
        print("Warning: df_merged not found. Please create your merged DataFrame first.")

except Exception as e:
    print(f"Error saving merged file: {e}")

 
    if os.path.exists(merged_file_path):
        print(f"File verified in directory")
        print(f"Merged dataset shape: {df_merged.shape}")
        print("\nFirst few rows of merged data:")
        print(df_merged[['id', 'title_x', 'budget', 'cast']].head())

except Exception as e:
    print(f"Error saving merged file: {e}")

Directory already exists: outputs/datasets/collection



Merged file saved successfully in outputs/datasets/collection
Merged dataset shape: (4803, 23)


Verify the merged dataset


In [None]:
print("\nColumns in merged dataset:", df_merged.columns.tolist())
print("\nSample of merged data:")
print(df_merged[['id', 'title_x', 'budget', 'cast']].head())


Columns in merged dataset: ['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title_x', 'vote_average', 'vote_count', 'title_y', 'cast', 'crew']

Sample of merged data:
       id                                   title_x     budget  \
0   19995                                    Avatar  237000000   
1     285  Pirates of the Caribbean: At World's End  300000000   
2  206647                                   Spectre  245000000   
3   49026                     The Dark Knight Rises  250000000   
4   49529                               John Carter  260000000   

                                                cast  
0  [{"cast_id": 242, "character": "Jake Sully", "...  
1  [{"cast_id": 4, "character": "Captain Jack Spa...  
2  [{"cast_id": 1, "character": "James Bond", "cr...  
3  [{"cast_id": 2, "

---