#  Data Collection 

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.


## Outputs

* Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Install Python Packages #

In [4]:
%pip install -r //workspace/Film_Hit_prediction/requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [5]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Film_Hit_prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [6]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [7]:
current_dir = os.getcwd()
current_dir

'/workspace/Film_Hit_prediction'

# Fetch data from Keggle

Install Kaggle package to fetch data

In [8]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


---

# Fetch data from Keggle

Install Kaggle package to fetch data

In [9]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [10]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [11]:
!pip install kagglehub



Destination folder:

In [12]:

DestinationFolder = "inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)

Download the Dataset:

In [13]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata -p {DestinationFolder}


tmdb-movie-metadata.zip: Skipping, found more recently modified local copy (use --force to force download)


Unzip the dataset

In [14]:
import zipfile
dataset_zip = os.path.join(DestinationFolder, "tmdb-movie-metadata.zip")
with zipfile.ZipFile(dataset_zip, 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

print(f"Dataset downloaded and extracted to: {DestinationFolder}")

Dataset downloaded and extracted to: inputs/datasets/raw


---

# Load and inspect Kaggle Data #


In [15]:
import pandas as pd

In [16]:
movie_credits_path = "inputs/datasets/raw/tmdb_5000_credits.csv"
movies_path = "inputs/datasets/raw/tmdb_5000_movies.csv"

df_movie_credits = pd.read_csv(movie_credits_path)
df_movies = pd.read_csv(movies_path)

print("Movie Credits Data:")
print(df_movie_credits.head())

print("\nMovies Data:")
print(df_movies.head())


Movie Credits Data:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                                cast  \
0  [{"cast_id": 242, "character": "Jake Sully", "...   
1  [{"cast_id": 4, "character": "Captain Jack Spa...   
2  [{"cast_id": 1, "character": "James Bond", "cr...   
3  [{"cast_id": 2, "character": "Bruce Wayne / Ba...   
4  [{"cast_id": 5, "character": "John Carter", "c...   

                                                crew  
0  [{"credit_id": "52fe48009251416c750aca23", "de...  
1  [{"credit_id": "52fe4232c3a36847f800b579", "de...  
2  [{"credit_id": "54805967c3a36829b5002c41", "de...  
3  [{"credit_id": "52fe4781c3a36847f81398c3", "de...  
4  [{"credit_id": "52fe47

DataFrame Summary

In [17]:
print("Movie Credits DataFrame Info:")
df_movie_credits.info()

print("\nMovies DataFrame Info:")
df_movies.info()

Movie Credits DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB

Movies DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   ob

Check for duplicates

In [18]:
print("\nDuplicate Rows in Movie Credits DataFrame:")
duplicates_movie_credits = df_movie_credits[df_movie_credits.duplicated()]
print(duplicates_movie_credits)


print("\nDuplicate Rows in Movies DataFrame:")
duplicates_movies = df_movies[df_movies.duplicated()]
print(duplicates_movies)


Duplicate Rows in Movie Credits DataFrame:
Empty DataFrame
Columns: [movie_id, title, cast, crew]
Index: []

Duplicate Rows in Movies DataFrame:
Empty DataFrame
Columns: [budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count]
Index: []


Keep Relvant columns



In [19]:
columns_to_keep = ['original_language', 'genres', 'budget', 'revenue']
df_movies_filtered = df_movies[columns_to_keep]

print("Filtered DataFrame with selected columns:")
print(df_movies_filtered.head())

Filtered DataFrame with selected columns:
  original_language                                             genres  \
0                en  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1                en  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2                en  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3                en  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4                en  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

      budget     revenue  
0  237000000  2787965087  
1  300000000   961000000  
2  245000000   880674609  
3  250000000  1084939099  
4  260000000   284139100  


Handle missing values

In [20]:
df_movies_filtered = df_movies_filtered.dropna(subset=['original_language', 'genres', 'budget', 'revenue'])

print("DataFrame after dropping rows with NaN in 'original_language', 'genres', 'budget', 'revenue':")
print(df_movies_filtered.head())

DataFrame after dropping rows with NaN in 'original_language', 'genres', 'budget', 'revenue':
  original_language                                             genres  \
0                en  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1                en  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2                en  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3                en  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4                en  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

      budget     revenue  
0  237000000  2787965087  
1  300000000   961000000  
2  245000000   880674609  
3  250000000  1084939099  
4  260000000   284139100  


One-hot encode the 'genres' column 

In [21]:
df_movies_filtered['genres'] = df_movies_filtered['genres'].apply(lambda x: x.split(','))
df_genres = pd.get_dummies(df_movies_filtered['genres'].apply(pd.Series).stack(), prefix='genre').sum(level=0)
df_movies_filtered = pd.concat([df_movies_filtered, df_genres], axis=1)

print("DataFrame after concatenating one-hot encoded genres:")
print(df_movies_filtered.head())

DataFrame after concatenating one-hot encoded genres:
  original_language                                             genres  \
0                en  [[{"id": 28,  "name": "Action"},  {"id": 12,  ...   
1                en  [[{"id": 12,  "name": "Adventure"},  {"id": 14...   
2                en  [[{"id": 28,  "name": "Action"},  {"id": 12,  ...   
3                en  [[{"id": 28,  "name": "Action"},  {"id": 80,  ...   
4                en  [[{"id": 28,  "name": "Action"},  {"id": 12,  ...   

      budget     revenue  genre_ "name": "Action"}  genre_ "name": "Action"}]  \
0  237000000  2787965087                         1                          0   
1  300000000   961000000                         0                          1   
2  245000000   880674609                         1                          0   
3  250000000  1084939099                         1                          0   
4  260000000   284139100                         1                          0   

   genre_ "nam

  df_genres = pd.get_dummies(df_movies_filtered['genres'].apply(pd.Series).stack(), prefix='genre').sum(level=0)


One-hot encode 'original_language'

In [22]:
df_movies_filtered = pd.get_dummies(df_movies_filtered, columns=['original_language'])

print("DataFrame after applying one-hot encoding to 'original_language':")
print(df_movies_filtered.head())

DataFrame after applying one-hot encoding to 'original_language':
                                              genres     budget     revenue  \
0  [[{"id": 28,  "name": "Action"},  {"id": 12,  ...  237000000  2787965087   
1  [[{"id": 12,  "name": "Adventure"},  {"id": 14...  300000000   961000000   
2  [[{"id": 28,  "name": "Action"},  {"id": 12,  ...  245000000   880674609   
3  [[{"id": 28,  "name": "Action"},  {"id": 80,  ...  250000000  1084939099   
4  [[{"id": 28,  "name": "Action"},  {"id": 12,  ...  260000000   284139100   

   genre_ "name": "Action"}  genre_ "name": "Action"}]  \
0                         1                          0   
1                         0                          1   
2                         1                          0   
3                         1                          0   
4                         1                          0   

   genre_ "name": "Adventure"}  genre_ "name": "Adventure"}]  \
0                            1                

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Save as CSV

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [23]:
df_movies.to_csv('encoded_movies.csv', index=False)
