#  Data Collection 

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.


## Outputs

* Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Install Python Packages #

In [None]:
%pip install -r //workspace/Film_Hit_prediction/requirements.txt

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Keggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

---

# Fetch data from Keggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
!pip install kagglehub

Destination folder:

In [38]:

DestinationFolder = "inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)

Download the Dataset:

In [None]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata -p {DestinationFolder}


Unzip the dataset

In [None]:
import zipfile
dataset_zip = os.path.join(DestinationFolder, "tmdb-movie-metadata.zip")
with zipfile.ZipFile(dataset_zip, 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

print(f"Dataset downloaded and extracted to: {DestinationFolder}")

---

# Load and inspect Kaggle Data #


In [12]:
import pandas as pd

In [None]:
movie_credits_path = "inputs/datasets/raw/tmdb_5000_credits.csv"
movies_path = "inputs/datasets/raw/tmdb_5000_movies.csv"

df_movie_credits = pd.read_csv(movie_credits_path)
df_movies = pd.read_csv(movies_path)

print("Movie Credits Data:")
print(df_movie_credits.head())

print("\nMovies Data:")
print(df_movies.head())


DataFrame Summary

In [None]:
print("Movie Credits DataFrame Info:")
df_movie_credits.info()

print("\nMovies DataFrame Info:")
df_movies.info()

Check for duplicates

In [None]:
print("\nDuplicate Rows in Movie Credits DataFrame:")
duplicates_movie_credits = df_movie_credits[df_movie_credits.duplicated()]
print(duplicates_movie_credits)


print("\nDuplicate Rows in Movies DataFrame:")
duplicates_movies = df_movies[df_movies.duplicated()]
print(duplicates_movies)

Keep Relvant columns



In [None]:
columns_to_keep = ['genres','original_language','budget', 'revenue']
df_movies_filtered = df_movies[columns_to_keep]

print("Filtered DataFrame with selected columns:")
print(df_movies_filtered.head())

handle missing data


In [None]:
print(df_movies_filtered[columns_to_keep].isnull().sum())

One-hot encode the 'genres' column 

In [None]:
print(df_movies_filtered['genres'].head())

In [None]:
import json

df_movies_filtered['genres'] = df_movies_filtered['genres'].apply(lambda x: [genre['name'] for genre in json.loads(x)])
genre_dummies = pd.get_dummies(df_movies_filtered['genres'].explode()).groupby(level=0).max()
df_movies_filtered = pd.concat([df_movies_filtered, genre_dummies], axis=1)

In [None]:
# Print the shape of the genre dummies
print("Genre Dummy Columns Shape:", genre_dummies.shape)

# Print the names of the new genre columns
print("\nNew Genre Columns:")
print(list(genre_dummies.columns))

In [None]:
print(df_movies_filtered['Action'].dtype)

One-hot encode 'original_language'

In [None]:
print(df_movies_filtered['original_language'].head())

In [23]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding
le = LabelEncoder()
df_movies_filtered['language_encoded'] = le.fit_transform(df_movies_filtered['original_language'])

# Or simple mapping if you prefer
language_map = {lang: idx for idx, lang in enumerate(df_movies_filtered['original_language'].unique())}
df_movies_filtered['language_encoded'] = df_movies_filtered['original_language'].map(language_map)

In [None]:
print("First few rows of encoded languages:")
print(df_movies_filtered[['original_language', 'language_encoded']].head())

In [None]:
print(df_movies_filtered['original_language'].value_counts())

---

In [None]:
print(df_movies_filtered.columns)

# Drop the original encoded columns

In [None]:
df_movies_filtered = df_movies_filtered.drop(['genres'], axis=1)
print(df_movies_filtered.columns)

In [None]:
print(df_movies_filtered.columns.tolist())

---

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

# Save as CSV

In [29]:
df_movies_filtered.to_csv('encoded_movies.csv', index=False)