#  Data Collection 

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.


## Outputs

* Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Install Python Packages #

In [30]:
%pip install -r //workspace/Film_Hit_prediction/requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [31]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Film_Hit_prediction'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [32]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [33]:
current_dir = os.getcwd()
current_dir

'/workspace'

# Fetch data from Keggle

Install Kaggle package to fetch data

In [34]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


---

# Fetch data from Keggle

Install Kaggle package to fetch data

In [35]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


In [36]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


In [37]:
!pip install kagglehub



Destination folder:

In [38]:

DestinationFolder = "inputs/datasets/raw"
os.makedirs(DestinationFolder, exist_ok=True)

Download the Dataset:

In [39]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata -p {DestinationFolder}


Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace. Or use the environment method.


Unzip the dataset

In [40]:
import zipfile
dataset_zip = os.path.join(DestinationFolder, "tmdb-movie-metadata.zip")
with zipfile.ZipFile(dataset_zip, 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

print(f"Dataset downloaded and extracted to: {DestinationFolder}")

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/datasets/raw/tmdb-movie-metadata.zip'

---

# Load and inspect Kaggle Data #


In [12]:
import pandas as pd

In [None]:
movie_credits_path = "inputs/datasets/raw/tmdb_5000_credits.csv"
movies_path = "inputs/datasets/raw/tmdb_5000_movies.csv"

df_movie_credits = pd.read_csv(movie_credits_path)
df_movies = pd.read_csv(movies_path)

print("Movie Credits Data:")
print(df_movie_credits.head())

print("\nMovies Data:")
print(df_movies.head())


Movie Credits Data:
   movie_id                                     title  \
0     19995                                    Avatar   
1       285  Pirates of the Caribbean: At World's End   
2    206647                                   Spectre   
3     49026                     The Dark Knight Rises   
4     49529                               John Carter   

                                                cast  \
0  [{"cast_id": 242, "character": "Jake Sully", "...   
1  [{"cast_id": 4, "character": "Captain Jack Spa...   
2  [{"cast_id": 1, "character": "James Bond", "cr...   
3  [{"cast_id": 2, "character": "Bruce Wayne / Ba...   
4  [{"cast_id": 5, "character": "John Carter", "c...   

                                                crew  
0  [{"credit_id": "52fe48009251416c750aca23", "de...  
1  [{"credit_id": "52fe4232c3a36847f800b579", "de...  
2  [{"credit_id": "54805967c3a36829b5002c41", "de...  
3  [{"credit_id": "52fe4781c3a36847f81398c3", "de...  
4  [{"credit_id": "52fe47

DataFrame Summary

In [None]:
print("Movie Credits DataFrame Info:")
df_movie_credits.info()

print("\nMovies DataFrame Info:")
df_movies.info()

Movie Credits DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB

Movies DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   ob

Check for duplicates

In [None]:
print("\nDuplicate Rows in Movie Credits DataFrame:")
duplicates_movie_credits = df_movie_credits[df_movie_credits.duplicated()]
print(duplicates_movie_credits)


print("\nDuplicate Rows in Movies DataFrame:")
duplicates_movies = df_movies[df_movies.duplicated()]
print(duplicates_movies)


Duplicate Rows in Movie Credits DataFrame:
Empty DataFrame
Columns: [movie_id, title, cast, crew]
Index: []

Duplicate Rows in Movies DataFrame:
Empty DataFrame
Columns: [budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count]
Index: []


Keep Relvant columns



In [16]:
columns_to_keep = ['genres','original_language','budget', 'revenue']
df_movies_filtered = df_movies[columns_to_keep]

print("Filtered DataFrame with selected columns:")
print(df_movies_filtered.head())

Filtered DataFrame with selected columns:
                                              genres original_language  \
0  [{"id": 28, "name": "Action"}, {"id": 12, "nam...                en   
1  [{"id": 12, "name": "Adventure"}, {"id": 14, "...                en   
2  [{"id": 28, "name": "Action"}, {"id": 12, "nam...                en   
3  [{"id": 28, "name": "Action"}, {"id": 80, "nam...                en   
4  [{"id": 28, "name": "Action"}, {"id": 12, "nam...                en   

      budget     revenue  
0  237000000  2787965087  
1  300000000   961000000  
2  245000000   880674609  
3  250000000  1084939099  
4  260000000   284139100  


handle missing data


In [17]:
print(df_movies_filtered[columns_to_keep].isnull().sum())

genres               0
original_language    0
budget               0
revenue              0
dtype: int64


One-hot encode the 'genres' column 

In [18]:
print(df_movies_filtered['genres'].head())

0    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
1    [{"id": 12, "name": "Adventure"}, {"id": 14, "...
2    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
3    [{"id": 28, "name": "Action"}, {"id": 80, "nam...
4    [{"id": 28, "name": "Action"}, {"id": 12, "nam...
Name: genres, dtype: object


In [None]:
import json

df_movies_filtered['genres'] = df_movies_filtered['genres'].apply(lambda x: [genre['name'] for genre in json.loads(x)])
genre_dummies = pd.get_dummies(df_movies_filtered['genres'].explode()).groupby(level=0).max()
df_movies_filtered = pd.concat([df_movies_filtered, genre_dummies], axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_movies_filtered['genres'] = df_movies_filtered['genres'].apply(lambda x: [genre['name'] for genre in json.loads(x)])


In [20]:
# Print the shape of the genre dummies
print("Genre Dummy Columns Shape:", genre_dummies.shape)

# Print the names of the new genre columns
print("\nNew Genre Columns:")
print(list(genre_dummies.columns))

Genre Dummy Columns Shape: (4803, 20)

New Genre Columns:
['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']


In [21]:
print(df_movies_filtered['Action'].dtype)

uint8


One-hot encode 'original_language'

In [22]:
print(df_movies_filtered['original_language'].head())

0    en
1    en
2    en
3    en
4    en
Name: original_language, dtype: object


In [23]:
from sklearn.preprocessing import LabelEncoder

# Label Encoding
le = LabelEncoder()
df_movies_filtered['language_encoded'] = le.fit_transform(df_movies_filtered['original_language'])

# Or simple mapping if you prefer
language_map = {lang: idx for idx, lang in enumerate(df_movies_filtered['original_language'].unique())}
df_movies_filtered['language_encoded'] = df_movies_filtered['original_language'].map(language_map)

In [24]:
print("First few rows of encoded languages:")
print(df_movies_filtered[['original_language', 'language_encoded']].head())

First few rows of encoded languages:
  original_language  language_encoded
0                en                 0
1                en                 0
2                en                 0
3                en                 0
4                en                 0


In [25]:
print(df_movies_filtered['original_language'].value_counts())

en    4505
fr      70
es      32
zh      27
de      27
hi      19
ja      16
it      14
cn      12
ru      11
ko      11
pt       9
da       7
sv       5
nl       4
fa       4
th       3
he       3
ta       2
cs       2
ro       2
id       2
ar       2
vi       1
sl       1
ps       1
no       1
ky       1
hu       1
pl       1
af       1
nb       1
tr       1
is       1
xx       1
te       1
el       1
Name: original_language, dtype: int64


---

In [26]:
print(df_movies_filtered.columns)

Index(['genres', 'original_language', 'budget', 'revenue', 'Action',
       'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama',
       'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western',
       'language_encoded'],
      dtype='object')


# Drop the original encoded columns

In [27]:
df_movies_filtered = df_movies_filtered.drop(['genres'], axis=1)
print(df_movies_filtered.columns)

Index(['original_language', 'budget', 'revenue', 'Action', 'Adventure',
       'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family',
       'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery',
       'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western',
       'language_encoded'],
      dtype='object')


In [28]:
print(df_movies_filtered.columns.tolist())

['original_language', 'budget', 'revenue', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western', 'language_encoded']


---

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

# Save as CSV

In [29]:
df_movies_filtered.to_csv('encoded_movies.csv', index=False)