# Datenverarbeitung: TMDb + MovieLens

In diesem Notebook werden die Filmdaten aus TMDb mit Bewertungen aus dem MovieLens-Datensatz kombiniert. Ziel ist es, ein sauberes, analysierbares Dataset zu erzeugen, das als Grundlage für ein Machine Learning Modell dient.

In [None]:
import os
import pandas as pd

RAW_PATH = '../dataset/raw'
PROCESSED_PATH = '../dataset/processed'

tmdb_file = os.path.join(RAW_PATH, 'tmdb_5000_movies.csv')
link_file = os.path.join(RAW_PATH, 'link.csv')
rating_file = os.path.join(RAW_PATH, 'rating.csv')


In [15]:
# Laden
tmdb_df = pd.read_csv(tmdb_file)
link_df = pd.read_csv(link_file)
ratings_df = pd.read_csv(rating_file)

In [16]:
# Überblick über die Datensätze
print("TMDB-Daten:")
display(tmdb_df.head())
print("\nLink-Daten:")
display(link_df.head())
print("\nRatings-Daten:")
display(ratings_df.head())

# Form und Info
print("Formen:", tmdb_df.shape, link_df.shape, ratings_df.shape)
tmdb_df.info()
ratings_df.info()


TMDB-Daten:


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124



Link-Daten:


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0



Ratings-Daten:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


Formen: (4803, 20) (27278, 3) (20000263, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 n

In [17]:
# Fehlende Werte
print("Fehlende Werte in TMDB:")
print(tmdb_df.isnull().sum())

# Duplikate prüfen
print("Duplikate in Ratings:", ratings_df.duplicated().sum())
ratings_df = ratings_df.drop_duplicates()


Fehlende Werte in TMDB:
budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64
Duplikate in Ratings: 0


In [18]:
# Merge: link mit tmdb (via movieId → id)
merged_links = link_df.merge(tmdb_df, left_on='tmdbId', right_on='id', how='inner')

# Merge: dann mit ratings
merged_df = ratings_df.merge(merged_links, on='movieId', how='inner')

print("Zusammengeführte Daten:")
display(merged_df.head())


Zusammengeführte Daten:


Unnamed: 0,userId,movieId,rating,timestamp,imdbId,tmdbId,budget,genres,homepage,id,...,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,1,47,3.5,2005-04-02 23:32:07,114369,807.0,33000000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 9648, ""na...",http://www.sevenmovie.com/,807,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1995-09-22,327311859,127.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Seven deadly sins. Seven ways to die.,Se7en,8.1,5765
1,1,50,3.5,2005-04-02 23:29:40,114814,629.0,6000000,"[{""id"": 18, ""name"": ""Drama""}, {""id"": 80, ""name...",http://www.mgm.com/#/our-titles/2083/The-Usual...,629,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1995-07-19,23341568,106.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}, ...",Released,Five Criminals. One Line Up. No Coincidence.,The Usual Suspects,8.1,3254
2,1,112,3.5,2004-09-10 03:09:00,113326,33542.0,7500000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 28, ""name...",,33542,...,"[{""iso_3166_1"": ""HK"", ""name"": ""Hong Kong""}]",1995-01-30,32392047,91.0,"[{""iso_639_1"": ""cn"", ""name"": ""\u5e7f\u5dde\u8b...",Released,No Fear. No Stuntman. No Equal.,Rumble in the Bronx,6.5,240
3,1,151,4.0,2004-09-10 03:08:54,114287,11780.0,28000000,"[{""id"": 12, ""name"": ""Adventure""}]",,11780,...,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",1995-04-13,31596911,139.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Honor made him a man. Courage made him a hero....,Rob Roy,6.5,148
4,1,223,4.0,2005-04-02 23:46:13,109445,2292.0,27000,"[{""id"": 35, ""name"": ""Comedy""}]",http://www.miramax.com/movie/clerks/,2292,...,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",1994-09-13,3151130,92.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Just because they serve you doesn't mean they ...,Clerks,7.4,755


In [21]:
print(merged_df.shape)
print(merged_df.memory_usage(deep=True).sum() / 1024**2, "MB")


(13425642, 26)
25900.681980133057 MB


In [None]:
# Reduziertes DataFrame (200.000 zufällige Zeilen)
reduced_df = merged_df.sample(n=200_000, random_state=42)

# Verzeichnis 
os.makedirs(PROCESSED_PATH, exist_ok=True)

# Speichern
reduced_df.to_csv(os.path.join(PROCESSED_PATH, 'merged_movies_sampled.csv'), index=False)
