# TMDB Datenbeschaffung
In diesem Notebook werden Filmdaten von TMDB beschaffen. Dafür müssen wir uns mit der TMDB API verbinden. Es gibt eine Python Bibliothek "tmdbsimple", die dies ermöglicht.

Quelle: https://github.com/celiao/tmdbsimple

Diese Filmdaten enthalten zum Beispiel den Titel, die Genres, die Filmbeschreibung, die Filmdauer, Produktionsländer und viele andere Attribute. Danach werden die Daten als Dataframe gespeichert und als csv exportiert, um später zu reinigen.

Diese Daten werden schlussendlich mit den movielens Daten verbunden --> so können wir später auf die jeweiligen Ratings von den abefragten Filmen zugreifen.

In [1]:
import tmdbsimple as tmdb
import requests
import pandas as pd
from datetime import datetime

In [2]:
# read movie IDs (movieId & tmdbId)
df_movie_ids = pd.read_csv('ml-25m/links.csv', usecols = ["movieId", "tmdbId"])
df_movie_ids.head()

Unnamed: 0,movieId,tmdbId
0,1,862.0
1,2,8844.0
2,3,15602.0
3,4,31357.0
4,5,11862.0


In [3]:
# print number of missing values
print(df_movie_ids.isna().sum())

movieId      0
tmdbId     107
dtype: int64


In [4]:
# Remove 107 missing values from column tmdbId
df_movie_ids = df_movie_ids[df_movie_ids["tmdbId"].notna()]

print(df_movie_ids.isna().sum())

movieId    0
tmdbId     0
dtype: int64


Hier werden die Filmdaten von der TMDB API abgefragt und, solange der Film später wie 2000 veröffentlicht wurde, in einem Dataframe abgespeichert und exportieren.

In [68]:
tmdb.API_KEY = "f831ad4674ee3f8502035db2a48af4c6"
movie_lst = []

# loop through all tmdb IDs
for movie_id in df_movie_ids["tmdbId"]:
    
    try:
        # set timeout
        tmdb.REQUESTS_TIMEOUT = (2, 5)

        # create session
        tmdb.REQUESTS_SESSION = requests.Session()
        
        # request movie data with movie_id
        movie = tmdb.Movies(movie_id)
        
        # get movie information (dict)
        response = movie.info()
                
        # check if release date is after 2010, add to movie list if its true
        if datetime.strptime(movie.release_date, "%Y-%m-%d") >= datetime.strptime("2010-01-01", "%Y-%m-%d"):
            movie_lst.append(response)
        
    except Exception as e:
        
        print(e)
    
df_movies = pd.DataFrame(movie_lst)

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,/s2bpgVhpWODDfoADW78IpMDCMTR.jpg,,1783810,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,79782,tt1684935,pl,Wenecja,...,2010-06-11,0,110,"[{'english_name': 'Czech', 'iso_639_1': 'cs', ...",Released,,Venice,False,7.000,13
1,False,,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",,141210,tt2250194,en,The Sleepover,...,2012-10-12,0,6,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,The Sleepover,False,6.600,8
2,False,,,0,"[{'id': 18, 'name': 'Drama'}]",http://www.thefarmerswifefilm.co.uk/,143750,tt2140519,en,The Farmer's Wife,...,2012-06-20,0,18,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,The Farmer's Wife,False,10.000,1
3,False,,,0,"[{'id': 99, 'name': 'Documentary'}]",,84198,tt1736049,en,A Place at the Table,...,2012-03-22,0,84,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,One Nation. Underfed.,A Place at the Table,False,6.700,20
4,False,,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...",,171982,tt2378428,en,Romance,...,2012-10-09,0,27,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Romance,False,6.000,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19958,False,/n7kr24jkZBg6EERpJBdKvOjMMdV.jpg,,0,"[{'id': 99, 'name': 'Documentary'}, {'id': 36,...",,646282,tt8132166,es,El cuadro,...,2019-11-08,0,107,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,What is happening in that room?,The Painting,False,8.000,2
19959,False,/4evYVAzIHXSSVFxCQhBgkgj52pH.jpg,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 36, 'name...",,595924,tt10199670,fr,Liberté,...,2019-09-04,0,132,"[{'english_name': 'French', 'iso_639_1': 'fr',...",Released,,Liberte,False,5.400,22
19960,False,/ekVWMz32hsrRuSLf5KTjg3PvcUa.jpg,,0,"[{'id': 36, 'name': 'History'}]",,622831,tt10551150,zh,決勝時刻,...,2019-09-20,15030400,0,"[{'english_name': 'Mandarin', 'iso_639_1': 'zh...",Released,,Mao Zedong 1949,False,5.700,6
19961,False,/3kb5b8IQCX4vd3baNBoZqAboP41.jpg,,0,"[{'id': 18, 'name': 'Drama'}]",,499546,tt6671244,nl,Wij,...,2018-07-12,0,100,"[{'english_name': 'Dutch', 'iso_639_1': 'nl', ...",Released,,We,False,5.938,56


Filme ab 2010 als csv abspeichern

In [39]:
# save dataframe to csv
df_movies.to_csv("tmdb_movies.csv", sep="\t")