# Update DB 
`get_additional_data must be run prior (OR retrieve CSVs from google drive: )
This file uses CSVs generated from get_additional_data (descriptions, trailers, posters) and get_service_provider (service provider links) to update movie data in `imdb_movies` table. 

It may make sense to have `service_providers` as a seperate table and create a one to many relationship between the movie and their service providers. 

In [2]:
# imports 
import pandas as pd 
import psycopg2

In [3]:
# create connection to prod DB
connection = psycopg2.connect(
    user="postgres",
    password="lambdaschoolgroa",
    host="groadb-dev.cbayt2opbptw.us-east-1.rds.amazonaws.com",
    port="5432",
    database="postgres")

## Add descriptions to movies

In [6]:
# read in description data
desc_df = pd.read_csv('description_results.csv', engine='python')
print(desc_df.shape)
desc_df.head()

(201352, 3)


Unnamed: 0,movie_id,tmdb_id,description
0,1051231,31223.0,In the Hands of the Gods is the true story of ...
1,1051244,573815.0,A group of talented youth exploited by the hea...
2,10513474,639651.0,A unique chance to explore Pier Paolo Pasolini...
3,10515086,599672.0,A meeting with a new inmate in the psychiatric...
4,10515340,531678.0,A strange disease is plaguing the city. Hoping...


In [7]:
# how many null values
desc_df['description'].isnull().sum()

27476

In [8]:
# any empty strings?
desc_df[desc_df['description'] == ""].shape[0]

0

In [9]:
# drop rows without a description
desc_df = desc_df.dropna(subset=['description'])
desc_df.shape

(173876, 3)

In [14]:
# update every movie we have a description for
# cursor = connection.cursor()

for movie_id, desc in desc_df[['movie_id', 'description']].values:
    query = f"UPDATE imdb_movies SET description = '{desc}' WHERE movie_id = {movie_id};"
    print(query)
    break
#     cursor.execute(query)

# connection.commit()
# cursor.close()

UPDATE imdb_movies SET description = 'In the Hands of the Gods is the true story of five young British freestyle footballers journey across the Americas to Argentina in the hope of meeting their hero, Diego Maradona. This coming-of-age road movie tells the story of a group of young men in pursuit of a lifelong dream.' WHERE movie_id = 1051231;


In [None]:
# count every movie with a description
cursor = connection.cursor()

query = "SELECT COUNT(*) from imdb_movies WHERE description IS NOT NULL;"
cursor.execute(query)
desc_count = cursor.fetchone()

cursor.close()
desc_count

## Add trailer to movie

In [4]:
# read in trailer data 
trailer_df = pd.read_csv('trailer_data.csv')
print(trailer_df.shape)
trailer_df.head()

(50793, 4)


Unnamed: 0,movie_id,video_key,video_site,more_than_one
0,1051244,ztSS7hnEviY,YouTube,False
1,10515086,WA2NvFSHchk,YouTube,False
2,10515460,HQksgesFrFY,YouTube,False
3,10515480,QBNKpcUOWgI,YouTube,False
4,1051232,k9SdzYiyG14,YouTube,False


In [15]:
# any null values?
trailer_df.isnull().sum()

movie_id         0
video_key        0
video_site       0
more_than_one    0
dtype: int64

In [16]:
# what are the video_site values?
trailer_df['video_site'].value_counts()

YouTube    50205
Vimeo        588
Name: video_site, dtype: int64

In [17]:
# how many did have more than one on TMDb
trailer_df['more_than_one'].value_counts()

False    43481
True      7312
Name: more_than_one, dtype: int64

In [18]:
# update every movie we have a trailer for
# cursor = connection.cursor()

for movie_id, key, site in trailer_df[['movie_id', 'video_key', 'video_site']].values:
    query = f"UPDATE imdb_movies SET video_key = '{key}', video_site = '{site}' WHERE movie_id = {movie_id};"
    print(query)
    break
#     cursor.execute(query)

# connection.commit()
# cursor.close()

UPDATE imdb_movies SET video_key = 'ztSS7hnEviY', video_site = 'YouTube' WHERE movie_id = 1051244;


## Add poster to movies that didn't have one

In [19]:
# read in poster data 
poster_df = pd.read_csv('poster_data.csv')
print(poster_df.shape)
poster_df.head()

(60252, 3)


Unnamed: 0,movie_id,tmdb_id,poster_path
0,1051231,31223,
1,1051704,636806,
2,1051245,120528,
3,1051226,41255,/85PnTI5NknwbOabekq30kqisMPX.jpg
4,1051834,533781,


In [20]:
# how many null poster_paths?
poster_df['poster_path'].isnull().sum()

55371

In [21]:
# drop those rows
poster_df = poster_df.dropna(subset=['poster_path'])
poster_df.shape

(4881, 3)

In [22]:
# update every movie we have a poster for
# cursor = connection.cursor()

for movie_id, poster_path in poster_df[['movie_id', 'poster_path']].values:
    query = f"UPDATE imdb_movies SET poster_path = '{poster_path}' WHERE movie_id = {movie_id};"
    print(query)
    break
#     cursor.execute(query)

# connection.commit()
# cursor.close()

UPDATE imdb_movies SET poster_path = '/85PnTI5NknwbOabekq30kqisMPX.jpg' WHERE movie_id = 1051226;


## Add service provider links to movies

In [4]:
provider_df = pd.read_csv('provider_data.csv')
print(provider_df.shape)
provider_df.head()

(224382, 7)


Unnamed: 0,movie_id,title,jw_id,jw_title,offer_provider_id,offer_urls,ratio
0,1051262,The Stalker Within,309961,The Evil Within,"7,7,7,7,192,3,3,192,3,192,9,9,10,10,10,10,3,19...",https://www.vudu.com/content/movies/details/Th...,73
1,1051263,Stolen Life,167298,Stolen Life,2510109,"https://www.fandor.com/films/stolen_life,https...",100
2,10513072,Don't Date Your Sister,434938,Don't Open Your Eyes,1010101022221299,https://www.amazon.com/gp/product/B07H4T4NQ2?c...,67
3,1051320,La cantatrice chauve,23259,La Bamba,"7,7,7,7,3,3,192,192,279,279,279,279,10,10,10,1...",https://www.vudu.com/content/movies/details/La...,36
4,10513286,Historia de mi nombre,403709,Marriage Story,888,"http://www.netflix.com/title/80223779,http://w...",40


We need to filter out rows with `ratio` values less than 90 because it is very likely it is not the correct movie if less than that.

In [5]:
filtered_df = provider_df[provider_df['ratio'] > 90]
print(filtered_df.shape)
filtered_df.head()

(29017, 7)


Unnamed: 0,movie_id,title,jw_id,jw_title,offer_provider_id,offer_urls,ratio
1,1051263,Stolen Life,167298,Stolen Life,2510109,"https://www.fandor.com/films/stolen_life,https...",100
9,10511068,Inheritance,311393,Inheritance,337777192192331921929910101010,https://play.google.com/store/movies/details/I...,100
11,10514932,What Do I Do Now?,659810,What Do I Do Now?,9910101010,https://www.amazon.com/gp/product/B075QVYP3H?c...,100
28,1051245,Moving Midway,78930,Moving Midway,2219199,https://itunes.apple.com/us/movie/moving-midwa...,100
31,1051253,The Portal,319476,The Portal,"3,3,3,10,10,10,10,9,9,192,192,192,2,2,2,2,68,6...",https://play.google.com/store/movies/details/T...,100


`offer_provider_id` and `offer_urls` are string deliminated sequences. We can use the JustWatch API to get the provider details of each ID and the URLs will be saved to be linked on the movie detail page. There is other info we could've saved to better differentiate between the values with the same provider_id on the same movie. From what I saw it seems some are HD while some are SD, as well as some of the monetization types differ resulting in repeats of the same provider. 

We will probably want to filter out most service providers (other than Netflix, Amazon, and Hulu) because some are not popular enough to justify displaying theirs on the webpage. 