# Abstract 

Digitization, as one of the key outcomes of technological growth, has led to deep changes in entertainment, the world of cinema, and many other socio-economical areas. As a result, the distribution and broadcasting strategy, that Netflix brought to the market, turned into an amazing success story in a very short lapse of time.

Netflix's strategy is based on the idea that consumers can access the platform's entire content catalogue for a monthly price. Additionally, Netflix only broadcasts its films on the web, with no theatrical or limited distribution. This approach, which is vastly different from the classic idea of the Hollywood studio system, has led to advances for audiences, directors and studios in various directions. In this way, we can say that streaming services, such as Netflix, are influencing the film industry in terms of how we access movies, what material we consume and how movies are made. Their dominant position places Internet platforms in a strong decisional position in terms of film content. In the future, this kind of authority could be key in determining what constitutes a "well-made film".

The impact of Internet streaming services on filmmakers has been one of the most important transformations in the show business in recent years. The promise of a more open environment for filmmakers than other large studios has attracted numerous directors to the platforms, with huge ramifications in the world of cinema. Furthermore, these services have less stringent standards than cinemas, making them attractive to producers. Another important aspect concerns independent directors. Since the 1980s, when Hollywood became the hub of cinema and blockbuster films began to dominate theatres, it has been difficult for independent directors to reach large audiences. Cinemas often prefer high-budget movies as they can profit much more from them. As a result, independent movies have had few opportunities outside of film festivals to date. However, as internet technologies advance, streaming services are becoming major actors in the world of cinema, and independent filmmakers are now facing the opportunity to reach a wider audience.

The purpose of our research is to investigate, through data, how streaming platforms have changed movies production. Has the world of production really become fairer?

# Data gatering (Everyone)
We start from two existing datasets:
* [Netflix](https://www.kaggle.com/datasets/shivamb/netflix-shows):it is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

* [Amazon prime](https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows): it is one of the most popular media and video streaming platforms. They have close to 10000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Amazon Prime, along with details such as - cast, directors, ratings, release year, duration, etc.*



In [2]:
import pandas as pd # queried_data processing
import pandas_profiling as pp
import numpy as np # linear algebra

In [66]:
df_netflix = pd.read_csv('Dataset/originalDataset/netflix_titles.csv')
df_amazon = pd.read_csv('Dataset/originalDataset/amazon_prime_titles.csv')
print(len(df_netflix), len(df_amazon))

8807 9668


# Preparing data (Giulia)

The first step is to concatenate the Amazon and Netflix datasets while keeping the storage information. We add two columns: netflix and amazon both with a value of 1 or 0 which represent the absence or presence of the title on the platform. To keep the added date information we rename the columns to distinguish the related streaming service, respectively in date_added_netflix and date_added_amazon.

In [67]:
df_netflix.drop(columns = df_netflix.columns[0], axis = 1, inplace= True)
df_netflix['netflix'] = 1
df_netflix['amazon'] = 0
df_netflix.rename(columns = {'date_added':'date_added_netflix'}, inplace = True)

df_amazon.drop(columns = df_amazon.columns[0], axis = 1, inplace= True)
df_amazon['amazon'] = 1
df_amazon['netflix'] = 0
df_amazon.rename(columns = {'date_added':'date_added_amazon'}, inplace = True)

In [68]:

dataset = pd.concat([df_netflix, df_amazon],axis=0, join="outer", sort=False)
dataset = dataset.reset_index(drop=True)
dataset.head(3)


Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1,0,
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1,0,
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1,0,


The concatenated dataset has some errors. Shows that are both featured on Netflix and Amazon Prime are recorded twice in our dataset: the first time with netflix information only, the second time only on amazon. We decide to extract year of publication, type (film or TV show) and title name and store them in a triple. We then check the triples that the two datasets have in common and store them in a list called 'title'.

In [69]:
netflix = []
amazon = []

def union(df, new):
    for  i, x in df['title'].iteritems():
        year = df['release_year'][i]
        type = df['type'][i]
        movie = x
        new.append((year, type, movie))
    return new

union(df_netflix, netflix)
union(df_amazon, amazon)

print(len(netflix), len(amazon))

8807 9668


In [70]:
title = []
for (y,t,m) in netflix:
    if (y,t,m) in amazon:
        title.append((y,t,m))
len(title)

182

Finally, we iterate over the concatenated dataset querying only the shared titles, merging Amazon and Netflix information about possession and date of addition.
We decide to take description, country and cast information from the Netflix dataset because it was the best compiled of the two. So, at the end of this process, we drop duplicates filtered by title, director, release year and type; keeping the first entries. Then we fill null value with 'No Data'.

In [71]:
df = dataset.copy()
df.replace(np.nan, 'null', inplace=True)
to_drop = set()
for i, r in df.iterrows(): 
    if (r['release_year'],r['type'],r['title']) in title:
        df.loc[i,'netflix'] = 1
        df.loc[i, 'amazon'] = 1
        q = df.query('title=="'+r['title']+'" & type=="'+r['type']+'" & release_year== '+ str(r['release_year']) +'')

        null_val = []      
        for j, x in q.iterrows():
            if x['date_added_netflix'] != 'null':
                df.loc[i, 'date_added_netflix'] = x['date_added_netflix']

            if x['date_added_amazon'] != 'null':
                df.loc[i, 'date_added_amazon'] = x['date_added_amazon']

            n = 0 
            for i in x.values:
                if i == 'null':
                    n+=1
            null_rate = n / len(x.values) * 100
            null_val.append((j,null_rate))
        
        if len(null_val) >= 2:
            row_to_keep = min(null_val, key=lambda tup: tup[1])
            null_val.remove(row_to_keep)
            for i, x in null_val:
                if x != 0.0:
                    to_drop.add(i)


df = df.drop(to_drop)

In [72]:
df.replace( 'null', np.nan, inplace=True)
for i in df.columns:
    null_rate = df[i].isna().sum() / len(df) * 100 
    if null_rate > 0 :
        print("{} null rate: {}%".format(i,round(null_rate,2)))

type null rate: 0.01%
title null rate: 0.01%
director null rate: 25.64%
cast null rate: 11.23%
country null rate: 52.86%
date_added_netflix null rate: 51.91%
release_year null rate: 0.01%
rating null rate: 1.83%
duration null rate: 0.02%
listed_in null rate: 0.01%
description null rate: 0.01%
netflix null rate: 0.01%
amazon null rate: 0.01%
date_added_amazon null rate: 99.16%


From null values score we notice that we miss some information about cast, director and countires. We don't consider the null rate of date_added_netflix column because netflix titles cover half of the dataset so, 52.86% null rate is a normal score. The colum date_added_amazon, on the contrary, has a lot of missing information that we will try to fill during data enrichment fases. 

In [73]:
df['date_added_netflix'].replace(np.nan, 'No Data',inplace  = True)
df['date_added_amazon'].replace(np.nan, 'No Data',inplace  = True)
df['country'].replace(np.nan, 'No Data',inplace  = True)
df['director'].replace(np.nan, 'No Data',inplace  = True)
df['cast'].replace(np.nan, 'No Data',inplace  = True)
df['rating'].replace(np.nan, 'No Data',inplace  = True)
df['title'] = df['title'].replace({'"':''}, regex=True)
df['title'] = df['title'].replace({'\n':' '}, regex=True)
df = df.drop_duplicates(subset=['title','director', 'release_year', 'type'], keep='first')
df = df.dropna()
df = df.reset_index(drop=True)
df.to_csv('data.csv', index=False)

df.head(3)

Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Data,United States,"September 25, 2021",2020.0,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1.0,0.0,No Data
1,TV Show,Blood & Water,No Data,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021.0,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1.0,0.0,No Data
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",No Data,"September 24, 2021",2021.0,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1.0,0.0,No Data


# Data enrichment (Camilla)

Our data will be enriched using two sources:

* [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page):  is a free and open knowledge graph, containing linked information used by the famous online Wikipedia. 

* [iMdB](https://www.imdb.com/):  the Internet Movie Database is the world's most popular and authoritative source for movies, TV shows and celebrity content, where you can find ratings and reviews by critique and the public.

In order to query information, we split our dataset into Movies and TV Shows. Through Wikidata we retrieve missing information from our starting dataset about countries and directors, adding director gender data. In addition, we add interesting information for us such as the gender of the director and the distributor. Finally, we also retrieve the IMDb id of the movie for future queries. The results are stored in two CSV files: one for movies and the other for TV shows.

Then we query IMDb's API asking for general information about titles, starting from the id retrieved from Wikidata. From this API we get information about contents languages, keywords, distributors, critique ratings (Rotten Tomatoes) and public ratings (IMDb ratings). The results are then stored in JSON format.

In [3]:
import pprint #indet json 
import requests #make http requests
import json
from qwikidata.sparql  import return_sparql_query_results #return sparql results
from SPARQLWrapper import SPARQLWrapper, JSON #questo serve a vedere la struttura delle risposte
import ssl
from http.client import IncompleteRead
import time
import urllib.error
from textblob import Word

In [75]:
movie_title = df.query("type == 'Movie'")
print(len(movie_title))

# movie_query = """
#             SELECT ?film_label ?director_label ?dir_gen_label ?distributor_label ?imdbID ?rottenscore
#             WHERE
#             {
#             ?film wdt:P31 wd:Q11424 .
#             ?film rdfs:label """+'"'+ x +'"' +"""@en .
#             ?film rdfs:label ?film_label .
#             FILTER(lang(?film_label) = 'en')
#             OPTIONAL {?film wdt:P57 ?director . 
#             ?director rdfs:label ?director_label .    
#             FILTER(lang(?director_label) = 'en')
#             OPTIONAL {?director wdt:P21 ?dir_gen . 
#             ?dir_gen rdfs:label ?dir_gen_label .
#             FILTER(lang(?dir_gen_label) = 'en')}}
#             OPTIONAL {?film wdt:P750 ?distributor . 
#             ?distributor rdfs:label ?distributor_label .
#             FILTER(lang(?distributor_label) = 'en')}
#             OPTIONAL {?film wdt:P345 ?imdbID.}
#             OPTIONAL {?film wdt:P444 ?rottenscore.}
#             }
#             """

13784


In [76]:
tv_series = df.query("type == 'TV Show'")
print(len(tv_series))

# tv_series_query = """
#             SELECT ?series_label ?director_label ?dir_gen_label ?distributor_label ?imdbID ?rottenscore
#             WHERE
#             {
#             ?series wdt:P31 wd:Q5398426 .
#             ?series rdfs:label """+'"'+ x +'"' +"""@en .
#             ?series rdfs:label ?series_label .
#             FILTER(lang(?series_label) = 'en')
#             OPTIONAL {?series wdt:P57 ?director . 
#             ?director rdfs:label ?director_label .    
#             FILTER(lang(?director_label) = 'en')
#             OPTIONAL {?director wdt:P21 ?dir_gen . 
#             ?dir_gen rdfs:label ?dir_gen_label .
#             FILTER(lang(?dir_gen_label) = 'en')}}
#             OPTIONAL {?series wdt:P449 ?distributor . 
#             ?distributor rdfs:label ?distributor_label .
#             FILTER(lang(?distributor_label) = 'en')}
#             OPTIONAL {?series wdt:P345 ?imdbID.}
#             OPTIONAL {?series wdt:P444 ?rottenscore.}
#             }
#             """

4505


In [77]:
# def wikidata(data, query):

#     title = []
#     director = []
#     gender = []
#     distributor = []
#     imdbID = []
#     rottenscore = []
#     not_found = []

    
#     # get the endpoint API
#     wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
        

#     for x in data():
        
#         try:
#             my_SPARQL_query = query
#             # set the endpoint 
#             sparql_wd = SPARQLWrapper(wikidata_endpoint)
#             # set the query
#             sparql_wd.setQuery(my_SPARQL_query)
#             # set the returned format
#             sparql_wd.setReturnFormat(JSON)
#             # get the results
            
#             results = sparql_wd.query().convert()

#             if results['results']['bindings'] == []:
#                 not_found.append(""+x+"")
                
#             else:
#                 title.append(results['results']['bindings'][0]['film_label']['value'])
#                 if "director_label" in results['results']['bindings'][0]:
#                     director.append(results['results']['bindings'][0]['director_label']['value'])
#                 else:
#                     director.append("no_data")
#                 if "dir_gen_label" in results['results']['bindings'][0]:
#                     gender.append(results['results']['bindings'][0]['dir_gen_label']['value'])
#                 else:
#                     gender.append("no_data")
#                 if "distributor_label" in results['results']['bindings'][0]:
#                     distributor.append(results['results']['bindings'][0]['distributor_label']['value'])
#                 else:
#                     distributor.append("no_data")
#                 if "imdbID" in results['results']['bindings'][0]:
#                     imdbID.append(results['results']['bindings'][0]['imdbID']['value'])
#                 else:
#                     imdbID.append("no_data")
#                 if "rottenscore" in results['results']['bindings'][0]:
#                     rottenscore.append(results['results']['bindings'][0]['rottenscore']['value'])
#                 else:
#                     rottenscore.append("no_data")
                
#         except urllib.error.HTTPError as e:
#             time.sleep((int(e.headers["retry-after"])) + 1)
#             error_title = query.index(x)
#             wikidata(query[error_title:])

#     df = {"id": imdbID, "title": title, "director": director, "director_gender": gender, "distributor": distributor, "rating score": rottenscore}
#     return df
            
# movie_wiki = wikidata(movie_title['title'], movie_query)
# movie_wiki.to_csv('Dataset/query_results_data/wikidata_query_results/movie.csv', index = False)
# serie_wiki = wikidata(tv_series['title'], tv_series_query)
# serie_wiki.to_csv('Dataset/query_results_data/wikidata_query_results/TVShows.csvv', index = False)

In [78]:
movie_wiki = pd.read_csv('Dataset/query_results_data/wikidata_query_results/movie.csv')
serie_wiki = pd.read_csv('Dataset/query_results_data/wikidata_query_results/TVShows.csv')

print(len(movie_wiki), len(serie_wiki))

6666 1969


In [79]:
# def imdb(data, filename):
#     json_list = []
#     for id in data:

#         url = "https://imdb-api.com/en/API/UserRatings/k_2xtxoo0v/"+id+""

#         resp = requests.get(url)
#         data = resp.json()
#         json_list.append(data)

#         with open(filename, 'w') as file_object:  #open the file in write mode
#             json.dump(json_list, file_object)

# imdb(movie_wiki['id'].to_list(), 'movie.json')
# imdb(serie_wiki['id'].to_list(), 'series.json')


In [80]:
f = open('Dataset/query_results_data/imdb_query_results/movie.json')
movie_imdb = json.load(f)

f = open('Dataset/query_results_data/imdb_query_results/series.json')
serie_imdb = json.load(f)

# Data Cleaning (Giulia)
In order to clean the data, we first merge IMDb results with Wikidata results through the function query_union(). Then we fill our starting dataset using the function reconciliation(). Then, we replace any 'No Data', '', 'no_data' and 'None' strings with null values. We normalize to singular duration information that was expressed with 'Serie', 'Series' and 'min' values to keep only 'Serie' and 'min' values. Then we drop brackets from 'distributor' columns since IMDb had stored the data as stringified lists, so as to get a comma-separated list of directors. Finally we proceed to control the rate of the null value occurring.

In [81]:
def query_union(imdb, wiki):
    for x in range(len(imdb)):
        id = imdb[x]['id']
        if id:
            row = wiki.query('id == "'+id+'"')
            index = row.index.values
            if len(index) != 0:
                index = row.index.values[0]
                wiki.loc[index, 'title_imdb'] = imdb[x]['title']
                wiki.loc[index, 'release_date'] = imdb[x]['releaseDate']

                wiki.loc[index, 'directors'] = imdb[x]['directors']
                wiki.loc[index, 'cast'] = imdb[x]['stars']

                if wiki._get_value(index, 'distributor') == 'no_data':
                    wiki.loc[index, 'distributor'] = imdb[x]['companies']

                wiki.loc[index, 'countries'] = imdb[x]['countries']
                wiki.loc[index, 'languages'] = imdb[x]['languages']
                wiki.loc[index, 'rating'] = imdb[x]['contentRating']
                
                wiki.loc[index, 'imDbRating'] = imdb[x]['imDbRating']
                wiki.loc[index, 'imDbRatingVotes'] = imdb[x]['imDbRatingVotes']

                if wiki._get_value(index, 'rating score') == 'no_data':
                    if imdb[x]['ratings']:
                        wiki.loc[index, 'rating score'] = imdb[x]['ratings']['rottenTomatoes']

                if imdb[x]['boxOffice']:
                    wiki.loc[index, 'budget'] = imdb[x]['boxOffice']['budget']
                    wiki.loc[index, 'cumulativeWorldwideGross'] = imdb[x]['boxOffice']['cumulativeWorldwideGross']

                wiki.loc[index, 'keywords'] = imdb[x]['keywords']
                wiki.loc[index, 'awards'] = imdb[x]['awards']
    
    d = wiki
    return d

movie = query_union(movie_imdb, movie_wiki)
movie["release_date"] = pd.to_datetime(movie['release_date'])
movie['relese_year'] = movie['release_date'].dt.year

serie = query_union(serie_imdb, serie_wiki)
serie["release_date"] = pd.to_datetime(serie['release_date'])
serie['relese_year'] = serie['release_date'].dt.year

In [None]:
def reconciliation(starting_data, queried_data, type):
    for i, r in starting_data.iterrows():
        results = queried_data['title'].to_list()
        if r['type'] == type:
            if r['title'] in results:
                index = results.index(r['title'])

                starting_data.loc[i, 'imdb_id'] = queried_data._get_value(index, 'id')
                starting_data.loc[i, 'languages'] = queried_data._get_value(index, 'languages')
                starting_data.loc[i, 'director_gender'] = queried_data._get_value(index, 'director gender')

                starting_data.loc[i, 'distributor'] = queried_data._get_value(index, 'distributor')
                starting_data.loc[i, 'imDbRating'] = queried_data._get_value(index, 'imDbRating')
                starting_data.loc[i, 'imDbRatingVotes'] = queried_data._get_value(index, 'imDbRatingVotes')
                starting_data.loc[i, 'rottenTomatoes'] = queried_data._get_value(index, 'rating score')

                starting_data.loc[i, 'budget'] = queried_data._get_value(index, 'budget')
                starting_data.loc[i, 'gross'] = queried_data._get_value(index, 'cumulativeWorldwideGross')
                starting_data.loc[i, 'keywords'] = queried_data._get_value(index, 'keywords')
                starting_data.loc[i, 'awards'] = queried_data._get_value(index, 'awards')
                
                if r['director'] == 'No Data':
                    starting_data.loc[i, 'director'] = queried_data._get_value(index, 'director')
                elif pd.isna(r['director']):
                    starting_data.loc[i, 'directors'] = queried_data._get_value(index, 'directors')

                if r['country'] == 'No Data':
                    starting_data.loc[i, 'country'] = queried_data._get_value(index, 'countries')
                if r['rating'] == 'No Data':
                    starting_data.loc[i, 'rating'] = queried_data._get_value(index, 'rating')
                if r['cast'] == 'No Data':
                    starting_data.loc[i, 'cast'] = queried_data._get_value(index, 'cast')

                if r['date_added_netflix'] == 'No Data' and r['netflix'] == 1.0 and queried_data._get_value(index, 'relese_year') > 2006.0:
                    starting_data.loc[i, 'date_added_netflix'] = queried_data._get_value(index, 'release_date')

                if r['date_added_amazon'] == 'No Data' and r['amazon'] == 1.0 and queried_data._get_value(index, 'relese_year') > 2006.0:
                    starting_data.loc[i, 'date_added_amazon'] = queried_data._get_value(index, 'release_date')

    final_dataset = starting_data
    return final_dataset

working = reconciliation(df, movie, 'Movie')
final = reconciliation(working, serie, 'TV Show')

final.replace( 'No Data', np.nan, inplace=True)
final.replace( 'None', np.nan, inplace=True)
final.replace( '', np.nan, inplace=True)
final.replace( 'no_data', np.nan, inplace=True)
final.dropna(subset=['imdb_id'], inplace=True)
final.reset_index(drop=True, inplace=True)
final = final[['type', 'imdb_id', 'title', 'director','director_gender', 'cast', 'distributor', 'country', 'languages','release_year', 'rating', 'duration', 'listed_in', 'netflix', 'amazon', 'date_added_netflix', 'date_added_amazon', 'imDbRating', 'imDbRatingVotes','rottenTomatoes', 'budget', 'gross', 'description', 'keywords', 'awards']]
final['distributor'] = final['distributor'].str.replace(r"['()']","")
final['duration'] = final['duration'].apply(lambda w: Word(w).singularize())


In [83]:
for i in final.columns:
    null_rate = final[i].isna().sum() / len(final) * 100 
    if null_rate > 0 :
        print("{} null rate: {}%".format(i,round(null_rate,2)))

director null rate: 15.92%
director_gender null rate: 27.59%
cast null rate: 0.23%
distributor null rate: 23.03%
country null rate: 13.75%
languages null rate: 24.06%
rating null rate: 0.01%
date_added_netflix null rate: 39.87%
date_added_amazon null rate: 84.16%
imDbRating null rate: 22.87%
imDbRatingVotes null rate: 22.87%
rottenTomatoes null rate: 53.32%
budget null rate: 78.98%
gross null rate: 68.44%
keywords null rate: 27.62%
awards null rate: 50.54%


In [84]:
final_dataset = final.copy()

From the score of the null values ​​we note that the null values ​​of the director's data drop from 25.64% to 15.92%. National data also improved from a zero rate of 52.86% to 13.75%. Netflix added date reaches a score of 39.87% while amazon added date reaches 84.16%.
The null transmission rate is almost completely filled in.

Regarding the data coming from the queries, we see the gender of the director, the distributor, the languages, imDb rating, imDbRating votes and the keywords as the best compiled. In contrast, Rotten Tomatoes, budgets, gross, and awards cover less than half of the dataset, but we have decided to keep that information as it could be helpful in understanding the impact of streaming services on the film industry.

At this point we analyze our data to understand the type of normalization process to be applied and if it is possible to fill in the missing information starting from the records of other columns.

### Director Gender - 27% null to 18.92%  (Giulia)

Director gender could be compiled starting from director records. We use the library [gender-guesser](https://pypi.org/project/gender-guesser/), in order to get the gender from the director's names. This library converts any English name into its gender retrieving six types of results: male, female, mostly_male, mostly_female, unknown and andy  (be used for both genders). We decide to transform mostly_male into male, mostly_female into female, andy is split half  into female and half into male. As for unknowns, we decide to keep them as attention-worth information.

In [85]:
import gender_guesser.detector as gender
d = gender.Detector()
missing_gender = final_dataset.query('director.notnull() & director_gender.isnull()')

In [86]:
andy = 0
for  index, name in missing_gender['director'].iteritems(): 
    if not pd.isna(name):
        full_names = name.split(', ')
        genders = []
        for i in full_names:
            name = i.split()[0]
            if '.' in name:
                name = i.split()[1]
            gender = d.get_gender(name)
            if gender == 'mostly_male':
                gender = 'male'
            elif gender == 'mostly_female':
                gender = 'female'
            elif gender == 'andy':
                if (andy % 2) == 0:
                    gender = 'male'
                else:
                    gender = 'female'
                andy += 1
            genders.append(gender)
        final_dataset.loc[index, 'director_gender'] = ', '.join(genders)

### Languages - 22.76% null to 13.75%   (Giulia)

Langauges cold be compiled form country data. We use two library: [CountryInfo](https://pypi.org/project/countryinfo/) and [Babel](https://babel.pocoo.org/en/latest/). First we get the list of spoken languages in each country, using CountryInfo's module .languages() that retrieves its iso-alpha code. Finally, we convert it with the results in languages name using .get_language_name('lang') module. During this process we must deal with two problems: some countries are not present in the CountryInfo dataset, and some iso-alpha codes are not converted by Babel. To overcome them, we manually compile the dataset creating the LANG_MAPPING dictionary. 

In [87]:
from countryinfo import CountryInfo
from babel import Locale
missing_languages = final_dataset.query('country.notnull() & languages.isnull()')

LANG_MAPPING = {
    'Soviet Union' : 'Russian',
    'Bahamas': 'English',
    'Kosovo': 'Albanian, Serbian',
    'South Africa': 'English',
    'United Kingdom': 'English',
    'gn': 'Guarani',
    'nr': 'South Ndebele',
    'st': 'Sotho',
    'tn': 'Tswana',
    'ts' : 'Tsonga',
    've' : 'Venda',
}

In [88]:
for index, x in missing_languages['country'].iteritems():
    if not pd.isna(x):
        to_add = set()
        l = x.split(', ')
        for i in range(len(l)):
            name = l[i].replace(',', '')
            if name in LANG_MAPPING.keys():
                lang = LANG_MAPPING[name]
                to_add.add(lang)
            else:
                country = CountryInfo(name)
                data = country.languages()
                list = set()
                for item in data:
                    if item in LANG_MAPPING.keys():
                        lang = LANG_MAPPING[item]
                        list.add(lang)
                    else:
                        language = Locale.parse(item)
                        language = language.get_language_name('en')
                        list.add(language)
                to_add.add(', '.join(list)) 

        final_dataset.loc[index, 'languages'] = ', '.join(to_add)

### Keywords - 26.47% null to 0%  (Giulia)

Keywords missing data could be compiled starting from titles' descriptions. To do that we use [Spacy](https://spacy.io), a natural language processing library. We decide to keep any world annotated by this library as a noun (NOUN) and pronoun (PROPN).


In [89]:
import spacy
keywords_to_find = final_dataset.query('keywords.isnull() and description.notnull()')
NER = spacy.load("en_core_web_sm")
to_take = ['NOUN', 'PROPN']
keywords = []
for index, x in keywords_to_find['description'].iteritems():
    parsed = NER(x)
    tokens = []
    for token in parsed:
        if token.pos_ in to_take:
            if token.text not in tokens:
                tokens.append(token.text)
    keys = ','.join(tokens)
    final_dataset.loc[index, 'keywords']  = keys

### Rating - normalization   (Camilla)
We normalize rating information grouping the various categories into six macro categories: kids all, older kids 7+, teens 13+, young adults 16+, adults 18+ and unrated.

In [90]:
for index, rating in final_dataset['rating'].iteritems():
    if not pd.isna(rating):
        if rating == 'ALL'or rating == 'TV-G'or rating == 'TV-Y'or rating == 'G':
            final_dataset.loc[index, 'rating']  = 'Kids All'
        elif rating == '7+'or rating == 'PG'or rating == 'TV-PG'or rating == 'TV-Y7'or rating == 'TV-Y7-FV' :
            final_dataset.loc[index, 'rating']  = 'Older Kids 7+'
        elif rating == '13+'or rating == 'PG-13' :
            final_dataset.loc[index, 'rating']  = 'Teens 13+'
        elif rating == '16'or rating == '16+'or rating == 'AGES_16_'or rating == 'TV-14':
            final_dataset.loc[index, 'rating']  = 'Young Adults 16+'
        elif rating == '18+'or rating == 'AGES_18_'or rating == 'TV-MA'or rating == 'NC-17' or rating == 'R':
            final_dataset.loc[index, 'rating']  = 'Adults 18+'
        elif rating == 'NR'or rating == 'UNRATED'or rating == 'UR'or rating == 'NOT_RATE' or rating == 'TV-NR':
            final_dataset.loc[index, 'rating']  = 'Unrated'

### Listed in - normalization   (Giulia)

Amazon and Netflix use to list titles into genre macro categories. They use different names, sometimes differentiating contents genre if it deals with TV Shows or Movies. In order to normalize the titles' genre information we first normalize them to the singular form, and then we convert the results using the GEN_MAPPING dictionary.

In [91]:
GEN_MAPPING = {
    'Lgbtq movie': 'Lgbtq',
    'Suspense':'Thriller',
     'Sports movie':'Sport',
     'Anime feature':'Anime',
     'Anime series':'Anime',
    'International movie':'International',
    'International tv show':'International',
     'Science fiction': 'Sci-fi & fantasy',
      'Tv sci-fi & fantasy':'Sci-fi & fantasy',
    'Fantasy':'Sci-fi & fantasy',
     'Tv drama' : 'Drama',
    'Horror movie':'Horror',
    'Tv horror' :'Horror',
    'Adventure':'Action & adventure',
    'Action':'Action & adventure',
    'Tv action & adventure':'Action & adventure',
    'Romantic movie':'Romance',
    'Faith and spirituality':'Faith & spirituality',
     'Arthouse': 'Art',
       'And culture': 'Art',
    'Movie':'Classic & cult',
    'Special interest':'Independent', 
    'Kid':'Children & family',
     'Young adult audience':'Teen',
     'Teen tv show':'Teen',
     'Children & family movie':'Children & family',
     "Kid's tv":'Children & family',
     "Kids' tv":'Children & family',
     'Stand-up comedy': 'Stand-up comedy & talk show',
     'Tv thriller':'Thriller',
     'Tv comedy': 'Commedy',
     'Comedy':'Commedy',
     'Classic movie': 'Classic & cult',
     'Classic & cult tv': 'Classic & cult',
     'Cult movie': 'Classic & cult',
     'Docuseries':'Documentary',
     'Romantic tv show':'Romance',
     'Crime tv show':'Crime & mystery',
     'Tv mystery':'Crime & mystery',
     'Tv show': 'Reality',
     'Reality tv': 'Reality',
     'Science & nature tv': 'Science & nature',
     'Indipendent movie':'Indipendent',
     'Music videos and concert':'Music & musical',
     'Spanish-language tv show': 'International',
     'Korean tv show': 'International',
     'British tv show': 'International',
     'Independent movie':'Independent'

    }

In [92]:
df = pd.DataFrame()

df['genre'] = final_dataset['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
for index, i in df['genre'].iteritems(): 
    for n in range(len(i)):
        w = i[n]
        w = Word(w).singularize().capitalize()
        if w in GEN_MAPPING.keys():       
            w = GEN_MAPPING[w]
        i = i[:n]+[w]+i[n+1:]
    final_dataset.loc[index, 'listed_in'] = ', '.join(i)


### Distributor - normalization  (Giulia)

We notce that in iMDb distributor data, Amazon was recoded using different names, refering to the company department responsable for the title distribution. As we wanna keep only cumulative information we convert any of those strings in Amazon Prime.

In [93]:
final_dataset['distributor'].replace(  np.nan, 'no_data', inplace=True)

In [94]:
df = pd.DataFrame()
Types = set()
df['distr'] = final_dataset['distributor'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
for index, i in df['distr'].iteritems(): 
    for n in range(len(i)):
        w = i[n]
        if 'Amazon' in w or 'amazon' in w:
            w = 'Amazon Prime'
        i = i[:n]+[w]+i[n+1:]
    final_dataset.loc[index, 'distributor'] = ', '.join(i)

### Rotten Tomatoes and iMDb Ratings - normalization  (Camilla)
Rotten Tomatoes data was sometimes expressed as a percentage, sometimes in fractional form. They are also of type string and not numeric. We, therefore, decided to convert them into percentages of type float. IMDb Ratings were expressed as float form value 0 to 10. In order to prepare data for analysis, we diced to also convert  IMDb Ratings into percentages.

In [95]:
for  index, vote in final_dataset['rottenTomatoes'].iteritems():
    if not pd.isna(vote) and type(vote) != float:
        l = (vote.split())
        if len(l) > 1:
            l.remove('of')
            val = float(l[0])*100//float(l[1])
        if '%' in vote:
            val = float(vote.replace('%', ''))
        elif '/' in vote:
            l = vote.split('/')
            val = float(l[0]) * 100 // float(l[1])
        else:
            val = float(l[0])
            
        final_dataset.loc[index, 'rottenTomatoes'] = val

for  index, imbdvote in final['imDbRating'].iteritems():
    if pd.notnull(imbdvote):
        val = float(imbdvote)*10
        final_dataset.loc[index, 'imDbRating'] = val

### Awards - normalization  (Giulia)

Awards were expressed as strings containing the first half a well-known award name (such as Oscars), information on the status of the award (if the title was nominated or it actually won the award), and the total number of prizes or nominations received. In the second half, there is the recorded total amount of nominations and prizes won by that title from any kind of competition or contest. We split this data into five columns: special_award_name, special_award_tot, special_award_stat, award_win_tot and award_nomination_tot.

In [96]:
for  index, x in final_dataset['awards'].iteritems():
    if not pd.isna(x) and x != '':
        list = x.split(", ")
        
        k = list[0]

        special = k.split('|')

        nameA =[]
        tot = []
        stat= []
        for item in special:     
            l = item.split()
            
            if 'for' in l:
                l.remove('for')

            if 'Top' not in l and len(l) > 1:
                stat.append(l[0])
                tot.append(l[1])
                name = ' '.join(l[2:])
                name = name.removesuffix('s')
                nameA.append(name)
                
            elif 'Top' in l and len(l) > 1:
                tot.append('1')
                name = ' '.join(l[:3])
                nameA.append(name)
                stat.append('nn')
                
            else:
                if l[0] != 'Nominated' and l[0] != 'Won':
                    nameA.append(l[0])
                    tot.append('1')
                    stat.append('nn')
                    
        final_dataset.loc[index, 'special_award_name']  = ', '.join(nameA)
        final_dataset.loc[index, 'special_award_tot']  = ', '.join(tot)
        final_dataset.loc[index, 'special_award_stat']  = ', '.join(stat)

        w = list[1].split('&')
        win_tot = []
        nomination_tot = []
        
        for n in range(len(w)):
            if 'win' in  w[n]  and 'nominations' not in w[n]:
                nom = w[n].split()
                win_tot.append(int(nom[0]))
            if 'nomination' in w[n] and 'win' not in w[n]:
                win = w[n].split()
                nomination_tot.append(int(win[0]))
        
        if nomination_tot != []:
            final_dataset.loc[index, 'award_nomination_tot']  = nomination_tot[0]
        else :
            final_dataset.loc[index, 'award_nomination_tot']  = np.nan

        if win_tot != []:
            final_dataset.loc[index, 'award_win_tot']  = win_tot[0]
        else :
            final_dataset.loc[index, 'award_win_tot']  = np.nan
 
    else:
        final_dataset.loc[index, 'special_award_name']  = np.nan        
        final_dataset.loc[index, 'special_award_tot']  = np.nan        
        final_dataset.loc[index, 'special_award_stat']  = np.nan        
        final_dataset.loc[index, 'award_win_tot']  = np.nan        
        final_dataset.loc[index, 'award_nomination_tot']  = np.nan
        

### Budget and Gross - normalization  (Giulia)

Budget data was expressed in different currencies. We decide to normalize them in euro using CurrencyConverter library. Here we had to deal with some issues. Budgets were recorded as strings, so we had first to separate the numbers from the currency symbol. Then using CURRENCY_MAPPING dictionary we convert currency symbols in its ISO code used as a parameter for the conversion that was made using the .convert(tot, currency) method. Not all the ISO currency codes were present on CurrencyConverter so we add the CURRENCY_EXRATE dictionary to perform the conversion manually.

To normalize gross data we followed the same procedure, beside the fact that all records were expressed in american dollars because it refere to world wide tiltes' gross. So we are not resorting to dictionaries, but we convert the ammount using .convert(tot, 'USD') method

In [97]:
import re
from currency_converter import CurrencyConverter
c = CurrencyConverter()

CURRENCY_MAPPING = {
    '$' : 'USD',
    'A$': 'AUD',
    'CA$': 'CAD',
    'CN¥': 'CNY',
    'HK$': 'HKD',
    'MX$': 'MXN',
    'NT$': 'TWD',
    'RUR': 'RUB',
    '£' : 'GBP',
    '₩' : 'KRW',
    '₪' : 'ILS',
    '€' : 'EUR',
    '₹' : 'INR',
}

CURRENCY_EXRATE = {
    'ARS' : 0.0070, 
    'EGP' : 0.052, 
    'MVR' : 0.065, 
    'NGN' : 0.0023, 
    'PKR' : 0.0043, 
    'TWD' : 0.032,
}

In [98]:
# Budget normalization
for index, x in final_dataset['budget'].iteritems():
    if not pd.isna(x):
        x = x.split()
        if '(estimated)' in x:
            x.remove('(estimated)')
        if len(x) == 1:
            val = x[0].split(',')
            val = ''.join(val)
            currency = [s for s in re.findall(r"[^0-9.]", val)]
            currency = ''.join(currency)
            tot = val.strip(currency)
        elif len(x) != 0:
            tot = x[1].split(',')
            tot = ''.join(tot)
            currency = x[0]

        if currency in CURRENCY_MAPPING.keys():
            currency = CURRENCY_MAPPING[currency]

        if currency in c.currencies:
            ammount = int(c.convert(tot, currency))
        else:
            ex_Rate = CURRENCY_EXRATE[currency]
            ammount = int(int(tot) // ex_Rate)

        final_dataset.loc[index, 'budget']  = ammount 
    else:
        final_dataset.loc[index, 'budget']  = np.nan

# Gross normalization
for index, x in final_dataset['gross'].iteritems():
    if not pd.isna(x):
        x = x.split(',')
        x = ''.join(x)
        x = [int(s) for s in re.findall(r'\b\d+\b', x)]
        ammount = int(c.convert(x[0], 'USD'))
        final_dataset.loc[index, 'gross']  =  ammount
    else:
        final_dataset.loc[index, 'gross']  = np.nan

### Final Dataset

Finally, we convert any 'no_data' string, added during the cleaning phases to deal with all the procedures, in nan values and we save the final dataset in CSV format.

In [99]:
final_dataset.replace( np.nan, 'no_data', inplace=True)
final_dataset.to_csv('final_dataset.csv', index= False)

# Data analysis and visualization (Giulia)

Starting from our data and our research question: how streaming platforms have changed film production; we decided to divide the analysis into 3 steps:

1. What kind of content is present on Streaming Platforms? 
2. How streaming services treats their content? 
3. How streaming service influence cinema industry?

To deal with visualization and analysis we use [matplotlib](https://matplotlib.org), [plotly](https://go.plotly.com/website) and [sklearn](https://scikit-learn.org/stable/) library.

In [4]:
import tkinter
import matplotlib
import matplotlib.colors
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from sklearn.preprocessing import MultiLabelBinarizer 

data = pd.read_csv('final_dataset.csv')
colors = ['#0f79af','#5287c1','#7b95d0','#a0a3de','#c2b2ea','#e2c1f5','#ffd2ff','#ffb5ea','#ff96cc','#ff76a7','#ff547c','#f9314c','#e50914']
two_colors =['#ff547c', '#a0a3de']
margin={'t':10,'l':10,'b':10,'r':10}
margin_title = {'t':50,'l':10,'b':10,'r':10}

# What kind of content is present on Streaming Platforms?
To represent contents we use data about the type of item (if in Tv Shows or Movies), country of provenance, and spoken languages. Then to have a deeper insight we look to the directors' gender and rating, who has made the content and for whom the content is made.

### Tv Series/Movie distribution sunburst
We plot with a sunburst chart the distribution of content typology in our dataset (Movie, TV Show) and the consequent subdivision between Amazon and Netflix.

**Conclusion:** Netflix is more represented in our dataset than Amazon. Surprisingly there are more films than TV series on both platforms.

In [101]:
x= data.groupby(['type'])[['amazon', 'netflix']].sum()
y=len(data)
r=((x*100/y)).round(2)
ratio = pd.DataFrame(r)
ratio
df = ratio.stack().reset_index()
df.columns = ['type','distr','val']
df['distr'] = df['distr'].apply(lambda x: x.capitalize())
fig = px.sunburst(df, path=['distr', 'type'], values='val',color='distr', hover_name="type", color_discrete_map={'Netflix':'#ff547c', 'Amazon':'#a0a3de'})

fig.update_traces(marker=dict(line=dict(color='#212529', width=2)), textfont=dict(family="Arial, san serif",size=16,color="white"),hoverlabel=dict(font_size=14,font_family="Arial, san serif",font_color="white"),)
fig.update_traces(hovertemplate=('<br>%{customdata[0]} presence is: %{value}%<br>'))
fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15)
fig.write_html('visualization/one.html')
fig.show()


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



### Country analysis
We plot in a bar chart the 10 countries more represented in our dataset.

**Conclusion:** here we can notice that half of the content comes from the USA, the second most represented country is India followed by the UK. The huge presence of American content could be related to the fact that both companies are from there.

In [102]:
df = data.copy()
df['country'].replace('no_data', np.nan, inplace  = True)
df.dropna(inplace=True)
country = []
for i, x in df['country'].iteritems():
    l = x.split(', ')
    for c in l:
        country.append(c)
df = pd.DataFrame()
df['country'] = country
df['count'] = 1
df['country'].replace('United States', 'USA', inplace=True)
df['country'].replace('United Kingdom', 'UK',inplace=True)
df['country'].replace('South Korea', 'S. Korea',inplace=True)
x = df.groupby('country')['count'].sum().sort_values(ascending=False)[0:10]
y=len(df)
r=((x*100/y)).round(2)
df = pd.DataFrame(r)
fig = go.Figure(data=[go.Bar(x=df.index, y=df['count'].to_list())])
fig.update_traces(marker_color=colors[::-1][1:], marker_line_color='#212529',
                  marker_line_width=2)

fig.update_traces(marker=dict(line=dict(color='#212529', width=2)), hoverlabel=dict(font_size=14,font_family="Arial, san serif",font_color="white"))
fig.update_traces(hovertemplate=('<br>%{x} presence is: %{value}%<br><extra></extra>'))
fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15)

fig.update_yaxes(color="white")
fig.update_xaxes(color="white")
fig.write_html('visualization/two.html')
fig.show()

### Languages analysis

We plot in a bar chart the 10 languages most represented on our streaming services (Amazon and Netflix).

**Conclusion:** as we expected, English is the most prominent language in streaming services. Most Asian films are translated into English for international fruition and placed in this kind of worldwide services, even if Hindi is having a good representation with respect to others. European languages too have a good degree of representation.

In [103]:
df = data.copy()
df['languages'].replace('no_data', np.nan, inplace  = True)
df.dropna(inplace=True)
languages = []
for i, x in df['languages'].iteritems():
    l = x.split(', ')
    for lang in l:
        languages.append(lang)
df = pd.DataFrame()
df['languages'] = languages
df['count'] = 1
x = df.groupby('languages')['count'].sum().sort_values(ascending=False)[0:10]
y=len(df)
r=((x*100/y)).round(2)
df = pd.DataFrame(r)
fig = go.Figure(data=[go.Bar(x=df.index, y=df['count'].to_list())])
fig.update_traces(marker_color=colors[::-1][1:], marker_line_color='#212529',
                  marker_line_width=2)

fig.update_traces(marker=dict(line=dict(color='#212529', width=2)), hoverlabel=dict(font_size=14,font_family="Arial, san serif",font_color="white"))

fig.update_traces(hovertemplate=('<br>%{x} presence is: %{value}%<br><extra></extra>'))

fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15)

fig.update_yaxes(color="white")
fig.update_xaxes(color="white")
fig.write_html('visualization/three.html')
fig.show()

## Rating analysis

We plot in bar chart ratings present on our dataset divided by type (Movie and TV Show).

**Conclusion:** Here we can notice that most of the contents are made for Adults. Another interesting insight is the fact that movies present more teenagers than young adults ratings and TV Shows tend to be rated more by young adults than teens.

In [104]:
df = data.copy()
df['rating'].replace('no_data', np.nan, inplace  = True)
df.dropna(inplace=True)
df['count'] = 1
order = pd.DataFrame(df.groupby('rating')['count'].sum().sort_values(ascending=False).reset_index())
rating_order = order['rating'].to_list()
mf = df.groupby('type')['rating'].value_counts().unstack().sort_index().fillna(0).astype(int)[rating_order]
fig = go.Figure()
fig.add_trace(go.Bar(x=mf.columns, y=mf.loc['Movie'],
                base=0,
                marker_color='#ff547c',
                name='Movie'))
fig.add_trace(go.Bar(x=mf.columns,  y= mf.loc['TV Show'],
                marker_color='#a0a3de',
                name='TV Shows'
                ))

fig.update_traces(marker=dict(line=dict(color='#212529', width=2)), hoverlabel=dict(font_size=14,font_family="Arial, san serif",font_color="white") )
fig.update_traces(hovertemplate=('<br>Total %{x}: %{y}<br><extra></extra>'))
fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15)

fig.update_yaxes(color="white")
fig.update_xaxes(color="white")
fig.write_html('visualization/four.html')
fig.show()

### Director gender analysis

We plot a pie chart to understand gender distribution in our dataset. Then, in order to better get the evolution over time of gender inclusion, we plot a filled area chart.

**Conclusion:** FFrom the two plots we notice that male directors are still the majority. Interestingly, after 2014 (the year of the real boom of streaming services), female directors have seen an increase. But, in the same year, male directors also had an exponential growth. Are we really achieving a gender balance in the world of cinema, or are we simply producing more content in less time?

In [105]:
df = data.copy()
df['director_gender'].replace('no_data', np.nan, inplace  = True)
df = df.dropna(subset=['director_gender'])
gender = []
release_year = []
for i, x in df['director_gender'].iteritems():
    x = x.split(', ')
    for gen in x:
        gender.append(gen)
        year = df._get_value(i, 'release_year')
        release_year.append(year)
df = pd.DataFrame()
df['release_year'] = release_year
df['gender'] = gender
df['count'] = 1
df1 = pd.DataFrame(df.groupby('gender')['count'].sum().sort_values(ascending=False).reset_index())
gender_order = df1['gender'].to_list()
fig = px.pie(df1, values='count', names='gender')

fig.update_traces(textposition='inside', textinfo='percent+label', marker=dict(colors=colors[::-1][2:], line=dict(color='#212529', width=2)), textfont=dict(family="Arial, san serif",size=15,color="white"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

fig.update_traces(hovertemplate=('<br>Total %{label} directors are %{value}<br><extra></extra>'))
fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15)
fig.write_html('visualization/five.html')
fig.show()

df2 = df[['release_year', 'gender']].groupby('gender')['release_year'].value_counts().unstack().loc[gender_order]
df2 = df2.stack().reset_index()
df2.columns = ['gender','year','val']
df2 = df2[(df2.year >= 2008)&(df2.year < 2022)]
df2 = df2[::-1]
df2
fig = px.area(df2, x="year", y="val", color="gender", line_shape='spline', color_discrete_sequence= colors[::-1][1:], labels={'gender':'Director gender', 'val':'Director amount', 'year': 'Movie release year'}, custom_data=['gender'])

fig.update_traces( marker=dict(line=dict(width=2)),textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

fig.update_traces(hovertemplate=('<br>In %{x}, %{y} films by %{customdata[0]} directors were recorded<br><extra></extra>'))
fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15, xaxis_showgrid=False)
fig.write_html('visualization/six.html')
fig.show()

# How streaming services treats their content? 
o understand how streaming services deal with their content we first analyse how they add content on the platforms. We start from a yearly overview to then get more granular, restituting the monthly addition habits of streaming services. Then we explore which genre they add during the year and if there is some trend for each month. Because the genre is associated with the streaming platform we also analyse how Amazon and Netflix register their content by looking at genre co-occurrence. Finally, we investigate how old movies are and which national film culture is more represented by the two companies.


### Year added
We plot two line charts one for Netflix and one for Amazon to investigate movement chages by year in content type and quantity addition. 

**Conclusions:** Netflix and Amazon act in different ways in a matter of content addition. If on one hand Netflix appears to have an increase in content addition after 2014, on the other hand, Amazon seems to be more constant but it shows less offer. Both streaming services have a downward trend after 2020, probably due to the Covid-19 pandemic. Both services' peak in global content amount was in 2019. It appears that Netflix has focused more attention on increasing Movie content rather than TV Shows. Movies have increased much more drastically than TV shows. Amazon, on the contrary, has always had more attention to Movies than Tv Shows but in 2021 there is an inversion of the trend. During this year Amazon increased TV shows more than movies.

In [106]:
dfn = data.copy()
dfn['date_added_netflix'].replace('no_data', np.nan, inplace  = True)
dfn = dfn.dropna()
dfn['date_added_netflix'] = pd.to_datetime(dfn['date_added_netflix'])
dfn['year_added'] = dfn['date_added_netflix'].dt.year
dfn['count'] = 1

dfa = data.copy()
dfa['date_added_amazon'].replace('no_data', np.nan, inplace  = True)
dfa = dfa.dropna()
dfa['date_added_amazon'] = pd.to_datetime(dfa['date_added_amazon'])
dfa['year_added'] = dfa['date_added_amazon'].dt.year
dfa['count'] = 1

def year_added(df, file, prov, range):
    fig = go.Figure(layout_yaxis_range=range)
    for i, r in enumerate(df['type'].value_counts().index[::-1]):
        mtv_rel = pd.DataFrame(df[df['type']==r]['year_added'].value_counts().sort_index())
        fig.add_trace(go.Scatter(
            x=mtv_rel.index, y=mtv_rel['year_added'],
            hoverinfo='x+y',
            mode='lines',
            line=dict(width=2, color=two_colors[i]),
            # line_shape='spline', 
            name = r,
            stackgroup='one'
        ))

        for fig_scatter_data in fig.data:
            fig_scatter_data['customdata'] = [fig_scatter_data['name']] * len(fig_scatter_data['x'])

    fig.update_traces( marker=dict(line=dict(width=2)),textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

    fig.update_traces(hovertemplate=('<br>In %{x}, %{y} %{customdata} were added on ' +prov+'<br><extra></extra>'))

    fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15, xaxis_showgrid=False)
    
    fig.write_html('visualization/'+ file)
    fig.show()

year_added(dfn, 'seven.html', 'Netflix', [0,1300])
year_added(dfa, 'eight.html', 'Amazon', [0,300])

### Month added 

We plot two line charts one for Netflix and one for Amazon to investigate movement by month in content type and quantity addition.

**Conclusions:** It is interesting to notice that Amazon and Netflix seem to be complementary in month addition. If Netflix is publishing mostly at the beginning and at the end of the year, Amazon concentrates more in the middle - in particular on the month of June.

In [107]:
month_order = ['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

dfn = data.copy()
dfn['date_added_netflix'].replace('no_data', np.nan, inplace  = True)
dfn = dfn.dropna()
dfn['date_added_netflix'] = pd.to_datetime(dfn['date_added_netflix'])
dfn['month_added_netflix']=dfn['date_added_netflix'].dt.month
dfn['month_added_netflix']=dfn['date_added_netflix'].dt.month_name()
dfn['month_name_added'] = pd.Categorical(dfn['month_added_netflix'], categories=month_order, ordered=True)

dfa = data.copy()
dfa['date_added_amazon'].replace('no_data', np.nan, inplace  = True)
dfa = dfa.dropna()
dfa['date_added_amazon'] = pd.to_datetime(dfa['date_added_amazon'])
dfa['month_added_amazon']=dfa['date_added_amazon'].dt.month
dfa['month_added_amazon']=dfa['date_added_amazon'].dt.month_name()
dfa['month_name_added'] = pd.Categorical(dfa['month_added_amazon'], categories=month_order, ordered=True)
def month_added(df, file, prov, range):
    data_sub = df.groupby('type')['month_name_added'].value_counts().unstack().fillna(0).loc[['TV Show','Movie']].cumsum(axis=0).T
    fig = go.Figure(layout_yaxis_range=range)
    for i, r in enumerate(df['type'].value_counts().index[::-1]):
        mtv_rel = pd.DataFrame(data_sub[r])
        fig.add_trace(go.Scatter(
            x=mtv_rel.index, y=mtv_rel[r],
            hoverinfo='x+y',
            mode='lines',
            line=dict(width=2, color=two_colors[i]),
            # line_shape='spline', 
            name = r,
            stackgroup='one' # define stack group
        ))

        for fig_scatter_data in fig.data:
            fig_scatter_data['customdata'] = [fig_scatter_data['name']] * len(fig_scatter_data['x'])

    fig.update_traces( marker=dict(line=dict(width=2)),textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

    fig.update_traces(hovertemplate=('<br>On %{x}, %{y} %{customdata} were added on ' +prov+'<br><extra></extra>'))

    fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15, xaxis_showgrid=False)
    fig.write_html('visualization/'+ file)
    fig.show()
month_added(dfn, 'nine.html', 'Netflix', [0,700])
month_added(dfa, 'ten.html', 'Amazon', [0, 200])

### Content genre addition over month

We plot two histograms to understand if there is a differentiation in terms of content genres month addition between the two streaming services.

**Conclusion:** both the services have a dominance of Dramatic and Comedy content. Netflix has a huge amount of international movies, with respect to Amazon, which appears to add less of them but more constantly over the months. 

In [None]:
month_order = ['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

dfn = data.copy()
dfn['date_added_netflix'].replace('no_data', np.nan, inplace  = True)
dfn = dfn.dropna()
dfn['date_added_netflix'] = pd.to_datetime(dfn['date_added_netflix'])
dfn['month_added_netflix']=dfn['date_added_netflix'].dt.month
dfn['month_added_netflix']=dfn['date_added_netflix'].dt.month_name()
dfn['month_name_added'] = pd.Categorical(dfn['month_added_netflix'], categories=month_order, ordered=True)
dfn['genre'] = dfn['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 

dfa = data.copy()
dfa['date_added_amazon'].replace('no_data', np.nan, inplace  = True)
dfa = dfa.dropna()
dfa['date_added_amazon'] = pd.to_datetime(dfa['date_added_amazon'])
dfa['month_added_amazon']=dfa['date_added_amazon'].dt.month
dfa['month_added_amazon']=dfa['date_added_amazon'].dt.month_name()
dfa['month_name_added'] = pd.Categorical(dfa['month_added_amazon'], categories=month_order, ordered=True)
dfa['genre'] = dfa['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
def genre_month(df, file, prov):
    d = pd.DataFrame()
    j = 0
    for i, x in dfn['genre'].iteritems():
        for gen in x:
            d.loc[j, 'genre'] = gen
            mh = dfn._get_value(i, 'month_name_added')
            d.loc[j, 'month'] = mh
            d.loc[j, 'count'] = 1
            j+=1
    d = d[['month', 'genre']].groupby('genre')['month'].value_counts().unstack()
    d = d.stack().reset_index()
    d.columns = ['genre','month','count']

    for x in set(d['month'].values):
        f = d[d['month']==x]
        f['perc'] = (f['count'] / f['count'].sum()) * 100
        for i, x in f['perc'].iteritems():
            d.loc[i, 'percent'] = x

    customdata = np.stack(d['genre'])

    fig = px.histogram(d, x="month", y="percent", color='genre', color_discrete_sequence=colors, category_orders=dict(month=month_order), labels={'genre':'Genre'})

    fig.update_traces( textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

    fig.update_traces(customdata=['genre'], selector=dict(type='histogram'))

    fig.update_layout(paper_bgcolor="#212529", plot_bgcolor ='rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)', font_color="white", font_family="Arial, san serif", font_size=15, xaxis_showgrid=False)

    fig.write_html('visualization/'+ file)
    fig.show()

genre_month(dfn, 'eleven.html', 'Netflix')
genre_month(dfa, 'twelve.html', 'Amazon')

In [None]:
d = pd.DataFrame()
j = 0
for i, x in dfn['genre'].iteritems():
    for gen in x:
        d.loc[j, 'genre'] = gen
        mh = dfn._get_value(i, 'month_name_added')
        d.loc[j, 'month'] = mh
        d.loc[j, 'count'] = 1
        j+=1
d = d[['month', 'genre']].groupby('genre')['month'].value_counts().unstack()
d = d.stack().reset_index()
d.columns = ['genre','month','count']

for x in set(d['month'].values):
    f = d[d['month']==x]
    f['perc'] = (f['count'] / f['count'].sum()) * 100
    for i, x in f['perc'].iteritems():
        d.loc[i, 'percent'] = x
d

### Genre correlation

In order to understand how genres combine together for type and streaming services, we built a heatmap.

**Conclusion:** It is interesting to notice that for Netflix Independent Movies tend to be Dramas, while Independent contents tend to be Documentary on Amazon. On Netflix, Documentary is associated with Science and Nature. In Amazon, LGBTQ movies tend to be listed in Arts and rarely Horror are also classified as Drama, same for Thriller and Classic movies. Different when we talk about tv shows in Amazon where Thriller is mostly associated with Horror. Amazon tends to relate the Entertainment and Arts contents of both types together.
Another observation is that International Movies are rarely in the Children's genre for Netflix, while Amazon distributes more heterogeneously international content over genres.

In [None]:
df = data.copy()
df['listed_in'].replace('no_data', np.nan, inplace  = True)
df = df.dropna(subset=['listed_in'])
df_tv = df[df["type"] == "TV Show"]
df_movies = df[df["type"] == "Movie"]

df_amazon = df.query("amazon == 1.0")
df_tv_a = df_amazon[df_amazon["type"] == "TV Show"]
df_movies_a = df_amazon[df_amazon["type"] == "Movie"]

df_netflix = df.query("netflix == 1.0")
df_tv_n = df_netflix[df_netflix["type"] == "TV Show"]
df_movies_n = df_netflix[df_netflix["type"] == "Movie"]
def genre_heatmap(df, file, prov):
    df['genre'] = df['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
    test = df['genre']
    mlb = MultiLabelBinarizer()
    res = pd.DataFrame(mlb.fit_transform(test), columns=mlb.classes_, index=test.index)
    corr = res.corr()
    mask = np.zeros_like(corr, dtype = bool) 
    mask[np.triu_indices_from(mask)] = True
    corr1 = corr.mask(mask)
    X = corr1.columns.values
    corr = corr1.values.tolist()
    N = len(corr)
    hovertext = [[f'corr({X[i]}, {X[j]})= {corr[i][j]:.2f}' if i>j else '' for j in range(N)] for i in range(N)]
    title = prov + ' genres' 
    heat = go.Heatmap(z=corr,
                  x=X,
                  y=X,
                  xgap=1, ygap=1,
                  colorscale=colors,
                  colorbar_thickness=20,
                  colorbar_ticklen=3,
                  hovertext =hovertext,
                  hoverinfo='text'
                   )  
    layout = go.Layout(
                title_text=title, title_x=0.5,
                   autosize=False,
                    width=600,
                    height=600,
                   xaxis_showgrid=False,
                   yaxis_showgrid=False,
                   yaxis_autorange='reversed')
    fig=go.Figure(data=[heat], layout=layout)

    fig.update_traces( textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

    fig.update_layout(paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',xaxis_title="Contents divided by genre on "+ prov, margin=margin_title,
    yaxis_title="Contents amount",
    font=dict(
        family="Arial, san serif",
        size=12,
        color="white"
    ) ) 
    fig.write_html('visualization/'+ file)     
    fig.show() 

genre_heatmap(df, 'therteen.html', 'Full dataset')
genre_heatmap(df_tv, 'fourteen.html', 'Full tv shows')
genre_heatmap(df_movies, 'fifteen.html', 'Full movies')

genre_heatmap(df_amazon, 'sixteen.html', 'Full Amazon')
genre_heatmap(df_tv_a, 'seventeen.html', 'Amazon tv shows')
genre_heatmap(df_movies_a, 'eighteen.html', 'Amazon movies')

genre_heatmap(df_netflix, 'nineteen.html', 'Full Netflix')
genre_heatmap(df_tv_n, 'twenty.html', 'Netflix tv shows')
genre_heatmap(df_movies_n, 'twentyone.html', 'Netflix movies')

### Cinema culture representation

To understand how old are movie present in streaming services and which country's cinema culture is more represented in our dataset, we plot one scatter plot for Amazon and one from Netflix.

**Conclusion:** The average gap between when content is released, and when it is then added varies by country. In Spain and Nigeria, Netflix appears to be dominated by newer movies whereas Hong Kong & India have an older average movie. Amazon on the contrary has newer Chinese movies and older French movies. The span between the year of addition and movie release year is very close. The gap for TV shows seems more regular than for movies. This is likely due to subsequent series being released year by year. Spain seems to have the newest content overall on Netflix while Indian TV shows are the newest on Amazon.

In [110]:
df = data.copy()
df['country'].replace('no_data', np.nan, inplace  = True)
df = df.dropna()
df['first_country'] = df['country'].apply(lambda x: x.split(",")[0])
df['first_country'].replace('United States', 'USA', inplace=True)
df['first_country'].replace('United Kingdom', 'UK',inplace=True)
df['first_country'].replace('South Korea', 'S. Korea',inplace=True)
df['count'] = 1

dfn = df.copy()
dfn['date_added_netflix'].replace('no_data', np.nan, inplace  = True)
dfn = dfn.dropna()
dfn['date_added_netflix'] = pd.to_datetime(dfn['date_added_netflix'])
dfn['year_added'] = dfn['date_added_netflix'].dt.year
df_movies_n = dfn[dfn["type"] == "Movie"]
df_series_n = dfn[dfn["type"] == "TV Show"]

dfa = df.copy()
dfa['date_added_amazon'].replace('no_data', np.nan, inplace  = True)
dfa = dfa.dropna()
dfa['date_added_amazon'] = pd.to_datetime(dfa['date_added_amazon'])
dfa['year_added'] = dfa['date_added_amazon'].dt.year
df_movies_a = dfa[dfa["type"] == "Movie"]
df_series_a = dfa[dfa["type"] == "TV Show"]

def scatter_repr(data, file, prov):
    df = data.groupby('first_country')[['first_country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
    df = df['first_country']
    df_loli = data.loc[data['first_country'].isin(df)]
    loli = df_loli.groupby('first_country')['release_year','year_added'].mean().round()
    ordered_df = loli.sort_values(by='release_year')
    my_range=[0,len(loli.index)+1]
    min = ordered_df['release_year'].to_list()[0]-3
    label_pos = [2022] * len(ordered_df+1)
    title = 'Average '+ prov + ' oldness'
    fig = go.Figure()

    fig.add_trace(go.Scatter(
            x=ordered_df['release_year'].to_list(),
            y=[i for i in range(1, 11)],
            marker=dict(color="#ff547c", size=12),
            mode="markers",
            name="Release year",
    ))

    fig.add_trace(go.Scatter(
            x=ordered_df['year_added'].to_list(),
            y=[i for i in range(1, 11)],
            marker=dict(color="#a0a3de", size=12),
            mode="markers",
            name="Added year",
    ))

    fig.update_layout(title_text=title, title_x=0.5,
        yaxis_range=my_range, xaxis_range=[min,2025],
        xaxis_showgrid=False,
        yaxis_showgrid=False,
        yaxis_showticklabels=False,
        paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',
        margin=margin_title,
        font=dict(
            family="Arial, san serif",
            size=12,
            color="white"
        ) 
    )

    fig.update_traces(marker=dict(line=dict(width=2)),textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

    y_pos = 1
    for i, x in ordered_df.iterrows():

        fig.add_shape(type='line',
                        x0=ordered_df['release_year'][i],
                        y0=y_pos,
                        x1=ordered_df['year_added'][i],
                        y1=y_pos,
                        line=dict(color='white'),
                        xref='x',
                        yref='y',
                        layer='below'
        )
        y_pos += 1
    
    fig.add_trace(go.Scatter(
        x= label_pos,
        y= [i for i in range(1, 11)],
        text= ordered_df.index.values.tolist(),
        mode="text",
        hoverinfo='skip',
        showlegend=False,
        textposition="top right",
            textfont=dict(
                family="Arial, sans serif",
                size=12,
                color="white"
            )
    ))

    fig.write_html('visualization/'+ file) 
    fig.show()
    
scatter_repr(df_movies_n,'twentytwo.html', 'Netflix movies')
scatter_repr(df_series_n, 'twentythree.html', 'Netflix series')
scatter_repr(df_movies_a, 'twentyfour.html', 'Amazon movies')
scatter_repr(df_series_a, 'twentyfive.html', 'Amazon series')


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.




Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.




Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.




Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



# How streaming service influence the cinema industry?

To understand how the streaming services affect the film industry, we need to consider several factors including distribution, whether it is theirs or that they rent content from other distributors; the type of title they distribute and whether it has ever been nominated for or won an award, finally the opinion of critics and the public.

### Distributor presence over time on streaming services
film industry, let's look at the amount of content owned by distributors from 2008, the year in which many streaming services were born, to 2021. So, let's draw a histogram that has the number of titles on the y-axis and the year of release on the x-axis.

**Conclusion:** On streaming platforms, most movies are distributed by the platform itself or by other streaming services such as Hulu. Dominant in the chart is Netflix followed by Amazon. Among the traditional theatrical distributor, only Warner Bros seems to be competitive.

In [111]:
df = data.copy()
df['distributor'].replace('no_data', np.nan, inplace  = True)
df = df.dropna()
df['distr'] = df['distributor'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 

df_new = pd.DataFrame()
j = 0
for index, i in df['distr'].iteritems(): 
    for n in range(len(i)):
        w = i[n]
        df_new.loc[j, 'distributor'] = w
        df_new.loc[j, 'release_year'] = df._get_value(index, 'release_year')
        j += 1
df = df_new[['release_year', 'distributor']].groupby('distributor')['release_year'].value_counts().unstack()
df = df.stack().reset_index()
df.columns = ['distributor','year','val']
df = df[(df.year >= 2008)&(df.year < 2022)]
df = df.sort_values(by='val', ascending=False)[:50]
fig = px.histogram(df, x="year", y="val", color='distributor', color_discrete_sequence=colors)

fig.update_traces( marker=dict(line=dict(width=2)),textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

fig.update_layout(
    yaxis_title="Contents amount own by distributor",
    legend_title="Distributors",
    margin=margin, yaxis_gridcolor ='rgba(225, 225, 225, .5)',  xaxis_showgrid=False,
        paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',
        font=dict(
            family="Arial, san serif",
            size=15,
            color="white"
        ) 
    )
fig.write_html('visualization/twentysix.html') 
fig.show()

### Distributor and genere correlation 

So let's investigate what genre are the movies that streaming services distribute. To do this we use a heat map.

**Conclusion:** Amazon tends to distribute more Drama alongside the United Artists Corporation. Netflix has a lack of content for children and families, but in general, it has a wider gender distribution.

In [112]:
df = data.copy()
df['listed_in'].replace('no_data', np.nan, inplace  = True)
df['distributor'].replace('no_data', np.nan, inplace  = True)
df = df.query('listed_in.notnull() & distributor.notnull()')
df['distr'] = df['distributor'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
df['genre'] = df['listed_in'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
genre = []
distr = []
year = []
for i, x in df['genre'].iteritems():
    for n in range(len(x)):
        k = df._get_value(i, 'distr')
        for num in range(len(k)):
            genre.append(x[n])
            distr.append(k[num])
            
df = pd.DataFrame()
df['distr'] = distr
df['genre'] = genre
df['count'] = 1

g = df.groupby('genre')[['genre','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
g = g['genre']

d = df.groupby('distr')[['distr','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
d = d['distr']

df_heatmap = df.loc[df['genre'].isin(g) & df['distr'].isin(d)]
df_heatmap = pd.crosstab(df_heatmap['distr'],df_heatmap['genre'],normalize = "index")
for x in df_heatmap.columns.to_list():
    df_heatmap[x] = pd.Series([round(val, 2) for val in df_heatmap[x]], index = df_heatmap.index)
fig = px.imshow(df_heatmap, color_continuous_scale=colors, aspect="auto")

fig.update_traces( textfont=dict(family="Arial, san serif",size=16,color="white"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

fig.update_layout(paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529', margin=margin,
yaxis_title="Distributors",
xaxis_title="Genre",
font=dict(
    family="Arial, san serif",
    size=12,
    color="white"
) ) 
fig.write_html('visualization/twentyseven.html') 
fig.show()

### Award analisys

Until 2019, content not released in theatres could not participate in many major competitions, including the Oscars. To understand the correlation between distributor and prize we use a heat map. Then, we draw a histogram showing the awards and nominations amount divided by distributors.

**Conclusion:** 
Of the movies on streaming platforms, Netflix and Amazon appear to have won the most generic awards, but they don't score as high in the most popular awards. Content on these platforms is more likely to win a Primetime Emmy among well-known awards. There are a few movies that have won Oscars, most of which have been distributed by Warner Bros and Paramount Pictures. Proportionately, the movies distributed by the platforms do not tend to be nominated for awards, with the most acclaimed awards received coming from TV broadcasters. We will probably have to wait a few years to see any substantial changes.


In [113]:
df = data.copy()
df['special_award_tot'].replace('no_data', np.nan, inplace  = True)
df['special_award_name'].replace('no_data', np.nan, inplace  = True)
df['distributor'].replace('no_data', np.nan, inplace  = True)
df = df.query('special_award_name.notnull() & distributor.notnull()')
df[['special_award_name', 'special_award_tot', 'distributor']]

df['distr'] = df['distributor'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
df['special'] = df['special_award_name'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
df['tot'] = df['special_award_tot'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 
spe = []
distr = []
for i, x in df['distr'].iteritems():
    for n in range(len(x)):
        k = df._get_value(i, 'special')
        for num in range(len(k)):
            t = df._get_value(i, 'tot')
            for tot in t:
                for u in range(int(tot)):
                    spe.append(k[num])
                    distr.append(x[n])

df = pd.DataFrame()
df['distr'] = distr
df['award'] = spe
df['count'] = 1

s = df.groupby('award')[['award','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
s = s['award']

d = df.groupby('distr')[['distr','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
d = d['distr']
d

df_heatmap = df.loc[df['award'].isin(s) & df['distr'].isin(d)]
df_heatmap = pd.crosstab(df_heatmap['distr'],df_heatmap['award'],normalize = "index")
for x in df_heatmap.columns.to_list():
    df_heatmap[x] = pd.Series([round(val, 2) for val in df_heatmap[x]], index = df_heatmap.index)
fig = px.imshow(df_heatmap, color_continuous_scale=colors, aspect="auto")

fig.update_traces( textfont=dict(family="Arial, san serif",size=16,color="white"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

fig.update_layout(paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529', margin=margin,
yaxis_title="Distributors",
xaxis_title="Awards",
font=dict(
    family="Arial, san serif",
    size=12,
    color="white"
) ) 
fig.write_html('visualization/twentyeight.html') 
fig.show()

In [114]:
df = data.copy()
df = df[['distributor', 'award_nomination_tot', 'award_win_tot' ]]
df.replace('no_data', np.nan, inplace  = True)
df.dropna(inplace=True)

df['distr'] = df['distributor'].apply(lambda x :  x.replace(' ,',',').replace(', ',',').split(',')) 

df_new = pd.DataFrame()
j = 0 
for index, i in df['distr'].iteritems(): 
    for n in range(len(i)):
        w = i[n]
        df_new.loc[j, 'distributor'] = w
        df_new.loc[j, 'count'] = 1
        df_new.loc[j, 'award_nomination_tot'] = float(df._get_value(index, 'award_nomination_tot'))
        df_new.loc[j, 'award_win_tot'] = float(df._get_value(index, 'award_win_tot'))
        j += 1

order = pd.DataFrame(df_new.groupby('distributor')['count', 'award_nomination_tot', 'award_win_tot'].sum().sort_values(by='count', ascending=False).reset_index())[:10]

distr = []
val_name = []
val = []
for i, x in order.iterrows():
    total = x['count']
    if x['award_nomination_tot'] > 0:
        distr.append(x['distributor'])
        val_name.append('Award nomination')
        val.append(int(x['award_nomination_tot']/total))
    if x['award_win_tot'] > 0:
        distr.append(x['distributor'])
        val_name.append('Award win')
        val.append(int(x['award_win_tot']/total))
df = pd.DataFrame()
df['distr'] = distr
df['val_name'] = val_name
df['val'] = val


fig = px.histogram(df, x="distr", y="val", color='val_name', barmode='group', color_discrete_sequence=two_colors)

fig.update_traces( textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

# fig.update_traces(hovertemplate=('<br>Total %{x}: %{customdata[0]}<br><extra></extra>'))

fig.update_layout(
    margin=margin,
    yaxis_title="Percentage",
    xaxis_title="Distributer",
    legend_title="Awards type", yaxis_gridcolor ='rgba(225, 225, 225, .5)',
        paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529',
        font=dict(
            family="Arial, san serif",
            size=15,
            color="white"
        ) 
    )
fig.write_html('visualization/twentynine.html') 
fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



### Critique v.s Popular opinion
Finally we focus on the audience. Streaming services link how people watch movies. Let's compare 1) iMDb audience figures, great for gathering general public opinon, with 2) Rotten Tomatoes, a platform that takes care of simplifying the opinions of critics. In this way we believe can get an idea, if the films on streaming platforms are more appreciated by the public/critics, and we are able to explore the evolution of the trend over time. Finally, to give more context to our data, we build a scatter chart where the points are the titles positioned on the year of release, sorted by public opinion but forgotten by critical opinion.

**Conlusion:** Older content has higher Rotten Tomatoes than iMDB Reating. It is interesting that in 1960 there was a dramatic drop in Rotten Tomatoes, but public opinion seems almost unchanged. Exploring the scatterplot we note that the result is caused by an outlier: G.I Blues, one of the worst Rotten Tomatoes scored films (has score 0), but at the same time rated with 68 points by the audienc, that may fancy the Pop icon Elvis Preslei, the protagonist of the film. Similarly, the trend of Rotten Tomatoes decreases compared to iMDb reating in 1995 and 2014, years of the internet explosion; and then it normalizes in 2008, the year in which streaming services appeared.

In [115]:
df = data.copy()
df = df[['release_year', 'rottenTomatoes', 'imDbRating']]
df.replace('no_data', np.nan, inplace  = True)
df.dropna(inplace=True)

year = []
val_name = []
val = []
for i, x in df.iterrows():
    if x['rottenTomatoes']:
        year.append(x['release_year'])
        val_name.append('Rotten Tomatoes')
        val.append(float(x['rottenTomatoes']))
    if x['imDbRating']:
        year.append(x['release_year'])
        val_name.append('iMDb Rating')
        val.append(float(x['imDbRating']))
df = pd.DataFrame()
df['year'] = year
df['val_name'] = val_name
df['score'] = val

rotten = df.query('val_name == "Rotten Tomatoes"')
rotten = rotten.groupby('year').mean()
rotten['value'] = 'Rotten Tomatoes'

rotten.groupby('year')['score'].mean()
imdb = df.query('val_name == "iMDb Rating"')
imdb = imdb.groupby('year').mean()
imdb['value'] = 'iMDb Rating'

df = pd.concat([rotten, imdb])
fig = px.line(df, x=df.index, y='score', color='value', color_discrete_sequence=two_colors)

fig.update_traces(marker=dict(line=dict(color='#212529', width=2)), hoverlabel=dict(font_size=14,font_family="Arial, san serif",font_color="white") )

fig.update_layout(
    margin=margin,
    yaxis_title="Average by year",
    legend_title="Critiques", yaxis_gridcolor ='rgba(225, 225, 225, .5)',xaxis_showgrid=False, paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)', hoverlabel_bordercolor='#212529', xaxis_title=None,
    font=dict(
            family="Arial, san serif",
            size=15,
            color="white"
        ) 
    )
fig.write_html('visualization/thirty.html') 
fig.show()

In [116]:
df = data.copy()
df = df[['title','release_year', 'rottenTomatoes', 'imDbRating', 'amazon', 'netflix']]
df.replace('no_data', np.nan, inplace  = True)
df.dropna(inplace=True)
df = df.astype({'rottenTomatoes':'float','imDbRating':'float'})
title = []
release_year = []
rottenTomatoe = []
imdb = []
distr = []
for i, x in df.iterrows():
    if x['amazon'] == 1.0:
        title.append(x['title'])
        release_year.append(x['release_year'])
        rottenTomatoe.append(x['rottenTomatoes'])
        imdb.append(x['imDbRating'])
        distr.append('Amazon')
    if x['netflix'] == 1.0:
        title.append(x['title'])
        release_year.append(x['release_year'])
        rottenTomatoe.append(x['rottenTomatoes'])
        imdb.append(x['imDbRating'])
        distr.append('Netflix')

df = pd.DataFrame()
df['title'] = title
df['release_year'] = release_year
df['rottenTomatoes'] = rottenTomatoe
df['imDbRating'] = imdb
df['distributer'] = distr
fig = px.scatter(df, x='release_year', y='imDbRating', color='rottenTomatoes', hover_name="title", hover_data=['distributer'], color_continuous_scale=colors)
fig.update_traces( marker=dict(line=dict(width=2), size=12),textfont=dict(family="Arial, san serif",size=16,color="#212529"),hoverlabel=dict(font_size=14,font_family="Arial, san serif", font_color='white'))

fig.update_layout(
    margin=margin, xaxis_showgrid=False,
    yaxis_title="iMDb Reating", yaxis_gridcolor ='rgba(225, 225, 225, .5)',hoverlabel_bordercolor='#212529',
        paper_bgcolor="#212529", plot_bgcolor = 'rgba(0, 0, 0, 0)',
        xaxis_title=None,
        font=dict(
            family="Arial, san serif",
            size=15,
            color="white"
        ) 
    )
fig.layout.coloraxis.colorbar.title = "Rotten Tomatoes"
fig.write_html('visualization/thirtyone.html') 
fig.show()