# Abstract 

Digitization, as one of the key outcomes of technological growth, has led to profound changes in entertainment, and therefore in the world of cinema, as well as in many other areas. As a result, the distribution and broadcasting strategy that Netflix brought to the market turned into an amazing success story in a very short period of time.

Netflix's strategy is based on the idea that consumers can access the platform's entire content catalog for a monthly price. In addition, Netflix only broadcasts its films on the web, with no theatrical or limited distribution. The approach, which is vastly different from the classic idea of ​​the Hollywood studio system, has led to significant advances for audiences, directors and studios in various ways. In this way, we can confidently say that streaming services, such as Netflix, are influencing the film industry in terms of how we access movies, what material we consume and how movies are made.

Every day, platforms such as Netflix and Amazon Prime gain more users thanks to competitive prices compared to movie theaters, and recommendation algorithms. The latter play an important role in the dissemination of romantic comedies and thrillers, obtaining some success thanks to the data of millions of users who use them. This dominant position places Internet platforms in a strong position in terms of film content. In the future, that authority could be key in determining what constitutes a "well-made film".

The impact of Internet streaming services on filmmakers has been one of the most important transformations in the world of cinema in recent years. The promise of a more open environment for filmmakers than other large studios has attracted numerous directors to the platforms, with huge ramifications in the world of cinema. Furthermore, the fact that these services have less stringent standards than cinemas makes them attractive to producers. Another important aspect concerns independent directors. Since the 1980s, when Hollywood became the hub of cinema and blockbuster films began to dominate theaters, it has been difficult for independent directors to reach large audiences. Cinemas often prefer high-budget movies as they can make a much larger profit from them. As a result, independent films have few opportunities outside of film festivals to date. However, with internet streaming services becoming a major role in the world of cinema, independent filmmakers now have the opportunity to reach a wider audience.

The purpose of this notebook is to investigate, through data, how streaming platforms have changed film production. is the world of production really fairer? How much power does the user of these platforms have?

# Data gatering
We start from two existing datasets:
* [Netflix](https://www.kaggle.com/datasets/shivamb/netflix-shows): One of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

* [Amazon prime](https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows): Another one of the most popular media and video streaming platforms. They have close to 10000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Amazon Prime, along with details such as - cast, directors, ratings, release year, duration, etc.*



In [1]:
import pandas as pd # data processing
import pandas_profiling as pp
import numpy as np # linear algebra

In [2]:
df_netflix = pd.read_csv('originalDataset/netflix_titles.csv')
df_amazon = pd.read_csv('originalDataset/amazon_prime_titles.csv')

In [3]:
print(len(df_netflix))
print(len(df_amazon))

8807
9668


# Prepering data

Objective is the one of concatenate amazon and netflix databases mantaing storage information. W
We add two colums: netflix and amazon both with value 1 or 0 representing the absence or presence of the title on the platform. To keep the date added information we rename columns to distinguish the relative streaming service.

In [4]:
df_netflix.drop(columns = df_netflix.columns[0], axis = 1, inplace= True)
df_netflix['netflix'] = 1
df_netflix['amazon'] = 0
df_netflix.rename(columns = {'date_added':'date_added_netflix'}, inplace = True)

df_netflix.head(2)

Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1,0
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1,0


In [5]:
df_amazon.drop(columns = df_amazon.columns[0], axis = 1, inplace= True)
df_amazon['amazon'] = 1
df_amazon['netflix'] = 0
df_amazon.rename(columns = {'date_added':'date_added_amazon'}, inplace = True)

df_amazon.head(2)

Unnamed: 0,type,title,director,cast,country,date_added_amazon,release_year,rating,duration,listed_in,description,amazon,netflix
0,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...,1,0
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,1,0


In [10]:

dataset = pd.concat([df_netflix, df_amazon],axis=0, join="outer", sort=False)

dataset.head(3)


Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1,0,
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1,0,
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1,0,


The concatenated dataset present some errors. Shows that are both present on Netflix and Amazon Prime are recoded twice in our dataset: first time having only netflix information, the second time only the amazon ones. We decide to extract unic triples from the original datasets containing year, type (movie or tv Series) and name. Then we check the triples that the two dataset has in common and we store them in a list named 'title'.

In [7]:
netflix = []
amazon = []

def union(df, new):
    for  i, x in df['title'].iteritems():
        year = df['release_year'][i]
        type = df['type'][i]
        movie = x
        new.append((year, type, movie))
    return new

union(df_netflix, netflix)
union(df_amazon, amazon)

print(len(netflix), len(amazon))

8807 9668


In [8]:
title = []
for (y,t,m) in netflix:
    if (y,t,m) in amazon:
        title.append((y,t,m))

Finally, we iterate over the concatenated dataset querying only the shared titles merging amazon and Netflix information about possession and date of addition.
We decide to take description, country and cast information from the Netflix dataset because it was the best filled of the two. So, at the end of this process, we drop duplicates filtered by title, director, release year and type; keeping the first entries.

In [9]:
df = dataset.copy()
df.replace(np.nan, 'null', inplace=True)
for i, r in df.iterrows(): 
    if (r['release_year'],r['type'],r['title']) in title:
        df.loc[i,'netflix'] = 1
        df.loc[i, 'amazon'] = 1
        q = df.query('title=="'+r['title']+'" & type=="'+r['type']+'" & release_year== '+ str(r['release_year']) +'')

        for i, x in q['date_added_netflix'].iteritems():
            if x != 'null':
                df.loc[i, 'date_added_netflix'] = x

        for i, x in q['date_added_amazon'].iteritems():
            if x != 'null':
                df.loc[i, 'date_added_amazon'] = x

In [10]:
df.replace( 'null', np.nan, inplace=True)
df['date_added_netflix'].replace(np.nan, 1000,inplace  = True)
df['date_added_amazon'].replace(np.nan, 1000,inplace  = True)
df['country'].replace(np.nan, 'No Data',inplace  = True)
df = df.drop_duplicates(subset=['title','director', 'release_year', 'type'], keep='first')
df = df.dropna()
df = df.reset_index(drop=True)

In [11]:
df

Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",No Data,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1,0,1000
1,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",No Data,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...,1,0,1000
2,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",No Data,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,1,0,1000
3,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",1,0,1000
4,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...,1,0,1000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12209,Movie,The Man in the Hat,"John-Paul Davidson, Stephen Warbeck","Ciaran Hinds, Stephen Dillane, Maïwenn",No Data,1000,2021,13+,96 min,Comedy,The Man in the Hat journeys through France in ...,0,1,1000
12210,Movie,River,Emily Skye,"Mary Cameron Rogers, Alexandra Rose, Rob Marsh...",No Data,1000,2021,16+,93 min,"Drama, Science Fiction, Suspense","River is a grounded Sci-Fi mystery Thriller, t...",0,1,1000
12211,Movie,Pride Of The Bowery,Joseph H. Lewis,"Leo Gorcey, Bobby Jordan",No Data,1000,1940,7+,60 min,Comedy,New York City street principles get an East Si...,0,1,1000
12212,Movie,Outpost,Steve Barker,"Ray Stevenson, Julian Wadham, Richard Brake, M...",No Data,1000,2008,R,90 min,Action,"In war-torn Eastern Europe, a world-weary grou...",0,1,1000


# data enrichment 

In [12]:
import pprint #indet json 
import requests #make http requests
from qwikidata.sparql  import return_sparql_query_results #return sparql results
from SPARQLWrapper import SPARQLWrapper, JSON #questo serve a vedere la struttura delle risposte
import ssl
from http.client import IncompleteRead
import time
import urllib.error
from xml.etree.ElementPath import xpath_tokenizer_re

In [14]:
movie_title = df.query("type == 'Movie'")
movie_title.reset_index(level=None, drop=True, inplace=True, col_level=0, col_fill='')
movie_title.head(2)

Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",No Data,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,1,0,1000
1,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s...",1,0,1000


In [16]:
film = []
director = []
gender = []
distributor = []
imdbID = []
rottenscore = []
not_found = []

In [None]:

def wikidata_reconciliation(query):

    
    # get the endpoint API
    wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
        

    for x in query:
        
        try:

            print(x)
            my_SPARQL_query = """
            SELECT ?film ?film_label ?director ?director_label ?dir_gen ?dir_gen_label 
            WHERE
            {
            ?film wdt:P31 wd:Q11424 .
            ?film rdfs:label """+'"'+ x +'"' +"""@en .
            ?film rdfs:label ?film_label .
            FILTER(lang(?film_label) = 'en')
            OPTIONAL {?film wdt:P57 ?director . 
            ?director rdfs:label ?director_label .    
            FILTER(lang(?director_label) = 'en')
            OPTIONAL {?director wdt:P21 ?dir_gen . 
            ?dir_gen rdfs:label ?dir_gen_label .
            FILTER(lang(?dir_gen_label) = 'en')}}
            
            }
            """
            # set the endpoint 
            sparql_wd = SPARQLWrapper(wikidata_endpoint)
            # set the query
            sparql_wd.setQuery(my_SPARQL_query)
            # set the returned format
            sparql_wd.setReturnFormat(JSON)
            # get the results
            
            results = sparql_wd.query().convert()

            if results['results']['bindings'] == []:
                not_found.append(""+x+"")
                
            else:
                film.append(results['results']['bindings'][0]['film_label']['value'])
                if "director_label" in results['results']['bindings'][0]:
                    director.append(results['results']['bindings'][0]['director_label']['value'])
                else:
                    director.append("no_data")
                if "dir_gen_label" in results['results']['bindings'][0]:
                    gender.append(results['results']['bindings'][0]['dir_gen_label']['value'])
                else:
                    gender.append("no_data")
                


        except urllib.error.HTTPError as e:
            time.sleep((int(e.headers["retry-after"])) + 1)
            error_title = query.index(x)
            wikidata_reconciliation(query[error_title:])
            

wikidata_reconciliation(movie_title[2637:])

In [None]:


import urllib.error
from xml.etree.ElementPath import xpath_tokenizer_re

def wikidata_reconciliation1(query):

    
    # get the endpoint API
    wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
        

    for x in query:

        try:

            
            my_SPARQL_query = """
            SELECT ?film ?film_label ?distributor ?distributor_label ?imdbID ?rottenscore
            WHERE
            {
            ?film wdt:P31 wd:Q11424 .
            ?film rdfs:label """+'"'+ x +'"' +"""@en .
            ?film rdfs:label ?film_label .
            FILTER(lang(?film_label) = 'en')
            OPTIONAL {?film wdt:P750 ?distributor . 
            ?distributor rdfs:label ?distributor_label .
            FILTER(lang(?distributor_label) = 'en')}
            OPTIONAL {?film wdt:P345 ?imdbID.}
            OPTIONAL {?film wdt:P444 ?rottenscore.}
            }
            """
            # set the endpoint 
            sparql_wd = SPARQLWrapper(wikidata_endpoint)
            # set the query
            sparql_wd.setQuery(my_SPARQL_query)
            # set the returned format
            sparql_wd.setReturnFormat(JSON)
            # get the results
            
            results = sparql_wd.query().convert()

            if results['results']['bindings'] == []:
                print(""+x+" not found")
            else:
                film.append(results['results']['bindings'][0]['film_label']['value'])
                if "distributor_label" in results['results']['bindings'][0]:
                    distributor.append(results['results']['bindings'][0]['distributor_label']['value'])
                else:
                    distributor.append("no_data")
                if "imdbID" in results['results']['bindings'][0]:
                    imdbID.append(results['results']['bindings'][0]['imdbID']['value'])
                else:
                    imdbID.append("no_data")
                if "rottenscore" in results['results']['bindings'][0]:
                    rottenscore.append(results['results']['bindings'][0]['rottenscore']['value'])
                else:
                    rottenscore.append("no_data")



        except urllib.error.HTTPError as e:

            print(e.headers["retry-after"])
            time.sleep((int(e.headers["retry-after"])) + 1)
            error_title = query.index(x)
            wikidata_reconciliation1(query[error_title:])
        except IncompleteRead:
            # Oh well, reconnect and keep trucking
            continue
            




wikidata_reconciliation1(movie_title[8089:9000])

In [None]:
dict = {"title": film, "director": director,
        "director gender": gender}

lists_df = pd.DataFrame(dict)
lists_df.to_csv('savelist.csv')
lists_df.head(5)

In [None]:
dict1 = {"title": film, "distributor": distributor, "id": imdbID, "rating score": rottenscore}

lists1_df = pd.DataFrame(dict1)
lists1_df.to_csv('6000-9000.csv')
lists1_df.head(5)

# data analysis

# data visualization

In [None]:
#