# Abstract 

Digitization, as one of the key outcomes of technological growth, has led to profound changes in entertainment, and therefore in the world of cinema, as well as in many other areas. As a result, the distribution and broadcasting strategy that Netflix brought to the market turned into an amazing success story in a very short period of time.

Netflix's strategy is based on the idea that consumers can access the platform's entire content catalog for a monthly price. In addition, Netflix only broadcasts its films on the web, with no theatrical or limited distribution. The approach, which is vastly different from the classic idea of ​​the Hollywood studio system, has led to significant advances for audiences, directors and studios in various ways. In this way, we can confidently say that streaming services, such as Netflix, are influencing the film industry in terms of how we access movies, what material we consume and how movies are made.

Every day, platforms such as Netflix and Amazon Prime gain more users thanks to competitive prices compared to movie theaters, and recommendation algorithms. The latter play an important role in the dissemination of romantic comedies and thrillers, obtaining some success thanks to the data of millions of users who use them. This dominant position places Internet platforms in a strong position in terms of film content. In the future, that authority could be key in determining what constitutes a "well-made film".

The impact of Internet streaming services on filmmakers has been one of the most important transformations in the world of cinema in recent years. The promise of a more open environment for filmmakers than other large studios has attracted numerous directors to the platforms, with huge ramifications in the world of cinema. Furthermore, the fact that these services have less stringent standards than cinemas makes them attractive to producers. Another important aspect concerns independent directors. Since the 1980s, when Hollywood became the hub of cinema and blockbuster films began to dominate theaters, it has been difficult for independent directors to reach large audiences. Cinemas often prefer high-budget movies as they can make a much larger profit from them. As a result, independent films have few opportunities outside of film festivals to date. However, with internet streaming services becoming a major role in the world of cinema, independent filmmakers now have the opportunity to reach a wider audience.

The purpose of this notebook is to investigate, through data, how streaming platforms have changed film production. is the world of production really fairer? How much power does the user of these platforms have?

# Data gatering
We start from two existing datasets:
* [Netflix](https://www.kaggle.com/datasets/shivamb/netflix-shows): One of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

* [Amazon prime](https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows): Another one of the most popular media and video streaming platforms. They have close to 10000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally. This tabular dataset consists of listings of all the movies and tv shows available on Amazon Prime, along with details such as - cast, directors, ratings, release year, duration, etc.*



In [91]:
import pandas as pd # data processing
import pandas_profiling as pp
import numpy as np # linear algebra

In [92]:
df_netflix = pd.read_csv('originalDataset/netflix_titles.csv')
df_amazon = pd.read_csv('originalDataset/amazon_prime_titles.csv')

In [93]:
print(len(df_netflix))
print(len(df_amazon))

8807
9668


# Prepering data

Objective is the one of concatenate amazon and netflix databases mantaing storage information. W
We add two colums: netflix and amazon both with value 1 or 0 representing the absence or presence of the title on the platform. To keep the date added information we rename columns to distinguish the relative streaming service.

In [94]:
df_netflix.drop(columns = df_netflix.columns[0], axis = 1, inplace= True)
df_netflix['netflix'] = 1
df_netflix['amazon'] = 0
df_netflix.rename(columns = {'date_added':'date_added_netflix'}, inplace = True)

df_netflix.head(2)

Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1,0
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1,0


In [95]:
df_amazon.drop(columns = df_amazon.columns[0], axis = 1, inplace= True)
df_amazon['amazon'] = 1
df_amazon['netflix'] = 0
df_amazon.rename(columns = {'date_added':'date_added_amazon'}, inplace = True)

df_amazon.head(2)

Unnamed: 0,type,title,director,cast,country,date_added_amazon,release_year,rating,duration,listed_in,description,amazon,netflix
0,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...,1,0
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...,1,0


In [96]:

dataset = pd.concat([df_netflix, df_amazon],axis=0, join="outer", sort=False)

dataset.head(3)


Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1,0,
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",1,0,
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,1,0,


The concatenated dataset present some errors. Shows that are both present on Netflix and Amazon Prime are recoded twice in our dataset: first time having only netflix information, the second time only the amazon ones. We decide to extract unic triples from the original datasets containing year, type (movie or tv Series) and name. Then we check the triples that the two dataset has in common and we store them in a list named 'title'.

In [97]:
netflix = []
amazon = []

def union(df, new):
    for  i, x in df['title'].iteritems():
        year = df['release_year'][i]
        type = df['type'][i]
        movie = x
        new.append((year, type, movie))
    return new

union(df_netflix, netflix)
union(df_amazon, amazon)

print(len(netflix), len(amazon))

8807 9668


In [98]:
title = []
for (y,t,m) in netflix:
    if (y,t,m) in amazon:
        title.append((y,t,m))

Finally, we iterate over the concatenated dataset querying only the shared titles merging amazon and Netflix information about possession and date of addition.
We decide to take description, country and cast information from the Netflix dataset because it was the best filled of the two. So, at the end of this process, we drop duplicates filtered by title, director, release year and type; keeping the first entries. Then we fill null value with 'No Data'.

In [99]:
df = dataset.copy()
df.replace(np.nan, 'null', inplace=True)
for i, r in df.iterrows(): 
    if (r['release_year'],r['type'],r['title']) in title:
        df.loc[i,'netflix'] = 1
        df.loc[i, 'amazon'] = 1
        q = df.query('title=="'+r['title']+'" & type=="'+r['type']+'" & release_year== '+ str(r['release_year']) +'')

        for i, x in q['date_added_netflix'].iteritems():
            if x != 'null':
                df.loc[i, 'date_added_netflix'] = x

        for i, x in q['date_added_amazon'].iteritems():
            if x != 'null':
                df.loc[i, 'date_added_amazon'] = x

In [100]:
df.replace( 'null', np.nan, inplace=True)
for i in df.columns:
    null_rate = df[i].isna().sum() / len(df) * 100 
    if null_rate > 0 :
        print("{} null rate: {}%".format(i,round(null_rate,2)))

director null rate: 25.53%
cast null rate: 11.14%
country null rate: 53.19%
date_added_netflix null rate: 51.4%
rating null rate: 1.85%
duration null rate: 0.02%
date_added_amazon null rate: 99.15%


In [101]:

df['date_added_netflix'].replace(np.nan, 1000,inplace  = True)
df['date_added_amazon'].replace(np.nan, 1000,inplace  = True)
df['country'].replace(np.nan, 'No Data',inplace  = True)
df['director'].replace(np.nan, 'No Data',inplace  = True)
df['cast'].replace(np.nan, 'No Data',inplace  = True)
df['rating'].replace(np.nan, 'No Data',inplace  = True)
df = df.drop_duplicates(subset=['title','director', 'release_year', 'type'], keep='first')
df = df.dropna()
df = df.reset_index(drop=True)

In [102]:
df['title'] = df['title'].replace({'"':''}, regex=True)
df['title'] = df['title'].replace({'\n':' '}, regex=True)

In [103]:
df.to_csv('data.csv')

# Data enrichment 

Our data will be enriched using two sources:

* [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page): wikidata is the a free and open knowledge graph containing linked information used by the famouse online encyclopedia. 

* [iMdB](https://www.imdb.com/): Internet Movie Database is the world's most popular and authoritative source for movie, TV shows and celebrity content, where you can find ratings and reviews by creteque and public.

In [104]:
import pprint #indet json 
import requests #make http requests
from qwikidata.sparql  import return_sparql_query_results #return sparql results
from SPARQLWrapper import SPARQLWrapper, JSON #questo serve a vedere la struttura delle risposte
import ssl
from http.client import IncompleteRead
import time
import urllib.error
from xml.etree.ElementPath import xpath_tokenizer_re

In order to query information, we split our dataset in Movies and TV Shows. Through wikidate we retrive missing information form our starting dataset, such like countries and directors. In addition, we add interesting information for us such as the gender of the director and the distributor. Finally, we also retrive the iMdb id of the movie for future query.

In [105]:
df = pd.read_csv('data.csv')

Unnamed: 0.1,Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
8416,8416,Movie,The Memphis Belle: A Story of a Flying Fortress,William Wyler,No Data,United States,"March 31, 2017",1944,TV-PG,40 min,"Classic Movies, Documentaries",This documentary centers on the crew of the B-...,1,0,1000


In [75]:
for i, x in df['title'].iteritems():
    if x == 'The Memphis Belle: A Story of a Flying Fortress':
        print(x)

In [41]:
movie_title = df.query("type == 'Movie'")
movie_title.reset_index(level=None, drop=True, inplace=True, col_level=0, col_fill='')
movie_title.head(2)

Unnamed: 0,type,title,director,cast,country,date_added_netflix,release_year,rating,duration,listed_in,description,netflix,amazon,date_added_amazon
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Data,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",1,0,1000
1,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",No Data,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...,1,0,1000


In [23]:
imdbID = []
not_found = []

def wikidata_reconciliation(query):
        wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"

        for i, x in query.iterrows():
                title = x['title']
                year =  x['release_year']

                try:

                        my_SPARQL_query = """SELECT ?imdbID
                        WHERE
                        {
                        ?film wdt:P31 wd:Q11424 .
                        ?film rdfs:label """+'"'+ title +'"' +"""@en .
                        ?film wdt:P577 ?time .
                        FILTER ( YEAR(?time) =  """+ str(year) +""" ).
                        ?film wdt:P345 ?imdbID.
                        }"""

                        print(title)

                        sparql_wd = SPARQLWrapper(wikidata_endpoint)
                        # set the query
                        sparql_wd.setQuery(my_SPARQL_query)
                        # set the returned format
                        sparql_wd.setReturnFormat(JSON)

                        results = sparql_wd.query().convert()

                        if results['results']['bindings'] == []:
                                not_found.append(title)
                                
                        else:

                                imdbID.append((results['results']['bindings'][0]['imdbID']['value'], title, year))

                except urllib.error.HTTPError as e:
                        time.sleep((int(e.headers["retry-after"])) + 1)
                        wikidata_reconciliation(query[i:])
    
wikidata_reconciliation(movie_title[:7000])

Dick Johnson Is Dead
My Little Pony: A New Generation
Sankofa
The Starling
Je Suis Karl
Confessions of an Invisible Girl
Europe's Most Dangerous Man: Otto Skorzeny in Spain
Intrusion
Avvai Shanmughi
Go! Go! Cory Carson: Chrissy Takes the Wheel
Jeans
Minsara Kanavu
Grown Ups
Dark Skies
Paranoia
Ankahi Kahaniya
The Father Who Moves Mountains
The Stronghold
Birth of the Dragon
Jaws
Jaws 2
Jaws 2
Jaws 3
Jaws: The Revenge
My Heroes Were Cowboys
Safe House
Training Day
InuYasha the Movie 2: The Castle Beyond the Looking Glass
InuYasha the Movie 3: Swords of an Honorable Ruler
InuYasha the Movie 4: Fire on the Mystic Island
InuYasha the Movie: Affections Touching Across Time
Naruto Shippuden the Movie: Blood Prison
Naruto Shippûden the Movie: Bonds
Naruto Shippûden the Movie: The Will of Fire
Naruto Shippuden: The Movie
Naruto Shippuden: The Movie: The Lost Tower
Naruto the Movie 2: Legend of the Stone of Gelel
Naruto the Movie 3: Guardians of the Crescent Moon Kingdom
Naruto the Movie: Ninja

QueryBadFormed: QueryBadFormed: a bad request has been sent to the endpoint, probably the sparql query is bad formed. 

Response:
b'SPARQL-QUERY: queryStr=SELECT ?imdbID\n                        WHERE\n                        {\n                        ?film wdt:P31 wd:Q11424 .\n                        ?film rdfs:label "The Memphis Belle: A Story of a\nFlying Fortress"@en .\n                        ?film wdt:P577 ?time .\n                        FILTER ( YEAR(?time) =  1944 ).\n                        ?film wdt:P345 ?imdbID.\n                        }\njava.util.concurrent.ExecutionException: org.openrdf.query.MalformedQueryException: Lexical error at line 5, column 74.  Encountered: "\\n" (10), after : "\\"The Memphis Belle: A Story of a"\n\tat java.util.concurrent.FutureTask.report(FutureTask.java:122)\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:206)\n\tat com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:292)\n\tat com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlQuery(QueryServlet.java:678)\n\tat com.bigdata.rdf.sail.webapp.QueryServlet.doGet(QueryServlet.java:290)\n\tat com.bigdata.rdf.sail.webapp.RESTServlet.doGet(RESTServlet.java:240)\n\tat com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doGet(MultiTenancyServlet.java:273)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:687)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)\n\tat org.wikidata.query.rdf.blazegraph.throttling.ThrottlingFilter.doFilter(ThrottlingFilter.java:320)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat org.wikidata.query.rdf.blazegraph.throttling.SystemOverloadFilter.doFilter(SystemOverloadFilter.java:82)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat ch.qos.logback.classic.helpers.MDCInsertingServletFilter.doFilter(MDCInsertingServletFilter.java:49)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat org.wikidata.query.rdf.blazegraph.filters.QueryEventSenderFilter.doFilter(QueryEventSenderFilter.java:116)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat org.wikidata.query.rdf.blazegraph.filters.ClientIPFilter.doFilter(ClientIPFilter.java:43)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat org.wikidata.query.rdf.blazegraph.filters.JWTIdentityFilter.doFilter(JWTIdentityFilter.java:66)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat org.wikidata.query.rdf.blazegraph.filters.RealAgentFilter.doFilter(RealAgentFilter.java:33)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)\n\tat org.wikidata.query.rdf.blazegraph.filters.RequestConcurrencyFilter.doFilter(RequestConcurrencyFilter.java:50)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1340)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1242)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:503)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)\n\tat java.lang.Thread.run(Thread.java:750)\nCaused by: org.openrdf.query.MalformedQueryException: Lexical error at line 5, column 74.  Encountered: "\\n" (10), after : "\\"The Memphis Belle: A Story of a"\n\tat com.bigdata.rdf.sail.sparql.Bigdata2ASTSPARQLParser.parseQuery2(Bigdata2ASTSPARQLParser.java:404)\n\tat com.bigdata.rdf.sail.webapp.QueryServlet$SparqlQueryTask.call(QueryServlet.java:741)\n\tat com.bigdata.rdf.sail.webapp.QueryServlet$SparqlQueryTask.call(QueryServlet.java:695)\n\tat com.bigdata.rdf.task.ApiTaskForIndexManager.call(ApiTaskForIndexManager.java:68)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\t... 1 more\nCaused by: com.bigdata.rdf.sail.sparql.ast.TokenMgrError: Lexical error at line 5, column 74.  Encountered: "\\n" (10), after : "\\"The Memphis Belle: A Story of a"\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilderTokenManager.getNextToken(SyntaxTreeBuilderTokenManager.java:3994)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.jj_ntk(SyntaxTreeBuilder.java:9637)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.PathElt(SyntaxTreeBuilder.java:3185)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.PathSequence(SyntaxTreeBuilder.java:3134)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.PathAlternative(SyntaxTreeBuilder.java:3093)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.Path(SyntaxTreeBuilder.java:3084)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.VerbPath(SyntaxTreeBuilder.java:3080)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.PropertyListPath(SyntaxTreeBuilder.java:2981)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.TriplesSameSubjectPath(SyntaxTreeBuilder.java:2919)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.TriplesBlock(SyntaxTreeBuilder.java:2321)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.BasicGraphPattern(SyntaxTreeBuilder.java:2097)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.GraphPattern(SyntaxTreeBuilder.java:2034)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.GroupGraphPattern(SyntaxTreeBuilder.java:1969)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.WhereClause(SyntaxTreeBuilder.java:1013)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.SelectQuery(SyntaxTreeBuilder.java:377)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.Query(SyntaxTreeBuilder.java:328)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.QueryContainer(SyntaxTreeBuilder.java:216)\n\tat com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilder.parseQuery(SyntaxTreeBuilder.java:32)\n\tat com.bigdata.rdf.sail.sparql.Bigdata2ASTSPARQLParser.parseQuery2(Bigdata2ASTSPARQLParser.java:336)\n\t... 7 more\n'

In [23]:
film = []
director = []
gender = []
distributor = []
imdbID = []
rottenscore = []
not_found = []

In [None]:

def wikidata_reconciliation(query):

    
    # get the endpoint API
    wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
        

    for i, r in query.iterrow():
        x = r['title'] 
        y = r['release_year']
        
        try:

            print(x)
            my_SPARQL_query = """
            SELECT ?film_label ?director_label ?dir_gen_label 
            WHERE
            {
            ?film wdt:P31 wd:Q11424 .
            ?film rdfs:label """+'"'+ x +'"' +"""@en .
            ?film rdfs:label ?film_label .
            FILTER(lang(?film_label) = 'en')
            OPTIONAL {?film wdt:P57 ?director . 
            ?director rdfs:label ?director_label .    
            FILTER(lang(?director_label) = 'en')
            OPTIONAL {?director wdt:P21 ?dir_gen . 
            ?dir_gen rdfs:label ?dir_gen_label .
            FILTER(lang(?dir_gen_label) = 'en')}}
            
            }
            """
            # set the endpoint 
            sparql_wd = SPARQLWrapper(wikidata_endpoint)
            # set the query
            sparql_wd.setQuery(my_SPARQL_query)
            # set the returned format
            sparql_wd.setReturnFormat(JSON)
            # get the results
            
            results = sparql_wd.query().convert()

            if results['results']['bindings'] == []:
                not_found.append(""+x+"")
                
            else:
                film.append(results['results']['bindings'][0]['film_label']['value'])
                if "director_label" in results['results']['bindings'][0]:
                    director.append(results['results']['bindings'][0]['director_label']['value'])
                else:
                    director.append("no_data")
                if "dir_gen_label" in results['results']['bindings'][0]:
                    gender.append(results['results']['bindings'][0]['dir_gen_label']['value'])
                else:
                    gender.append("no_data")
                


        except urllib.error.HTTPError as e:
            time.sleep((int(e.headers["retry-after"])) + 1)
            error_title = query.index(x)
            wikidata_reconciliation(query[error_title:])
            

wikidata_reconciliation(movie_title[2637:])

In [None]:


import urllib.error
from xml.etree.ElementPath import xpath_tokenizer_re

def wikidata_reconciliation1(query):

    
    # get the endpoint API
    wikidata_endpoint = "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
        

    for x in query:

        try:

            
            my_SPARQL_query = """
            SELECT ?film_label ?distributor_label ?imdbID ?rottenscore
            WHERE
            {
            ?film wdt:P31 wd:Q11424 .
            ?film rdfs:label """+'"'+ x +'"' +"""@en .
            ?film rdfs:label ?film_label .
            FILTER(lang(?film_label) = 'en')
            OPTIONAL {?film wdt:P750 ?distributor . 
            ?distributor rdfs:label ?distributor_label .
            FILTER(lang(?distributor_label) = 'en')}
            OPTIONAL {?film wdt:P345 ?imdbID.}
            OPTIONAL {?film wdt:P444 ?rottenscore.}
            }
            """
            # set the endpoint 
            sparql_wd = SPARQLWrapper(wikidata_endpoint)
            # set the query
            sparql_wd.setQuery(my_SPARQL_query)
            # set the returned format
            sparql_wd.setReturnFormat(JSON)
            # get the results
            
            results = sparql_wd.query().convert()

            if results['results']['bindings'] == []:
                print(""+x+" not found")
            else:
                film.append(results['results']['bindings'][0]['film_label']['value'])
                if "distributor_label" in results['results']['bindings'][0]:
                    distributor.append(results['results']['bindings'][0]['distributor_label']['value'])
                else:
                    distributor.append("no_data")
                if "imdbID" in results['results']['bindings'][0]:
                    imdbID.append(results['results']['bindings'][0]['imdbID']['value'])
                else:
                    imdbID.append("no_data")
                if "rottenscore" in results['results']['bindings'][0]:
                    rottenscore.append(results['results']['bindings'][0]['rottenscore']['value'])
                else:
                    rottenscore.append("no_data")



        except urllib.error.HTTPError as e:

            print(e.headers["retry-after"])
            time.sleep((int(e.headers["retry-after"])) + 1)
            error_title = query.index(x)
            wikidata_reconciliation1(query[error_title:])
        except IncompleteRead:
            # Oh well, reconnect and keep trucking
            continue
            




wikidata_reconciliation1(movie_title[8089:9000])

In [None]:
dict = {"title": film, "director": director,
        "director gender": gender}

lists_df = pd.DataFrame(dict)
lists_df.to_csv('savelist.csv')
lists_df.head(5)

In [None]:
dict1 = {"title": film, "distributor": distributor, "id": imdbID, "rating score": rottenscore}

lists1_df = pd.DataFrame(dict1)
lists1_df.to_csv('6000-9000.csv')
lists1_df.head(5)

# data analysis

# data visualization

In [None]:
#