<a href="https://colab.research.google.com/github/AdamKirstein/Wikipedia_data_collection_example/blob/master/lets-talk-data-walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install Dependencies**

In [None]:
!pip3 install qwikidata --quiet
!pip3 install SPARQLWrapper --quiet
!pip3 install requests --quiet
!pip3 install Wikidata --quiet
!pip3 install wikipedia --quiet 

In [None]:
from qwikidata.sparql import return_sparql_query_results 
from SPARQLWrapper import SPARQLWrapper, JSON 
from pandas.io.parsers import ParserError
import pandas as pd
import sys
import requests
from requests.exceptions import SSLError
from datetime import datetime
import warnings 
import wikipedia

warnings.simplefilter("ignore")

Start by setting up a means of establishing a connection between the Wikidata stores and your python notebook. If you are familiar with SQL Alchemy, or establishing any kind of connection with any external data base via python, it is very similar to that. 

In [None]:
def Create_sparql_engine():
    """ 
    Step 1: create user and end point features
   
    user_agent = The User-Agent for the HTTP request header. 
                The default value is an autogenerated string using the SPARQLWrapper version code.
    end_point =  SPARQL endpoint’s URI.
    
    Step 2: crete the SPARQL 'engine' by calling the SPARQLwrapper on the end_point and user_agent
    """
    user_agent = "WDQS-example Python/%s.%s" % ( sys.version_info[0], sys.version_info[1])
    endpoint_url = "https://query.wikidata.org/sparql"
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    return sparql

In [None]:

def get_wikidata_query(engine,query_string):
    """ 
    Step 3: We call the engine from the previous function, as well as a string of the query we want. 
    You can construct this query using the Wikidata SPARQL Interface
    """
    # Call engine on query_string
    engine.setQuery(query_string)
    #return query results
    engine.setReturnFormat(JSON)
    #transform into a pandas data frame
    results_df = pd.io.json.json_normalize(engine.query().convert()['results']['bindings'])
    return results_df



# Page views functions 
def pull_pageviews_from_wikimedia( project, access, agent, articles, granularity, start, end ):
  if isinstance(articles, list):
    errors = []
    dataframes = []
    for article in articles:
        index = makePageviewsRequest(project, access, agent, article, granularity, start, end)
        index_result = index.json()
        try:
          dataframes.append(pd.DataFrame(index_result['items']))
        except:
          errors.append(article)
    df = pd.concat(dataframes, axis=0, ignore_index=True)
  else: 
    index = makePageviewsRequest(project, access, agent,articles, granularity, start, end)
    index_result = index.json()
    df =  pd.DataFrame(index_result['items'])
  return df, errors



def makePageviewsRequest(project, access, agent,articles, granularity, start, end ):
  index = requests.get("https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/{}.wikipedia.org/{}/{}/{}/{}/{}/{}"\
                                  .format(project, access, agent,articles, granularity, start, end ))
  return index 


# assets functions
def pull_page_description(wikiFilmList):
    wikiFilmList = [(" ".join(i.split("_")))for i in wikiFilmList]
    summaries_dict = {}
    for i in wikiFilmList:
      try:
        summaries_dict[i] = wikipedia.page(i).content
      except:
        print(i)
    return summaries_dict

# **Section 1: Interface with Wikidata**

#### 1.1 Query Anatomy 



---



```
#select our columns!
#Column variables will always start with '?' before the name
# we need to specify a item (film, in this case) and a label value. 
# w/o the label value, we wont be able to read the results. 


SELECT DISTINCT ?film ?filmLabel ?date ?alias 
WHERE{
  # i want all instances of film (gimme all dem films) 

  ?film wdt:P31 wd:Q11424;

        #now we are targeting the Published date node and saying
        # i need the country value that you have living inside of you please
        #you're also saying, in addition to that, i need the date value too 

        #publish date node -> published location -> United States value
        p:P577 [ pq:P291 wd:Q30; ps:P577 ?date].

  # in addition to the film name as it appears in wikidata, we also want to select the version of the title
  # as it appears on Wikipedia. We do this to have specificity in our data as some names are shared across multiple
  # subjects. 

  # ?alias = film name as it appears on wikipedia
  ?article schema:about  ?film ; schema:isPartOf <https://en.wikipedia.org/> ;  schema:name ?alias .

  # always include this in your query. it determines the language that your output will be in 

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}

ORDER BY DESC(?date)
limit 100
}
```



#### 1.1.5 Pseudocode For Query


---
To remind you, the query we constructed is saying: 

give me any film, published in the United States, date of publication in the US, and the way the name of the film appears on wikipedia.com




#### 1.2 Use this for your query 

---



copy the below, then go to: https://query.wikidata.org/

```
SELECT DISTINCT ?film ?filmLabel ?date ?alias
WHERE
{
  ?film wdt:P31 wd:Q11424;
        p:P577 [ pq:P291 wd:Q30; ps:P577 ?date].  
  ?article schema:about  ?film ; schema:isPartOf <https://en.wikipedia.org/> ;  schema:name ?alias .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY DESC(?date)
limit 100
```



#### 1.2.5 Paste Query in query_string quotations

In [None]:
query_string = """ erase me and paste query here plz (don't erase the quotation marks tho) """

## 1.3 Initialize the connection between python and Wikidata and retrieve result

---







In [None]:
engine= Create_sparql_engine()
# give the results function the engine (access to the Wikidata Base ) and your query to execute on 
results = get_wikidata_query(engine, query_string)

In [None]:
results

## 1.3.5 Clean Up Results

---






In [None]:
# # Select the columns we want
wikidata_results = results[['alias.value','date.value', 'film.value']]
# # # rename them to make them cute
wikidata_results.columns = ['film', 'release_date', 'wikipedia_code']
# # #clean up wikipedia code
wikidata_results.wikipedia_code = [(i.split('/')[-1])for i in wikidata_results.wikipedia_code]
# # #format dates
wikidata_results.release_date= pd.to_datetime(wikidata_results.release_date).dt.strftime("%Y-%m-%d")
# # # format input to match what is required by the Meta-Wiki API input
wikidata_results['meta_wiki_format'] = wikidata_results.film.str.replace(' ',"_")
wikidata_results.drop_duplicates(inplace=True)

In [None]:
wikidata_results

In [None]:
wiki_film_list = wikidata_results.meta_wiki_format.to_list()

# Section 2: **Meta Wiki Pageviews API**








## 2.1 API Anatomy 
### **The Meta-Wiki Rest API takes 7 arguments:**


---




*   project - language of wikipedia 
*   access - type of access (Desktop, Mobile, etc)

*   agent - type of user. (Bot, real user, api call, etc)
*   article - the name of the wikipedia page. IT MUST BE IN FORMAT My_Wikipedia_Page (with underscores filling natural word spacing)


*   granularity - daily, monthly, hourly (tbd)
*   start time - earliest date you want your query to start at**


*   end date  - the max date you want your query to consider**



**Date formats must be YYYYMMDD or YYYYMMDDHH


---









### 2.2 **Retrieve Page Views From API**

In [None]:
meta_wiki_pageviews,errors = pull_pageviews_from_wikimedia(project = 'en',
                                     access= 'all-access',
                                     agent= 'user',
                                     articles= wiki_film_list,
                                     granularity='daily', 
                                     start='20190101',
                                     end='20200728')

In [None]:
meta_wiki_pageviews

# Section 3: **MediaWiki Content API**

# **Media Wiki Page Content**


---






This API will return to you the page of the wikipedia topic into which you feed it. The required format of string input is the opposite to the pageviews. (ie. no underscores between natural spacing. 

input : My Film Name
output: wikipedia page (text)

You might encounter an error wherein your input does not return an output. 
it's at this point where you will wnt to return to the wikidata page of that item to look t it's "Also Known As" section found by clicking the "In More Languages" drop down beneath the title of the page. you can target and collect these AKA values in your query, and then concieve of a script that tries each one if the initial one fails, until you get a result. 

In [None]:
test = pull_page_description(wiki_film_list)

What About Love (film)
Ip Man 4: The Finale


In [None]:
list(test.values())[6]

**END**