<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____
<a id='home'></a>

# Data collecting

0. [The GitHub Repo](#part0) 
1. [Uploading files](#part1a)
2. [APIs](#part2) 
3. [Scraping](#part3) 
4. [Social media](#part4) 

____
<a id='part0'></a>

## 0. The GitHub Repo

**Where is the Data?**
**Where is the Code?**

We used GitHub as a way to organize our work. It transforming the way you acces your code and data beyond your local machine; giving you the feeling you are in the cloud. Follow these steps now:

1. Go to [github.com](https://github.com/), and sign up. Remember your _username_ and _password_.
2. Install the Github *desktop app* in your computer. It is available [here](https://desktop.github.com/). Sign in to the app using your _username_ and _password_.
3. Once you are signed in, go back to the Github web, and create a _repository_. Choose a name. **DO NOT forget** to select the option to add a READ ME file, choose a **LICENSE** too (I recommend MIT).
4. **Clone** the repository just created into your computer. You can select where to clone the repository, avoid cloning into another's app folder.


After those steps, let's open the data file that you find [here](https://drive.google.com/drive/folders/1uH6S-8rns8THDnBRCy3OGLEki9LjyTCL?usp=sharing). Once you open the file, you will see something like this:


In [None]:
from IPython.display import IFrame  
wikiLink="https://docs.google.com/spreadsheets/d/e/2PACX-1vRXdVAxZTnQ6N7bI1xJ_XRQSoG-FsiucGI_fsyBuKCi6TcO3guGB_-6nk4i6so7SG__eIpfS0o8pUuZ/pubhtml?widget=true&amp;headers=false" 
IFrame(wikiLink, width=700, height=300)

Understanding what you see is not straightforward:

* What does this data represent?
* What is this data table talking about?
* What does each column represent?
* What possible mistakes can appear in the data values?
* What variables will be useful for my hypotheses and goals?

If this is the first time you tun into a data set like this, you first need to **read** the documentation of the data (available in the same folder).

At this stage, you can download this data from GoogleDocs into the folder cloned from GitHub (now in your local machine). Download it first as a CSV file into your computer. 

Now, open GitHub Desktop App. You will see that the app is trying to tell you some changes have ocurred in your local folder cloned from GitHub. Now, let's **Commit** and **Push**, that will syncronize contents in your local repo and your cloud repo.

In *my own* repo, it looks like [this](https://github.com/EvansDataScience/data/blob/master/hsb_ok_google%20-%20data.csv).

GitHub allows you to see the data contents (it is not always possible). Now, find the icon **raw** and copy its link address (**do not** copy the URL of the page). This is mine:

https://github.com/EvansDataScience/data/raw/master/hsb_ok_google%20-%20data.csv

Now let's get ready for Python.

____
<a id='part1a'></a>

## 1. Uploading files

The file we are planing to read into Python is a data table. Python needs support from one of its libraries to deal with data tables: **PANDAS**:

* Check if you already have Pandas.
* If you do not have it, install it.

In [None]:
# do I have it?
!pip show pandas

In [None]:
# if you do not have it:
#!pip install pandas

Now that I have it, I can read in the data, so that I can work on it:

In [None]:
# call pandas
import pandas as pd # 'pd' is a nickname

# use a function from pandas to read the cloud data into a Python object
gitCloudRepo='https://github.com/EvansDataScience/data/raw/master/'
fileName="hsb_ok_google%20-%20data.csv"
DFcsv=pd.read_csv(gitCloudRepo + fileName)

If Python has not **complained**, that is, you got no error messages; then, you are good to continue!

The **DFcsv** Python object holds the data:

In [None]:
DFcsv

This object is of a particular type:

In [None]:
type(DFcsv)

The type is **Data Frame** (DF). We will see several functions that can be applied to DFs. Let me show you some basic ones.

In [None]:
# dimensions (rows,columns)
DFcsv.shape

In [None]:
# top / bottom
DFcsv.head(10) #tail()

In [None]:
# column names
DFcsv.columns

In [None]:
# access by index position
DFcsv.iloc[:,4]

In [None]:
# access by index names
DFcsv.loc[:,'sctyp']

In [None]:
# subdata frame
DFcsv[['sctyp']]

And the most important for future sessions:

In [None]:
DFcsv.info() # the data type Python has assigned

Python will not care about the original data file (CSV, EXCEL. etc.), once they are read into Python you will have a DF.

**Proprietary Sofware**

Several times, you may find that you are given a file that was previously prepared with proprietary software. The most common in the policy field are:

* SPSS (file extension: **sav**).
* STATA (file extension: **dta**).
* EXCEL (file extension: **xlsx** or **xls**).

Getting these files up and running might not bring much pre processing challenges. However, you may need different levels of effort to read them from **GitHub**: they will not be as easy to open as a CSV.

Download [these files](https://drive.google.com/drive/folders/1XxTztY6rFkGwbR7wUtO_xiD4xvBrFqJN?usp=share_link) into the repository where your CSV file is currently stored in your local machine, then commit and push. We can create the links to each of them. Using my repo as an example:

In [None]:
gitCloudRepo + fileName


linkToSTATA=gitCloudRepo+'hsb_ok.dta'
linkToSPSS=gitCloudRepo+'hsb_ok.sav'
linkToEXCEL=gitCloudRepo+'hsb_ok.xlsx'

**Exercise:**
    
Open the other files (excel, spss, stata). Check if all the functions work well with the DFs created. Verify if the results are the same.

[home](#home)

____

<a id='part2'></a>

## 2. Collecting data from APIs

Open data portals from the government and other organizations have APIs, a service that allows you to collect their data. Let's take a look a Seattle data about [Seattle Real Time Fire 911 Calls](https://dev.socrata.com/foundry/data.seattle.gov/kzjm-xkqj).

That page tells you how to get the data into pandas. But first, you need to install **sodapy**. Then you can continue:

In [None]:
#!pip install sodapy

Let's follow some steps, according to the API:

In [None]:
from sodapy import Socrata

# Unauthenticated client (using 'None')

client = Socrata("data.seattle.gov", None)

# If you have credentials:
# client = Socrata(data.seattle.gov,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 500 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

You can see the results now:

In [None]:
results_df

Data from APIs may need some more pre processing than the previous files. Besides, you should study the API documentation to know how to interact with the portal. Not every open data portal behaves the same.

[home](#home)
____
<a id='part3'></a>

## 3. Scraping

Sometimes you are interested in data from the web. Let me get a table from wikipedia:

In [30]:
from IPython.display import IFrame  
wikiLink="https://en.wikipedia.org/wiki/Democracy_Index" 
IFrame(wikiLink, width=700, height=300)

I will use pandas to get the table, but you need to install these first:
* html5lib 
* beautifulsoup4
* lxml

In [31]:
dataWIKI=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable'})

Pandas has this command **read_html** that will save lots of coding, above I just said:
* The link to the webpage.
* The position of the header.
* The external library that will be used to extract the text (_flavor_).
* The attributes of the table.

dataWIKI is not a data frame:

In [32]:
type(dataWIKI)

list

The command **read_html** returns all the elements from the link with the same attributes. Let's see how many there are:

In [33]:
len(dataWIKI)

5

This means you have five DFs. Is ours the first one?

In [34]:
# remember that Python starts counting in ZERO!
dataWIKI[0]

Unnamed: 0,Type of regime,Score,Countries,Countries.1,Proportion ofWorld population (%)
0,Type of regime,Score,Number,(%),Proportion ofWorld population (%)
1,Full democracies,9.01–10.00 8.01–9.00,21,12.6%,6.4%
2,Flawed democracies,7.01–8.00 6.01–7.00,53,31.7%,39.3%
3,Hybrid regimes,5.01–6.00 4.01–5.00,34,20.4%,17.2%
4,Authoritarian regimes,3.01–4.00 0–3.00,59,35.3%,37.1%


or the last one?

In [35]:
dataWIKI[4]

Unnamed: 0,Rank,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank,Country,Regime type,Overall score,Δ Score,Elec­toral pro­cessand plura­lism,Func­tioningof govern­ment,Poli­ticalpartici­pation,Poli­ticalcul­ture,Civilliber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,9.75,0.06,10.00,9.64,10.00,10.00,9.12
2,2,2,New Zealand,Full democracy,9.37,0.12,10.00,8.93,9.44,8.75,9.71
3,3,3,Finland,Full democracy,9.27,0.07,10.00,9.29,8.89,8.75,9.41
4,4,1,Sweden,Full democracy,9.26,,9.58,9.29,8.33,10.00,9.12
...,...,...,...,...,...,...,...,...,...,...,...
166,162,2,Central African Republic,Authoritarian,1.43,0.11,1.25,0.00,1.67,1.88,2.35
167,164,2,Democratic Republic of the Congo,Authoritarian,1.40,0.27,0.75,0.00,2.22,3.13,0.88
168,165,2,North Korea,Authoritarian,1.08,,0.00,2.50,1.67,1.25,0.00
169,166,31,Myanmar,Authoritarian,1.02,2.04,0.00,0.00,1.67,3.13,0.29


Tables scrapped will bring different cleaning challenges. 

[home](#home)
____
<a id='part4'></a>

## Social media data

Social media offer APIs too that allow you to get _some_ data. To use this service, you need to register as a developer. For our Twitter example, you should go [here](https://developer.twitter.com/en).
Once you are a confirmed developer, Twitter, Facebook and others will allow you to get _some_ of their data (the more you pay the more they offer). 

Let's pay attention to Twitter. First, check if you have **tweepy**:

In [None]:
!pip show tweepy

In [None]:
#!pip install tweepy

Tweepy is the key library, but you may need several other libraries according to your goals.

In [1]:
import tweepy

Let me introduce myself to Twitter:

In [8]:
# credentials from a file
import json
keysAPI = json.load(open('APIkeys.txt','r'))

# getting info from the file
api_key = keysAPI['api_key']
api_key_secret = keysAPI['api_key_secret']
access_token = keysAPI['access_token']
access_token_secret = keysAPI['access_token_secret']

# introducing myself:
auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api=tweepy.API(auth, wait_on_rate_limit=True,timeout=60,
               parser=tweepy.parsers.JSONParser())

Let me ask for some tweets from a particular user:

In [46]:
who='@WAStateGov'
howMany=100
gottenTweets = api.user_timeline(screen_name = who, 
                                 count = howMany, 
                                 include_rts = True,
                                 tweet_mode="extended")

In the previous cases, I got a table (a data frame), you should always check what you have:

In [69]:
type(gottenTweets)

list

I have a list, then I could ask how many tweets I got (just to confirm):

In [70]:
len(gottenTweets)

100

Let me view what I have in the first tweet:

In [71]:
gottenTweets[0]

{'created_at': 'Tue Dec 06 00:05:12 +0000 2022',
 'id': 1599917840953651200,
 'id_str': '1599917840953651200',
 'full_text': '#TheDailyGov: ‘This is what we need’ Inslee says of Catalyst Project opening Monday; After decade, same-sex marriage law not so controversial; Shelved since 2018, WA gun law may be implemented; Dealing with the flu or a cold?; Real ID deadline extended. https://t.co/O4VPPiEWML https://t.co/w081Rj4NDz',
 'truncated': False,
 'display_text_range': [0, 277],
 'entities': {'hashtags': [{'text': 'TheDailyGov', 'indices': [0, 12]}],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': 'https://t.co/O4VPPiEWML',
    'expanded_url': 'https://www.governor.wa.gov/news-media/news-media/daily-gov-news-clips',
    'display_url': 'governor.wa.gov/news-media/new…',
    'indices': [254, 277]}],
  'media': [{'id': 1599917715950817281,
    'id_str': '1599917715950817281',
    'indices': [278, 301],
    'media_url': 'http://pbs.twimg.com/media/FjQMr5kaUAE3nME.jpg',
    'medi

It will take some time to become familiar with a [tweet object structure](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet). Let's find out how each tweet is currently stored:

In [72]:
type(gottenTweets[0])

dict

Now you know that each tweet is stored as a **dictionary**. 
Dictionaries or **dicts** are very flexible and important structures in Python. Let me show you a simple one: 

In [73]:
aDictExample={"name":"Peter",
             "speaks":['French', 'Spanish'],
             'country':'Morocco'}

# then
aDictExample

{'name': 'Peter', 'speaks': ['French', 'Spanish'], 'country': 'Morocco'}

Dicts are a basic structure in Python, and one that makes Python very appealing. Each element in Python can be accessed via **keys**. Our _aDictExample_ has these keys:

In [74]:
aDictExample.keys()

dict_keys(['name', 'speaks', 'country'])

So, you access the info like this:

In [75]:
aDictExample['speaks']

['French', 'Spanish']

Then, let's see our **gottenTweets** keys:

In [76]:
gottenTweets[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])

In [77]:
# then:
gottenTweets[0]['created_at']

'Tue Dec 06 00:05:12 +0000 2022'

In [78]:
gottenTweets[1]['created_at']

'Mon Dec 05 23:59:49 +0000 2022'

We could prepare a data frame using the current tweets, first let's prepare a list of each of fields we want:

In [79]:
# list comprehesions
dates=[t['created_at'] for t in gottenTweets]
ids=[t['id'] for t in gottenTweets]
rts=[t['retweet_count'] for t in gottenTweets]
likes=[t['favorite_count'] for t in gottenTweets]
text=[t['full_text'] for t in gottenTweets]
rtw=[t['full_text'].startswith('RT') for t in gottenTweets]

Each of the objects created is a list (dates, ids,rts,likes and text). Let me show you one:

In [80]:
rtw

[False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False]

Let me create a data frame with those lists:

In [81]:
tweetsAsDF=pd.DataFrame({'dates':dates,'ids':ids,'rts':rts,'likes':likes,'text':text,'rtw':rtw})

In [64]:
tweetsAsDF

Unnamed: 0,dates,ids,rts,likes,text,rtw
0,Tue Dec 06 00:05:12 +0000 2022,1599917840953651200,0,2,#TheDailyGov: ‘This is what we need’ Inslee sa...,False
1,Mon Dec 05 23:59:49 +0000 2022,1599916485081968640,9,0,RT @WA_DOL: BREAKING: Stricter air travel ID s...,True
2,Mon Dec 05 20:11:10 +0000 2022,1599858941844127744,2,0,"RT @TVWnews: Right now, TVW is livestreaming –...",True
3,Sat Dec 03 20:30:49 +0000 2022,1599139111222509568,11,0,"RT @WAStatePks: On the first day of '23, my st...",True
4,Sat Dec 03 00:34:47 +0000 2022,1598838121839603712,2,0,RT @kxly4news: The Washington State Department...,True
...,...,...,...,...,...,...
95,Sun Mar 15 20:16:28 +0000 2020,1239284348269899777,3664,0,"RT @GovInslee: While fighting COVID-19, we mus...",True
96,Thu Mar 12 02:00:27 +0000 2020,1237921364591308800,37,0,RT @GovInslee: Para proteger a los residentes ...,True
97,Wed Mar 11 22:56:17 +0000 2020,1237875018727428096,110,0,"RT @waEMD: As of today, the state will prohibi...",True
98,Wed Mar 11 22:56:05 +0000 2020,1237874969138221056,370,0,RT @GovInslee: Washingtonians without health i...,True


You can know how many are retweets or not:

In [66]:
tweetsAsDF[tweetsAsDF['rtw']==False].shape

(44, 6)

In [68]:
tweetsAsDF[tweetsAsDF['rtw']==True].shape

(56, 6)

[home](#home)