# Session 8: Strings, Queries and APIs

*Nicklas Johansen*

## Recap (1:2) 

We can think of there as being two 'types' of plots:
- **Exploratory** plots: Figures for understanding data
    - Quick to produce $\sim$ minimal polishing
    - Interesting feature may by implied by the producer
    - Be careful showing these out of context
- **Explanatory** plots: Figures to convey a message
    - Polished figures
    - Direct attention to interesting feature in the data
    - Minimize risk of misunderstanding

There exist several packages for plotting.  Some popular ones:
- `Matplotlib` is good for customization (explanatory plots). Might take a lot of time when customizing!
- `Seaborn` and `Pandas` are good quick and dirty plots (exploratory)

## Recap (2:2) 

We need to put a lot of thinking in how to present data.

In particular, one must consider the *type* of data that is to be presented:


- One variable:
    - Categorical: Pie charts, simple counts, etc.
    - Numeric: Histograms, distplot (/cumulative), boxplot in seaborn


- Multiple variables:
    - `scatter` (matplotlib) or `jointplot` (seaborn) for (i) simple descriptives when (ii) both variables are numeric and (iii) there are not too many observations
    - `lmplot` or `regplot` (seaborn) when you also want to fit a linear model
    - `barplot` (matplotlib), `catplot` and `violinplot` (both seaborn) when one or more variables are categorical
    - The option `hue` allows you to add a "third" categorical dimension... use with care
    - Lots of other plot types and options. Go explore yourself!
    

- When you just want to explore: `pairplot` (seaborn) plots all pairwise correlations

## Agenda

In this sesion, we will work with strings, requests and APIs:
- Text as Data
- Key Based Containers
- Interacting with the Web
- Leveraging APIs

# Text as Data



## Why Text Data

Data is everywhere... and collection is taking speed! 
- Personal devices and [what we have at home](https://www.nytimes.com/wirecutter/blog/amazons-alexa-never-stops-listening-to-you/)
- Online in terms of news websites, wikipedia, social media, blogs, document archives 

Working with text data opens up interesting new avenues for analysis and research. Some cool examples:
  - Text analysis, topic modelling and monetary policy:
      - [Transparency and shifts in deliberation about monetary policy](https://sekhansen.github.io/pdf_files/qje_2018.pdf)
      - [Narrative signals about uncertainty in inflation reports drive long-run outcomes](https://sekhansen.github.io/pdf_files/jme_2019.pdf)
  - [More partisanship (polarization) in congressional speeches](https://www.brown.edu/Research/Shapiro/pdfs/politext.pdf)


## How Text Data

Data from the web often come in HTML or other text format

In this course, you will get tools to do basic work with text as data.

However, in order to do that:

- learn how to manipulate and save strings
- save our text data in smart ways (JSON)
- interact with the web

In [None]:
# DST
# Scraping

# Key Based Containers

## Containers Recap (1:2)

*What are containers? Which have we seen?*

Sequential containers:
- `list` which we can modify (**mutable**).
    - useful to collect data on the go
- `tuple` which is after initial assignment **immutable**
     - tuples are faster as they can do less things
- `array` 
    - which is mutable in content (i.e. we can change elements)
    - but immutable in size
    - great for data analysis

## Containers Recap (2:2)

Non-sequential containers:
- Dictionaries (`dict`) which are accessed by keys (immutable objects).
- Sets (`set`) where elements are
    - unique (no duplicates) 
    - not ordered
    - disadvantage: cannot access specific elements!

## Dictionaries Recap (1:2)

*How did we make a container which is accessed by arbitrary keys?*

By using a dictionary, `dict`. Simple way of constructing a `dict`:

In [18]:
my_dict = {'Nicklas': 'Programmer',
           'Jacob': 'Political Scientist',
           'Preben': 'Executive',
           'Britta': 'Accountant'}

my_dict

{'Nicklas': 'Programmer',
 'Jacob': 'Political Scientist',
 'Preben': 'Executive',
 'Britta': 'Accountant'}

In [19]:
print(my_dict['Nicklas'])

Programmer


In [20]:
my_new_dict = {}
for a in range(0,10):
    my_new_dict["cube%s" %a] = a**2
    
print(my_new_dict['cube1'])

my_new_dict

1


{'cube0': 0,
 'cube1': 1,
 'cube2': 4,
 'cube3': 9,
 'cube4': 16,
 'cube5': 25,
 'cube6': 36,
 'cube7': 49,
 'cube8': 64,
 'cube9': 81}

## Dictionaries Recap (2:2)

Dictionaries can also be constructed from two associated lists. These are tied together with the `zip` function. Try the following code:

In [21]:
keys = ['a', 'b', 'c']
values = range(2,5)

key_value_pairs = list(zip(keys, values))
print(key_value_pairs) #Print as a list of tuples

[('a', 2), ('b', 3), ('c', 4)]


In [22]:
my_dict2 = dict(key_value_pairs)
print(my_dict2) #Print dictionary

{'a': 2, 'b': 3, 'c': 4}


In [23]:
print(my_dict2['a']) #Fetch the value associated with 'a'

2


## Storing Containers

*Does there exist a file format for easy storage of containers?*

Yes, the JSON file format.
- Can store lists and dictionaries.
- Syntax is the same as Python lists and dictionaries - only add quotation marks. 
    - Example: `'{"a":1,"b":1}'`
    

*Why is JSON so useful?*

- Standard format that looks exactly like Python.
- Extreme flexibility:
    - Can hold any list or dictionary of any depth which contains only float, int, str.
    - Does not work well with other formats, but normally holds any structured data.
        - Extension to spatial data: GeoJSON

# Interacting with the Web

## The Internet as Data (1:2)

When we surf around the internet we are exposed to a wealth of information.

- What if we could take this and analyze it?   


Well, we can. And we will.   
Examples: Facebook, Twitter, Reddit, Wikipedia, Airbnb etc.

## The Internet as Data (2:2)

Sometimes we get lucky. The data is served to us.

- The data is provided as an `API`
- The data can be extracted using `web scraping`.

## Web Interactions

In the words of Gazarov (2016): The web can be seen as a large network of connected servers
- A page on the internet is stored somewhere on a remote server
    - Remote server $\sim$ remotely located computer that is optimized to process requests
    
    
- When accessing a web page through browser:
    - Your browser (the *client*) sends a request to the website's server
    - The server then sends code back to the browser
    - This code is interpreted by the browser and displayed


- Websites come in the form of HTML $-$ APIs only contain data (often in *JSON* format) without presentational overhead

## The Web Protocol
*What is `http` and where is it used?*

- `http` stands for HyperText Transfer Protocol.
- `http` is good for transmitting the data when a webpage is visited:
   - the visiting client sends request for URL or object;
   - the server returns relevant data if active.


*Should we care about `http`?*

- In this course we ***do not*** care explicitly about `http`. 
- We use a Python module called `requests` as a `http` interface.
- However... Some useful advice - you should **always**:
  - use the encrypted version, `https`;
  - use authenticated connection, i.e. private login, whenever possible.

## Markup Language
*What is `html` and where is it used?*

- HyperText Markup Lanugage
- `html` is a language for communicating how a webpage looks like and behaves.
  - That is, `html` contains: content, design, available actions.

*Should we care about `html`?*

- Yes, `html` is often where the interesting data can be found.
- Sometimes, we are lucky, and instead of `html` we get a JSON in return. 
- Getting data from `html` will the topic of the upcoming scraping session.

# Leveraging APIs 

## Web APIs (1:4)
*So when do we get lucky, i.e. when is `html` not important?*

- When we get a Application Programming Interface (`API`) on the web
- What does this mean?
  - We send a query to the Web API 
  - We get a response from the Web API with data back in return, typically as JSON.
  - The API usually provides access to a database or some service

## Web APIs (2:4)
*So where is the API?*

- Usually on separate sub-domain, e.g. `api.github.com`
- Sometimes hidden in code (upcoming scraping session) 

*So how do we know how the API works?*

- There usually is some documentation. E.g. google ["api github com"](https://www.google.com/search?q=api+github)

## Web APIs (3:4)
*So is data free?*

- Most commercial APIs require authentication and have limited free usage
  - e.g. Twitter, Google Maps, weather services, etc.
  

- Some open APIs that are free
  - Danish 
    - Danish statistics (DST)
    - Danish weather data (DMI)
    - Danish spatial data (DAWA, danish addresses) 
  - Global
      - OpenStreetMaps, Wikipedia
      

- If no authentication is required the API may be delimited.
  - This means only a certain number of requests can be handled per second or per hour from a given IP address.

## Web APIs (4:4)
*So how do make the URLs?*

- An `API` query is a URL consisting of:
  - Server URL, e.g. `https://api.github.com`
  - Endpoint path, `/users/isdsucph/repos`
  
We can convert a string to JSON with `loads`.

## File Handling
*How can we remove a file?*

The module `os` can do a lot of file handling tasks, e.g. removing files:

In [24]:
import os
os.remove('my_file.json')

FileNotFoundError: [Errno 2] No such file or directory: 'my_file.json'

# Associated Readings+

PDA:
- Section 2.3: How to work with strings in Python
- Section 3.3: Opening text files, interpreting characters
- Section 6.1: Opening and working with CSV files
- Section 6.3: Intro to interacting with APIs
- Section 7.3: Manipulating strings

Gazarov (2016): "What is an API? In English, please."
- Excellent and easily understood intro to the concept
- Examples of different 'types' of APIs
- Intro to the concepts of servers, clients and HTML

# session_8_exercises.ipynb
Will be uploaded on github.
- Method 1: sync your cloned repo
- Method 2: download from git repo

`Remember` to create a local copy of the notebook