# Data Gathering

This notebook covers strategies you can use to gather data for an analysis. 

If you want to move on to first working on data analyses (with provided data) you can move onto the next tutorials, and come back to this one later.

<div class="alert alert-success">
Data Gathering is simply the process of collecting your data together.
</div>

It can encompass anything from launching a data collection project, web scraping, pulling from a database, downloading data in bulk. 

It might even include simply calling someone to ask if you can use some of their data. 

## Where to get Data

### The Web 

The web is absolutely full of data or ways to get data, either by hosting **data repositories** from which you can download data, by offering **APIs** through which you can request specific data from particular applications, or as data itself, such that you can use **web scraping** to extract data directly from websites. 

### Other than the Web

Not all data is indexed or accessible on the web, at least not publicly. 

Sometimes finding data means chasing down data wherever it might be. 

If there is some particular data you need, you can try to figure out who might have it, and get in touch to see if it might be available.

### Data Gathering Skills
Depending on your gathering method, you will likely have to do some combination of the following:
- Download data files from repositories
- Read data files into python
- Use APIs 
- Query databases
- Call someone and ask them to send you a harddrive

## Data Repositories

<div class="alert alert-success">
A Data Repository is basically just a place that data is stored. For our purposes, it is a place you can download data from. 
</div>

<div class="alert alert-info">
A good list of available data repositories: http://www.kdnuggets.com/datasets/index.html
</div>

### DATA.GOV

<div class="alert alert-success">
Data.gov is the US Federal Governments publicly available data repository. It has tons of different datasets, all available for download. 
</div>

<div class="alert alert-info">
Check out what data is available on data.gov here: https://www.data.gov/
</div>

<br>
<img src="img/data_gov.png" alt="gov_dat" height="500" width="750">
<br>

### San Diego City Data

<div class="alert alert-success">
The City of San Diego also has a data repository for publicly available San Diego city data. 
</div>

<div class="alert alert-info">
Check out what data is available from the city of San Diego here: https://data.sandiego.gov/
</div>

<br>
<img src="img/sd_data.png" alt="sd_dat" height="500" width="750">
<br>

## Databases

<div class="alert alert-success">
A database is an organized collection of data. More formally, 'database' refers to a set of related data, and the way it is organized. 
</div>

### Structured Query Language - SQL

<div class="alert alert-success">
SQL (pronounced 'sequel') is a language used to 'communicate' with databases, making queries to request particular data from them.
</div>

<div class="alert alert-info">
There is a useful introduction and tutorial to SQL
<a href=http://www.sqlcourse.com/intro.html>here</a>
as well as some useful 'cheat sheets' 
<a href=http://www.cheat-sheets.org/sites/sql.su/>here</a>
and
<a href=http://www.sqltutorial.org/wp-content/uploads/2016/04/SQL-cheat-sheet.pdf>here</a>.
</div>

SQL is the standard, and most popular, way to interface with relational databases.

Note: None of the rest of the tutorials presume or require any knowledge of SQL. 

You can look into it if you want, or if it is relevant to accessing some data you want to analyze, but it is not required for this set of tutorials. 

## Application Program Interfaces (APIs)

<div class="alert alert-success">
APIs are basically a way for software to talk to software - it is an interface into an application / website / database designed for software.
</div>

<div class="alert alert-info">
For a simple explanation of APIs go
<a href=https://medium.freecodecamp.com/what-is-an-api-in-english-please-b880a3214a82>here</a>
or for a much broader, more technical, overview try
<a href=https://medium.com/@mattburgess/apis-a-basic-primer-f8250602597d>here</a>.
</div>

APIs offer a lot of functionality - you can send requests to the application to do all kinds of actions. In fact, any application interface that is designed to be used programatically is an API, including, for example, interfaces for using packages of code. 

One of the many things that APIs do, and offer, is a way to query and access data from particular applications / databases. The benefit of using APIs for data gathering purposes is that they typically return data in nicely structured formats, that are relatively easy to analyze.

### Launching URL Requests from Python

In [1]:
# Imports
#  requests lets you make http requests from python
import requests
import pandas as pd

In practice, APIs are usually special URLs that return raw data (json or XML) as opposed to a web page to be rendered for human viewers (html). Find the documentation for a particular API to see how you send requests to access whatever data you want. For example, let's try the Github API. 

In [2]:
# Request data from the Github API on a particular user
page = requests.get('https://api.github.com/users/tomdonoghue')

In [3]:
# The content we get back is a json file
page.content

b'{"login":"TomDonoghue","id":7727566,"avatar_url":"https://avatars3.githubusercontent.com/u/7727566?v=3","gravatar_id":"","url":"https://api.github.com/users/TomDonoghue","html_url":"https://github.com/TomDonoghue","followers_url":"https://api.github.com/users/TomDonoghue/followers","following_url":"https://api.github.com/users/TomDonoghue/following{/other_user}","gists_url":"https://api.github.com/users/TomDonoghue/gists{/gist_id}","starred_url":"https://api.github.com/users/TomDonoghue/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/TomDonoghue/subscriptions","organizations_url":"https://api.github.com/users/TomDonoghue/orgs","repos_url":"https://api.github.com/users/TomDonoghue/repos","events_url":"https://api.github.com/users/TomDonoghue/events{/privacy}","received_events_url":"https://api.github.com/users/TomDonoghue/received_events","type":"User","site_admin":false,"name":"Tom","company":"UC San Diego","blog":"tomdonoghue.github.io","location":"San Dieg

In [4]:
# We can read in the json data with pandas
pd.read_json(page.content, typ='series')

avatar_url             https://avatars3.githubusercontent.com/u/77275...
bio                    Cognitive Science Grad Student @ UCSD. \r\nOn ...
blog                                               tomdonoghue.github.io
company                                                     UC San Diego
created_at                                          2014-05-28T20:20:48Z
email                                          thomasdonoghue@hotmail.ca
events_url             https://api.github.com/users/TomDonoghue/event...
followers                                                              2
followers_url          https://api.github.com/users/TomDonoghue/follo...
following                                                              9
following_url          https://api.github.com/users/TomDonoghue/follo...
gists_url              https://api.github.com/users/TomDonoghue/gists...
gravatar_id                                                             
hireable                                           

<div class="alert alert-info">
This link lists some commonly used APIs: http://www.webopedia.com/TERM/A/API.html
</div>



## Web Scraping

<div class="alert alert-success">
Web scraping is when you (programmatically) extract data from websites.
</div>

<div class="alert alert-info">
More information on web scraping (wikipedia): https://en.wikipedia.org/wiki/Web_scraping
</div>

Web scraping is distinct from using an API, even though many APIs may be accessed over the internet. Web scraping is different in that you are (programmatically) navigating through the internet, and extracting data of interest. 

Note:
Be aware that scraping data from websites (without using APIs) can often be an involved project itself - scraping sites can take a considerable amount of tuning to get the data you want. 

Be aware that data presented on websites may not be well structured, or in an organzed format that lends itself to easy analysis.

If you try scraping websites, also make sure you are allowed to scrape the data, and follow the websites terms of service. 