<center><img src="https://github.com/DACSS-PreProcessing/session1_main/blob/main/pics/LogoSimple.png?raw=true" width="900"></center>


# Data Collection in Python

### Prof. José Manuel Magallanes, PhD 


<a id='beginning'></a>

Let me share some code to get data. Let me show you how to deal with the following cases:

1. [Propietary software.](#part1) 
2. [Ad-hoc collection.](#part2) 
3. [Use of APIs.](#part3) 
4. [Scraping webpages.](#part4) 


Remember that the location of your files is extremely important. If you have created a folder name "myProject", and your code is saved  in that folder, then "myProject" is your the root folder. In general, you will create other folders inside the root. One of those will be for your data. In any case, you should become familiar with some important commands in Python:

In [None]:
# where is located my current code?
import os
os.getcwd()

The command above gave you your current location, if it is not what you expected, you can change it with another command:

In [None]:
os.chdir()

The command above gave you your current location. As you see, you have a _sepator_ ( \, \\, /, //) used to create a path to the folder. This separator may vary depending on what platform (operating system)  you have. To make this simple, you simply write:

In [None]:
folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)
fileToRead

The object _fileToRead_ has the right name of the path, because **os.path.join**  creates a path using the **right** _separators_.


____


<a id='part1'></a>
## Collecting data from propietary software

Let's start with data from SPSS and STATA, very common in public policy schools. To work with these kind of files, we will simply use *pandas*. 

In [None]:
import pandas as pd

I using a file from the American National Election Studies (ANES). This is a rather big file, so let me select some variables ("libcpre_self","libcpo_self",a couple of question pre and post elections asking respondents to place themselves on a seven point scale ranging from ‘extremely liberal’ to ‘extremely conservative’) and create a data frame with them:

In [None]:
varsOfInterest=["libcpre_self","libcpo_self"]

Getting a Stata file into pandas is quite easy:

In [None]:
import os
folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)
dataStata=pd.read_stata(fileToRead,columns=varsOfInterest)

In [None]:
dataStata.head()

Pandas can read SPSS, if you previously install **pyreadstat**:

In [None]:
# pip install pyreadstat

In [None]:
fileName="anes_timeseries_2012.sav"
fileToRead=os.path.join(folder,fileName)

dataSpss=pd.read_spss(fileToRead,usecols=varsOfInterest)
dataSpss.head()

[Go to page beginning](#beginning)

_____

<a id='part2'></a>

## Collecting your ad-hoc data

Let me assume you are answering some questions in this [form](https://forms.gle/6Ga61ifjCjo3j3YF6).

After completing the survey, your answers are saved in an spreadsheet, which you should publish as a CSV file. Then, you can read it like this:

In [None]:
import pandas as pd

link='https://docs.google.com/spreadsheets/d/1MrWp0kVtNGzTOmhuexjsIaZ9HpSv2X8RZsZ-KuEC4A8/pub?gid=1044145359&single=true&output=csv'

myData = pd.read_csv(link)

# here it is:
myData.head()

[Go to page beginning](#beginning)

-----

<a id='part3'></a>

## Collecting data from APIs

There are organizations, public and private, that have an open data policy that allows people to access their repositories dynamically. You can get that data in CSV format if available, but the data is always in  XML or JSON format, which are data containers that store data in an *associative array* structure. Let me get the data about [Fire 9-1-1  calls  from Seattle](https://dev.socrata.com/foundry/data.seattle.gov/kzjm-xkqj):

In [None]:
import requests

# where is it online?
url = "https://data.seattle.gov/resource/kzjm-xkqj.json"

# Go for the data:
response = requests.get(url)

# If we got the data:
if response.status_code == 200:
    data911 = response.json()

In [None]:
len(data911)

In [None]:
# You can turn it easily into a pandas data frame:
import pandas as pd
data911DF=pd.DataFrame(data911)

In [None]:
# here you are...
data911DF

As you got only 1000 rows, you may try:

In [None]:
# pip install sodapy

In [None]:
from sodapy import Socrata

client = Socrata("data.seattle.gov", None)
results = client.get("kzjm-xkqj", limit=3000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

In [None]:
results_df.shape

[Go to page beginning](#beginning)

_____

<a id='part4'></a>

## Collecting data by scraping

We are going to get the data from a table from this [wikipage](https://en.wikipedia.org/wiki/List_of_freedom_indices)

In [None]:
# pip install html5lib lxml

In [None]:
wikiTables=pd.read_html(io=wikiLink,flavor='bs4',attrs={"class":"wikitable"},match='Score')

In [None]:
#How many are there?
len(wikiTables)

In [None]:
# So, I just pick the one I need:
wikiTable=wikiTables[0]

In [None]:
# what do you have:
wikiTable

### Reminder for Deliverable 1:

Start collecting the data you will use in the course. If you have the files, save them in folder inside your repo. If the files are above 100 Mb, you may try amazonws instead, or use GIT-LFS. For sure, there might be cases where you decide not to commit/push your local data files.
