<img src="https://i.imgur.com/6U6q5jQ.png"/>


# Data Collection in Python

<a id='beginning'></a>
This session pays attention to get data. In this situation, you can be confronted with a decision to collect data from repositories or similar sources, or collect your own data to answer an ad-hoc research question. The latter case will make you consider if you need a probabilistic or non-probabilistic design; which will also determine the next steps in your design.
In any case, you need to collect data to be read by R or Python, unless your data is not suitable for any kind of computational data processing. But in this unit, I am assuming it is. If you have collected your data, a popular choice to record your observations is an spreadsheet, maybe using Excel or GoogleDocs. If you have collected data from another party, you may also have spreadsheets, or more sophisticated files in particular formats, like SPSS or STATA. Maybe you decided to collect data from the web, and you may be dealing with XML or JSON formats; or simply text without much structure. Let me show you how to deal with the following cases:

1. [Propietary software.](#part1)
2. [Ad-hoc collection.](#part2)
3. [Use of APIs.](#part3)
4. [Scraping webpages.](#part4)


Remember that the location of your files is extremely important. If you have created a folder name "my project", your code should be in that folder, which I call sometimes the root folder,  and your data in another folder inside that root folder. In any case, you should become familiar with some important commands from the **os** package:

In [82]:
import os

The two more important uses are:

In [83]:
# where am I?
os.getcwd()

'/content'

If the file is in a folder inside your root folder, you simply write:

In [84]:
import os

folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)

The object _fileToRead_ has the right name of the path, because **os.path.join** creates a path using the elements between the parenthesis. Notice that if you are using Windows, a folder in "C" hard drive should be written like this:
os.path.join('c:/','folder1', 'folder2'). Notice that you can write several folders, and path.join creates the right separator, but just for Windows you need that element ':/'. If you want to know the separator your computer is using, type this:

In [85]:
os.path.sep

'/'

Let's turn our attention to the file acquisition process.


____


<a id='part1'></a>
## Collecting data from propietary software

Let's start with data from SPSS and STATA, very common in public policy schools. To work with these kind of files, we will simply use *pandas*.

In [86]:
import pandas as pd

I using a file from the American National Election Studies (ANES). This is a rather big file, so let me select some variables ("libcpre_self","libcpo_self",a couple of question pre and post elections asking respondents to place themselves on a seven point scale ranging from ‘extremely liberal’ to ‘extremely conservative’) and create a data frame with them:

In [87]:
varsOfInterest=["libcpre_self","libcpo_self"]

Getting a Stata file into pandas is quite easy:

In [89]:
import os
folder="data"
fileName="anes_timeseries_2012.dta"
fileToRead=os.path.join(folder,fileName)
dataStata=pd.read_stata(fileToRead,columns=varsOfInterest)

ValueError: buffer is smaller than requested size

In [None]:
dataStata.head()

Opening SPSS files in pandas requires you previously install pyreadstat:

In [None]:
# do you have it?
!pip show pyreadstat

In [None]:
# Set up the file location:
fileName="anes_timeseries_2012.sav"
fileToRead=os.path.join(folder,fileName)

# Open it:
dataSpss=pd.read_spss(fileToRead)

In [None]:
dataSpss.head()

In [None]:
pip show openpyxl

In [None]:
# Set up the file location:
fileName="HDI.xlsx"
fileToRead=os.path.join(folder,fileName)

# Open it:
dataExcel=pd.read_excel(fileToRead)

In [None]:
dataExcel.info()

[Go to page beginning](#beginning)

_____

<a id='part2'></a>

## Collecting your ad-hoc data

Let me assume you have collected some data using Google Forms. The answers to your form are saved in an spreadsheet, which you should publish as a CSV file. Then, I can read it like this:

In [104]:
import pandas as pd
link='https://docs.google.com/spreadsheets/d/e/2PACX-1vRCHCDPx4NmYA5phchO2rZhZSPvHZjkF08E11i3gsjHCy4zVWc12IRGg8rMzDgpvIHCZQqGeqPFhWa6/pub?gid=692075096&single=true&output=csv'
fromGoogle = pd.read_csv(link)

# here it is:
fromGoogle

Unnamed: 0,HDI rank,Country,1990,2000,2010,2011,2012,2013,2014
0,1,Norway,0849,0917,0940,0941,0942,0942,0944
1,2,Australia,0865,0898,0927,0930,0932,0933,0935
2,3,Switzerland,0831,0888,0924,0925,0927,0928,0930
3,4,Denmark,0799,0862,0908,0920,0921,0923,0923
4,5,Netherlands,0829,0877,0909,0919,0920,0920,0922
...,...,...,...,...,...,...,...,...,...
183,184,Burundi,0295,0301,0390,0392,0395,0397,0400
184,185,Chad,..,0332,0371,0382,0386,0388,0392
185,186,Eritrea,..,..,0381,0386,0390,0390,0391
186,187,Central African Republic,0314,0310,0362,0368,0373,0348,0350


In [105]:
fromGoogle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   HDI rank  188 non-null    int64 
 1   Country   188 non-null    object
 2   1990      188 non-null    object
 3   2000      188 non-null    object
 4   2010      188 non-null    object
 5   2011      188 non-null    object
 6   2012      188 non-null    object
 7   2013      188 non-null    object
 8   2014      188 non-null    object
dtypes: int64(1), object(8)
memory usage: 13.3+ KB


[Go to page beginning](#beginning)

-----

<a id='part3'></a>

## Collecting data from APIs

There are organizations, public and private, that have an open data policy that allows people to access their repositories dynamically. You can get that data in CSV format if available, but the data is always in  XML or JSON format, which are containers that store data in an *associative array* structure. Python's dictionaries are very useful in these situations, as they can keep the NOSQL structure better than data frames. Let me get the data about 9-1-1 Police reponses from Seattle:

In [90]:
pip install sodapy



In [91]:
# pip install sodapy

# make sure to install these packages before running:
# pip install pandas
# pip install sodapy

import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.seattle.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.seattle.gov,
#                  MyAppToken,
#                  username="user@example.com",
#                  password="AFakePassword")

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("kzjm-xkqj", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



In [92]:
results_df.shape

(2000, 12)

In [93]:
results_df

Unnamed: 0,address,type,datetime,latitude,longitude,report_location,incident_number,:@computed_region_ru88_fbhk,:@computed_region_kuhn_3gp2,:@computed_region_q256_3sug,:@computed_region_2day_rhn5,:@computed_region_cyqu_gs94
0,3515 Woodland Park Ave N,Auto Fire Alarm,2024-02-06T07:05:00.000,47.649775,-122.344031,"{'type': 'Point', 'coordinates': [-122.344031,...",F240018983,20,3,18377,,
1,4349 Ne 55th St,Aid Response,2024-02-06T07:03:00.000,47.668672,-122.281477,"{'type': 'Point', 'coordinates': [-122.281477,...",F240018981,55,48,18383,,
2,1940 11th Ave W,Triaged Incident,2024-02-06T07:01:00.000,47.637192,-122.371622,"{'type': 'Point', 'coordinates': [-122.371622,...",F240018982,50,39,19575,,
3,520 S Cloverdale St,Auto Fire Alarm,2024-02-06T07:00:00.000,47.52648,-122.327935,"{'type': 'Point', 'coordinates': [-122.327935,...",F240018980,59,15,18388,,
4,Sb I5 At Spokane,MVI Freeway,2024-02-06T06:50:00.000,47.577176,-122.319669,"{'type': 'Point', 'coordinates': [-122.319669,...",F240018978,57,41,19584,,
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,3625 Ne 105th St,Low Acuity Response,2024-01-31T10:43:00.000,47.704764,-122.28909,"{'type': 'Point', 'coordinates': [-122.28909, ...",F240016322,55,29,19579,127,2
1996,E Pike St / 12th Ave,Aid Response,2024-01-31T10:43:00.000,47.614104,-122.316836,"{'type': 'Point', 'coordinates': [-122.316836,...",F240016323,8,11,19578,,
1997,12740 30th Ave Ne,Nurseline/AMR,2024-01-31T10:36:00.000,47.721329,-122.296339,"{'type': 'Point', 'coordinates': [-122.296339,...",F240016321,29,26,19579,117,2
1998,8300 Military Rd S,Aid Response,2024-01-31T10:25:00.000,47.530109,-122.295457,"{'type': 'Point', 'coordinates': [-122.295457,...",F240016320,58,37,18388,,


[Go to page beginning](#beginning)

_____

<a id='part4'></a>

## Collecting data by scraping

We are going to get the data from a table from this [wikipage](https://en.wikipedia.org/wiki/List_of_freedom_indices)

In [94]:
pip show beautifulsoup4 html5lib lxml

Name: beautifulsoup4
Version: 4.12.3
Summary: Screen-scraping library
Home-page: 
Author: 
Author-email: Leonard Richardson <leonardr@segfault.org>
License: MIT License
Location: /usr/local/lib/python3.10/dist-packages
Requires: soupsieve
Required-by: gdown, google, nbconvert, yfinance
---
Name: html5lib
Version: 1.1
Summary: HTML parser based on the WHATWG HTML specification
Home-page: https://github.com/html5lib/html5lib-python
Author: 
Author-email: 
License: MIT License
Location: /usr/local/lib/python3.10/dist-packages
Requires: six, webencodings
Required-by: yfinance
---
Name: lxml
Version: 4.9.4
Summary: Powerful and Pythonic XML processing library combining libxml2/libxslt with the ElementTree API.
Home-page: https://lxml.de/
Author: lxml dev team
Author-email: lxml-dev@lxml.de
License: BSD-3-Clause
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: nbconvert, pandas-datareader, yfinance


In [95]:

# Location
wikilink = "https://en.wikipedia.org/wiki/List_of_freedom_indices"

wikiTables1=pd.read_html(wikilink)
wikiTables2=pd.read_html(wikilink,attrs={'class':'wikitable sortable'})
wikiTables3=pd.read_html(wikilink,match="Score")

In [96]:
#How many are there?
len(wikiTables1),len(wikiTables2),len(wikiTables3)

(5, 2, 2)

In [97]:
wikiTables2[0]

Unnamed: 0,Country,Freedom in the World 2023[13],Score,Index of Economic Freedom 2023[14],Score.1,Press Freedom Index 2023[3],Score.2,Democracy Index 2023[9],Score.3
0,Norway,,100,,76.9,,95.18,,9.81
1,Ireland,,97,,82,,89.91,,9.05
2,Sweden,,100,,77.5,,88.15,,9.26
3,Finland,,100,,77.1,,87.94,,9.2
4,Denmark,,97,,77.6,,89.48,,9.15
...,...,...,...,...,...,...,...,...,...
192,Afghanistan,,8,,—,,39.75,,2.85
193,Yemen,,9,,—,,32.78,,1.95
194,Palestine,,—,,—,,37.86,,3.83
195,Syria,,1,,—,,27.22,,1.43


In [98]:
wikiTables3[0]

Unnamed: 0,Country,Freedom in the World 2023[13],Score,Index of Economic Freedom 2023[14],Score.1,Press Freedom Index 2023[3],Score.2,Democracy Index 2023[9],Score.3
0,Norway,,100,,76.9,,95.18,,9.81
1,Ireland,,97,,82,,89.91,,9.05
2,Sweden,,100,,77.5,,88.15,,9.26
3,Finland,,100,,77.1,,87.94,,9.2
4,Denmark,,97,,77.6,,89.48,,9.15
...,...,...,...,...,...,...,...,...,...
192,Afghanistan,,8,,—,,39.75,,2.85
193,Yemen,,9,,—,,32.78,,1.95
194,Palestine,,—,,—,,37.86,,3.83
195,Syria,,1,,—,,27.22,,1.43


In [99]:
wikiTables2[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country                             197 non-null    object 
 1   Freedom in the World 2023[13]       0 non-null      float64
 2   Score                               197 non-null    object 
 3   Index of Economic Freedom 2023[14]  0 non-null      float64
 4   Score.1                             197 non-null    object 
 5   Press Freedom Index 2023[3]         0 non-null      float64
 6   Score.2                             197 non-null    object 
 7   Democracy Index 2023[9]             0 non-null      float64
 8   Score.3                             197 non-null    object 
dtypes: float64(4), object(5)
memory usage: 14.0+ KB


In [100]:
wikiTables2_bs=pd.read_html(wikilink,flavor='bs4',
                            attrs={'class':'wikitable sortable'})

In [101]:
wikiTables2_bs[0]

Unnamed: 0,Country,Freedom in the World 2023[13],Score,Index of Economic Freedom 2023[14],Score.1,Press Freedom Index 2023[3],Score.2,Democracy Index 2023[9],Score.3
0,Norway,free,100,mostly free,76.9,good,95.18,full democracy,9.81
1,Ireland,free,97,free,82,good,89.91,full democracy,9.05
2,Sweden,free,100,mostly free,77.5,good,88.15,full democracy,9.26
3,Finland,free,100,mostly free,77.1,good,87.94,full democracy,9.2
4,Denmark,free,97,mostly free,77.6,good,89.48,full democracy,9.15
...,...,...,...,...,...,...,...,...,...
192,Afghanistan,not free,8,,—,very serious,39.75,authoritarian regime,2.85
193,Yemen,not free,9,,—,very serious,32.78,authoritarian regime,1.95
194,Palestine,,—,,—,very serious,37.86,authoritarian regime,3.83
195,Syria,not free,1,,—,very serious,27.22,authoritarian regime,1.43


In [102]:
wikiTables2_bs[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Country                             197 non-null    object
 1   Freedom in the World 2023[13]       196 non-null    object
 2   Score                               197 non-null    object
 3   Index of Economic Freedom 2023[14]  176 non-null    object
 4   Score.1                             197 non-null    object
 5   Press Freedom Index 2023[3]         184 non-null    object
 6   Score.2                             197 non-null    object
 7   Democracy Index 2023[9]             165 non-null    object
 8   Score.3                             197 non-null    object
dtypes: object(9)
memory usage: 14.0+ KB


In [103]:
freedomDF=wikiTables2_bs[0].copy()