## Getting Data in its Different Formats
### Getting and Saving Data
*Curtis Miller*

In this notebook I look at some of the myriad of ways we can load in data.

#### CSV
We can load in comma-separated value files (CSV files) using code akin to the code below (relying on the pandas function `read_csv()`).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
pop = pd.read_csv("PopPyramids.csv")

# A peak at the contents
pop.head()

We can set some parameters to do an even better job at getting a good data frame (the one above is fine, but could be better). We want `Country`, `Year`, and `Age` to be indices, and we want to exclude `Region` (no useful information).

In [None]:
pop = pd.read_csv("PopPyramids.csv", index_col=['Country', 'Year', 'Age'])
pop.drop('Region', axis=1, inplace=True)
pop.sort_index(inplace=True)    # If we don't do this, some slicing operations won't work (index will not be sorted)
pop.head()

In [None]:
pop.loc[('UnitedStates', 2013), :]    # Usage demonstration

In [None]:
pop.loc[(slice(None), 2017, 'Total'), :].sort_values('Both Sexes Population', ascending=False)

#### Excel

We can read Excel files (including `.xls` and `.xlsx`) using the `read_excel()` function from pandas.

In [None]:
pop_excel = pd.read_excel("PopPyramids.xlsx", index_col=[1, 2, 3])
pop_excel.drop('Region', axis=1, inplace=True)
pop_excel.sort_index(inplace=True)
pop_excel.head()

#### HTML

Reading HTML can be done using the `read_html()` function in pandas. Let's first read a relatively clean HTML file.

In [None]:
pop_html = pd.read_html("PopPyramids.html")    # This returns a list
pop_html

In [None]:
pop_html[0].head()    # This is a data frame

In [None]:
pop_html = pd.read_html("PopPyramids.html", attrs={'id': 'PopData'}, index_col=[1, 2, 3])[0]    # More specific way to get the table wanted
pop_html.drop('Region', axis=1, inplace=True)
pop_html.sort_index(inplace=True)
pop_html.head()

How about parsing a real-world HTML file? *(Warning: HTML file may have changed; the Internet is unpredictable.)*

In [None]:
cities = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population")    # You may need to install
                                                                                                     # html5lib via conda
cities

In [None]:
cities[3]    # Ugly

#### XML

In general, you cannot simply convert from XML to a `DataFrame`, or any native Python object for that matter. XML needs to be parsed (HTML is very similar). But suppose that XML data is in a nice format. We can use a parser like lxml for creating our `DataFrame`. The solution, though, depends on the XML in the file. There is no universal solution.

The following demonstrates what parsing the file `PopPyramids.xml` looks like:

In [None]:
from lxml import objectify

In [None]:
with open('PopPyramids.xml') as f:
    root = objectify.parse(f).getroot()    # Get the root of the tree structure of the XML

obj = list()    # Will contain all rows of the DataFrame

for entry in root.entry:    # Iterate over all children in root with tag "entry"
    entry_fields = dict()   # Create a dict that will contain a row
    for var in entry.var:   # Iterate over all children of entry with tag "var"
        entry_fields[var.attrib['name']] = var.pyval    # The element of entry_fields corresponding to the name attribute of var
                                                        # is assigned the pythonized value of the contents of var
    obj.append(entry_fields)  # Add this row to obj

obj

In [None]:
pop_xml = pd.DataFrame(obj)
pop_xml

In [None]:
# Make the DataFrame nicer
cols = [col for col in pop_xml.columns if col not in ['Age', 'Year', 'Country', 'Region']]    # Columns to be included
idx_list = pop_xml[['Country', 'Year', 'Age']].values.T.tolist()    # A list of lists to be used to create a MultiIndex
# Notice that for a DataFrame df, df.values is a NumPy array (look on your own)
pop_xml = pd.DataFrame(pop_xml[cols].values, columns=cols, index=idx_list)

In [None]:
pop_xml.head()

#### JSON

JSON is easier than XML to work with, and pandas provides a `read_json()` function for reading from a JSON file, but be sure to look at the file first to ensure that the JSON could even be coerced into a tabular format (not all Python `dict`s can become `DataFrame`s, and the same holds for `JSON` objects since they're almost the same thing).

Here's what reading from JSON looks like:

In [None]:
pop_json = pd.read_json('PopPyramids.json')
pop_json.head()

#### Raw API Call

API calls will likely consist of mostly `GET` requests, sometimes `PUSH` requests (and very rarely anything else). That's all that's common across APIs; otherwise, application is API-specific. Python then handles what the API returns (commonly JSON, sometimes XML).

Here we get the data contained in the files used above directly via the U.S. Census Bureau's API. Refer to these links for usage of the API:

* [Census Bureau API Overview](https://www.census.gov/developers/)
* [API Guide](https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf)
* [Available APIs](https://www.census.gov/data/developers/data-sets.html)
* [International Database](https://www.census.gov/data/developers/data-sets/international-database.html)
* [Populations by 5-Year Age Groups](https://api.census.gov/data/timeseries/idb/5year.html)
* [Request a Key](http://api.census.gov/data/key_signup.html)

Like most APIs, you will need a unique key for using the API. Here, it's `secret_key` (which I created off-video).

In [None]:
from requests import get    # For making GET requests

In [None]:
base_url = "https://api.census.gov/data/timeseries/idb/5year"    # The base URL of the API for making requests
parameters = {"key": secret_key,    # The secret key
              "get": ",".join(["FPOP", "FPOP0_4", "FPOP5_9", "FPOP10_14", "FPOP15_19", "FPOP20_24", "FPOP25_29", "FPOP30_34",
                            "FPOP35_39", "FPOP40_44", "FPOP45_49", "FPOP50_54", "FPOP55_59", "FPOP60_64",
                            "FPOP65_69", "FPOP70_74", "FPOP75_79", "FPOP80_84", "FPOP85_89", "FPOP90_94",
                            "FPOP95_99", "FPOP100_", "MPOP", "MPOP0_4", "MPOP5_9", "MPOP10_14", "MPOP15_19", "MPOP20_24",
                            "MPOP25_29", "MPOP30_34", "MPOP35_39", "MPOP40_44", "MPOP45_49", "MPOP50_54", "MPOP55_59",
                            "MPOP60_64", "MPOP65_69", "MPOP70_74", "MPOP75_79", "MPOP80_84", "MPOP85_89", "MPOP90_94",
                            "MPOP95_99", "MPOP100_"]),    # Variables we request from the API
              "time": "from 2013 to 2017",
              "FIPS": "*"}    # Get data for all FIPS codes (identifiers for countries; for example, NO is Norway)

In [None]:
response = get(base_url, params=parameters)

In [None]:
response.status_code    # If 200, the call was a "success"

In [None]:
response.url    # What the resulting URL passed in the call looks like

In [None]:
response.content     # This is JSON

In [None]:
resp_obj = response.json()    # Create a Python object from the JSON sent back
resp_obj

In [None]:
pops_api_raw = pd.DataFrame(resp_obj[1:], columns=resp_obj[0])    # Create a DataFrame
pops_api_raw

The format is unlike what we had before, and the numbers are being treated as strings. We will need to do some serious transformation to put this in the format we want.

But there's a better way.


#### APIs Via Packages

Always check to see if there's a Python package written already for the API you want to use. Twitter, for example, has a dedicated package. Unfortunately, the API for accessing international does not have a package (though the census Python package allows accessing other data sets).

#### Population Pyramid Plot

Below is the code for generating a population pyramid for the United States in 2017 using matplotlib.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
pop.head()

In [None]:
plotdf = pop.loc[('UnitedStates', 2017), ['Male Population', 'Female Population']]
plotdf

In [None]:
agegroups = pd.Categorical(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49',
                            '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94',
                            '95-99', '100+'])    # A relatively new type of data, for categorical-type data
plotdf = plotdf.loc[agegroups, :]    # I want a custom order to the rows
plotdf

In [None]:
def plot_pop_pyramid(df, country=None, year=None):
    """Generate a plot of a population pyramid.
    
    Args:
        df (pandas.DataFrame): A DataFrame with index Age (for age groups) and columns Male Population, Female Population
                               of numeric data that will be used for creating the plot
        country (str): The country for which the population pyramid represents (used in the title); if None, ignored
        year (int): The year of the population pyramid's data (used in the title); if None, ignored
    """
    
    agegroups = pd.Categorical(['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49',
                                '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90-94',
                                '95-99', '100+'])    # A relatively new type of data, for categorical-type data
    ypos = [i for i in range(len(agegroups))]
    plt.yticks(ypos, list(agegroups))
    plt.barh(ypos, -df["Male Population"], align='center', color='blue')
    plt.barh(ypos, df["Female Population"], align='center', color='red')
    
    max_extent = df.values.max() * 1.1
    plt.xlim([-max_extent, max_extent])
    t = "Population Pyramid"
    if country != None:
        t += " for " + country
    if year != None:
        t += ", " + str(year)
    _ = plt.title(t)
    plt.ylim([-0.5, len(ypos) + 1])
    
    plt.show()

In [None]:
plot_pop_pyramid(plotdf, "The United States of America", 2017)