In [None]:
### APIs : What and why

An API (Application Programming interface) is a way for two different applications to communicate. Whilst the term applies to any two programs we are using it to refer to the API of a web service that provides data.

To retrieve data from an API, a request to a remote web server is made.

For example, if you want to build an application which plots stock prices, you would use the API of something like google finance to request the current stock prices.

APIs are useful where:
* Data is changing quickly, e.g. stock prices
* The whole dataset is not required, e.g. the tweets of one user
* Repeated computation is involved, e.g. Spotify API that tells you the genre of a piece of music


In [None]:


We will be using Pandas (a contraction of 'panel' and 'data'). Pandas is a python library for doing practical, real world data analysis.

Being comfortable with using pandas is a tutorial (or set of tutorials) alone$^{*}$, so don't worry if you're unfamiliar, but we will pick up the basics.

\* *(See the resources section at the end of this tutorial for more resources on pandas)*

The main data-structure in pandas is the dataframe, it stores rows of observations over columns of variables. Lets see how it works...

In [None]:

#### Construct a DataFrame

In [None]:

import pandas as pd  # Import the package

# Some input data - a small sample of the iris dataset
data = {
    'sepal_length': [6.9, 6.9, 4.8, 5.4, 4.6],
    'sepal_width': [3.2, 3.1, 3.4, 3.0, 3.6],
    'petal_length': [5.7, 5.1, 1.9, 4.5, 1.0],
    'petal_width': [2.3, 2.3, 0.2, 1.5, 0.2],
    'species': ['virginica', 'virginica', 'setosa', 'versicolor', 'setosa']
}

df = pd.DataFrame(data)

df # Typing a variable at the end of a cell will print it out

#### Indexes

df.index  # Access the index

# Change the index
df.index = [2, 3, 4, 5, 6]
print(df.index)  # Access the index
df

df.index = df.index - 1  # Change the index
df.index  # Access the index

#### COLUMNS

df.columns  # Access the columns

# Change the columns
df.columns = ['sepal_length', 'sepal_width',
              'petal_length', 'petal_width', 'SPECIES']
print(df.columns)
df

# Apply the `str.lower` function to every column
df.columns = map(str.lower, df.columns)
df.columns  # Access the columns

#### Accessing columns

# Access one column
df.loc[:, 'sepal_length']

# Access one column (short-hand)
df['sepal_length']

# Access two columns
df.loc[:, ['sepal_length', 'sepal_width']]

# Access two columns (short-hand)
df[['sepal_length', 'sepal_width']]

#### Accessing rows

# Access one row
df.loc[3, :]

# Access two rows
df.loc[[3, 0], :]
# Note: `0` is not an index so it's value's are `NaN` (not a number), i.e. missing

# Access a range of rows
df.loc[2:5, :]

# Access rows by index location (number starting from zero)
df.iloc[0:2, :]

# Filter keeping only rows with `sepal_length` > 5
df[df.sepal_length > 5]

# Filter keeping only rows with `sepal_length` > 5 (alternative syntax)
df.query("sepal_length > 5")

#### Reading data from disk

Pandas can read files directly from a file (even excel files) and will automatically try and infer as much as it can about the structure of the data)

``` python
filename = 'Your_filename_here.json'
df = pd.read_json(filename)

filename = 'Your_filename_here.csv'
df = pd.read_csv(filename)

# Requires `xlrd` package to be installed
filename = 'Your_filename_here.xlsx'
df = pd.read_excel(filename)
```

# You may need to manually tell pandas some things about your dataset
# Lots of options detailed in the docs...
pd.read_csv?

#### Adding new columns

df['new_column'] = 1
df

df['new_column'] = [1, 2, 3, 4, 5]
df

df['new_column'] = df['sepal_length'] - df['petal_length']
df