# Introduction to Pandas

Civil 774 - Smart Infrastructure Analytics


![alt text](https://miro.medium.com/max/3006/1*KdxlBR9P3mDp9JZ_URMdYQ.jpeg)




**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In this lecture, we will talk about: 
1. Pandas Data Structures
2. Import data into Pandas
3. Pandas fundamentals


In [22]:
import pandas as pd

## 1. Pandas Data Structures

## Series

A **Series** is a single vector of data (like an array) with an *index* that labels each element in the vector.

Now first let's create a Pandas series with only numbers

In [23]:
counts = pd.Series([23,2321,42,52])

In [24]:
counts

0      23
1    2321
2      42
3      52
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

Let's try to see only the values of a series

In [25]:
counts.values

array([  23, 2321,   42,   52], dtype=int64)

Now let's try to see only the indexes of a series

In [26]:
counts.index

RangeIndex(start=0, stop=4, step=1)

### Index and slicing

The indexes can be strings as well, so they can be more meaningful
Now let's try that: 

In [27]:
Price = pd.Series([1.2, 0.8, 1.1, 0.6], index=['Auckland','Christchurch','Wellington', 'Hamilton'])

In [28]:
Price

Auckland        1.2
Christchurch    0.8
Wellington      1.1
Hamilton        0.6
dtype: float64

In [29]:
Price.index

Index(['Auckland', 'Christchurch', 'Wellington', 'Hamilton'], dtype='object')

In [30]:
Price.values

array([1.2, 0.8, 1.1, 0.6])

These labels can be used to refer to the values in the `Series`.

In [31]:
Price['Auckland']

1.2

With the string indexes, we can easily find values using the index. For instance, let's try to find the index name end with letter "d"

In [32]:
Price[[a.endswith('d') for a in Price.index]]

Auckland    1.2
dtype: float64

We can still use positional indexing if we wish. Let's try to find the first second city in our series

In [33]:
Price[1]

0.8

Slicing is the way we can take some data out of the series, and later on Dataframe, we talked briefly about this in the previous week but let's do it in Pandas

In [34]:
Price[1:3]

Christchurch    0.8
Wellington      1.1
dtype: float64

Let's also see if we can add new data into this series. In pandas we can use the `.append` for this purpose. However, note that .append is not an in-place operation in Pandas. From the docs:

    'Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.'

We need to assign the result back.

In [35]:
new_series = pd.Series([0.9],index=['Gisborne'])

In [36]:
Price=Price.append(new_series)

In [37]:
Price

Auckland        1.2
Christchurch    0.8
Wellington      1.1
Hamilton        0.6
Gisborne        0.9
dtype: float64

and also drop or remove some values. Let's try to remove the last item that had been added


In [38]:
Price = Price.drop('Gisborne')

In [39]:
Price

Auckland        1.2
Christchurch    0.8
Wellington      1.1
Hamilton        0.6
dtype: float64

### Convert Python array to Pandas Series

With Series contain values and indexes separately, we can convert a standard Python array into a Pandas series and provide new indexes
 Let's try that: 

In [40]:
array0 = [123,2312,442,3322]


In [41]:
pop = pd.Series(array0)
pop

0     123
1    2312
2     442
3    3322
dtype: int64

In [42]:
pop.index = ['Auckland','Christchurch','Wellington', 'Hamilton']

In [43]:
pop

Auckland         123
Christchurch    2312
Wellington       442
Hamilton        3322
dtype: int64

We can also combine with other packages in Python, e.g. Numpy, to create series in Pandas, for instance, let's use another data analytics package `Numpy` to randomly create a Series

In [44]:
import numpy as np


In [45]:
n0 = np.random.randn(4)
n0

array([-0.2957932 ,  0.08980085,  0.97836122, -0.25114475])

In [46]:
index0 = ['Auckland','Christchurch','Wellington', 'Hamilton']

In [47]:
s = pd.Series(n0,index=index0)
s

Auckland       -0.295793
Christchurch    0.089801
Wellington      0.978361
Hamilton       -0.251145
dtype: float64

### Name the data and the index

We can give both the array of values and the index meaningful labels themselves:

In [48]:
pop.name = 'Population'
pop.index.name = 'Cities'
pop

Cities
Auckland         123
Christchurch    2312
Wellington       442
Hamilton        3322
Name: Population, dtype: int64

### Filter the data

Now let's try to filter cities with more than 350,000 people

In [49]:
pop[pop>350]

Cities
Christchurch    2312
Wellington       442
Hamilton        3322
Name: Population, dtype: int64

In [50]:
pop[pop<1000]

Cities
Auckland      123
Wellington    442
Name: Population, dtype: int64

## DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data.

### Create DataFrame

We created some Series in the previous section, let's combine them as a new DataFrame (2-dimension Excel-spreadsheet-like data structure!)

In [51]:
df = pd.concat([Price,pop,s],axis=1)
df

Unnamed: 0,0,Population,1
Auckland,1.2,123,-0.295793
Christchurch,0.8,2312,0.089801
Wellington,1.1,442,0.978361
Hamilton,0.6,3322,-0.251145


Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [52]:
df.columns = ['Price','Population','Score']

In [53]:
df

Unnamed: 0,Price,Population,Score
Auckland,1.2,123,-0.295793
Christchurch,0.8,2312,0.089801
Wellington,1.1,442,0.978361
Hamilton,0.6,3322,-0.251145


And we can also rearrange the order of these columns

In [54]:
df[['Population','Price','Score']]

Unnamed: 0,Population,Price,Score
Auckland,123,1.2,-0.295793
Christchurch,2312,0.8,0.089801
Wellington,442,1.1,0.978361
Hamilton,3322,0.6,-0.251145


A `DataFrame` has a second index, representing the columns:

In [55]:
df.columns

Index(['Price', 'Population', 'Score'], dtype='object')

If we wish to access columns, we can do so either by dict-like indexing or by attribute:

In [56]:
df['Price']

Auckland        1.2
Christchurch    0.8
Wellington      1.1
Hamilton        0.6
Name: Price, dtype: float64

In [57]:
df['Score']

Auckland       -0.295793
Christchurch    0.089801
Wellington      0.978361
Hamilton       -0.251145
Name: Score, dtype: float64

In [58]:
df.Price

Auckland        1.2
Christchurch    0.8
Wellington      1.1
Hamilton        0.6
Name: Price, dtype: float64

We can modify or add value to a Dataframe. Locate the index of a certain value by using `.loc` (if we have a string or mixed index), or `.iloc` if we use a numberic index. 

In [59]:
df.loc['Auckland','Population'] = 1600
df

Unnamed: 0,Price,Population,Score
Auckland,1.2,1600,-0.295793
Christchurch,0.8,2312,0.089801
Wellington,1.1,442,0.978361
Hamilton,0.6,3322,-0.251145


In [60]:
df.iloc[0,0]

1.2

An important method to select data in `Pandas` is `.loc` and `.iloc`

They enable us to access a group of rows and columns by label(s) or a boolean array.

`.loc[]` is primarily label based, but may also be used with a boolean array.

Allowed inputs are:

* A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

* A list or array of labels, e.g. ['a', 'b', 'c'].

* A slice object with labels, e.g. 'a':'f'.

* A boolean array of the same length as the axis being sliced, e.g. [True, False, True].

`.iloc[]` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

Allowed inputs are:

* An integer, e.g. 5.

* A list or array of integers, e.g. [4, 3, 0].

* A slice object with ints, e.g. 1:7.

* A boolean array.

Note that we can just define a new column for the dataframe `df`to add a new column, but it either need to be the same length as the `DataFrame`, or should be just one similar value for the whole column

Similar to the previously when we worked with Series, we can also use `.drop` to remove columns or rows in Pandas

**Exercise**: How can you remove a row in Pandas?

## 2. Importing data to Pandas

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Data from text files such as CSV can be read into a DataFrame using `read_csv`:

In [62]:
df = pd.read_csv("../Data/NZ_cars.csv")
df

FileNotFoundError: [Errno 2] No such file or directory: '../Data/NZ_cars.csv'

Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can see the data types of the data by using `.dtypes`

If we want to see the first 5 lines in the data, `pandas` has the function `.head()` to do that

In [None]:
df.head()

We can change the number of lines `.head()` displays by adding a number into it. For example here we show 10 first lines


In [None]:
df.head(2)


We can also show the last X lines by `.tail(X)`, for instance the last 5 lines

In [None]:
df.tail(3)

We can show only one column, for instance let's show only the `BASIC_COLOUR`

In [None]:
df['BASIC_COLOUR']

We can count the number of unique colours by using `nunique`

In [None]:
df.BASIC_COLOUR.unique()

In [None]:
df.BASIC_COLOUR.nunique()

And find out the most popular colour by `value_counts()`

In [None]:
df.BASIC_COLOUR.value_counts()

`read_csv` is just a convenience function for `read_table`, since csv is such a common format:

In [None]:
df = pd.read_table("../Data/NZ_cars.csv", sep=',')
df.head(3)

The `sep` argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately very common in some data formats: 
    
    sep='\s+'

For a more useful index, we can specify the first two columns as the indexes of our data, which together provide a unique index to the data.

In [None]:
df = pd.read_csv("../Data/NZ_cars.csv", index_col = ['OBJECTID'])
df.head(3)

In [None]:
df = pd.read_csv("../Data/NZ_cars.csv", index_col = ['BODY_TYPE','BASIC_COLOUR'])
df.head(3)

This is called a *hierarchical* index, which we will revisit later in the lecture.

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the `skiprows` argument:

In [None]:
df = pd.read_csv("../Data/NZ_cars.csv",skiprows=[0,1,3])
df.head()

Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [None]:
df = pd.read_csv("../Data/NZ_cars.csv",nrows = 4)
df

Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA` and `NULL`.

## 3. Pandas Fundamentals

This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.

For some variety, we will now look at our Electricity generation data in NZ. Try to read the file `generation2.csv` in the Data folder. 

Note that this dataset is larger. It might take a few seconds to a minute of processing time just to read the data

In [None]:
df_gen = pd.read_csv("../Data/generation2.csv", index_col = 'id')
df_gen.head()

### 3.1. Manipulating indices

**Reindexing** allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.

A simple use of `reindex` is to alter the order of the rows:

Notice that we specified the `id` column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by combining `Site_Code` and `POC_Code`:

In [None]:
gen_id = df_gen.Site_Code + df_gen.POC_Code.astype(str)
gen_id

In [None]:
df2 = df_gen.copy()
df2.index = gen_id
df2.head()

This looks okay, but let's check:

In [None]:
df2.index.is_unique

So, indices need not be unique. Our choice is not unique because there are multiple samples with the same `Site_Code` and `POC_Code`

In [None]:
pd.Series(df2.index).value_counts()

The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:

In [None]:
df2.loc['HWAHWA1101']

We will learn more about indexing below.

We can create a truly unique index by combining `POC_Code` and `Trading_date`

In [None]:
new_index = df_gen.POC_Code + df_gen.Trading_date
df3 = df_gen.copy()
df3.index = new_index
df3.head()

In [None]:
df3.index.is_unique

In [None]:
df3.loc['ARG11011/08/97']

and you may ask why do we need a truly unique index? 

![alt_text](https://i.stack.imgur.com/1ejWY.png)

You may notice the missing values where Pandas has read them as `NaN`
Missing values can be filled as desired, either with selected values, or by rule:

In [None]:
df3['Site_Code'] = df3['Site_Code'].fillna("Unknown_Site_Code")
df3['TP1'].sum()

And we can replace the rest of `NaN` with zeros

In [None]:
df3 = df3.fillna(method='backfill')
display(df3)

Keep in mind that `reindex` does not work if we pass a non-unique index series.

We can remove rows or columns via the `drop` method:

In [None]:
df3=df3.drop(['TP1'],axis=1)
df3.head()

### 3.2 Data Selection

This section discusses methods to select data from Pandas dataframe. First, let's create a series named `Fuel` with all the Fuel_Code

In [None]:
fuel = df3['Fuel_Code'].unique()


Now let's choose the first three items in the Series

In [None]:
fuel[0:3]

Then let's choose specific values from one index to another

We can slice with data labels, since they have an intrinsic order within the Index:

With this we can also modify the data, let's force all the data selected to 'Solar'

In a `DataFrame` we can slice along either or both axes. Let's select a smaller Dataframe of `Fuel_Code` and `Tech_Code`

We notice that there are data points where 'TP1' is zero, so let's filter them out. Note that we have to assign the value to itself or to another data frame. 

Using the labels, we can use indexing field `loc` to select subsets of rows and columns in an intuitive way:

We can choose multiple rows and columns, too

What's if we want to select columns and rows with the numeric position of their rows and columns ?

We use `iloc` instead

And what's if we want a mix between the string label and the numeric values? 

We can use `groupby` to select data of specific conditions

For instance, let's select al generators of different types ('Diesel', 'Hydro' and 'Solar') and then calculate the mean, minimum and maximum of `TP1`


`Groupby` can also be applied with multiple conditions

### 3.3 Data Operations

`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as combining baseball statistics across years:

We see a lot of `NaN` because Pandas' data alignment places `NaN` values for labels that do not overlap in the two Series. In fact, there are only 264 sites that have both `Hydro` and `Thrml`

While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with `NaN`. We can use the `add` method to calculate `TP1` totals by using the `fill_value` argument to insert a zero where labels do not overlap:

Operations can also be **broadcast** between rows or columns.

For example, if we subtract the maximum `TP1` value from the `hr` column, we get how many fewer than the maximum were by each site: 

Or, looking at things row-wise, we can see how a particular site compares with the rest of the group with respect to important statistics. Let's look at site 1000 and see how is it diffeerent to the rest in terms of TP1 to TP4

We can also apply functions to each column or row of a `DataFrame`. Let's see the median of TP1 to TP4

If we need to do more complicated calculations that Pandas does not natively support, 
a trick to use is to develop a `lambda` function that can be applied accross the dataframe

For instance here we want to know the different between the maximum and minimum values of `TP1` to `TP4`

Lets use apply to calcuate a weighted sum of values. This also an example of how we can write equations in Markdown

$$SLG = \frac{TP1 + (2 \times TP2) + (3 \times TP3) + (4 \times TP4)}{TP1+TP2+TP3+TP4+100}$$


### 3.4 Sorting and Ranking

Pandas objects include methods for re-ordering data. First, let's try to sort the index of our dataframe

What if we want to sort the columns instead?

And we can also sort values instead of sorting the index

For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_values`:

**Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series. Let's rank the generators according to their `TP1` values

Ties are assigned the mean value of the tied ranks, which may result in decimal values.

Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:

When multiple columns are selected, the rank for each column will be returned

### Exercise

Calculate another  **weighted sum** of each generator in `df_gen_unique`, and return the ordered series of estimates.

$$eq2 = \frac{TP3 + TP5^2 - 5*TP7}{TP9 - 4*TP11 + TP13 + 200}$$

In [None]:
# Write your answer here

### 3.5 Hierarchical indexing

In the electricity generation example, we combined 2 fields to obtain a unique index that was not simply an integer value. A more elegant way to have done this would be to create a hierarchical index from the three fields.

This index is a `MultiIndex` object that consists of a sequence of tuples, the elements of which is some combination of the two columns used to create the index. Where there are multiple repeated values, Pandas does not print the repeats, making it easy to identify groups of values.

Multi-indexing can be done when we read CSV data, too

With a hierachical index, we can select subsets of the data based on a *partial* index:

Hierarchical indices can be created on either or both axes. Here is a trivial example:

If you want to get fancy, both the row and column indices themselves can be given names:

With this, we can do all sorts of custom indexing:

Additionally, the order of the set of indices in a hierarchical `MultiIndex` can be changed by swapping them pairwise:

### 3.6 Missing data

The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy).

Missing values may be dropped or indexed out:

By default, `dropna` drops entire rows in which one or more values are missing.

This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value.

This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument.

If we want to drop missing values column-wise instead of row-wise, we use `axis=1`.

Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the `fillna` argument. First, let's fill all missing value with zeros

We can specify condition in each column to fill missing values

Notice that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or  `DataFrame` in place (**in general, we like to do this, by the way!**).

We can alter values in-place using `inplace=True`.

Missing values can also be interpolated, using any one of a variety of methods: 
Read more here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

### 3.7 Data summarization

We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data. The Pandas package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.

A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is `describe`:

`describe` can detect non-numeric data and sometimes yield useful information about it.

We can also calculate summary statistics *across* multiple columns, for example, correlation and covariance.

$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$

$$corr(x,y) = \frac{cov(x,y)}{(n-1)s_x s_y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}}$$

A simple function can give us the whole correlation table

If we have a `DataFrame` with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:

### 3.8 Merging and appending data

Sometimes we have to merge data together. Pandas also supports this operation

First, we slide data to take two Dataframes for `Coal` and `Hydro` only

now combine `df1` and `df2`

### 3.9 Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.

First we do a simple export to csv file to the same folder our code is in

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.

An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.

The complement to `to_pickle` is the `read_pickle` function, which restores the pickle to a `DataFrame` or `Series`: