# Pandas


## Overview

[Pandas](http://pandas.pydata.org/) is a package of fast, efficient data analysis tools for Python.

Its popularity has surged in recent years, coincident with the rise
of fields such as data science and machine learning.
 
Just as [NumPy](http://www.numpy.org/) provides the basic array data type plus core array operations, pandas

1. defines fundamental structures for working with data and  
1. endows them with methods that facilitate operations such as  
  
  - reading in data  
  - adjusting indices  
  - working with dates and time series  
  - sorting, grouping, re-ordering and general data munging <sup><a href=#mung id=mung-link>[1]</a></sup>  
  - dealing with missing values, etc., etc.  

More sophisticated statistical functionality is left to other packages, such
as [statsmodels](http://www.statsmodels.org/) and [scikit-learn](http://scikit-learn.org/), which are built on top of pandas.

We start by importing the usual machine learning libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Series

Two important data types defined by pandas are  `Series` and `DataFrame`.

You can think of a `Series` as a “column” of data, such as a collection of observations on a single variable.

Let’s start with a Series example

In [3]:
s = pd.Series(np.random.randn(4), name='daily returns')
s

0    0.175195
1   -1.685238
2   -0.517588
3   -0.197971
Name: daily returns, dtype: float64

Here you can imagine the indices `0, 1, 2, 3` as indexing four listed
companies, and the values being daily returns on their shares.

Pandas `Series` are built on top of NumPy arrays and support many similar
operations,

But `Series` provide more than NumPy arrays.

Not only do they have some additional (statistically oriented) methods

In [4]:
s.describe()

count    4.000000
mean    -0.556401
std      0.804049
min     -1.685238
25%     -0.809501
50%     -0.357780
75%     -0.104680
max      0.175195
Name: daily returns, dtype: float64

But their index values do not need to be numbers, as is the case with numpy arrays.

Viewed in this way, `Series` are like fast, efficient Python dictionaries
(with the restriction that the items in the dictionary all have the same
type—in this case, floats).

1. How can I set the index to be 'AMZN', 'AAPL', 'MSFT', 'GOOG' (in that order?)
2. How can you pick out the `AMZN` value from the series?
3. How can you check that `AAPL` is included in the series?
4. I learn that there is a problem with the data source. How can I set the `GOOG` return to an unknown value?
5. How can I find the return with the highest value?
6. If I wanted to include the city in the index, how could this be done?

## DataFrames

While a `Series` is a single column of data, a `DataFrame` is several columns, one for each variable.

A `DataFrame` is an object for storing related columns of data.

In essence, a `DataFrame` in pandas is analogous to a (highly optimized) Excel spreadsheet.

Thus, it is a powerful tool for representing and analyzing data that are naturally organized  into rows and columns, often with  descriptive indexes for individual rows and individual columns.

Here’s the content of `test_pwt.csv`

```text
"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"
```


1. Supposing you have this data saved as `test_pwt.csv` in data directory , how can you read it into a data frame `df`?
2. How can we use `.iloc` to extract rows 3 to 8 and columns 1 to 5, inclusive?
3. How can we use df.index and df.loc to select rows 3 to 8 and both the country and tcgdp columns?
4. How do use the `country` variable as the index in this dataframe?
5. How can we rename `POP` to `population` and `tcgdp` to `total GDP`?
6. The population is in 1000s, how can we convert that to single units (persons in this case)?
7. Which countries have a `population` greater than 20 million?
8. How can we add a column `GDP per capita` showing real GDP per capita, in dollars per person?
9. Grouping the countries into "large" (> 20 million) and "small" (<= 20 million), what is the mean GDP per capita for each category?

One of the nice things about pandas `DataFrame` and `Series` objects is that they have methods for plotting and visualization that work through Matplotlib.

For example, we can easily generate a bar plot of GDP per capita

In [None]:
ax = df['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()

10. At the moment the data frame is ordered alphabetically on the countries. How can we use `sort_values()` to order it by `GDP per capita`?

Plotting as before now yields

In [None]:
ax = df['GDP percap'].plot(kind='bar')
ax.set_xlabel('country', fontsize=12)
ax.set_ylabel('GDP per capita', fontsize=12)
plt.show()

11. Which plot do you prefer, and why?