# Introductory Econometrics with Python
## Carlos Góes (andregoes@gmail.com)

# 0.2. Introduction to Pandas
## 0.2.1. What is a data frame?

When working with different statistical softwares, you will soon realize that a concept keeps coming up: "data frames." But what are data frames?

**Data frames are a way to organize data**. They are essentially matrices where each row denotes one different observation (for instance, an individual) while each column denotes variables (one characteristic for that particular observation).

Think of how the Census used to be conducted before computers came around. First, an interviewer went to many different houses and wrote down the characteristics of those who live in such houses. Then, someone had to organize the interview sheets. They did that by creating a table indentifying each individual (in a row) and their respective characteristics (in a column). Finally, statisticians used those data to create summary tables, find trends, and draw correlations. 

A typical dataframe looks like the figure below.

<img src="https://github.com/omercadopopular/cgoes/raw/master/IntroEconometricsPython/Figs/dfcs.PNG" width="400">

Note that there is a column, highlighted in grey, which is the index identifier. The number in that column identify different observations (in this case, different students). To its right lay columns which store the characteristics of those individuals.

Try first running your finger vertically across the indices, then moving your finger rightwards to see what are the characteristics of that student. With that method, you will be able torealize that student \#3 is male, majors in Chemistry, and has a 2.6 GPA.

The data frame above shows characteristics of different individual in a particular moment in time. But another way to organize data is to observe how the characteristics of the same individual change over time. The figure below shows an example of that. 

<img src="https://github.com/omercadopopular/cgoes/raw/master/IntroEconometricsPython/Figs/dfts.PNG" width="350">

If you run your finger through the indices (which now are not individuals, but years) and stop in 1997, then move your finger rightwards, you will conclude that in 1997 the GDP growth rate was -1.2% and the unemployment rate was 13.4%.

**Data frames are a central tool of statistical analysis**. We first organize the data into a data frame, then make adjustments to "clean" improprieties in the data (such as wrong or missing numbers), and finally can conduct some formal econometrics. **In applied statistics, knowing how to do these procedures is just as important than knowing how to run regressions**. If your data is garbage, your results will also be garbage, regardless of how fancy your statistical models may be.

In this lecture you will learn:

* a
* b
* c

## 0.2.2. Pandas Series

During this course, we will frequently work with Pandas. Pandas is a Python package providing fast, flexible, and expressive data structures that provides a intuitive way to work with "relational". Its basic data structure is a series, which is a one dimensional array (like a list or a numpy array), which has a specific index and some values. Series can store integers, floats, strings or booleans. 

Below, we use first build a dictionary, then transform it into a pandas series. Using dictionaries can be quite handy, because pandas automatically identifies the labels of our dictionary as a proper index. (You could also create a series from a list and then later input the indices manually).

In [5]:
import pandas as pd
import numpy as np

idnumber  = {
            'Carlos Goes': '06/99209',
            "Nicolas Powidayko": '10/22290',
            "Alexander Rabbat": '08/21346',
            "Dani Alaino": '07/20345',
            "Lya Nikate": '09/23567',
            "Niz Borroz": '11/22035',
            "Tom Rundal": "98/20145"
            }

# Transform it into a Pandas Series

idnumber = pd.Series(idnumber)
idnumber

Alexander Rabbat     08/21346
Carlos Goes          06/99209
Dani Alaino          07/20345
Lya Nikate           09/23567
Nicolas Powidayko    10/22290
Niz Borroz           11/22035
Tom Rundal           98/20145
dtype: object

You can use .loc to return the values for each index

In [6]:
idnumber.loc['Carlos Goes']

'06/99209'

Alternatively, you can use .iloc and use a numerical index, just like you would with a list

In [7]:
idnumber.iloc[1]

'06/99209'

Below, we create two more series. Note that we create the series directly, integrating the dictionary within the brackets.

In [8]:
major = pd.Series({
        'Carlos Goes': 'Economics',
        "Nicolas Powidayko": 'Economics',
        "Alexander Rabbat": 'International Affairs',
        "Dani Alaino": 'International Affairs',
        "Lya Nikate": 'International Affairs',
        "Niz Borroz": 'International Affairs',
        "Tom Rundal": "Economics"
        })

major

Alexander Rabbat     International Affairs
Carlos Goes                      Economics
Dani Alaino          International Affairs
Lya Nikate           International Affairs
Nicolas Powidayko                Economics
Niz Borroz           International Affairs
Tom Rundal                       Economics
dtype: object

In [9]:
gpa = pd.Series({
        'Carlos Goes': 4.0,
        "Nicolas Powidayko": 3.8,
        "Alexander Rabbat": 2.8,
        "Dani Alaino": 3.4,
        "Lya Nikate": 3.3,
        "Niz Borroz": 2.0,
        "Tom Rundal": 3.0
        })

gpa

Alexander Rabbat     2.8
Carlos Goes          4.0
Dani Alaino          3.4
Lya Nikate           3.3
Nicolas Powidayko    3.8
Niz Borroz           2.0
Tom Rundal           3.0
dtype: float64

## 0.2.3. Pandas DataFrames

Okay. Now we have three series and, as you might have noticed, these series show information of different students: their student ids, GPAs, and majors. This different series seem the kind of data we usually store in a data frame.

Fortunately, pandas makes it easy to join them. We can call the .DataFrame property of the pandas package and list the different dictionaries as our data. The end result will be a properly indexed table.

In [10]:
data = [idnumber, major, gpa]

df = pd.DataFrame(data, index=['idnumber','major', 'gpa'])
df

Unnamed: 0,Alexander Rabbat,Carlos Goes,Dani Alaino,Lya Nikate,Nicolas Powidayko,Niz Borroz,Tom Rundal
idnumber,08/21346,06/99209,07/20345,09/23567,10/22290,11/22035,98/20145
major,International Affairs,Economics,International Affairs,International Affairs,Economics,International Affairs,Economics
gpa,2.8,4,3.4,3.3,3.8,2,3


Interesting, but is there a better way to loot at these data? What if we wanted to have the names of the individuals as the indices. No problem, we can call the .T function, which lets us transpose the data frame.

In [11]:
df = df.T
df

Unnamed: 0,idnumber,major,gpa
Alexander Rabbat,08/21346,International Affairs,2.8
Carlos Goes,06/99209,Economics,4.0
Dani Alaino,07/20345,International Affairs,3.4
Lya Nikate,09/23567,International Affairs,3.3
Nicolas Powidayko,10/22290,Economics,3.8
Niz Borroz,11/22035,International Affairs,2.0
Tom Rundal,98/20145,Economics,3.0


In [12]:
# Now you can use return data in different ways

#   Using ".loc", you call the index

df.loc['Carlos Goes']

idnumber     06/99209
major       Economics
gpa                 4
Name: Carlos Goes, dtype: object

In [13]:
# And can also ask for a specific series for one index

df.loc['Carlos Goes', 'idnumber']

'06/99209'

In [14]:
# Using the brackets, first you call the series, then one specific index

df['idnumber']['Carlos Goes']

'06/99209'

In [15]:
# Boolean masking

(df['major'] == 'Economics')

Alexander Rabbat     False
Carlos Goes           True
Dani Alaino          False
Lya Nikate           False
Nicolas Powidayko     True
Niz Borroz           False
Tom Rundal            True
Name: major, dtype: bool

In [16]:
df[df['major'] == 'Economics']

Unnamed: 0,idnumber,major,gpa
Carlos Goes,06/99209,Economics,4.0
Nicolas Powidayko,10/22290,Economics,3.8
Tom Rundal,98/20145,Economics,3.0


In [17]:
df[df['gpa'] <= 3]

Unnamed: 0,idnumber,major,gpa
Alexander Rabbat,08/21346,International Affairs,2.8
Niz Borroz,11/22035,International Affairs,2.0
Tom Rundal,98/20145,Economics,3.0


In [18]:
df['gpa'] *= 2
df

Unnamed: 0,idnumber,major,gpa
Alexander Rabbat,08/21346,International Affairs,5.6
Carlos Goes,06/99209,Economics,8.0
Dani Alaino,07/20345,International Affairs,6.8
Lya Nikate,09/23567,International Affairs,6.6
Nicolas Powidayko,10/22290,Economics,7.6
Niz Borroz,11/22035,International Affairs,4.0
Tom Rundal,98/20145,Economics,6.0


In [19]:
df['gpa'] /= 2
df

Unnamed: 0,idnumber,major,gpa
Alexander Rabbat,08/21346,International Affairs,2.8
Carlos Goes,06/99209,Economics,4.0
Dani Alaino,07/20345,International Affairs,3.4
Lya Nikate,09/23567,International Affairs,3.3
Nicolas Powidayko,10/22290,Economics,3.8
Niz Borroz,11/22035,International Affairs,2.0
Tom Rundal,98/20145,Economics,3.0


In [20]:
np.sum(df['gpa'])/len(df['gpa'])

3.185714285714286

In [21]:
np.mean(df['gpa'])

3.1857142857142859

In [22]:
%timeit 10

# Iterable

for name in df['major'].unique():
    avg = np.mean(df['gpa'][df['major'] == name])
    print('Mean GPA, ' + name + ': ' + str(avg))

100000000 loops, best of 3: 6.92 ns per loop
Mean GPA, International Affairs: 2.875
Mean GPA, Economics: 3.6


In [23]:
%timeit 10

# Using groups by

for name, table in df.groupby('major'):
    avg = np.mean(table['gpa'])
    print('Mean GPA, ' + name + ': ' + str(avg))

100000000 loops, best of 3: 6.76 ns per loop
Mean GPA, Economics: 3.6
Mean GPA, International Affairs: 2.875


In [27]:
maxmin = (df.groupby('major')['gpa']
            .agg({'max': 'max',
                  'min': 'min',
                  'count': 'count',
                  'sum': 'sum'
                 }))

maxmin['mean'] = maxmin['sum'] / maxmin['count'] 

maxmin

Unnamed: 0_level_0,max,min,count,sum,mean
major,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Economics,4.0,3.0,3,10.8,3.6
International Affairs,3.4,2.0,4,11.5,2.875
