<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Introduction to Pandas<br><br>
Tiago Ventura</center></center> <h1> 

---

**In this Notebook we cover**

- Pandas data objects and how to construct them.  
- Using indexes for Pandas data objects.

In [2]:
import pandas as pd
import numpy as np

# `pandas` Objects

Recall `numpy` offers a great flexibility and efficiency when dealing with data matrices (when compared to manipulating data represented as a nested list). However, as we saw, `numpy` is limited in its capacity to deal with heterogeneous data types in a single data matrix. This is a limitation given the nature of most social science datasets. 

The `pandas` package was designed to deal with this issue. Built on top of the `numpy` library, Pandas retains and expands upon `numpy`'s functionality.

The fundamental data constructs in a `pandas` object is the `Series` and the `DataFrame`.

## `Series`

A `pandas` series is a **one-dimensional** labeled array. 

It is capable of holding **heterogeneous data types** (e.g. integer, boolean, strings, etc.). 

The axis in a series as **"index"** --- similar to a list or `numpy` array--- however, we can use other data types to serve as an index, which allows for some powerful ways for manipulating the array.

At it's core, a `Pandas` `Series` is nothing but a column in an excel sheet or an `R` `data.frame`.  

### Constructor

To construct a pandas `Series`, we can use the `pd.Series()` constructor.

**Notice:** The input to create a pandas series is a list or an array. 

If you are migrating from `R`, this is a bit of a difference, since you can use simple vectors in R to create a dataframe column. Python does not have vectors as `R` does, only lists and arrays. These are the basic inputs to create a `Pandas.Series()`

In [92]:
import pandas as pd
s = pd.Series(["Argentina", "France", "Germany","Spain", "Italy", "Brazil"],
                 index=[2022, 2018, 2014, 2010, 2006, 2002])
s

2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object

The `Series` combines a sequence of **values** with an explicit sequence of **indices**, which we can access with the values and index attributes. 

### `Series` index

Series index come on three flavors: 

- explicit index
- implicit (deprecated)
- mask indexing

In [93]:
s.index

Int64Index([2022, 2018, 2014, 2010, 2006, 2002], dtype='int64')

In [98]:
# explicit
s[2022]

'Argentina'

In [113]:
# implicit -> Deprecated
s[:2]

  s[:2]


2022    Argentina
2018       France
dtype: object

In [124]:
# masking with a boolean
s[s.index>2016]

2022    Argentina
2018       France
dtype: object

In [125]:
# with or, which requires a weird set of parenthesis
s[((s=="Brazil") | (s=="France"))]

2018    France
2002    Brazil
dtype: object

### `Series` values

In [11]:
s.values

array([0.25, 0.5, True, 1.0], dtype=object)

---

## `DataFrame`

A `pandas` `DataFrame` is a two dimensional, relational data structure with the capacity to handle heterogeneous data types.

- "relational" = each column value contained within a row entry corresponds with the same observation. 
- "two dimensional" = a matrix data structure (no $N$-dimensional arrays). The data construct can be accessed through row/column indices. 
- "heterogeneous" = different data types can be contained within each column series. This means, for example, string, integer, and boolean values can coexist in the same data structure and retain the specific properties of their data type class. 

Put simply, a `DataFrame` is a collection of pandas series where each index position corresponds to the same observation. 

## How to create a DataFrame?

### Constructor

To create a pandas DataFrame, we call the `pd.DataFrame()` constructor. 


#### Construction using `dict()`
As input, we need to feed in a dictionary, where the _keys_ are the column names and the values are the relational data input.

In [7]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c','d','e','f']}
pd.DataFrame(my_dict)

Unnamed: 0,A,B,C
0,1,2.0,a
1,2,3.0,b
2,3,1.0,c
3,4,0.3,d
4,5,4.0,e
5,6,1.0,f


Data must be relational. If the dimensions do not align, an error will be thrown.

In [8]:
my_dict = {"A":[1,2,3,4,5,6],"B":[2,3,1,.3,4,1],"C":['a','b','c']}
pd.DataFrame(my_dict)

ValueError: arrays must all be same length

When constructing a DataFrame from scratch, using the dict constructor can help ease typing.  

In [9]:
pd.DataFrame(dict(A = [1,2,3],B = ['a','b','c']))

Unnamed: 0,A,B
0,1,a
1,2,b
2,3,c


#### Construction using `list()`

Likewise, we can simply input a list, and the `DataFrame` will put a 0-based integer index by default.

In [10]:
my_list = [4,4,5,6,7]
pd.DataFrame(my_list)

Unnamed: 0,0
0,4
1,4
2,5
3,6
4,7


The same holds if we feed in a nest list structure.

In [11]:
nested_list = np.random.randint(1,10,25).reshape(5,5).tolist()
nested_list

[[8, 3, 9, 3, 8],
 [9, 7, 3, 5, 9],
 [7, 8, 7, 6, 8],
 [2, 3, 9, 1, 4],
 [2, 4, 9, 7, 2]]

In [12]:
pd.DataFrame(nested_list)

Unnamed: 0,0,1,2,3,4
0,8,3,9,3,8
1,9,7,3,5,9
2,7,8,7,6,8
3,2,3,9,1,4
4,2,4,9,7,2


To overwrite the default indexing protocol, we can **provide a list of column names** to correspond to each column index position. 

In [13]:
col_names = [f'Var{i}' for i in range(1,6)]
D = pd.DataFrame(nested_list,columns=col_names)
D

Unnamed: 0,Var1,Var2,Var3,Var4,Var5
0,8,3,9,3,8
1,9,7,3,5,9
2,7,8,7,6,8
3,2,3,9,1,4
4,2,4,9,7,2


### From Nested Data to DataFrames

In real cases, your data will hardly come from ONE list, or ONE dictionary. 

Imagine you are scrapping data from the web, and every iteration, you want to add a new row to your dataframe. You have a Nested Data, in which your columns are repeating over every iteration.

For these cases, there in general two asbtract approaches to go from Nested Data (as list or dictionaries) to a `Pandas DataFrame`. 

- **Column-Wise:** From a dictionary of lists
- **Row-Wise:** From a list of dictionaries



### Column-Wise

In this approach, your data goes from a **dictionary of lists** to a `DataFrame`.

**Your input is organized by collumns!**

- Keys -> Columns Names
- Lists (values) -> Values for every observation **i** on your data. 


In [2]:
# create a dictionary of lists
dict_ = {"names":["Darrow", "Adrius", "Sevro", "Virginia", "Victra"],
        "nickname":["The Reaper", "Jakal", "Goblin", "Mustang", "NaN"], 
        "house":["Mars", "Venus", "Mars", "Mercury", "Jupiter"], 
        "color": ["Red", "Gold", "Gold", "Gold", "Gold"]}

# create a dataframe
pd.DataFrame(dict_)


Unnamed: 0,names,nickname,house,color
0,Darrow,The Reaper,Mars,Red
1,Adrius,Jakal,Venus,Gold
2,Sevro,Goblin,Mars,Gold
3,Virginia,Mustang,Mercury,Gold
4,Victra,,Jupiter,Gold


### Row-Wise

In this approach, your data goes from a **list of dictionaries** to a DataFrame. 

**Your input is organized by observations**

- Keys of each dictionary -> Columns names
- Values of each dictionary -> Values for every observation i in column j on your data.

In [22]:
# create a dictionary of lists
list_ = [{"names":"Darrow", "nickname":"The Reaper", "house":"Mars", "color":"red"}, # obs 1
         {"names":"Adrius", "nickname":"Jakal", "house":"Venus", "color":"gold"}, # obs 2
         {"names":"Sevro", "nickname":"Goblin", "house":"Mars", "color":"gold"}, # obs 3
         {"names":"Virginia", "nickname":"Mustang", "house":"Mercury", "color":"gold"}, # obs 3
         {"names":"Victra", "nickname":"NaN", "house":"Jupiter", "color":"gold"} # obs4
        ] 

# create a dataframe
D = pd.DataFrame(dict_)


The same building approach can be done with a list of lists + a list of names. 

In [69]:
# list of lists + list of names
list_names= ["names","nickname", "house", "color"]
list_of_lists = [
                 ["Darrow","The Reaper", "Mars", "red"], # obs 1
                 ["Adrius", "Jakal", "Venus","gold"], # obs 2
                 ["Sevro", "Goblin", "Mars", "gold"], # obs 3
                 ["Virginia", "Mustang", "Mercury", "gold"], # obs 3
                 ["Victra", "NaN","Jupiter", "gold"]
                ]

# create a dataframe
pd.DataFrame(list_of_lists, columns=list_names)


Unnamed: 0,names,nickname,house,color
0,Darrow,The Reaper,Mars,red
1,Adrius,Jakal,Venus,gold
2,Sevro,Goblin,Mars,gold
3,Virginia,Mustang,Mercury,gold
4,Victra,,Jupiter,gold


Keep these approaches in mind. The conversion of nested data to `DataFrames` will be useful when you are collecting your own data, particularly when using webscrapping and loops. 

### `DataFrame` index

Unlike with a `numpy` array, we cannot simply call a row index position. 

 <span style='color:red'> **No implict index with DataFrames!!!**</span>

In [128]:
D[1,:]

InvalidIndexError: (1, slice(None, None, None))

This is because the internal index to a `DataFrame` refers to the column index. This might be odd at first but if we think back to the behavior of Python dictionaries (which a DataFrame fundamentally is under the hood) we'll recall that the key is the default indexing features (as the immutable keys provide for efficient lookups in the dictionary object).

In [26]:
D['names']

0      Darrow
1      Adrius
2      Servro
3    Virginia
4      Victra
Name: names, dtype: object

Always remember that there are **2 indices** in a `DataFrame` that we must keep track of: the row `index` and the `column` index.

In [27]:
# Row index
D.index

RangeIndex(start=0, stop=5, step=1)

In [28]:
# column index
D.columns

Index(['names', 'nickname', 'house', 'color'], dtype='object')

### <span style='color:red'> Alert: </span> loc and iloc?

To **access** the indices in a `DataFrame`, we need to use two build-in methods:

- `.iloc[]` = use the numerical index position to call to locations in the `DataFrame`. (_The `i` is short for `index`._)
- `.loc[]` = use the labels to call to the location in the data frame. 


In [36]:
D.iloc[1,3]

'Gold'

In [37]:
D.loc[:3,['names','house']]

Unnamed: 0,names,house
0,Darrow,Mars
1,Adrius,Venus
2,Servro,Mars
3,Virginia,Mercury


A few things to note about `.loc[]`

- calls all named index positions. Above we get back all the requested rows (rather than the numerical range which returns one below the max value). This is because `.loc[]` treats the index as a labeled feature rather than a numerical one. 
- selecting ranges from labeled indices works the same as numerical indices. That is we can make calls to all variables in between (see below). 

In [38]:
D.loc[:,'names':'house']

Unnamed: 0,names,nickname,house
0,Darrow,The Reaper,Mars
1,Adrius,Jakal,Venus
2,Servro,Goblin,Mars
3,Virginia,Mustang,Mercury
4,Victra,,Jupiter


As with a `series`, we can redefine the row and column indices.

In [53]:
dict_

{'names': ['Darrow', 'Adrius', 'Servro', 'Virginia', 'Victra'],
 'nickname': ['The Reaper', 'Jakal', 'Goblin', 'Mustang', 'NaN'],
 'house': ['Mars', 'Venus', 'Mars', 'Mercury', 'Jupiter'],
 'color': ['Red', 'Gold', 'Gold', 'Gold', 'Gold']}

In [54]:
D2 = pd.DataFrame(dict_,
                 index=["d","a", "s", "v", "v2"])
D2

Unnamed: 0,names,nickname,house,color
d,Darrow,The Reaper,Mars,Red
a,Adrius,Jakal,Venus,Gold
s,Servro,Goblin,Mars,Gold
v,Virginia,Mustang,Mercury,Gold
v2,Victra,,Jupiter,Gold


In [55]:
# We can use the named indices for look up (and as with numpy, column rearrangement).
D2.loc[["d","a","v"],"names":"house"]

Unnamed: 0,names,nickname,house
d,Darrow,The Reaper,Mars
a,Adrius,Jakal,Venus
v,Virginia,Mustang,Mercury


In [57]:
# notice, we cannot use the numbers with loc anymore
D2.loc[1:2,"names":"house"]

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

#### And series? You told us before implicit indexes were deprecated...

Yes... you can also use loc, and iloc with index. 

In [129]:
# get our old series
s

2022    Argentina
2018       France
2014      Germany
2010        Spain
2006        Italy
2002       Brazil
dtype: object

In [132]:
# iloc
s.iloc[0]

2022    Argentina
2018       France
dtype: object

In [134]:
# loc
s.loc[2022]

'Argentina'

#### Setting the index as the unit of observation

We can redefine the index to work as a way to keep our unit of observation: consistent, clean, and easy to use.

In [70]:
dat = D.set_index('names')
dat

Unnamed: 0_level_0,nickname,house,color
names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Darrow,The Reaper,Mars,Red
Adrius,Jakal,Venus,Gold
Servro,Goblin,Mars,Gold
Virginia,Mustang,Mercury,Gold
Victra,,Jupiter,Gold


In [71]:
dat.loc['Darrow',:]

nickname    The Reaper
house             Mars
color              Red
Name: Darrow, dtype: object

Reverting the index back to it's original 0-based index is straight forward with the `.reset_index()` method.

In [72]:
dat = dat.reset_index()
dat

Unnamed: 0,names,nickname,house,color
0,Darrow,The Reaper,Mars,Red
1,Adrius,Jakal,Venus,Gold
2,Servro,Goblin,Mars,Gold
3,Virginia,Mustang,Mercury,Gold
4,Victra,,Jupiter,Gold


#### Hierarchical (multi-) index

In [73]:
dat = D.set_index(keys=['names', 'house'])
dat

Unnamed: 0_level_0,Unnamed: 1_level_0,nickname,color
names,house,Unnamed: 2_level_1,Unnamed: 3_level_1
Darrow,Mars,The Reaper,Red
Adrius,Venus,Jakal,Gold
Servro,Mars,Goblin,Gold
Virginia,Mercury,Mustang,Gold
Victra,Jupiter,,Gold


We can see that the index is composed of two levels.

In [74]:
dat.index

MultiIndex([(  'Darrow',    'Mars'),
            (  'Adrius',   'Venus'),
            (  'Servro',    'Mars'),
            ('Virginia', 'Mercury'),
            (  'Victra', 'Jupiter')],
           names=['names', 'house'])

Under the hood, the hierarchical indices are actually tuples.

In [76]:
dat.loc[("Darrow","Mars"),:]

nickname    The Reaper
color              Red
Name: (Darrow, Mars), dtype: object

We can use boolean lookups on the level values 

In [77]:
dat.loc[dat.index.get_level_values('house') == ,:]

Unnamed: 0_level_0,Unnamed: 1_level_0,nickname,color
names,house,Unnamed: 2_level_1,Unnamed: 3_level_1
Darrow,Mars,The Reaper,Red
Servro,Mars,Goblin,Gold


Finally, we can easily sort and order the index. 

In [79]:
dat.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,nickname,color
names,house,Unnamed: 2_level_1,Unnamed: 3_level_1
Adrius,Venus,Jakal,Gold
Darrow,Mars,The Reaper,Red
Servro,Mars,Goblin,Gold
Victra,Jupiter,,Gold
Virginia,Mercury,Mustang,Gold


As before, if we wish to revert the index back to a 0-based integer, we can with `.reset_index()`

In [82]:
# inplace to save in the same object
dat.reset_index(inplace=True)

Unnamed: 0,index,names,house,nickname,color
0,0,Darrow,Mars,The Reaper,Red
1,1,Adrius,Venus,Jakal,Gold
2,2,Servro,Mars,Goblin,Gold
3,3,Virginia,Mercury,Mustang,Gold
4,4,Victra,Jupiter,,Gold


In [81]:
dat

Unnamed: 0,names,house,nickname,color
0,Darrow,Mars,The Reaper,Red
1,Adrius,Venus,Jakal,Gold
2,Servro,Mars,Goblin,Gold
3,Virginia,Mercury,Mustang,Gold
4,Victra,Jupiter,,Gold


### Column Indices ("column names")

As seen, `pandas` can keep track of column feature using column index. We can access the column index at any time using the `.columns` attribut

In [83]:
dat.columns

Index(['names', 'house', 'nickname', 'color'], dtype='object')

Or we can simply redefine the dataframe using the `list()` constructor (recall the a `DataFrame` is really a `dict`)

In [84]:
list(dat)

['names', 'house', 'nickname', 'color']

**Overwriting column names**: below let's set all of the columns to be lower case. Note that we can invoke a `.str` method that gives access to all of the string data type methods. 

In [85]:
dat.columns = dat.columns.str.upper()
dat.columns

Index(['NAMES', 'HOUSE', 'NICKNAME', 'COLOR'], dtype='object')

In [86]:
dat

Unnamed: 0,NAMES,HOUSE,NICKNAME,COLOR
0,Darrow,Mars,The Reaper,Red
1,Adrius,Venus,Jakal,Gold
2,Servro,Mars,Goblin,Gold
3,Virginia,Mercury,Mustang,Gold
4,Victra,Jupiter,,Gold


But note that the **column index is _not mutable_**. Recall that _values_ are mutable in dictionary, but the _keys_ are not.

In [87]:
dat.columns[dat.columns == "POPULATION"] = "POP"

TypeError: Index does not support mutable operations

We either have to replace all the keys (as we do above), or use the `.rename()` method to rename a specific data feature by passing it a `dict` with the new renaming convention. 

```
data.rename(columns = {'old_name':'new_name'})
```

In [88]:
dat.rename(columns={"NAMES":"NAME"},
             inplace=True) # Makes the change in-place rather than making a copy
dat

Unnamed: 0,NAME,HOUSE,NICKNAME,COLOR
0,Darrow,Mars,The Reaper,Red
1,Adrius,Venus,Jakal,Gold
2,Servro,Mars,Goblin,Gold
3,Virginia,Mercury,Mustang,Gold
4,Victra,Jupiter,,Gold


Similar to row indices, we can generate **hierarchical column indices** as well. (As we'll see this will be the default column index output when aggregating variables next time).

### That's it for today! Next week we will see more and more of `Pandas DataFrames`