# Intro to pandas


We will learn the key data structures in **Pandas** and their attributes and methods. Moreover, we will learn how to select data in **`DataFrame`** and then do computations afterwards. Last, we will learn some basics in estimation and forecasting of autoregressive models in Python.

Our goal with this section is to provide you the necessary skills to – at a minimum – immediately replicate your current data analysis workflow in Python.


---
## Basics

First let's import and rename them as follows like we learned in intro to numpy notebook.


In [27]:
import numpy as np  # import the library of numercial programming
import matplotlib.pyplot as plt # import the library for plotting
import pandas as pd # import the library of data management
import seaborn as sns # import the library of beautiful plots
#import pandas_datareader as pdr # import package for extracting data from online sources
import datetime as dt # import package for working with dates

# set the figures display in the talk context
sns.set_context("talk") 
sns.set_style("darkgrid")

Let's first get to know the two most important data structures in `Pandas`.

### Series

The Series is the primary building block of pandas. A Series is a single column of data, with row labels for each observation. Pandas refers to the row labels as the index of the Series.

Series can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict. 

In [28]:
# create a series from a dictionary
gdp = {"GDP": [5974.7, 10031.0, 14681.1]} 
# what kind of data structure is this
print(type(gdp))

<class 'dict'>


It should tell us that it is a dictionary, with keys and values (which are lists). How do we get those?

In [29]:
# 'name' parameter specifies the column name of the series object
gdp_s= pd.Series(gdp,name='GDP')
print(type(gdp_s))

print(gdp_s)

<class 'pandas.core.series.Series'>
GDP    [5974.7, 10031.0, 14681.1]
Name: GDP, dtype: object


In [30]:
# create series from a list
cpi = [127.5, 169.3, 217.488]
cpi_s= pd.Series(cpi,name='CPI')

year = [1990, 2000, 2010]
country = ["US", "US", "US"]

year_s = pd.Series(year,name='Year')
country_s = pd.Series(country,name='Country')

In [31]:
print(cpi_s, year_s, country_s)

0    127.500
1    169.300
2    217.488
Name: CPI, dtype: float64 0    1990
1    2000
2    2010
Name: Year, dtype: int64 0    US
1    US
2    US
Name: Country, dtype: object


We can also create Series with specified index. Below, we create a Series which contains the US unemployment rate every other year starting in 1995.

In [32]:
values = [5.6, 5.3, 4.3, 4.2, 5.8, 5.3, 4.6, 7.8, 9.1, 8., 5.7]
years = list(range(1995, 2017, 2))

unemp = pd.Series(data=values, index=years, name="Unemployment")

In [33]:
unemp

1995    5.6
1997    5.3
1999    4.3
2001    4.2
2003    5.8
2005    5.3
2007    4.6
2009    7.8
2011    9.1
2013    8.0
2015    5.7
Name: Unemployment, dtype: float64

You may wonder what's the usage for the column name. The column name argument allows you to give a name to a Series object, i.e. to the column. So that when you'll put that in a DataFrame, the column will be named according to the name parameter.

In [34]:
df = pd.DataFrame(unemp)

In [35]:
df

Unnamed: 0,Unemployment
1995,5.6
1997,5.3
1999,4.3
2001,4.2
2003,5.8
2005,5.3
2007,4.6
2009,7.8
2011,9.1
2013,8.0


#### DataFrame

A `DataFrame` is essentially just a table of data, or a multiple Series stacked side by side as columns, while a `Series` can be thought of as a one columned `DataFrame`. Therefore, `DataFrame` is similar to a sheet in an Excel workbook or a table in a SQL database. 

In addition to row labels (an index), DataFrames also have column labels. We refer to these column labels as the columns or column names.

Dataframe can be created and initialized by passing either a Series or a Python Dict. 

Let's start by creating a `DataFrame` from the series previously created by pandas [concat](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) methods. It concatenates `pandas` objects, e.g., Series, DataFrame along a particular axis. 


In [36]:
Series_Df = pd.concat([year_s,country_s],axis=1)
print(Series_Df)

   Year Country
0  1990      US
1  2000      US
2  2010      US


In [37]:
Series_Df = pd.concat([year_s,country_s],axis=0)
print(Series_Df)

0    1990
1    2000
2    2010
0      US
1      US
2      US
dtype: object


One must be very careful for using `pd.concat` because, by default, it concatenates along the rows.

In [38]:
Series_Df = pd.concat([year_s,country_s])
print(Series_Df)

0    1990
1    2000
2    2010
0      US
1      US
2      US
dtype: object


Now let's create a `DataFrame` object from a dictionary.

In [39]:
data = {"GDP": [5974.7, 10031.0, 14681.1],
        "CPI": [127.5, 169.3, 217.488],
        "Year": [1990, 2000, 2010],
        "Country": ["US", "US", "US"]}

Now we are going to convert the type of data to a "DataFrame" this is the key oject within pandas. 

In [40]:
df = pd.DataFrame(data)

In [41]:
df

Unnamed: 0,GDP,CPI,Year,Country
0,5974.7,127.5,1990,US
1,10031.0,169.3,2000,US
2,14681.1,217.488,2010,US


Besides, we can also convert `Dataframe` to `Series` as well via just selecting one column of a `DataFrame`.

In [42]:
df['GDP']

0     5974.7
1    10031.0
2    14681.1
Name: GDP, dtype: float64

In [43]:
type(df['GDP'])

pandas.core.series.Series

In [44]:
df[['GDP']]

Unnamed: 0,GDP
0,5974.7
1,10031.0
2,14681.1


In [45]:
type(df[['GDP']])

pandas.core.frame.DataFrame

##### Columns in `DataFrame`

In [46]:
df

Unnamed: 0,GDP,CPI,Year,Country
0,5974.7,127.5,1990,US
1,10031.0,169.3,2000,US
2,14681.1,217.488,2010,US


This lays out the data in a very intuitive way, columns will be labeled as they are in the excel file.

#### `DataFrame` Attributes and Methods

In python remember the `DataFrame` is an object and with that object comes methods and attributes (we have seen less attributes, but lots of methods)


In [47]:
df

Unnamed: 0,GDP,CPI,Year,Country
0,5974.7,127.5,1990,US
1,10031.0,169.3,2000,US
2,14681.1,217.488,2010,US


In [48]:
print(df.shape)
# Note that this is an attribute not a method as the method 
# takes in arguments through () where as this just asks what is the shape of df

(3, 4)


In [49]:
print(df.columns) # which returns an object...but we can get it to a list.
print(df.columns.tolist())

Index(['GDP', 'CPI', 'Year', 'Country'], dtype='object')
['GDP', 'CPI', 'Year', 'Country']


In [50]:
print(df.index) # which is like a range type, but within pandas...
print(df.index.tolist())

RangeIndex(start=0, stop=3, step=1)
[0, 1, 2]


In [51]:
print(df.dtypes) # this is an attribute on the dataframe, simmilar to type

GDP        float64
CPI        float64
Year         int64
Country     object
dtype: object


So this is interesting, for the numerical values it says that they are floating
point vlaues, that is great. For the names, strings, it says that they are objects
NOT strings? Pandas does this (i) if all the data in a column is a number, then 
it is recognized as a number (ii) if not, then it is just going to be an object.

---
### Time to practice


**Exercises.** 

1. Load the dataframe `Average Value Weighted Returns -- Monthly` from the `'6_Portfolios_2x3'` in the Fama-French datasets. The sample period is from Jul 1963 - Sep 2018. 

In [52]:
port6_df = pd.read_csv("six_Portfolios.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'six_Portfolios.csv'

In [None]:
port6_df.head()

In [None]:
port6_df['Date'] = port6_df['Date'].astype(str)
port6_df['period'] = pd.PeriodIndex(port6_df['Date'],freq='m')
port6_df['Date'] = port6_df['period']
port6_df.set_index(['Date'],inplace=True)
port6_df.drop(columns=['period'],inplace=True)

2. What are the dimensions of the dataframe?

3. What dtypes are the variables? What do they mean?

4. What does dataframe_name.columns.tolist() do? How does it compare to list(dataframe_name)?

5. What is list(dataframe_name)[0]? Why? What type is it?

#### Rename the column name


In [None]:
df

In [None]:
df.columns = ["gdp","cpi", "year","country"]
# What if the elelments here were less than the number of columns?

df

In [None]:
df.columns = [var.upper() for var in df.columns]
# Here we can use list comprehension to change the names in columns in the way
# we want...

df

Another way to rename specific dataframe... note that if we did not have the df
in front, nothing would fundementally change, it would just copy and print out
the new one, but the saved df is the same...

In [None]:
df.rename(columns = {"GDP":"gdp"})

In [None]:
df

To change the saved df, you need to have the following assignment.

In [None]:
df = df.rename(columns = {"GDP":"gdp"})

In [None]:
df

Or, use the inplace option

In [None]:
df.rename(columns = {"CPI":"cpi"},inplace=True)

In [None]:
df

---
### Time to practice


**Exercises.** 

6. For the DataFrame `port6_df`, create a variable `diff` equal to the difference of SMALL HiBM and SMALL LoBM. Verify that diff is now in the dateframe.

7. Change its name to `small HML`

#### Play with Rows

The most brute force way to grab rows is via just **`iloc`**.

In [None]:
df

For example, if we want to grab the second row. Then it is the same for `list` or `numpy` data structure indexing system.

In [53]:
df.iloc[1]

GDP        10031.0
CPI          169.3
Year          2000
Country         US
Name: 1, dtype: object

As discussed before, here the row index is the same as the natural index. We may also use `loc` to select data.

In [None]:
df.loc[1]

In this example, although both `.iloc[1]` and `.loc[1]` use 1 as index, these two `1` are different. `1` in `.iloc[1]` selects the second row in the natural index, whereas `1` in `.loc[1]` selects the row with row index as `1`.

It can also be used to select several rows

In [None]:
df.iloc[0:2]

Note the difference here when `.loc` is used.

In [None]:
df.loc[0:2]

But the above way is hard to do conditional selection. For example, if we want to select the year at 2000.

In [None]:
df.loc[df['YEAR']==2000]

In [None]:
# This will also work
df[df['YEAR']==2000]

#### Play with Columns

##### Grab one column

The most recommend way to use is...

In [None]:
df['cpi']

In [None]:
df

In [None]:
df.iloc[:,1]

Since `DataFrame` is like an excel, think about the first input in the above bracket as index for rows while second for columns. Here, we want to select all the rows in the first column. 

Remember python index starts with **0**.

Regarding `:`, it's similar to we have learned before.


##### Grab several columns

In [None]:
df[["cpi","COUNTRY"]]

### The index


Every Series or DataFrame has an index. We told you that the index was the “row labels” for the data.

Very often, the index of a dataframe is the natural index. The natural index object is an interesting structure in itself, and it can be thought of as an immutable array.

In [None]:
df

In [None]:
df.columns = [var.lower() for var in df.columns]

In [None]:
df

In [None]:
df.index

In [None]:
df.index.values

Recall that we use `.iloc` with positions in the index (so it only takes integer) to grab rows.

In [None]:
df.iloc[0:2]

In [None]:
df.loc[0:2]

The natural index in many ways operates like an array. It has many of the attributes familiar from NumPy arrays.

In [None]:
df.index.shape

In [None]:
df.index.dtype

We can also use standard Python indexing notation to retrieve values or slices

In [None]:
df.index[1]

One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means

In [None]:
df.index[1] = 0

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.

#### Set Row Index

In [None]:
df

In [None]:
df.set_index(["year"])

In [None]:
df

Why is the index back to the original...well its just like rename methods, the original data frame is not fundamentally changed. To change it you need to either (i) assign the modified data frame either to itself or to a new name or (ii) use the inplace = True command where it does not create a new object, but directly creates the new index on the old object. 

So if we set the parameter `inplace` = True...

In [None]:
df.set_index(["year"], inplace = True)

In [None]:
df

In [None]:
df.loc[[2000,2010]]

---
### Time to practice


**Continue Our Exercises in `port6_df`** 

8. How would you extract all rows after 1990?

9. How would you extract the variables SMALL HiBM and BIG LoBM? 

10. How would you extract all variables for just one year, say 2000?


#### Reset the Row Index

Right now, we have set the index of **df** `DataFrame` as `year` column, can we reset it to numeric index? Yes, `pandas` provide the method `reset_index` methods as we'll show you.

In [None]:
df

In [None]:
df.reset_index()

In [None]:
df

If we set the `inplace` parameter = True, we will have similar results as `set_index` methods.

In [None]:
df.reset_index(inplace=True)

In [None]:
df

### Remove Stuff by Column or Row
How do we remove stuff, well there is the `.drop` method. In addition, we come across the `axis` parameter again. Let's become familar with it.

Can you guess what will happen, if...

In [None]:
df

In [None]:
df.reset_index(inplace=True)

In [None]:
df

In [None]:
df.drop("index", axis = 1) 

In [None]:
df.drop(0, axis = 0)  # the first 0 here means we want drop the first row which is indexed by 0

In [None]:
df

Again if we want to change the saved df, we need to use inplace option.

In [None]:
df.drop("index",axis=1,inplace=True)

In [None]:
df

Now, we can conclude: if we want to perform operations columnwise, we should set **axis** = 1 while for row-wise, **axis** = 0. 

### Add a row after a dataframe

We can add a row after a database. The most convenient way is to use `.loc` operation.

In [None]:
df

In [None]:
df.loc[3] = [566,788,2020,'US']

In [None]:
df

In [None]:
df.loc[len(df)] = [566,788,2020,'US']

In [None]:
df

---
### Time to practice


**Exercises.** 

11. How would you drop the variable `small HML` in `port6_df`? If you print your dataframe again, is it gone? If not, why do you think it is still there?


Hint: the key thing to recognize is the axis, this is saying drop a column named "small HML"
if you did this with out the axis it would give an error, why the defalut is 
axis = 0 which are rows...and there is no index named "small HML"

12. How would you drop one year, say, 1963, from the data set?

13. How to run a linear regression in which the dependent variable is `BIG HiBM`, the independent variables are `SMALL LoBM` and `ME1 BM2`.

---
# Some basics in autoregressive models in Python

We first construct our model in `statsmodels` using the `tsa.ARMA` methods

We will use `pandas` dataframes with `statsmodels`, however standard arrays can also be used as arguments

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import statsmodels.api as sm

In [None]:
ff_factors_ann = pd.read_csv('factor2024.csv')
ff_factors_ann['Date'] = ff_factors_ann['Date'].astype(str)
ff_factors_ann['period'] = pd.PeriodIndex(ff_factors_ann['Date'],freq='A')
ff_factors_ann['Date'] = ff_factors_ann['period']
ff_factors_ann.set_index(['Date'],inplace=True)
ff_factors_ann.drop(columns=['period'],inplace=True)

In [None]:
ff_factors_ann.tail()

In [None]:
ff_factors_ann.head()

In [None]:
ff_factors_ann.index

In [None]:
# to fit a ar(1) model
ar1_mod = sm.tsa.arima.ARIMA(ff_factors_ann['RF']/100,order=(1,0,0), missing='drop')
ar1_fit = ar1_mod.fit()
ar1_fit.summary()

In [None]:
# use integer index to predict
ar1_fit.predict(start=98,end=99)

In [None]:
# use row label index to predict
ar1_fit.predict(start='2025',end='2026')

In [None]:
# use forecast function to predict
ar1_fit.forecast()

In [None]:
ar1_fit.forecast().iloc[0]

In [None]:
# to fit a ma1 model
ma1_mod = sm.tsa.arima.ARIMA(ff_factors_ann['Rm']/100,order=(0,0,1), missing='drop', )
ma1_fit = ma1_mod.fit()
ma1_fit.summary()

In [None]:
# to predict
ma1_fit.predict(start=98,end=99)

In [None]:
ma1_fit.predict(start='2025',end='2026')

In [None]:
ma1_fit.forecast()

In [None]:
# to fit a arma(1,1) model
arma1_mod = sm.tsa.arima.ARIMA(ff_factors_ann['Rm']/100,order=(1,0,1), missing='drop')
arma1_fit = arma1_mod.fit()
arma1_fit.summary()

In [None]:
arma1_fit.predict(start=98,end=99)

In [None]:
arma1_fit.predict(start='2025',end='2026')

In [None]:
arma1_fit.forecast().iloc[0]

#### Time to practice

**Exercise.** 

1. Use AR(1), MA(1), ARMA(1,1), ARMA(1,2), ARMA(2,1), and ARMA(2,2) to fit the size factor (SMB).


2. Should you choose which model? Why?

3. Use the chosen model to predict the size factor in 2025.

---
## Summary

Let us summarize some key things that we covered.

* **Pandas Core Objects**: A `DataFrame` is essentially just a table of data while a `Series` can be thought of as a one columned `DataFrame`.

* **Understanding the `DataFrame`**:
    * Become familiar with basic attributes (`.columns`, `.shape`) and methods (`.sum()`, `.mean()`) in `DataFrame` data structure.
    * Know different ways to grab columns and rows, e.g., their pros and cons, especially for the differences between `iloc` and `loc`. They look familiar but the inputs for the two methods are very different. `loc` gets rows with particular labels from the index, while `iloc` gets rows at particular positions in the index (so it only takes integers).

* **Axis Understanding**: when setting **axis**, always think about the operation first, whether it will be done across column or across row. If the former, setting axis = 1. For this course and the majority of dataframe, the **axis** will always be **0** or **1**.

