<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# Intro to Pandas

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# What do we need for data processing?

* **import data** from csv, excel, text files, databases etc.
* **export data** to csv, excel, text files, web?
* work with **tabular data**
  - merge data tables (sheets, in Excel lingo)
  - filter data
  - pivot data
  - modify / reshape data
  - add / remove / rename columns
  - look up data
  - add columns with derived data
  - compute statistics
* **handle dates, text, numbers, etc.**
* **visualize data**
* **share data**
* **handle large files (tens of GB)**

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# The Python `pandas` package meets the requirements

* Python package for working with “**pan**el **da**ta”
* Replicates the look & feel of data tables in Excel and SQL DBs.
* Primary data structures: **DataFrame (2-D) & Series (1-D)**
* Easy I/O
* **The most important tool for doing data analysis in Python**
* Built on top of **NumPy (Numerical Python)**

Initially created at AQR Capital Management _"out of the need for a high performance, flexible tool to perform quantitative analysis on financial data"_ (according to Wikipedia)

## Documentation / help

* https://pandas.pydata.org/pandas-docs/stable/reference/
* www.stackoverflow.com
* search online; remember to use keywords `dataframe`, `pandas`, `python`

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# Installing Required Modules

**Run either of these commands in a terminal (or console):**

```
pip install pandas
pip install numpy
```

```
conda install pandas
conda install numpy
```

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# Importing Required Modules

**I advise you to always start with these three lines of code:**

In [None]:
import pandas as pd
import numpy as np
import datetime

# This line of code will just show numbers in a nice format.
# No need to dig into it right now, just copy/paste it in your projects.
pd.options.display.float_format = '{:,.2f}'.format

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# DataFrames and Series

* fundamental `pandas` objects
* represent 2-D (tables) and 1-D (columns) of data
* have lots of functionality!

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com\dataframe_example.png" width="500"/>
</div>

<br/>
<br/>

## DataFrame Indexes

* used to access data in a dataframe
* can be nested!

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com\dataframe_index.png" width="500"/>
</div>

<br/>
<br/>
<br/>

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com\dataframe_multiindex.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Series
* one dimensional data structure
* an array of data (**values**) + an array of labels (**index**)
* can contain data of multiple types
* can be built from a list, a dictionary etc.
* **think of a dataframe column or row as a series object**

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com\series_example.png" width="400"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Create a Series

In [None]:
letters = pd.Series( ['a', 'b', 'c'] )
letters

In [None]:
countries = pd.Series( ['Andorra', 'Belgium', 'Croatia'] )
countries

In [None]:
countries.index

**A series has an index, corresponding values, and a data type.**

In [None]:
type(countries)

In [None]:
countries.dtype

#### You can specify a custom index

In [None]:
c_data = pd.Series(
    ['Andorra', 'Belgium', 'Croatia', 'Albania'], 
    index=['a', 'b', 'c', 'a'])

c_data

In [None]:
print(c_data.dtype)

In [None]:
c_data.values

In [None]:
c_data.index

In [None]:
c_data['a']

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Create an empty dataframe, with specific columns

In [None]:
# country_df is a DataFrame object. 
# It is empty, but has the columns we specified.

country_df = pd.DataFrame( columns=['Letter', 'Country'] )
country_df

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Create dataframe columns using values from Series
Essentially, this is like copy/pasting a column from one Excel sheet to another.

In [None]:
letters = pd.Series( ['a', 'b', 'c'] )
letters

In [None]:
countries = pd.Series( ['Andorra', 'Belgium', 'Croatia'] )
countries

In [None]:
country_df = pd.DataFrame({
    'Letter': letters, 
    'Country': countries}, 
    index=[0, 1, 5])

country_df

**NOTE:** Python adds the index automatically.

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Exercise

Create a series object that contains 4 letters of the alphabet and has the following values in the index: 2, 3, 4, 5. Name this series object `alphabet`. 

Create another series object that contains 4 animals starting with those four letters and which has the following values in the index: 2, 3, 4, 7. Name this series `animals`. 

Finally, create a dataframe object from these two series. What do you notice?

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

#### Solution

In [None]:
alphabet = pd.Series(
    ['c', 'd', 'z', 'o'], 
    index=[2, 3, 4, 5])

alphabet

In [None]:
animals = pd.Series(
    ['cat', 'dolphin', 'zebra', 'orangutan'], 
    index=[2, 3, 4, 7])

animals

In [None]:
animals_data = pd.DataFrame({
    'Letter': alphabet, 
    'Animal': animals})

animals_data

We notice that when pandas creates the dataframe, it matches the rows **based on the index label**. FYI, there is a way around this, more on this in a bit.

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>