# Pandas

Pandas is the Swiss-Multipurpose Knife for Data Analysis in Python. With Pandas dealing with data-analysis is easy and simple but there are some things you need to get your head around first as Data-Frames and Data-Series. 

The tutorial provides a compact introduction to Pandas for beginners for I/O, data visualisation, statistical data analysis and aggregation within Jupiter notebooks.

---

## Brief Introduction to Pandas

Pandas builds on top of two main data structures: **Data Frame** and **Series**

### Data Frame _from the outside_

<img src="./images/df_outside.png" width="50%" />

### Data Frame _from the inside_

<img src="./images/df_inside.png" width="60%" />

### Data Frame vs Numpy Array

#### Numpy Array

<img src="./images/ndarray.png" />

#### Pandas Data Frame

<img src="./images/df_inside_numpy.png" width="70%" />

---


# Pandas in a Nutshell

In [1]:
import numpy as np
import pandas as pd

In [2]:
!head ./data/blooth_sales_data.csv

name,birthday,customer,orderdate,product,units,unitprice
Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69
Weldon,1953-03-17,Network Application Co,2016-07-22 13:48:03.156678,Lipitor,1,11.22
Sung,1977-10-23,Omega Pacific Future Incorporated,2016-07-09 13:48:03.156698,PlayStation,25,294.9
Emily,1982-07-02,Medicine Incorporated,2016-07-16 13:48:03.156717,Thriller record,5,18.27
Cornell,1963-07-02,Technology Direct Star Limited,2016-07-08 13:48:03.156735,Rubik’s Cube,35,15.98


In [3]:
sales_data = pd.read_csv('./data/blooth_sales_data.csv')

### Let's explore our data set

In [4]:
sales_data.head(10)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69
5,Weldon,1953-03-17,Network Application Co,2016-07-22 13:48:03.156678,Lipitor,1,11.22
6,Sung,1977-10-23,Omega Pacific Future Incorporated,2016-07-09 13:48:03.156698,PlayStation,25,294.9
7,Emily,1982-07-02,Medicine Incorporated,2016-07-16 13:48:03.156717,Thriller record,5,18.27
8,Cornell,1963-07-02,Technology Direct Star Limited,2016-07-08 13:48:03.156735,Rubik’s Cube,35,15.98
9,Ervin,1977-10-14,Provider Agency,2016-07-19 13:48:03.156754,Star Wars,24,11.5


#### Let's see what we have got now

In [5]:
type(sales_data)

pandas.core.frame.DataFrame

In [6]:
len(sales_data)

1000

#### Inspect your DataFrame with pandas methods

In [7]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


In [8]:
sales_data.tail(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
995,Ethan,1952-12-08,Application Industries,2016-07-21 13:48:03.177885,Harry Potter book,39,24.4
996,Rudolph,1959-10-15,Network Software West Inc,2016-07-19 13:48:03.177903,Rubik’s Cube,9,15.11
997,Annmarie,1982-06-04,Atlantic Corporation,2016-07-13 13:48:03.177924,Thriller record,19,9.16
998,Chang,1984-02-05,Venture Alpha Corporation,2016-07-13 13:48:03.177943,Harry Potter book,24,28.21
999,Ervin,1977-10-14,Provider Agency,2016-07-09 13:48:03.177962,iPhone,39,663.83


In [9]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null object
customer     1000 non-null object
orderdate    1000 non-null object
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 54.8+ KB


**note: floats and ints were detected automatically but date(time) are still strings objects**

* *columns*
* count rows
* data types (numpy)
* memenory used

**`Strings`** are stored in **`pandas`** as **`object`**!

In [10]:
?pd.read_csv

[0;31mSignature:[0m [0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0mfilepath_or_buffer[0m[0;34m,[0m [0msep[0m[0;34m=[0m[0;34m','[0m[0;34m,[0m [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m [0mnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0msqueeze[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mprefix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mmangle_dupe_cols[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m [0mdtype[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mengine[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mconverters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mtrue_values[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mfalse_values[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mskipinitialspace[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mskiprows[0m[0;34m=[0m[0;32mNo

**`pandas.read_csv`** has more than 50 parameters to customize imports.

For example dates can be parsed automatically.

> **`parse_dates`** a list of columns to parse for dates.

This is only one of multiple options to customize imports.

In [11]:
sales_data = pd.read_csv('./data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate']
                        )
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB


In [12]:
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


The auto date parser is US date friendly by default -> month first! MM/DD/YYYY add *dayfirst=True* for international and European format.

In [13]:
sales_data = pd.read_csv('./data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate'],
                         dayfirst=True)
sales_data.head(5)

Unnamed: 0,name,birthday,customer,orderdate,product,units,unitprice
0,Pasquale,1967-09-02,Electronics Inc,2016-07-17 13:48:03.156566,Thriller record,2,13.27
1,India,1968-12-13,Electronics Resource Group,2016-07-06 13:48:03.156596,Corolla,26,24458.69
2,Wayne,1992-09-10,East Application Contract Inc,2016-07-22 13:48:03.156618,Rubik’s Cube,41,15.79
3,Cori,1986-11-05,Signal Industries,2016-07-23 13:48:03.156638,iPhone,16,584.01
4,Chang,1972-04-23,Star Alpha Industries,2016-07-16 13:48:03.156657,Harry Potter book,4,25.69


**!** The date parse is US datew friendly! *MM/DD/YYYY*

To use the more common international format for sure,<br>
add 
>**`dayfirst=True`** 

The CSV import may be highly customized, <br>e.g.:

* `date_parser` - which columns to parse.
* `compression` - `pandas` hint compression of file, default: `infer`- auto discovery
* `delimiter` - delimiter
* `thousands`, `decimal` - thousands or decimal character
* `encoding` - encoding of the file
* `dtype`- target data type of column(s)
* `header`- header number(s)
* `skipfooter`- do not import the footer (e.g. summary line)

