# Introduction Pandas

## Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data
- load
- prepare
- manipulate
- model
- analyze

### Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

Import pasdas module and additional modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Object Creation


Creating a Series by passing a list of values, letting pandas create a default integer index:

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [3]:
dates = pd.date_range('20180901', periods=6)
dates

DatetimeIndex(['2018-09-01', '2018-09-02', '2018-09-03', '2018-09-04',
               '2018-09-05', '2018-09-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
#np.random.randn(rows,columns)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2018-09-01,0.457683,-0.765402,-0.59547,0.213014
2018-09-02,-1.839686,0.624647,-1.668384,-0.063743
2018-09-03,0.554446,-0.462718,-0.094293,-0.674034
2018-09-04,0.933482,0.020483,1.508006,0.770757
2018-09-05,-1.294754,-1.421425,0.050653,-1.004429
2018-09-06,-0.749926,-3.430865,-2.798963,-0.392262


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [8]:
df2 = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes.

In [9]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

Here is how to view the top and bottom rows of the frame:

In [12]:
df.head(2)

Unnamed: 0,A,B,C,D
2018-09-01,0.179401,0.869784,-0.708861,1.540188
2018-09-02,1.170228,-1.214013,-2.377521,0.106146


In [13]:
df.tail()

Unnamed: 0,A,B,C,D
2018-09-02,1.170228,-1.214013,-2.377521,0.106146
2018-09-03,-0.566644,-0.106563,-2.353751,-1.267331
2018-09-04,-0.888024,0.133448,-2.120545,1.031937
2018-09-05,1.40458,0.816753,-1.335552,-0.935397
2018-09-06,0.176161,-1.289045,-1.156599,2.697843


Display the index, columns, and the underlying NumPy data:

In [14]:
df.index

DatetimeIndex(['2018-09-01', '2018-09-02', '2018-09-03', '2018-09-04',
               '2018-09-05', '2018-09-06'],
              dtype='datetime64[ns]', freq='D')

In [15]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [16]:
df.values

array([[ 0.17940097,  0.86978358, -0.70886074,  1.54018815],
       [ 1.17022776, -1.21401272, -2.37752141,  0.10614585],
       [-0.5666436 , -0.10656306, -2.35375132, -1.26733133],
       [-0.88802416,  0.13344816, -2.12054471,  1.0319375 ],
       [ 1.40457953,  0.81675264, -1.33555155, -0.93539651],
       [ 0.17616124, -1.2890445 , -1.15659897,  2.697843  ]])

describe() shows a quick statistic summary of your data:

In [17]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.24595,-0.131606,-1.675471,0.528898
std,0.911287,0.947028,0.702871,1.518512
min,-0.888024,-1.289045,-2.377521,-1.267331
25%,-0.380942,-0.93715,-2.29545,-0.675011
50%,0.177781,0.013443,-1.728048,0.569042
75%,0.922521,0.645927,-1.201337,1.413125
max,1.40458,0.869784,-0.708861,2.697843


Transposing your data:

In [18]:
df.T

Unnamed: 0,2018-09-01 00:00:00,2018-09-02 00:00:00,2018-09-03 00:00:00,2018-09-04 00:00:00,2018-09-05 00:00:00,2018-09-06 00:00:00
A,0.179401,1.170228,-0.566644,-0.888024,1.40458,0.176161
B,0.869784,-1.214013,-0.106563,0.133448,0.816753,-1.289045
C,-0.708861,-2.377521,-2.353751,-2.120545,-1.335552,-1.156599
D,1.540188,0.106146,-1.267331,1.031937,-0.935397,2.697843


Sorting by an axis:

In [19]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2018-09-01,1.540188,-0.708861,0.869784,0.179401
2018-09-02,0.106146,-2.377521,-1.214013,1.170228
2018-09-03,-1.267331,-2.353751,-0.106563,-0.566644
2018-09-04,1.031937,-2.120545,0.133448,-0.888024
2018-09-05,-0.935397,-1.335552,0.816753,1.40458
2018-09-06,2.697843,-1.156599,-1.289045,0.176161


Sorting by values:

In [20]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2018-09-06,0.176161,-1.289045,-1.156599,2.697843
2018-09-02,1.170228,-1.214013,-2.377521,0.106146
2018-09-03,-0.566644,-0.106563,-2.353751,-1.267331
2018-09-04,-0.888024,0.133448,-2.120545,1.031937
2018-09-05,1.40458,0.816753,-1.335552,-0.935397
2018-09-01,0.179401,0.869784,-0.708861,1.540188


### Selection

#### Getting

Selecting a single column, which yields a Series, equivalent to df.A:

In [21]:
df['A']

2018-09-01    0.179401
2018-09-02    1.170228
2018-09-03   -0.566644
2018-09-04   -0.888024
2018-09-05    1.404580
2018-09-06    0.176161
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [22]:
#get 3 rows and start from row 0
df[0:3]

Unnamed: 0,A,B,C,D
2018-09-01,0.179401,0.869784,-0.708861,1.540188
2018-09-02,1.170228,-1.214013,-2.377521,0.106146
2018-09-03,-0.566644,-0.106563,-2.353751,-1.267331


In [24]:
df['20180902':'20180904']

Unnamed: 0,A,B,C,D
2018-09-02,1.170228,-1.214013,-2.377521,0.106146
2018-09-03,-0.566644,-0.106563,-2.353751,-1.267331
2018-09-04,-0.888024,0.133448,-2.120545,1.031937


#### Selection by Label

For getting a cross section using a label:

In [25]:
df.loc[dates[0]]

A    0.179401
B    0.869784
C   -0.708861
D    1.540188
Name: 2018-09-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [26]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2018-09-01,0.179401,0.869784
2018-09-02,1.170228,-1.214013
2018-09-03,-0.566644,-0.106563
2018-09-04,-0.888024,0.133448
2018-09-05,1.40458,0.816753
2018-09-06,0.176161,-1.289045


Showing label slicing, both endpoints are included:

In [27]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B


For getting a scalar value:

In [28]:
df.loc[dates[0],'A']

0.17940096583151505

#### Selection by Position

Select via the position of the passed integers:

In [29]:
df.iloc[3]

A   -0.888024
B    0.133448
C   -2.120545
D    1.031937
Name: 2018-09-04 00:00:00, dtype: float64

By lists of integer position locations

In [30]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2018-09-02,1.170228,-2.377521
2018-09-03,-0.566644,-2.353751
2018-09-05,1.40458,-1.335552


### Boolean Indexing

Using a single column’s values to select data.

In [32]:
df[df.A < 0]

Unnamed: 0,A,B,C,D
2018-09-03,-0.566644,-0.106563,-2.353751,-1.267331
2018-09-04,-0.888024,0.133448,-2.120545,1.031937


Selecting values from a DataFrame where a boolean condition is met.

In [33]:
df[df > 0]

Unnamed: 0,A,B,C,D
2018-09-01,0.179401,0.869784,,1.540188
2018-09-02,1.170228,,,0.106146
2018-09-03,,,,
2018-09-04,,0.133448,,1.031937
2018-09-05,1.40458,0.816753,,
2018-09-06,0.176161,,,2.697843


Using the isin() method for filtering:

In [32]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-1.552183,-0.433129,0.972894,-1.343322,one
2013-01-02,1.142853,-0.465687,-0.796619,-1.623627,one
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892,two
2013-01-04,0.177149,0.054317,-0.843511,-1.823639,three
2013-01-05,1.451345,-0.346647,-0.073056,0.760828,four
2013-01-06,-0.535913,-0.265819,0.557702,1.351285,three


In [33]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-0.636104,0.004551,-2.125687,-0.426892,two
2013-01-05,1.451345,-0.346647,-0.073056,0.760828,four


### Reference:

https://pandas.pydata.org/pandas-docs/stable/10min.html#selection-by-label