# Introduction Pandas

## Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data
- load
- prepare
- manipulate
- model
- analyze

### Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.

Import pasdas module and additional modules

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Object Creation


Creating a Series by passing a list of values, letting pandas create a default integer index:

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [3]:
dates = pd.date_range('20180901', periods=6)
dates

DatetimeIndex(['2018-09-01', '2018-09-02', '2018-09-03', '2018-09-04',
               '2018-09-05', '2018-09-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
#np.random.randn(rows,columns)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2018-09-01,-0.764375,0.654639,-0.528972,1.514615
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761
2018-09-04,-0.913472,0.044729,0.10189,-0.241046
2018-09-05,-0.560076,-1.40519,-1.493912,-0.500598
2018-09-06,1.803176,0.020965,0.01666,0.426011


Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [5]:
df2 = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes.

In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### Viewing Data

Here is how to view the top and bottom rows of the frame:

In [7]:
df.head(2)

Unnamed: 0,A,B,C,D
2018-09-01,-0.764375,0.654639,-0.528972,1.514615
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572


In [8]:
df.tail()

Unnamed: 0,A,B,C,D
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761
2018-09-04,-0.913472,0.044729,0.10189,-0.241046
2018-09-05,-0.560076,-1.40519,-1.493912,-0.500598
2018-09-06,1.803176,0.020965,0.01666,0.426011


Display the index, columns, and the underlying NumPy data:

In [9]:
df.index

DatetimeIndex(['2018-09-01', '2018-09-02', '2018-09-03', '2018-09-04',
               '2018-09-05', '2018-09-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [11]:
df.values

array([[-0.76437529,  0.65463914, -0.52897199,  1.51461457],
       [ 2.39503292, -1.80413007, -0.33175627, -1.1375715 ],
       [-1.05911518, -1.34042651, -0.01308011,  0.38476136],
       [-0.91347166,  0.04472889,  0.10188972, -0.2410459 ],
       [-0.56007631, -1.40518978, -1.49391169, -0.50059796],
       [ 1.80317586,  0.02096476,  0.01665952,  0.42601096]])

describe() shows a quick statistic summary of your data:

In [12]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.150195,-0.638236,-0.374862,0.074362
std,1.530123,1.00131,0.598222,0.91577
min,-1.059115,-1.80413,-1.493912,-1.137572
25%,-0.876198,-1.388999,-0.479668,-0.43571
50%,-0.662226,-0.659731,-0.172418,0.071858
75%,1.212363,0.038788,0.009225,0.415699
max,2.395033,0.654639,0.10189,1.514615


Transposing your data:

In [13]:
df.T

Unnamed: 0,2018-09-01,2018-09-02,2018-09-03,2018-09-04,2018-09-05,2018-09-06
A,-0.764375,2.395033,-1.059115,-0.913472,-0.560076,1.803176
B,0.654639,-1.80413,-1.340427,0.044729,-1.40519,0.020965
C,-0.528972,-0.331756,-0.01308,0.10189,-1.493912,0.01666
D,1.514615,-1.137572,0.384761,-0.241046,-0.500598,0.426011


Sorting by an axis:

In [14]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2018-09-01,1.514615,-0.528972,0.654639,-0.764375
2018-09-02,-1.137572,-0.331756,-1.80413,2.395033
2018-09-03,0.384761,-0.01308,-1.340427,-1.059115
2018-09-04,-0.241046,0.10189,0.044729,-0.913472
2018-09-05,-0.500598,-1.493912,-1.40519,-0.560076
2018-09-06,0.426011,0.01666,0.020965,1.803176


Sorting by values:

In [15]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572
2018-09-05,-0.560076,-1.40519,-1.493912,-0.500598
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761
2018-09-06,1.803176,0.020965,0.01666,0.426011
2018-09-04,-0.913472,0.044729,0.10189,-0.241046
2018-09-01,-0.764375,0.654639,-0.528972,1.514615


### Selection

#### Getting

Selecting a single column, which yields a Series, equivalent to df.A:

In [16]:
df['A']

2018-09-01   -0.764375
2018-09-02    2.395033
2018-09-03   -1.059115
2018-09-04   -0.913472
2018-09-05   -0.560076
2018-09-06    1.803176
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows.

In [17]:
#get 3 rows and start from row 0
df[0:3]

Unnamed: 0,A,B,C,D
2018-09-01,-0.764375,0.654639,-0.528972,1.514615
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761


In [18]:
df['20180902':'20180904']

Unnamed: 0,A,B,C,D
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761
2018-09-04,-0.913472,0.044729,0.10189,-0.241046


#### Selection by Label

For getting a cross section using a label:

In [19]:
df.loc[dates[0]]

A   -0.764375
B    0.654639
C   -0.528972
D    1.514615
Name: 2018-09-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [20]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
2018-09-01,-0.764375,0.654639
2018-09-02,2.395033,-1.80413
2018-09-03,-1.059115,-1.340427
2018-09-04,-0.913472,0.044729
2018-09-05,-0.560076,-1.40519
2018-09-06,1.803176,0.020965


Showing label slicing, both endpoints are included:

In [21]:
df.loc['20130102':'20130104',['A','B']]

Unnamed: 0,A,B


For getting a scalar value:

In [22]:
df.loc[dates[0],'A']

-0.7643752905586128

#### Selection by Position

Select via the position of the passed integers:

In [23]:
df.iloc[3]

A   -0.913472
B    0.044729
C    0.101890
D   -0.241046
Name: 2018-09-04 00:00:00, dtype: float64

By lists of integer position locations

In [24]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,A,C
2018-09-02,2.395033,-0.331756
2018-09-03,-1.059115,-0.01308
2018-09-05,-0.560076,-1.493912


### Boolean Indexing

Using a single column’s values to select data.

In [25]:
df[df.A < 0]

Unnamed: 0,A,B,C,D
2018-09-01,-0.764375,0.654639,-0.528972,1.514615
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761
2018-09-04,-0.913472,0.044729,0.10189,-0.241046
2018-09-05,-0.560076,-1.40519,-1.493912,-0.500598


Selecting values from a DataFrame where a boolean condition is met.

In [26]:
df[df > 0]

Unnamed: 0,A,B,C,D
2018-09-01,,0.654639,,1.514615
2018-09-02,2.395033,,,
2018-09-03,,,,0.384761
2018-09-04,,0.044729,0.10189,
2018-09-05,,,,
2018-09-06,1.803176,0.020965,0.01666,0.426011


Using the isin() method for filtering:

In [27]:
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2018-09-01,-0.764375,0.654639,-0.528972,1.514615,one
2018-09-02,2.395033,-1.80413,-0.331756,-1.137572,one
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761,two
2018-09-04,-0.913472,0.044729,0.10189,-0.241046,three
2018-09-05,-0.560076,-1.40519,-1.493912,-0.500598,four
2018-09-06,1.803176,0.020965,0.01666,0.426011,three


In [28]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2018-09-03,-1.059115,-1.340427,-0.01308,0.384761,two
2018-09-05,-0.560076,-1.40519,-1.493912,-0.500598,four


### Reference:

https://pandas.pydata.org/pandas-docs/stable/10min.html#selection-by-label