# Pandas - Introduction

`Pandas` is a package built on top of `NumPy`, which is consist of Series and DataFrame objects, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas has more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.

* Pandas series, DataFrame
* Quick checking DataFrame
* Descriptive stats on DataFrame
* Indexing, slicing, conditional subsetting
* Basic operations

In [1]:
import numpy as np
import pandas as pd

## 1.1. Series Object

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values, which can access with the `values` and `index` attributes.

In [2]:
s = pd.Series([0.25, 0.5, 0.75, 1.0], index=["alice", "bob", "charles", "darwin"])
s

alice      0.25
bob        0.50
charles    0.75
darwin     1.00
dtype: float64

In [3]:
# series items
s.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
# series keys
s.index

Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')

The Pandas Series is also like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. **This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.**

In [5]:
# Using a pre-defined Dictionary object
s = {'color': 'black',
     'size': 'S',
     'a': 0,
     'b': 1}

pd.Series(s)

color    black
size         S
a            0
b            1
dtype: object

## 1.2. DataFrame Object

The DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. We can create a DataFrame by `Series` and `Dictionary` with specific index, and it will conduct together even different length of series.

In [6]:
# create DataFrame by series with index
data = {'one': pd.Series(list(range(0,5)), index=['a','b','c','d','e']),
       'two': pd.Series(list(range(1,7)), index=['a','b','c','d','e','f'])}

pd.DataFrame(data)

Unnamed: 0,one,two
a,0.0,1
b,1.0,2
c,2.0,3
d,3.0,4
e,4.0,5
f,,6


In [7]:
# create DataFrame by dictionary
data = {'color': ['black', 'white', 'red', 'white', 'green'],
        'size': ['S', 'M', 'L', 'M', 'XL'],
        'date': pd.date_range('1/1/2019', periods=5, freq='W'),
        'a': np.random.randn(5),
        'b': np.random.normal(0.5, 2, 5)}

df = pd.DataFrame(data, index=range(1,6))
df

Unnamed: 0,color,size,date,a,b
1,black,S,2019-01-06,0.053487,0.451937
2,white,M,2019-01-13,0.83865,0.813639
3,red,L,2019-01-20,0.456919,2.49317
4,white,M,2019-01-27,-1.224732,-1.017963
5,green,XL,2019-02-03,1.564027,0.780043


In [8]:
# access to the index labels
df.index

RangeIndex(start=1, stop=6, step=1)

In [9]:
# access to the column lables
df.columns

Index(['color', 'size', 'date', 'a', 'b'], dtype='object')

DataFrame can be created reading directly from a CSV or an Excel file using `pd.read_csv` and `pd.read_excel`

In [10]:
pd.read_csv('../Data/Test_Scores.csv').head()

Unnamed: 0,ACT,FinalExam,QuizAvg,TestAvg
0,33,181,95,89
1,31,169,81,89
2,21,176,65,68
3,25,181,66,90
4,29,169,89,81


## 1.3. Index Object

This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Index objects also have many of the attributes familiar from NumPy arrays, but they are immutable - that is, they cannot be modified.

In [11]:
index = [['A', 'B', 'B', 'B', 'C'], ['JP', 'CN', 'US', 'US', 'US']]

ind = pd.Index(index)
ind

Index([['A', 'B', 'B', 'B', 'C'], ['JP', 'CN', 'US', 'US', 'US']], dtype='object')

The Index object follows many of the conventions used by Python's built-in `set` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [12]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

# intersection
print(indA & indB)

# union
print(indA | indB)

# symmetric difference
print(indA ^ indB)

Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')


In [13]:
# Adding single-indexing to dataframe
df.set_index(pd.Index(['A', 'B', 'C', 'D', 'E']), inplace=True)
df

Unnamed: 0,color,size,date,a,b
A,black,S,2019-01-06,0.053487,0.451937
B,white,M,2019-01-13,0.83865,0.813639
C,red,L,2019-01-20,0.456919,2.49317
D,white,M,2019-01-27,-1.224732,-1.017963
E,green,XL,2019-02-03,1.564027,0.780043


In [14]:
# reset index from 0
df.reset_index()

Unnamed: 0,index,color,size,date,a,b
0,A,black,S,2019-01-06,0.053487,0.451937
1,B,white,M,2019-01-13,0.83865,0.813639
2,C,red,L,2019-01-20,0.456919,2.49317
3,D,white,M,2019-01-27,-1.224732,-1.017963
4,E,green,XL,2019-02-03,1.564027,0.780043


## 2. Quick Checking DataFrames

In [15]:
# select top 3
df.head(3)

# select last 3
df.tail(3)

# sample data
df.sample(3)

Unnamed: 0,color,size,date,a,b
A,black,S,2019-01-06,0.053487,0.451937
B,white,M,2019-01-13,0.83865,0.813639
D,white,M,2019-01-27,-1.224732,-1.017963


In [16]:
# check data type
df.dtypes

color            object
size             object
date     datetime64[ns]
a               float64
b               float64
dtype: object

In [17]:
# change data type
df.a.astype(int)

A    0
B    0
C    0
D   -1
E    1
Name: a, dtype: int32

In [18]:
# check data information
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, A to E
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   color   5 non-null      object        
 1   size    5 non-null      object        
 2   date    5 non-null      datetime64[ns]
 3   a       5 non-null      float64       
 4   b       5 non-null      float64       
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 400.0+ bytes


## 3. Basic Descriptive Statistics

In [19]:
# basic statistical analysis
df.describe()

Unnamed: 0,a,b
count,5.0,5.0
mean,0.33767,0.704165
std,1.035738,1.249762
min,-1.224732,-1.017963
25%,0.053487,0.451937
50%,0.456919,0.780043
75%,0.83865,0.813639
max,1.564027,2.49317


In [20]:
# average by columns
df.mean()

# median by columns
df.median()

# variance by columns
df.var()

# standard deviation by columns
df.std()

# min by columns
df.min()

# max by columsn
df.max()

# top 5% quantile
df.quantile(0.95)

  
  """


a    1.418951
b    2.157263
Name: 0.95, dtype: float64

## 4. Indexing & Slicing

### 4.1. select rows

In [21]:
# select row by index, return Series
df.iloc[0]

# select row by index, return dataframe
df.iloc[[0]]

Unnamed: 0,color,size,date,a,b
A,black,S,2019-01-06,0.053487,0.451937


In [22]:
# select row by name, return Series
df.loc['A']

# select row by name, return dataframe
df.loc[['A']]

Unnamed: 0,color,size,date,a,b
A,black,S,2019-01-06,0.053487,0.451937


### 4.2. select columns

In [23]:
# select column by index, return Series
df.iloc[:, 0]

# select column by index, return dataframe
df.iloc[:, [0]]

Unnamed: 0,color
A,black
B,white
C,red
D,white
E,green


In [24]:
# select column by name, return Series
df.loc[:, 'color']

# select column by name, return dataframe
df[['color']]

Unnamed: 0,color
A,black
B,white
C,red
D,white
E,green


### 4.3. conditional subsetting

In [25]:
# numeric condition
df[df.a < - 0.1]

Unnamed: 0,color,size,date,a,b
D,white,M,2019-01-27,-1.224732,-1.017963


In [26]:
# categorical condition
df[df.color.isin(['white', 'black'])]

Unnamed: 0,color,size,date,a,b
A,black,S,2019-01-06,0.053487,0.451937
B,white,M,2019-01-13,0.83865,0.813639
D,white,M,2019-01-27,-1.224732,-1.017963


In [27]:
# exclusive condition
# same as != 'S'
df[df['size'].ne('S')]

Unnamed: 0,color,size,date,a,b
B,white,M,2019-01-13,0.83865,0.813639
C,red,L,2019-01-20,0.456919,2.49317
D,white,M,2019-01-27,-1.224732,-1.017963
E,green,XL,2019-02-03,1.564027,0.780043


In [28]:
df.loc[(df.color == 'white') & (df['size'] == 'M'), ['a', 'b']]

Unnamed: 0,a,b
B,0.83865,0.813639
D,-1.224732,-1.017963


## 5. Basic Operation

### 5.1. Evaluating an expression

A great feature supported by `pandas` is expression evaluation. The benefit here is that `Numexpr` library evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than `NumPy`, especially for large arrays. The `eval()`and `query()` functions in Pandas uses string expressions to efficiently compute operations. 

In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand. Note that the `query()` method also accepts the `@` flag to mark local variables:

In [29]:
df.query("color == 'white' and size == 'M'")

Unnamed: 0,color,size,date,a,b
B,white,M,2019-01-13,0.83865,0.813639
D,white,M,2019-01-27,-1.224732,-1.017963


In [30]:
# quick evaluate
df.eval('a + b')

A    0.505424
B    1.652288
C    2.950088
D   -2.242694
E    2.344069
dtype: float64

In [31]:
threshold = 0
df.eval('c = a + b > @threshold', inplace=True)
df

Unnamed: 0,color,size,date,a,b,c
A,black,S,2019-01-06,0.053487,0.451937,True
B,white,M,2019-01-13,0.83865,0.813639,True
C,red,L,2019-01-20,0.456919,2.49317,True
D,white,M,2019-01-27,-1.224732,-1.017963,False
E,green,XL,2019-02-03,1.564027,0.780043,True


### 5.2. Insert columns

In [32]:
# insert the day column at position 4 
df.insert(3, 'day', df.date.dt.strftime('%A'))
df

Unnamed: 0,color,size,date,day,a,b,c
A,black,S,2019-01-06,Sunday,0.053487,0.451937,True
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
C,red,L,2019-01-20,Sunday,0.456919,2.49317,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


### 5.3. Deleting columns or rows 

In [33]:
# Notice the axis=1 option for columns
df.drop('c', axis = 1)

Unnamed: 0,color,size,date,day,a,b
A,black,S,2019-01-06,Sunday,0.053487,0.451937
B,white,M,2019-01-13,Sunday,0.83865,0.813639
C,red,L,2019-01-20,Sunday,0.456919,2.49317
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963
E,green,XL,2019-02-03,Sunday,1.564027,0.780043


In [34]:
# axis = 0 is default for rows
df.drop('B', axis = 0)

Unnamed: 0,color,size,date,day,a,b,c
A,black,S,2019-01-06,Sunday,0.053487,0.451937,True
C,red,L,2019-01-20,Sunday,0.456919,2.49317,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


In [35]:
# removing rows with conditions
# same as df[df.color != 'red']
df.replace('red', np.nan).dropna()

Unnamed: 0,color,size,date,day,a,b,c
A,black,S,2019-01-06,Sunday,0.053487,0.451937,True
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


### 5.4. Unique & Counts

In [36]:
# return columns as array
print('Color column:', df.color.values)

# return unique category 
print('Color column unique values:', df.color.unique())

Color column: ['black' 'white' 'red' 'white' 'green']
Color column unique values: ['black' 'white' 'red' 'green']


In [37]:
# return counts
print('Color column counts:', df.color.count())

# return distinct counts
print('Color column distinct counts:', df.color.nunique())

Color column counts: 5
Color column distinct counts: 4


In [38]:
# return counts by category
df.color.value_counts()

white    2
black    1
green    1
red      1
Name: color, dtype: int64

### 5.5. Sorting

In [39]:
# sort by columns
df.sort_values(by='date', ascending=False)

Unnamed: 0,color,size,date,day,a,b,c
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
C,red,L,2019-01-20,Sunday,0.456919,2.49317,True
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
A,black,S,2019-01-06,Sunday,0.053487,0.451937,True


In [40]:
# sort by index
df.sort_index(ascending=False)

Unnamed: 0,color,size,date,day,a,b,c
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
C,red,L,2019-01-20,Sunday,0.456919,2.49317,True
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
A,black,S,2019-01-06,Sunday,0.053487,0.451937,True


### 5.6. Drop/filling Missing Value

In [41]:
dff = df.copy()
dff.iloc[0, 4] = np.nan
dff.iloc[2, 3] = np.nan
dff

Unnamed: 0,color,size,date,day,a,b,c
A,black,S,2019-01-06,Sunday,,0.451937,True
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
C,red,L,2019-01-20,,0.456919,2.49317,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


In [42]:
# drop any rows with NA, axis=1 as columns
# same as df[~df.isna().any(axis=1)]
dff.dropna(axis=0)

Unnamed: 0,color,size,date,day,a,b,c
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


In [43]:
# drop rows any specific column has NA values
dff[~dff.a.isna()]

Unnamed: 0,color,size,date,day,a,b,c
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
C,red,L,2019-01-20,,0.456919,2.49317,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


In [44]:
# fill in NA with mean
dff.fillna(dff.mean())

  


Unnamed: 0,color,size,date,day,a,b,c
A,black,S,2019-01-06,Sunday,0.408716,0.451937,True
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
C,red,L,2019-01-20,,0.456919,2.49317,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False
E,green,XL,2019-02-03,Sunday,1.564027,0.780043,True


### 5.7. Map & Apply

Mapping is used to transform an initial set of values to another set of values through a function.

In [45]:
# map
df.a.map(lambda x: np.mean(x))

A    0.053487
B    0.838650
C    0.456919
D   -1.224732
E    1.564027
Name: a, dtype: float64

`Apply` is similar to `map`, except that it transforms the entire DataFrame.

In [46]:
# apply
df[['a','b']].apply(lambda x: x.mean())

a    0.337670
b    0.704165
dtype: float64

In [47]:
# Transform
df.groupby('color').transform(lambda x: x.mean())

Unnamed: 0,date,a,b,c
A,2019-01-06,0.053487,0.451937,1.0
B,2019-01-20,-0.193041,-0.102162,0.5
C,2019-01-20,0.456919,2.49317,1.0
D,2019-01-20,-0.193041,-0.102162,0.5
E,2019-02-03,1.564027,0.780043,1.0


### 5.8. Rolling & Expanding

Creating a n-row window to aggregate data group by columns. `rolling` is a fixed size window; `expanding` is a increasing window every step. If not enough data, then return `NaN`

In [48]:
# aggregation
df.groupby('color').a.sum()

color
black    0.053487
green    1.564027
red      0.456919
white   -0.386082
Name: a, dtype: float64

In [49]:
# SQL windows function with fixed windows
df.groupby('color').rolling(2).a.sum()

color   
black  A         NaN
green  E         NaN
red    C         NaN
white  B         NaN
       D   -0.386082
Name: a, dtype: float64

In [50]:
# SQL windows function with increasing windows
df.groupby('color').expanding(1).a.sum()

color   
black  A    0.053487
green  E    1.564027
red    C    0.456919
white  B    0.838650
       D   -0.386082
Name: a, dtype: float64

### 5.9. Filter

In [51]:
# SQL having
df.groupby('color').filter(lambda x: len(x) > 1)

Unnamed: 0,color,size,date,day,a,b,c
B,white,M,2019-01-13,Sunday,0.83865,0.813639,True
D,white,M,2019-01-27,Sunday,-1.224732,-1.017963,False


### 5.10. Melt

In [52]:
# Convert wide dataframes to narrow
temp = pd.DataFrame({'ID': ['A','B','C'], 'Day1': [1,2,3], 'Day2': [2,4,6]})
print(temp)

temp.melt(id_vars=['ID'])

  ID  Day1  Day2
0  A     1     2
1  B     2     4
2  C     3     6


Unnamed: 0,ID,variable,value
0,A,Day1,1
1,B,Day1,2
2,C,Day1,3
3,A,Day2,2
4,B,Day2,4
5,C,Day2,6


### 5.11. Explode

In [53]:
temp = {'order_id':[1,3,7],
        'order_date':['20/5/2018','22/5/2018','23/5/2018'],
        'package':['p1,p2,p3','p4','p5,p6'],
        'package_code':['#111,#222,#333','#444','#555,$666']}
temp = pd.DataFrame(temp)
temp

Unnamed: 0,order_id,order_date,package,package_code
0,1,20/5/2018,"p1,p2,p3","#111,#222,#333"
1,3,22/5/2018,p4,#444
2,7,23/5/2018,"p5,p6","#555,$666"


In [54]:
temp.set_index(['order_id', 'order_date']).apply(lambda x: x.str.split(',').explode()).reset_index()

Unnamed: 0,order_id,order_date,package,package_code
0,1,20/5/2018,p1,#111
1,1,20/5/2018,p2,#222
2,1,20/5/2018,p3,#333
3,3,22/5/2018,p4,#444
4,7,23/5/2018,p5,#555
5,7,23/5/2018,p6,$666
