# Pandas Basics - Part 1

## 1. Series

  A `Pandas Series` is very similar to a `NumPy Array`, and it is built on top of the array object. The difference is that the `Series` includes **labels**, meaning they **can be indexed** by the **labels**.

### 1.1. Series initialization

In [10]:
import numpy as np
import pandas as pd

In [11]:
labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10,'b':20,'c':30}

In [12]:
pd.Series(data=my_data)

0    10
1    20
2    30
dtype: int64

In [13]:
# initialize with list
#  - 1st argument shoud be data, 2nd argument must be index 
pd.Series(data=my_data,index=labels)

a    10
b    20
c    30
dtype: int64

In [16]:
pd.Series(labels, my_data) # wrong order but correct initialization [pay attention to the dtype]

10    a
20    b
30    c
dtype: object

In [18]:
# initialize with numpy array
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

In [19]:
# initialize with dictionary
pd.Series(d)

a    10
b    20
c    30
dtype: int64

In [20]:
# Series can hold different object, e.g., functions. 'dtype' is 'object' in this case
pd.Series(data=[sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

### 1.2 Series indexing and selection

In [23]:
ser1 = pd.Series([1,2,3,4],['USA','China','Japan','Germany'])
ser1

USA        1
China      2
Japan      3
Germany    4
dtype: int64

In [24]:
ser2 = pd.Series([1,2,5,4],['USA','China','Italy','Germany'])
ser2

USA        1
China      2
Italy      5
Germany    4
dtype: int64

In [25]:
# index as str
ser1['China']

2

In [27]:
# index as int by default
ser3 = pd.Series(labels)
ser3[2]

'c'

In [28]:
# Series operations: add the respective data for a certain index, return 'NaN' if no matched index
ser1 + ser2

China      4.0
Germany    8.0
Italy      NaN
Japan      NaN
USA        2.0
dtype: float64

## 2. DataFrames
  
  A `Pandas DataFrame` is composed with a bunch of `Series` sharing indexes. 

In [1]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [2]:
np.random.seed(101) # 100 random numbers

In [3]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [4]:
type(df)

pandas.core.frame.DataFrame

In [5]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [6]:
type(df['W'])

pandas.core.series.Series

In [7]:
# passing a list to get multipul Series
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


### * Add a column

In [8]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


### * Drop a column

In [9]:
df.drop('new',axis=1) # returns a new DataFrame object, does not affect the origianl

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [10]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [11]:
df.drop('new',axis=1,inplace=True) # returns the original after dropping
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### * Drop a row

In [12]:
df.drop('E',inplace=True) # drop rows
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


### * Selecting a row

There are tow ways to select a row in `DataFrame`: **label-based** and **numerical-bsed**.

In [13]:
df.loc['C'] # label based indexing

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [14]:
df.iloc[2] # numerical-based indexing

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [15]:
# subset with rows and columns
df.loc['B','Y']

-0.8480769834036315

In [16]:
df.loc[['B','C'],['Y','Z']]

Unnamed: 0,Y,Z
B,-0.848077,0.605965
C,0.528813,-0.589001


### * Conditional selection

In [17]:
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True


In [18]:
df[df>0] # filter the data > 0, not commonly used

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057


In [20]:
df[df['W']>0] # passing a Series, returns only the values satisfy the condition.

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057


In [22]:
df[df['W']>0][['X','Y']]

Unnamed: 0,X,Y
A,0.628133,0.907969
B,-0.319318,-0.848077
D,-0.758872,-0.933237


In [29]:
# multi-conditioning: conditions must be in ()
# do not use python keywords 'and' and 'or'
df[(df['X']<0) & (df['W']>0)]

Unnamed: 0,W,X,Y,Z
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057


### * Index manipunation

In [34]:
df.reset_index() # not inplace

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057


In [35]:
newIndex = 'CA NY WY CO'.split()
df['states']=newIndex

In [36]:
df

Unnamed: 0,W,X,Y,Z,states
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,CO


In [38]:
# set index
df.set_index('states')

Unnamed: 0_level_0,W,X,Y,Z
states,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
CO,0.188695,-0.758872,-0.933237,0.955057


### * Multi-level index

In [39]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [46]:
# Index levels
outside = 'G1 G1 G1 G2 G2 G2'.split()
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside)) # zip into tuple pairs
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [50]:
pd.DataFrame(randn(6,2),hier_index,['A','B'])

Unnamed: 0,Unnamed: 1,A,B
G1,1,-0.376519,0.230336
G1,2,0.681209,1.035125
G1,3,-0.03116,1.939932
G2,1,-1.005187,-0.74179
G2,2,0.187125,-0.732845
G2,3,-1.38292,1.482495


In [52]:
df = pd.DataFrame(randn(6,2),hier_index,['A','B'])
df.loc['G1'] # index must be refered by '.loc[]'

Unnamed: 0,A,B
1,0.961458,-2.141212
2,0.992573,1.192241
3,-1.04678,1.292765


In [55]:
df.loc['G1'].loc[2]

A    0.992573
B    1.192241
Name: 2, dtype: float64

In [57]:
df.index.names # index has no names yet

FrozenList([None, None])

In [65]:
# set index names
df.index.names = ["Groups","Numbers"]
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Groups,Numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.961458,-2.141212
G1,2,0.992573,1.192241
G1,3,-1.04678,1.292765
G2,1,-1.467514,-0.494095
G2,2,-0.162535,0.485809
G2,3,0.392489,0.221491


In [66]:
df.loc['G2'].loc[2]['B']

0.48580873745486103

### * Cross indexing fucntion

In [67]:
df.xs('G1')  # == df.loc['G1']

Unnamed: 0_level_0,A,B
Numbers,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.961458,-2.141212
2,0.992573,1.192241
3,-1.04678,1.292765


In [68]:
df.xs(1,level='Numbers')  # it is very tricky using '.loc[]' function

Unnamed: 0_level_0,A,B
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.961458,-2.141212
G2,-1.467514,-0.494095


## 3. Missing Data

In [69]:
import numpy as np
import pandas as pd

In [70]:
d ={'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}
df = pd.DataFrame(d)
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


### * Drop NaN

In [72]:
df.dropna(axis=1) # drop column with NaN in

Unnamed: 0,C
0,1
1,2
2,3


In [73]:
df.dropna(thresh=2)  # threshold 

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


### * Fill NaN

In [74]:
df.fillna(value='Fill')

Unnamed: 0,A,B,C
0,1,5,1
1,2,Fill,2
2,Fill,Fill,3


In [76]:
df['A'].fillna(value=df['A'].mean())  # fill the NaN by the column mean

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64