# Pandas Basics

## 1. Series

  A `Pandas Series` is very similar to a `NumPy Array`, and it is built on top of the array object. The difference is that the `Series` includes **labels**, meaning they **can be indexed** by the **labels**.

### 1.1. Series initialization

In [10]:
import numpy as np
import pandas as pd

In [11]:
labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
d = {'a':10,'b':20,'c':30}

In [12]:
pd.Series(data=my_data)

0    10
1    20
2    30
dtype: int64

In [13]:
# 1st argument shoud be data, 2nd argument must be index 
pd.Series(data=my_data,index=labels)

a    10
b    20
c    30
dtype: int64

In [16]:
pd.Series(labels, my_data) # wrong order but correct initialization

10    a
20    b
30    c
dtype: object

In [18]:
# initialize with numpy array
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int64

In [19]:
# initialize with dictionary
pd.Series(d)

a    10
b    20
c    30
dtype: int64

In [20]:
# Series can hold different object, e.g., functions. 'dtype' is 'object' in this case
pd.Series(data=[sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

### 1.2 Series indexing and selection

In [23]:
ser1 = pd.Series([1,2,3,4],['USA','China','Japan','Germany'])
ser1

USA        1
China      2
Japan      3
Germany    4
dtype: int64

In [24]:
ser2 = pd.Series([1,2,5,4],['USA','China','Italy','Germany'])
ser2

USA        1
China      2
Italy      5
Germany    4
dtype: int64

In [25]:
# index as str
ser1['China']

2

In [27]:
# index as int
ser3 = pd.Series(labels)
ser3[2]

'c'

In [28]:
# Series operations: add the respective data for a certain index, return 'NaN' if no matched index
ser1 + ser2

China      4.0
Germany    8.0
Italy      NaN
Japan      NaN
USA        2.0
dtype: float64

## 2. DataFrames
  
  A `Pandas DataFrame` is composed with a bunch of `Series` sharing indexes. 

In [67]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [68]:
np.random.seed(101) # 100 random numbers

In [69]:
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [70]:
type(df)

pandas.core.frame.DataFrame

In [71]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [72]:
type(df['W'])

pandas.core.series.Series

In [73]:
# passing a list to get multipul Series
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


### * Add a column

In [79]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


### * Drop a column

In [80]:
df.drop('new',axis=1) # returns a new DataFrame object, does not affect the origianl

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [81]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [82]:
df.drop('new',axis=1,inplace=True) # returns the original after dropping
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### * Drop a row

In [84]:
df.drop('E',inplace=True) # drop rows
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


### * Selecting a row

There are tow ways to select a row in `DataFrame`: **label-based** and **numerical-bsed**.

In [88]:
df.loc['C'] # label based indexing

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [89]:
df.iloc[2] # numerical-based indexing

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [90]:
# subset with rows and columns
df.loc['B','Y']

-0.84807698340363147

In [93]:
df.loc[['B','C'],['Y','Z']]

Unnamed: 0,Y,Z
B,-0.848077,0.605965
C,0.528813,-0.589001
