# Introduction to Data Structures 

http://pandas.pydata.org/pandas-docs/stable/dsintro.html 

In [1]:
import numpy as np
import pandas as pd

## Series 

A **series** is a one-dimensional labelled array that can hold any (single) data type. The axis labels are called the **index**. 

In [4]:
# Creating a series from an ndarray
# What is an ndarray: 
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html

s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
print(s)

a   -0.724282
b   -0.254142
c    0.657918
d    0.082987
e   -1.304031
dtype: float64


In [5]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [7]:
# Without specifiying the index, index is range(whatever)
pd.Series(np.random.randn(10))

0    0.648510
1   -0.246504
2   -2.382480
3   -0.849500
4    3.401260
5   -0.479616
6   -0.841827
7   -0.496669
8   -0.460854
9   -1.389307
dtype: float64

In [8]:
# Creating a series from a dict

d = {'a':0., 'b':1., 'c':2.}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [10]:
# Switches the order around and adds a new index w/o 
# corresponding data

pd.Series(d, index=['b','c','d','a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

In [11]:
# Creating a series from a single scalar using broadcasting

pd.Series(5., index=['a','b','c','d','e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [12]:
# Using numpy ndarray operations and slicing

s

a   -0.724282
b   -0.254142
c    0.657918
d    0.082987
e   -1.304031
dtype: float64

### Series are ndarray-like

In [13]:
s[0]

-0.72428242430317336

In [14]:
s[:3]

a   -0.724282
b   -0.254142
c    0.657918
dtype: float64

In [15]:
s[s > s.median()]

c    0.657918
d    0.082987
dtype: float64

In [16]:
# Subsetting by passing a LIST of rows 

s[[4,3,1]]

e   -1.304031
d    0.082987
b   -0.254142
dtype: float64

In [17]:
# NP functions are vectorized 

np.exp(s)

a    0.484672
b    0.775582
c    1.930769
d    1.086527
e    0.271435
dtype: float64

### Series are also dict-like

In [20]:
s

a   -0.724282
b   -0.254142
c    0.657918
d    0.082987
e   -1.304031
dtype: float64

In [21]:
s['a']

-0.72428242430317336

In [22]:
s['e']

-1.3040313881629566

In [23]:
'e' in s

True

In [24]:
'f' in s

False

So there are TWO ways to access series data: 

1. By passing the index as a string
2. By passing the location as an int

In [26]:
s[0]

-0.72428242430317336

In [27]:
s['a']

-0.72428242430317336

## DataFrames

These are 2-dimensional labelled data structures with columns of potentially different types. Also can be thought of as a dict of Series. 

### DataFrame from dict of Series (or dict of Dicts)

In [36]:
d = {'one':pd.Series(np.random.randn(5), index=['a','b','c','d','e']),
     'two':pd.Series(np.random.randn(3), index=['a','b','x'])
}


df = pd.DataFrame(d)

In [45]:
df

Unnamed: 0,one,two
a,0.791277,0.356481
b,-0.479026,0.036603
c,0.624419,
d,0.885753,
e,-0.679874,
x,,-0.176253


The index is the union of the indices of the Series. 

In [42]:
pd.DataFrame(d, index=['a','b','c'])

Unnamed: 0,one,two
a,0.791277,0.356481
b,-0.479026,0.036603
c,0.624419,


In [44]:
pd.DataFrame(d, index=['a', 'Blue', 'Yellow'], columns = ['one', 'three'])

Unnamed: 0,one,three
a,0.791277,
Blue,,
Yellow,,


In [48]:
# Another example 

s1 = {'a': 10, 'b': 20, 'c': 30}
s2 = {'John': 1.1, 'Paul': 2.2, 'George': 3.3, 'Ringo':4.4}

d2 = {'one': s1, 'two': s2}
df2 = pd.DataFrame(d2)
df2

Unnamed: 0,one,two
George,,3.3
John,,1.1
Paul,,2.2
Ringo,,4.4
a,10.0,
b,20.0,
c,30.0,


In [50]:
s3 = {'a': 3.1, 'b':4.5}
df3 = pd.DataFrame({'s1': s1, 's3':s3})
df3

Unnamed: 0,s1,s3
a,10,3.1
b,20,4.5
c,30,


In [52]:
# Creates a new data frame from df3, using the existing 'a' index from df3
# plus a new index 'x' that is undefined 
pd.DataFrame(df3, index=['a', 'x'])

Unnamed: 0,s1,s3
a,10.0,3.1
x,,


In [55]:
# New data frame using contents of df3
# Two existing indices and a new undefined one
# One existing column and a new undefined one 

pd.DataFrame(df3, index=['b','c','z'], columns=['blue', 's3'])

Unnamed: 0,blue,s3
b,,4.5
c,,
z,,


In [57]:
df.columns

Index(['one', 'two'], dtype='object')

In [58]:
df.index

Index(['a', 'b', 'c', 'd', 'e', 'x'], dtype='object')

### DataFrame from dict of ndarrays or lists 

In [59]:
d = {'one':[1.,2.,3.,4.], 'two':[4., 3., 2., 1]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [62]:
pd.DataFrame(d, index=['a','b','c','d'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


So, in dict-of-lists form

- The keys of the dict are the column labels
- The values for each key are the row values 

### DataFrame from list of dicts

In [63]:
data2 = [{'a':1, 'b':2}, {'a':5, 'b':10, 'c': 20}]
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In list-of-dicts format, the keys are the column labels (still) and the values for each key are still the row values.

### Column selection, addition, and deletion

In [90]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [67]:
# Selects the column

df['one']

a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

In [68]:
df['two']

a    1.0
b    2.0
c    3.0
d    4.0
Name: two, dtype: float64

In [70]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [92]:
df['flag'] = df['one'] > 2
df

Unnamed: 0,one,two,foo,flag
a,1.0,1.0,bar,False
b,2.0,2.0,bar,False
c,3.0,3.0,bar,True
d,,4.0,bar,False


So `df['name']` selects the **column** `name` from `df`. This is different than series in which this syntax selects the **row**: 

In [80]:
s = pd.Series({'x': 5, 'y': 10, 'z': 20})
s['x']

5

In [93]:
df['foo'] = 'bar'
df

Unnamed: 0,one,two,foo,flag
a,1.0,1.0,bar,False
b,2.0,2.0,bar,False
c,3.0,3.0,bar,True
d,,4.0,bar,False


To get a **row**: This is what `loc` and `iloc` are for. 

### Row selection

In [94]:
df

Unnamed: 0,one,two,foo,flag
a,1.0,1.0,bar,False
b,2.0,2.0,bar,False
c,3.0,3.0,bar,True
d,,4.0,bar,False


In [95]:
df.loc['a']

one         1
two         1
foo       bar
flag    False
Name: a, dtype: object

In [96]:
df.iloc[2]

one        3
two        3
foo      bar
flag    True
Name: c, dtype: object

BUT we can use slicing without the `loc` or `iloc` to get rows: 

In [97]:
df[2:4]

Unnamed: 0,one,two,foo,flag
c,3.0,3.0,bar,True
d,,4.0,bar,False


In [98]:
df[:-1]

Unnamed: 0,one,two,foo,flag
a,1.0,1.0,bar,False
b,2.0,2.0,bar,False
c,3.0,3.0,bar,True


...or by passing a boolean vector

In [100]:
df[[True,False,True,True]]

Unnamed: 0,one,two,foo,flag
a,1.0,1.0,bar,False
c,3.0,3.0,bar,True
d,,4.0,bar,False


From the documentation: 

| Operation | Syntax | Result | 
|:----------|:-------|:-------|
|Select column | `df['colname']` | Series | 
|Select row by label | `df.loc['rowname']` | Series |
|Select row by integer location | `df.iloc[int]` | Series | 
|Slice rows | `df[a:b]` | DataFrame | 
|Select rows by boolean vector | `df[boolvect]` | DataFrame| 