<a href="https://colab.research.google.com/github/DJCordhose/ml-workshop/blob/master/notebooks/data-science/1-6-pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas intro

* High level API centered around the idea of a dataframe
* Think: programmable excel sheet: dataframe
* used to analyze and preprocess data
* uses numpy, but also sometimes a bit more high level competitor to it

https://pandas.pydata.org/

In [1]:
import pandas as pd
pd.__version__

'0.24.2'

In [2]:
import numpy as np
np.__version__

'1.16.4'

## Series

In [0]:
# list of 1-d data
s1 = pd.Series([10,20,30])

In [4]:
type(s1)

pandas.core.series.Series

In [5]:
# each entry has an index, if not specified just consecutively numbered  
s1

0    10
1    20
2    30
dtype: int64

In [6]:
# you can also specify the index, either in pairs of name/value
s2 = pd.Series({'a' : 10, 'b' : 20, 'c' : 30})
s2

a    10
b    20
c    30
dtype: int64

In [7]:
# or as an additional parameter
s3 = pd.Series([10,20,30], index=['a', 'b', 'c'])
s3

a    10
b    20
c    30
dtype: int64

In [8]:
# []-operator either gets by position 
s3[0]

10

In [9]:
# or index
s3['a']

10

In [10]:
# you can also make that explicit
s3.loc['a']

10

In [11]:
s3.iloc[0]

10

In [12]:
# you can get from more than one index
s3[['a', 'c']]

a    10
c    30
dtype: int64

In [15]:
s3[[0, 2]]

a    10
c    30
dtype: int64

In [19]:
# ranges are separated with a colon, start is inclusive, end exclusive
s3[0:2]

a    10
b    20
dtype: int64

In [20]:
# start 0 is implicit and can be left out
s3[:2]

a    10
b    20
dtype: int64

In [21]:
# end is optional, gets you all values from start
s3[1:]

b    20
c    30
dtype: int64

In [22]:
# turn into numpy array
s3.values

array([10, 20, 30])

### Advanced Series

In [24]:
# you can pass a boolean function to make selections as complex as you want 
s3[lambda value: value >= 20]

b    20
c    30
dtype: int64

In [25]:
# if you want to understand how this is possible:
# Python allows for operator overloading which has been done for Series and DataFrames
# in this case the []-operator is overloaded
# http://stackoverflow.com/questions/1957780/how-to-override-operator

# here a very simple example how to do this
class MyClass:
    def __getitem__(self, key):
        return key * 2
myobj = MyClass()
myobj[3]

6

### Dataframes

In [0]:
# a dataframe consists of Series, typically, but not necessarily, they will have the same index

df1 = pd.DataFrame(
    {'one': pd.Series([10,20,30], index=['a', 'b', 'c']),
     'two': pd.Series([100,200,300], index=['a', 'b', 'c'])
    })

In [27]:
type(df1)

pandas.core.frame.DataFrame

In [32]:
# just the first few rows (all in our case, because we do not have that many)
df1.head()

Unnamed: 0,one,two
a,10,100
b,20,200
c,30,300


In [33]:
df1.describe()

Unnamed: 0,one,two
count,3.0,3.0
mean,20.0,200.0
std,10.0,100.0
min,10.0,100.0
25%,15.0,150.0
50%,20.0,200.0
75%,25.0,250.0
max,30.0,300.0


In [0]:
df1.describe?

In [0]:
# Series can be accessed by their labels

s4 = df1['one']

In [37]:
type(s4)

pandas.core.series.Series

In [38]:
s4['a']

10

In [39]:
df1['one']['a']

10

In [0]:
df1[['one', 'two']]

Unnamed: 0,one,two
a,10,100
b,20,200
c,30,300


In [40]:
# turn into numpy array (loses all names and indices)
df1.values

array([[ 10, 100],
       [ 20, 200],
       [ 30, 300]])

## Exercise: Load a dataset and experiment with it

* load any of your data as a csv using `read_csv`
* if you do not have any, you can use https://raw.githubusercontent.com/DJCordhose/ml-workshop/master/data/iris_dirty.csv
* make sure you imported it properly
* use `header`, `encoding`, and `names` parameter if necessary
* show the first few entries and make descriptive statistics
* do anything you know to get to know the data, develope a feeling for it 

In [0]:
# notice header, encoding, and potentially names parameter
pd.read_csv?