# Pandas
* Install: `$ conda install pandas`
* Collection of instruments/methods for data processing
* Fundamental: Two-dimensional (row and column) tabular data (DataFrame)
* Important subtype: Series - one-dimensional data with an index

# Series I
* One-dimensional data
* Data is labeled with a index
* Series consist of pairs (index, data)

In [1]:
import pandas as pd
a = pd.Series([4, 2, 7, 8, 4, 4])
print(a)

0    4
1    2
2    7
3    8
4    4
5    4
dtype: int64


In [4]:
print(a[2])

7


In [5]:
print(a*2 + 4)

0    12
1     8
2    18
3    20
4    12
5    12
dtype: int64


In [6]:
print(a.unique())

[4 2 7 8]


In [7]:
print(a[2])

7


In [8]:
a.isin([2, 4])

0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

# Series II
* Indexing using non integer-based indexing

In [9]:
a = pd.Series([4, 2, 7, 8], index=['Spring', 'Summer', 'Autumn', 'Winter'])
a

Spring    4
Summer    2
Autumn    7
Winter    8
dtype: int64

In [10]:
a['Winter']

8

### Mini-assignment
* Modify the time series
$$(\text{Spring}, 4 ), (\text{Summer}, 2 ), (\text{Autumn}, 7), (\text{Winter}, 8)$$
such that all data is increased by $10$%

# Date/Time series
* Extensive support for date-/time-type series of data
* Convert times/dates from other external formats
* Create times/dates as `DatetimeIndex`

In [12]:
dti = pd.to_datetime(['Jan 2018', 'Feb 2018', 'Mar 2018', 'Apr 2018', 'May 2018', 'Jun 2018', 'Jul 2018', 'Aug 2018', 'Sep 2018', 'Oct 2018', 'Nov 2018', 'Dec 2018'])
dti

DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01',
               '2018-05-01', '2018-06-01', '2018-07-01', '2018-08-01',
               '2018-09-01', '2018-10-01', '2018-11-01', '2018-12-01'],
              dtype='datetime64[ns]', freq=None)

In [13]:
temps_2018 = pd.Series(data=[2.3, -.7, .3, 8.4, 15.0, 16.5, 19.2, 17.5, 14.1, 10.3, 5.9, 4.3], index=dti)
temps_2018

2018-01-01     2.3
2018-02-01    -0.7
2018-03-01     0.3
2018-04-01     8.4
2018-05-01    15.0
2018-06-01    16.5
2018-07-01    19.2
2018-08-01    17.5
2018-09-01    14.1
2018-10-01    10.3
2018-11-01     5.9
2018-12-01     4.3
dtype: float64

In [14]:
dti = pd.date_range(start='Jan 2018', end='Dec 2018', freq='MS')
dti

DatetimeIndex(['2018-01-01', '2018-02-01', '2018-03-01', '2018-04-01',
               '2018-05-01', '2018-06-01', '2018-07-01', '2018-08-01',
               '2018-09-01', '2018-10-01', '2018-11-01', '2018-12-01'],
              dtype='datetime64[ns]', freq='MS')

`DatetimeIndex` is the standard object type for storing date/time series. Often used as index for `Series`.

# DataFrame
1. Forming
2. Pretty-printing
3. Slicing columns and selecting

In [10]:
data = [['Henrik Pedersen', 42, 'Associate Professor'], 
        ['Susanne Smed', 49, 'Professor'], 
        ['Kirstine Bak', 27, 'ph.d.-student'],
        ['Hans Hansen', 55, 'Professor']]
a = pd.DataFrame(data, columns=['Name', 'Age', 'Category'])

In [11]:
a

Unnamed: 0,Name,Age,Category
0,Henrik Pedersen,42,Associate Professor
1,Susanne Smed,49,Professor
2,Kirstine Bak,27,ph.d.-student
3,Hans Hansen,55,Professor


In [17]:
a.shape

(4, 3)

In [18]:
a['Name']  #or a.Name

0    Henrik Pedersen
1       Susanne Smed
2       Kirstine Bak
3        Hans Hansen
Name: Name, dtype: object

## String methods
* Columns with strings can be manipulated as Pythons strings via str attribute

In [8]:
a['Name'].str.upper()

0    HENRIK PEDERSEN
1       SUSANNE SMED
2       KIRSTINE BAK
3        HANS HANSEN
Name: Name, dtype: object

In [None]:
## Regex

In [19]:
a['Name'].str.split(' ')

0    [Henrik, Pedersen]
1       [Susanne, Smed]
2       [Kirstine, Bak]
3        [Hans, Hansen]
Name: Name, dtype: object

In [31]:
import re
a['Name'].str.extract('([\w]+)\s*')

Unnamed: 0,0
0,Henrik
1,Susanne
2,Kirstine
3,Hans


## Slicing rows and selecting columns

In [19]:
a[2:4]

Unnamed: 0,Name,Age,Category
2,Kirstine Bak,27,ph.d.-student
3,Hans Hansen,55,Professor


In [20]:
a.loc[2:4, 'Age']   # this is a view

2    27
3    55
Name: Age, dtype: int64

In [21]:
a[2:4]['Age']    # this is view or a copy - we don't know

2    27
3    55
Name: Age, dtype: int64

## Difference between a view and a copy

In [12]:
a[2:4]['Age'] += 1
a

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Name,Age,Category
0,Henrik Pedersen,42,Associate Professor
1,Susanne Smed,49,Professor
2,Kirstine Bak,28,ph.d.-student
3,Hans Hansen,56,Professor


In [23]:
a.loc[2:4, 'Age'] += 1
a

Unnamed: 0,Name,Age,Category
0,Henrik Pedersen,42,Associate Professor
1,Susanne Smed,49,Professor
2,Kirstine Bak,29,ph.d.-student
3,Hans Hansen,57,Professor


## Manipulation and arithmetic

In [24]:
a['Age'] += 1

In [25]:
a['Compensation'] = [44000, 66000, 30000, 65000]

In [26]:
a

Unnamed: 0,Name,Age,Category,Compensation
0,Henrik Pedersen,43,Associate Professor,44000
1,Susanne Smed,50,Professor,66000
2,Kirstine Bak,30,ph.d.-student,30000
3,Hans Hansen,58,Professor,65000


## Membership and comparison

In [27]:
a.Category.isin(['Professor'])

0    False
1     True
2    False
3     True
Name: Category, dtype: bool

In [28]:
a[a.Category.isin(['Professor'])]

Unnamed: 0,Name,Age,Category,Compensation
1,Susanne Smed,50,Professor,66000
3,Hans Hansen,58,Professor,65000


In [29]:
a.Compensation > 50000

0    False
1     True
2    False
3     True
Name: Compensation, dtype: bool

In [30]:
del a['Compensation']
a

Unnamed: 0,Name,Age,Category
0,Henrik Pedersen,43,Associate Professor
1,Susanne Smed,50,Professor
2,Kirstine Bak,30,ph.d.-student
3,Hans Hansen,58,Professor


## Index functionalities

In [31]:
a.reindex([1,3,0,2])

Unnamed: 0,Name,Age,Category
1,Susanne Smed,50,Professor
3,Hans Hansen,58,Professor
0,Henrik Pedersen,43,Associate Professor
2,Kirstine Bak,30,ph.d.-student


In [32]:
a.sort_values('Age')

Unnamed: 0,Name,Age,Category
2,Kirstine Bak,30,ph.d.-student
0,Henrik Pedersen,43,Associate Professor
1,Susanne Smed,50,Professor
3,Hans Hansen,58,Professor


## Mini-assignment
* Give all with a compensation less than 50000 a raise of 1000 (Hint: use loc)

In [34]:
a['Compensation'] = [44000, 66000, 30000, 65000]
i = a['Compensation'] < 50000
a.loc[i, 'Compensation'] += 1000

In [35]:
a

Unnamed: 0,Name,Age,Category,Compensation
0,Henrik Pedersen,43,Associate Professor,45000
1,Susanne Smed,50,Professor,66000
2,Kirstine Bak,30,ph.d.-student,31000
3,Hans Hansen,58,Professor,65000


# Statistics
* First-order

In [36]:
a.mean()

Age                45.25
Compensation    51750.00
dtype: float64

In [37]:
a.Age.mean()

45.25

In [38]:
a.median()

Age                46.5
Compensation    55000.0
dtype: float64

## Second-order statistics
Pearson correlation coefficient
$$ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum_{i=1}^n (x_i - \bar x)}\sqrt{\sum_{i=1}^n (yi - \bar y)}}$$

In [51]:
a['Compensation'] = [44000, 66000, 30000, 65000]
a.corr()

Unnamed: 0,Age,Compensation
Age,1.0,0.943998
Compensation,0.943998,1.0


In [41]:
a.cov()

Unnamed: 0,Age,Compensation
Age,140.916667,195250.0
Compensation,195250.0,303583300.0


# Not available (na) and Not A Number (NaN)
* Missing data
* Operations not "well-defined"

In [45]:
a['Compensation'] *= pd.Series([1.02, 1.03, 1.04])

In [46]:
a

Unnamed: 0,Name,Age,Category,Compensation
0,Henrik Pedersen,43,Associate Professor,45777.6
1,Susanne Smed,50,Professor,70019.4
2,Kirstine Bak,30,ph.d.-student,32448.0
3,Hans Hansen,58,Professor,


In [47]:
b = a.dropna()
b

Unnamed: 0,Name,Age,Category,Compensation
0,Henrik Pedersen,43,Associate Professor,45777.6
1,Susanne Smed,50,Professor,70019.4
2,Kirstine Bak,30,ph.d.-student,32448.0


In [48]:
a.fillna(0) # Be careful here - mini-asignment!

Unnamed: 0,Name,Age,Category,Compensation
0,Henrik Pedersen,43,Associate Professor,45777.6
1,Susanne Smed,50,Professor,70019.4
2,Kirstine Bak,30,ph.d.-student,32448.0
3,Hans Hansen,58,Professor,0.0


In [49]:
# Reset our dataframe
a.loc[3, 'Compensation'] = 65000*1.02

## Mini-assignment
1. Compute the mean compensation with values: a['Compensation'] = [44000, 66000, 30000, 65000]
2. Update compensation: a['Compensation'] *= pd.Series([1.02, 1.02, 1.02])
3. Compute mean compensation - is it increased by 2% ?
4. Update NA values using a.fillna(0, inplace=True)
5. Compute mean compensation - what do you observe? - why?

## Non consecutive/custom Indexing
* Changing Index

In [13]:
a['Staff number'] = [11130, 23485, 209385, 10345]
b = a.set_index('Staff number')

In [14]:
b

Unnamed: 0_level_0,Name,Age,Category
Staff number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
11130,Henrik Pedersen,42,Associate Professor
23485,Susanne Smed,49,Professor
209385,Kirstine Bak,28,ph.d.-student
10345,Hans Hansen,56,Professor


In [15]:
b.loc[10345]

Name        Hans Hansen
Age                  56
Category      Professor
Name: 10345, dtype: object

# Categorical Data
We can work more explicitly with categorical data in Pandas.
- Define categorical variable as a special kind of `Series`
- Categorical variables can be ordered or un-ordered

In [67]:
pd.Categorical(a['Category'],
               categories=['ph.d.-student', 'Associate Professor', 'Professor'],
               ordered=True)

[Associate Professor, Professor, ph.d.-student, Professor]
Categories (3, object): [ph.d.-student < Associate Professor < Professor]

In [68]:
c = a.copy()
c['Category'] = pd.Categorical(a['Category'],
               categories=['ph.d.-student', 'Associate Professor', 'Professor'],
               ordered=True)

In [69]:
c.sort_values(['Category', 'Compensation'], ascending=False)

Unnamed: 0,Name,Age,Category,Compensation,Staff number,Seniority
1,Susanne Smed,50,Professor,67320.0,23485,Senior
3,Hans Hansen,58,Professor,0.0,10345,Senior
0,Henrik Pedersen,43,Associate Professor,44880.0,11130,Senior
2,Kirstine Bak,30,ph.d.-student,30600.0,209385,Junior


# Writing and reading data with Pandas
* csv
* xlsx (excel)
* json
* SAS
* Stata

## Write a simple csv file
* emplyees.csv

In [74]:
a.to_csv('employees.csv')

## Reading a simple csv file


In [17]:
df_new = pd.read_csv('employees.csv', index_col=['Category'])
df_new.sort_index()

Unnamed: 0_level_0,Seniority,Name,Age,Compensation
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Associate Professor,Senior,Henrik Pedersen,42,44000
Professor,Senior,Susanne Smed,49,66000
Professor,Senior,Hans Hansen,55,65000
ph.d.-student,Junior,Kirstine Bak,27,30000


### Saving to a(n) * file

In [None]:
a.to_excel('employees.xlsx')

In [None]:
a.to_html('employees.html')

In [None]:
a.to_stata('employees.dta')

# Saving in binary format
* Should be the preferred format for large datasets

In [None]:
a.to_hdf('employees.h5', key='employees')

# Data preparation: assembling
1. Adding columns via merge
2. Adding columns via concatenation
3. Adding rows via concatenation
4. Adding rows via merge

In [18]:
data1 = [['Henrik Pedersen', 42, 44000], 
        ['Susanne Smed', 49, 66000], 
        ['Kirstine Bak', 27, 30000],
        ['Hans Hansen', 55, 65000],
        ['Per Kirk', 42, 47000]]
df1 = pd.DataFrame(data1, columns=['Name', 'Age', 'Compensation'])
data2 = [['Associate Professor', 'Henrik Pedersen', 'Senior'], 
        ['Professor', 'Susanne Smed', 'Senior'], 
        ['ph.d.-student', 'Kirstine Bak', 'Junior'],
        ['Professor', 'Hans Hansen', 'Senior']]
df2 = pd.DataFrame(data2, columns=['Category', 'Name', 'Level'])
# brug csv filer i stedet for

## Merging column based on column name
* inner (default): intersection
* outer : union

In [82]:
df = pd.merge(df1, df2, on='Name')
df

Unnamed: 0,Name,Age,Compensation,Category,Level
0,Henrik Pedersen,42,44000,Associate Professor,Senior
1,Susanne Smed,49,66000,Professor,Senior
2,Kirstine Bak,27,30000,ph.d.-student,Junior
3,Hans Hansen,55,65000,Professor,Senior


In [83]:
df = pd.merge(df1, df2, on='Name', how='outer')
df

Unnamed: 0,Name,Age,Compensation,Category,Level
0,Henrik Pedersen,42,44000,Associate Professor,Senior
1,Susanne Smed,49,66000,Professor,Senior
2,Kirstine Bak,27,30000,ph.d.-student,Junior
3,Hans Hansen,55,65000,Professor,Senior
4,Per Kirk,42,47000,,


## Adding columns by concatenation

In [84]:
data1 = [['Henrik Pedersen', 42, 44000], 
        ['Susanne Smed', 49, 66000], 
        ['Kirstine Bak', 27, 30000],
        ['Hans Hansen', 55, 65000],
        ['Per Kirk', 42, 47000]]
df1 = pd.DataFrame(data1, columns=['Name', 'Age', 'Compensation'])
data2 = [['Associate Professor', 'Senior'], 
        ['Professor', 'Senior'], 
        ['ph.d.-student', 'Junior'],
        ['Professor', 'Senior']]
df2 = pd.DataFrame(data2, columns=['Category', 'Level'])

In [85]:
df = pd.concat([df1, df2], axis=1)
df

Unnamed: 0,Name,Age,Compensation,Category,Level
0,Henrik Pedersen,42,44000,Associate Professor,Senior
1,Susanne Smed,49,66000,Professor,Senior
2,Kirstine Bak,27,30000,ph.d.-student,Junior
3,Hans Hansen,55,65000,Professor,Senior
4,Per Kirk,42,47000,,


## Adding rows via concatenation

In [86]:
column_names = ['Name', 'Age', 'Compensation', 'Category', 'Level']
data1 = [['Henrik Pedersen', 42, 44000, 'Associate Professor', 'Senior'], 
        ['Susanne Smed', 49, 66000, 'Professor', 'Senior']]
df1 = pd.DataFrame(data1, columns=column_names)
data2 = [['Kirstine Bak', 27, 30000, 'ph.d.-student', 'Junior'],
        ['Hans Hansen', 55, 65000, 'Professor', 'Senior']]
df2 = pd.DataFrame(data2, columns=column_names) 

In [87]:
df = pd.concat([df1, df2]).reset_index(drop=True)
df

Unnamed: 0,Name,Age,Compensation,Category,Level
0,Henrik Pedersen,42,44000,Associate Professor,Senior
1,Susanne Smed,49,66000,Professor,Senior
2,Kirstine Bak,27,30000,ph.d.-student,Junior
3,Hans Hansen,55,65000,Professor,Senior


## Mini-assignment
* Adding rows and columns via merging of the below data frames
* Try both outer and inner: what do you observe? why?

In [88]:
data1 = [['Henrik Pedersen', 42, 'Senior'], 
        ['Susanne Smed', 49, 'Senior'],
        ['Kirstine Bak', 27, 'Junior']]
df1 = pd.DataFrame(data1, columns=['Name', 'Age', 'Level'])
data2 = [['Kirstine Bak', 27, 30000, 'Junior'],
        ['Hans Hansen', 55, 65000, 'Senior']]
df2 = pd.DataFrame(data2, columns=['Name', 'Age', 'Compensation', 'Level'])

## (If time permits) Data preparation : dropping
1. Specific
2. Duplicates

In [None]:
df2 = pd.DataFrame([['Kirstine Bak', 27, 31000, 'Ph.d.-student', 'Junior']],
                  columns=df.columns)

In [None]:
df = pd.concat([df, df2]).reset_index(drop=True)

In [None]:
df

In [None]:
df.drop_duplicates(subset='Name', inplace=True)

In [None]:
df

In [None]:
df.drop(2, inplace=True)
df

## Split-apply-combine
 * Split data into groups
 * Apply a function(ality) on each group
 * Combine into a new data structure

In [None]:
# Classic: using index / Pivot-table
df.set_index('Category', inplace=True)

In [None]:
df.mean(level='Category')

## Split-apply-combine 
using groupby

In [None]:
df.reset_index(inplace=True)

In [None]:
df.groupby('Category').mean()

In [None]:
df.groupby('Category').Age.min()

References

Fabio Nelli (2015) "Python Data Analytics", Apress, ISBN-13 (electronic): 978-1-4842-0958-5

Wes McKinney & PyData Development Team (2018) "pandas: powerful Python data analysis toolkit"

Hadley Wickham (2011) "The Split-Apply-Combine Strategy for Data Analysis", Journal of Statistical Software, Volume 40, Issue 1.
