# Session 2: Data Structuring 1

*Nicklas Johansen*

## Agenda

In this session, we will work with `pandas` and how to structure your data.

- Tidy Data
- numpy & pandas modules
- pandas series
- pandas DataFrames
- selecting data
- indexing & renaming


NB
- Download this file to your computer
- Rename the file
- Run code together with Nicklas

## Why We Structure Data

### Motivation
*Why do we want to learn data structuring?*

- Data rarely comes in the form of our model. We need to 'wrangle' our data.
- Someone has to do this
- You need to understand how data was prepared to avoid misconclusions

### Tidy Data

Good discussion [here](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The fundamentals:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.

<center><img src='https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/tidy.png'></center>

# Numpy and Pandas

In [1]:
# Loading packages
import numpy as np
import pandas as pd

## Numpy Overview
*What is the [`numpy`](http://www.numpy.org/) module?*

`numpy` is a Python module / library / package
- fast and versatile for manipulating arrays
- linear algebra tools available
- used in some machine learning and statistics packages

Example of creating an array similar to a 2x2 matrix:

In [2]:
table = [[1,2],[3,4]]
arr = np.array(table)
arr

array([[1, 2],
       [3, 4]])

## Pandas Motivation
*Why use Pandas?*

It is built on numpy:
- Simplicity: Pandas is built with Python's simplicity 
- Powerful and fast tools for manipulating data from numpy

Improves on numpy:
- Clarity, flexibility by using labels (keys)
- Introduces lots of new, useful tools for data analysis (more on this)

Note: Much more similar to common software for data manipulation like, say, Stata


#### Pandas Popularity

<center><img src='https://www.sqlshack.com/wp-content/uploads/2020/08/pandas-in-python-popularity-from-stack-overflow.png' alt="Drawing" style="width: 500px;"/></center>



# DataFrames and Series

In [3]:
# Loading packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

## Pandas Data Types
*How do we work with data in Pandas?*

- We use two fundamental data stuctures: 
  - ``Series``, and
  - ``DataFrame``.

## Pandas Series (1:5)
*What is a `Series`?*
- A vector/list with labels for each entry. Example:

In [4]:
L = [1, 1.2, 'abc', True]

my_series = pd.Series(L)
my_series

0       1
1     1.2
2     abc
3    True
dtype: object

## Pandas Series (2:5)
*What are the components in a Series?* 

Series generally consists of three components:
- `index`: label for each observation
- `values`: observation data
- `dtype`: the format of the series (`object` means any data type is allowed)
  - examples are fundamental datatypes (`float`, `int`, `bool`)  
      - in terms of precision: `float`>`int`>`bool`
      - this comes at a cost in the form of speed

## Pandas Series (3:5)
*How do we set custom index?* 

Indices need not have a sequential structure. To see this, consider the following example

In [2]:
num_data = range(0,3) # Generate data
num_data

range(0, 3)

In [3]:
indices = ['B', 'C', 'A'] # Generate index names
indices

['B', 'C', 'A']

In [4]:
# Create a pandas series from the two
my_series2 = pd.Series(data=num_data, index=indices) 
my_series2

B    0
C    1
A    2
dtype: int64

## Pandas Series (4:5)
*What data structure does the pandas series remind us of?*

A mix of Python list and dictionary. Consider the following simple transformation:

A mix of Python list and dictionary. Consider the following simple transformation:

In [7]:
my_series.to_dict()

{0: 1, 1: 1.2, 2: 'abc', 3: True}

*Can we also convert a dictionary to a series?*
- Yes, we just put into the Series (class) constructor. Example:

In [5]:
d = {'yesterday': 0, 'today': 1, 'tomorrow':3} # Create some dictionary
d

{'yesterday': 0, 'today': 1, 'tomorrow': 3}

In [6]:
my_series3 = pd.Series(d) # Use the constructor
my_series3

yesterday    0
today        1
tomorrow     3
dtype: int64

## Pandas Series (5:5)
*How is the series different from a dict?*
- An important distinction: Series indices are NOT unique! Example:

In [8]:
s = pd.Series(range(3), index=['A','A', 'A']) # Create series with same indices
s

A    0
A    1
A    2
dtype: int64

In [9]:
print(s.index.duplicated()) # Check duplicates


[False  True  True]


In [10]:
print(s.to_dict()) # So translating to a dict gives...

{'A': 2}


Series are both key and index  based (i.e. sequential).
- Remember that unlike, say, lists, dictionaries are not sequential!

## Pandas Data Frames (1:4)

*OK, so now we know what a series is. What is a `DataFrame` then?*
- A 2d-array (matrix) with labelled columns and rows (which are called indices). Example:

In [11]:
df = pd.DataFrame(data=[[1,2],[3,4]],columns=['A', 'B'])
df

Unnamed: 0,A,B
0,1,2
1,3,4


## Pandas Data Frames (2:4)
 *How can we really think about this?*

There are at least two simple ways of seeing the pandas DataFrame:
1. A numpy array with some additional stuff.
2. A set of series that have been merged horizontally
    - Note that columns can have different datatypes!
    
Most functions from `numpy` can be applied directly to Pandas. We can convert a DataFrame to a `numpy` array with `values` attribute.

In [11]:
df.values

array([[1, 2],
       [3, 4]], dtype=int64)

*To note*: In Python we can describe it as a *list of lists* or sometimes a *dict of dicts*.

In [12]:
df.values.tolist()

[[1, 2], [3, 4]]

## Pandas Data Frames (3:4)
*How can larger pandas dataframes be built?*
- Similar to Series, DataFrames can be built from dictionaries.
- An important difference: When it comes to creating distinct columns, DataFrames require that each value in the -dictionary is also a dictionary. Example:

In [13]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '3rd':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df1 = pd.DataFrame(d) # Use the constructor
my_df1

Unnamed: 0,january,february,march
1st,0,-3,3
2nd,1,-1,5
3rd,3,-2,4


## Pandas Data Frames (4:4)

*What happens if keys are not the same?*
- No big deal...

In [14]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '4th':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df2 = pd.DataFrame(d) # Use the constructor
my_df2

Unnamed: 0,january,february,march
1st,0.0,-3.0,3.0
2nd,1.0,-1.0,5.0
3rd,3.0,-2.0,
4th,,,4.0


## Series vs DataFrames (1:2)
*How are Series related to DataFrames?*
- Putting it simple: Every column is a series. Example, access as key (recommended):

In [15]:
print(df['B'])

0    2
1    4
Name: B, dtype: int64


Another option is access as object method... smart, but dangerous! Sometimes it works...

In [16]:
print(df.B)

0    2
1    4
Name: B, dtype: int64


## Series vs DataFrames (2:2)
*But when wouldn't this work?*
- To illustrate, add one more column:

In [17]:
df['count'] =  5 # adding new column to df
print(df)

   A  B  count
0  1  2      5
1  3  4      5


Now print this and see!

In [18]:
print(df.count)

<bound method DataFrame.count of    A  B  count
0  1  2      5
1  3  4      5>


Clearly, the key-based option is more robust as variables named same as methods, e.g. `count`, cannot be accesed.

## Converting Data Types

The data type of a series can be converted with the **astype** method. Some examples:

In [19]:
print(my_series3)
print()
print(my_series3.astype(np.float))
print()
print(my_series3.astype(np.str))

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday    0.0
today        1.0
tomorrow     3.0
dtype: float64

yesterday    0
today        1
tomorrow     3
dtype: object


## Indices and Column Names
*Why don't we just use numpy arrays and matrices?*

- Inspection of data is quicker
    - What was it that column 18 represented?
- Keep track of rows after deletion
    - Again.... What was it that column 18 represented!?
- Indices may contain fundamentally different data structures 
    - e.g. time series (more about this later)
    - Other datatypes (spatial data $\rightarrow$ advanced course)
- Facilitates complex operation (next session):
    - Merging datasets
    - Split-apply-combine (operations on subsets of data)
    - Method chaining (multiple operations in sequence)

## Viewing Series and Dataframes
*How can we view the contents in our dataset?*
- We can use `print` on our dataset
- We can visualize patterns by plotting

## The Head and Tail
*But what if we have a large data set with many rows?*
- Let's load the 'titanic' data set that comes with the *seaborn* library:

In [20]:
import seaborn as sns
titanic = sns.load_dataset('titanic')

We now select the *first* 3 rows in a the with the `head` method.

In [21]:
titanic.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


The `tail` method selects the last observations in a DataFrame. 

## Row and Column Selection (1:3)
*How can we select certain rows in a DataFrame using **keys**?*

With the `loc` attribute. Example:

In [22]:
print(titanic.loc[range(3),['survived', 'age', 'sex']])

   survived   age     sex
0         0  22.0    male
1         1  38.0  female
2         1  26.0  female


## Row and Column Selection (2:3)
*How can we select certain rows in a DataFrame using **index integers**?* 

The `iloc` method selects rows and columns for provided index integers. 

In [23]:
print(titanic.iloc[10:15,:5])

    survived  pclass     sex   age  sibsp
10         1       3  female   4.0      1
11         1       1  female  58.0      0
12         0       3    male  20.0      0
13         0       3    male  39.0      1
14         0       3  female  14.0      0


## Row and Column Selection (3:3)
*Other things to be aware of?* 

We can select rows for all columns by not specfifying columns (or specifying `:`). I.e:

In [24]:
titanic.loc[[0,1,2]]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


We can also select certain columns by specifying column names:

In [25]:
titanic[['survived']].head(3)

Unnamed: 0,survived
0,0
1,1
2,1


## Modifying DataFrames
*Why do we want to modify DataFrames?*

- Because data rarely comes in the form we want it.


## Changing the Index (1:3)
*How can we change the index of a DataFrame?*

In [26]:
my_df = pd.DataFrame([[1,2], [3,4], [5,6]], columns = ['a', 'b'], index = ['i', 'ii', 'iii'])
my_df

Unnamed: 0,a,b
i,1,2
ii,3,4
iii,5,6


We change or set a DataFrame's index using its method `set_index`. Example:

In [27]:
print(my_df.set_index('a'))

   b
a   
1  2
3  4
5  6


Clearly, doing so, we also implicitly delete the previous index.

Also, notice the level shift in *b* due to this.

In [27]:
my_df

Unnamed: 0,a,b
i,1,2
ii,3,4
iii,5,6


## Changing the Index (2:3)
*Is our DataFrame changed? I.e. does it have a new index?*
- Modifying DataFrames

In [28]:
my_df

Unnamed: 0,a,b
i,1,2
ii,3,4
iii,5,6


In [30]:
my_df_a = my_df.set_index('a')
my_df_a

Unnamed: 0_level_0,b
a,Unnamed: 1_level_1
1,2
3,4
5,6


In [31]:
my_df_a

Unnamed: 0_level_0,b
a,Unnamed: 1_level_1
1,2
3,4
5,6


## Changing the index (3:3)

Sometimes we wish to remove the index. This is done with the `reset_index` method:

In [29]:
print(my_df_a.reset_index()) # drop=True
print()
print(my_df_a.reset_index(drop=True)) # drop=True

   a  b
0  1  2
1  3  4
2  5  6

   b
0  2
1  4
2  6


By specifying the keyword `drop=True` we delete the old index.

*To note:* Indices can have multiple levels, in this case `level` can be specified to delete a specific level.

## Changing the Column Names

Column names can simply be changed with `columns`:

In [30]:
print(my_df)
my_df.columns = ['A', 'B']
print()
print(my_df)

     a  b
i    1  2
ii   3  4
iii  5  6

     A  B
i    1  2
ii   3  4
iii  5  6


DataFrame's also have the function called `rename`.

In [31]:
my_df.rename(columns={'A': 'Aa'}, inplace=True)
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6


## Changing all Column Values
*How can we can update values in a DataFrame?*

In [32]:
print(my_df)

# # set uniform value
my_df['B'] = 3
print()
print(my_df)

# set different values
my_df['B'] = [2,17,0] 
print()
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6

     Aa  B
i     1  3
ii    3  3
iii   5  3

     Aa   B
i     1   2
ii    3  17
iii   5   0


## Changing Specific Column Values
*How can we can update values in a DataFrame?*

In [33]:
print(my_df)

# loc, iloc
my_loc2 = ['i', 'iii']
my_df.loc[my_loc2, 'Aa'] = 10

print()
print(my_df)

     Aa   B
i     1   2
ii    3  17
iii   5   0

     Aa   B
i    10   2
ii    3  17
iii  10   0


## Sorting Data

A DataFrame can be sorted with `sort_values`; this method takes one or more columns to sort by. 

In [34]:
print(my_df.sort_values(by='Aa', ascending=True))

     Aa   B
ii    3  17
i    10   2
iii  10   0


Many key word arguments are possible for sort_values, including ascending if for one or more valuable, we want descending values. 

In addition, sorting by index is also possible with `sort_index`.

In [35]:
print(my_df.sort_index())

     Aa   B
i    10   2
ii    3  17
iii  10   0


## DO2021 COHORT

In [16]:
import pandas as pd

df = pd.read_excel ('data.xlsx', sheet_name='Complete')
df

Unnamed: 0,Hvad er dit KU brugernavn? (Skriv de 3 bogstaver + 3 tal fra din ku mail fx. abc123),Gruppe,Hvor gammel er du?,Hvilket fagområde er din nuværende uddannelse indenfor?,"Hvilket fagområde er din nuværende uddannelse indenfor? - Andet, skriv venligst:",Er du lige nu indskrevet på bachelor- eller kandidatstudieordning?,"Er du lige nu indskrevet på bachelor- eller kandidatstudieordning? - Andet, skriv venligst:",Hvilket postnummer er du vokset op?,I hvilken region er du vokset op?,I hvilken region er du vokset op? - Udlandet (uddyb venligst):,...,Pizza eller Poke Bowl?,Hvor vil du helst arbejde når du er færdiguddannet?,Hvad er din yndlingsmusik?,Navn,E-mail,Samlet status - Ny,Samlet status - Distribueret,Samlet status - Nogen svar,Samlet status - Gennemført,Samlet status - Frafaldet
0,vjq698,1.0,26.0,"Andet, skriv venligst:",samfundsfag,Kandidatstudieordning,,2830.0,Hovedstaden,,...,Pizza,Privat,Nik & Jay,Michael Jørgen Kjær,vjq698@alumni.ku.dk,0,0,0,1,0
1,gzf378,1.0,24.0,Statskundskab,,Kandidatstudieordning,,3480.0,Hovedstaden,,...,Pizza,Offentligt,Nik & Jay,Nanna Holze Brandt,gzf378@alumni.ku.dk,0,0,0,1,0
2,FTB283,1.0,24.0,Statskundskab,,Kandidatstudieordning,,7000.0,Syddanmark,,...,Pizza,Privat,Nik & Jay,Matthias Niels Runge Madsen,ftb283@alumni.ku.dk,0,0,0,1,0
3,,,,,,,,,,,...,,,,Kathrine Edwards,kvf505@alumni.ku.dk,0,1,0,0,0
4,zcn409,2.0,26.0,"Andet, skriv venligst:",Sikkerheds- og risikoledelse,Kandidatstudieordning,,90100.0,Udlandet (uddyb venligst):,Finland,...,Pizza,Privat,Nik & Jay,Laura Henna Sinikka Sunnari,zcn409@alumni.ku.dk,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,sdv351,18.0,22.0,Statskundskab,,Bachelorstudieordning,,2860.0,Hovedstaden,,...,Pizza,Privat,Nik & Jay,Emil Juel Stephens,sdv351@alumni.ku.dk,0,0,0,1,0
57,lmg642,19.0,40.0,Antropologi,,Bachelorstudieordning,,8600.0,Midtjylland,,...,Poke Bowl,Privat,Nik & Jay,lmg642 lmg642,lmg642@alumni.ku.dk,0,0,0,1,0
58,sjg501,19.0,23.0,Statskundskab,,Bachelorstudieordning,,4230.0,Sjælland,,...,Pizza,Privat,Nik & Jay,Oskar Holm Klæbel,sjg501@alumni.ku.dk,0,0,0,1,0
59,kmp116,20.0,22.0,Statskundskab,,Bachelorstudieordning,,4500.0,Sjælland,,...,Pizza,Offentligt,Nik & Jay,Albert Neve Alsbjerg,kmp116@alumni.ku.dk,0,0,0,1,0


In [23]:
# Hvor mange har svaret på survey?
df['Samlet status - Gennemført'].value_counts()

1    57
0     4
Name: Samlet status - Gennemført, dtype: int64

In [24]:
# Hvem har ikke svaret?
df[df['Samlet status - Gennemført']==0]['E-mail']

3     kvf505@alumni.ku.dk
8            mnk253@ku.dk
20    bnr433@alumni.ku.dk
21    hdz370@alumni.ku.dk
Name: E-mail, dtype: object

In [25]:
# Folk elsker pizza?
df['Pizza eller Poke Bowl?'].value_counts()

Pizza        41
Poke Bowl    16
Name: Pizza eller Poke Bowl?, dtype: int64

In [None]:
# Hvad er MSc vs BSc fodelingen i klassen?
df['Er du lige nu indskrevet på bachelor- eller kandidatstudieordning?'].value_counts()

In [None]:
To be contiued...

## Assignment 0

- Fundamentals of Python:
    - Data types: numeric, string and boolean
    - Operators: numerical and logical
    - Sequential containers (and a tiny bit on the non-sequential)
    
    
- Building blocks of code:
    - If-then syntax
    - Loops: for and while
    - Reuseable code: Functions, classes and modules


- Data Structuring in Pandas
    - Constructing a pandas Series/DataFrame
    - Reading csv_files
    - Naming columns and rows
    - Selecting columns and rows
    - Numerical operations
    - Sorting data

## Associated Readings

PDA, section 5.3: Descriptive statistics and numerical methods

PDA, chapter 7:
- Handling missing data
- Data transformations (duplicates, dummies, binning, etc.)
- String manipulations

PDA, sections 11.1-11.2:
- Dates and time in Python
- Working with time series in pandas (time as index)

PDA, sections 12.1, 12.3:
- Working with categorical data in pandas
- Method chaining

PML, chapter 4, section 'Handling categorical data':
- Encoding class labels with `LabelEncoder`
- One-hot encoding

## Group Exercises
Exercises where you can practice your data wrangling skills!  
Will be uploaded after the weekly lecture.