<img src="ku_logo_uk_v.png" alt="drawing" width="130" style="float:right"/>

# <span style="color:#2c061f"> Problem Set 3: Loading and Structuring data</span>  

<br>

## <span style="color:#374045"> Introduction to Programming and Numerical Analysis </span>
*Oluf Kelkjær*

### **Today's Plan**  
1. Introduction to Pandas
2. Monte Carlo integration briefly

### Introduction to Pandas  
`Pandas` is a powerful library when dealing with data.  


`Pandas` is built on top of `Numpy` which means that alot of `Numpy` structure is used or replicated in `Pandas`.  

The core element of Pandas is the `DataFrame`. Looks like a 'classic' dataset and can store heterogeneous tabular data.  

The `DataFrame` is a `Class` with many methods!  

In [5]:
import pandas as pd
import numpy as np

data = {"A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo"}

df = pd.DataFrame(data)
df.head() # df.tail(x) last x rows, df.sample(x) x random rows

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Almost every datawrangling action you can do in `SQL`, `Excel` and `R (data.table)`, you can also do in `pandas`.    

### Accessing data in DataFrame  
Multiple ways to go about:

In [23]:
df["A"]
# df.A
# df.loc[:,"A"] # .loc needs names. First input is rows, second i column. : means take all
# df.iloc[:,0] # .iloc needs index
# df["A"].all() == df.A.all() == df.loc[:,"A"].all() == df.iloc[:,0].all()

True

### Creating new columns  
You can add new columns to the DataFrame and math operations is allowed:  

In [36]:
df["C/D"] = df['C'] / df['D']
df

Unnamed: 0,A,B,C,D,E,F,C/D
0,1.0,2013-01-02,1.0,3,test,foo,0.333333
1,1.0,2013-01-02,1.0,3,train,foo,0.333333
2,1.0,2013-01-02,1.0,3,test,foo,0.333333


### Subsetting DataFrame
Sometimes you only need specific parts of a DataFrame. To subset often the `.loc` method is used:

In [40]:
# Subset DataFrames
boolean_array = df['E'] == 'test'
print(boolean_array)

df_new = df.loc[boolean_array,['B','E','C/D']] # only want rows where boolean array is True + specified columns
df_new

0     True
1    False
2     True
Name: E, dtype: bool


Unnamed: 0,B,E,C/D
0,2013-01-02,test,0.333333
2,2013-01-02,test,0.333333


### Pandas wrapped up
These functions will get you far.  
**Remember** the answers to the PS is suggested answers - what matters is the right result.  
However, don't overcomplicate things.  


For your next project (**Data Project**) - you will be using `pandas`.  
If you're spending time on the **Inaugural Project** today, fear not.  
You will also be dealing with `pandas` in the next Problem Set  

## Numerical Integration
**General** problem:
\begin{equation}
    \mathbb{E}[g(x)]=\int_{x \in X}^{} g(x)f(x)dx
\end{equation}
where $g:\mathbb{R}\rightarrow \mathbb{R}$ is some function and $f(x)$ is the PDF for $x$.  

**General** solution:  
Relying on the **LLN** we can **approximate** the true integral with a finite sample, i.e. turn into discrete sum:
\begin{equation}
    \mathbb{E}[g(x)]\approx \sum_{i=1}^{N} g(x_i) w_{i} 
\end{equation} 
In **Monte Carlo integration** we draw $N$ (pseudo-)random $x_i$ from $f(x)$, where weights $\sum w_i=\frac{1}{N}$.  
This means the integral can be approximated by
\begin{equation}
    \mathbb{E}[g(x)]\approx \frac{1}{N} \sum_{i=1}^{N} g(x_i)
\end{equation}
**In conclusion:** the most likely values of $x$ will weight the most as they are sampled the most often - thus gaining the appropriate weight in MC integration. Taking the mean is thereby sufficient.

**Question 3** of Inaugural project is presented as integral - you should the previous logic.  
Meaning:
1. Draw `x` from beta distribution
2. evalute `u( )` as seen in the question 
3. Return its mean