LEARNING OBJECTIVES:
- List & describe fundamental statistical evaluations
- Discuss the use of the t-test/anova as general purpose statistics
- 


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.integrate import solve_ivp
import re
import seaborn as sns

In [None]:
titanic = sns.load_dataset("titanic")

# Working with real-world datasets
What we did not talk about yet is the fact that often datasets you find in the real world are messy.  
Data points may be **missing**, other points may be wrongly measured and lay far **outside** the range of replicates. 
The data may also be saved in weird formats or otherwise difficult to read. 
And, most importantly, measurements often follow a certain kind of distribution and we might want to find out if and how something changed between conditions.

Let's load a dataset and start with cleaning it - preparing our data for analysis.

## Problems when loading
The first problems can already arise when trying to load the data.  
Try and load the file `wrk_example_data.csv`.  

In [None]:
pd.read_csv("../data/wrk_example_data.csv")

### Exercise 1.1
Also load the file `viz_example_data.csv` you used before.  
Can you make out what went wrong in the previous file?

In [None]:
df2 = pd.read_table("../data/viz_example_data.csv", sep="\t")
df = pd.read_table("../data/wrk_example_data.csv", sep="\t")
df

Another problem can be **file types** that are old, rarely used, or most often *proprietary* - so from a licensed program (e.g. excel).  
Such data may be hard to read by most open-source software and often you have to convert them to a common type like csv.  
But to do that you have to know exactly which format your data has, but even than proprietary formats just cannot be read without the proper programs.  
(Therefore, when depositing data, best export it into an easily readable format)

### Exercise 1.2
We have saved the same data another format as `wrk_example_data.dta`.  
Use 5-10 minutes to google or ask you neighbors to find out which data type that is and find out how to read it into python.  
(Tip: you can use pandas)

In [None]:
df = pd.read_csv("../data/viz_example_data.csv")

## Bad values
A thing to keep in ming when working with excel.  
While excel is quite easy to use and strong for data analysis, it can have problems with automatically assigning data types (like text, number, date, etc.) to values.  
Especially when inputting data into a spreadsheet, changing or copying values, make sure that the values are correctly saved.  

### Exercise 1.3
Load the file `wrk_example_data.xlsx`.  
Observe the data and hypothesize what error happened.  
Could you reverse it?

In [None]:
df = pd.read_excel("../data/wrk_example_data.xlsx")

In [None]:
df

## Not Available (NA) or Not a Number (NaN)
Let's now go actually bigger datasets that are closer to real world examples.
We can look at a dataset on people aboard the titanic which we loaded at the beginning of the notebook as `titanic`.

In [None]:
titanic.head()

We can se that there are `NaN`s or `NA`s, so missing values, in the deck column of our data.  
This can happen because the value was just not measured or on other cases because a calculation returned with a value that is Not a Number (`NaN`, i.e. division 0/0).  
It's important to know that they are there and to decide on how to deal with them.  
**By default pandas will ignore them in most applications like sum, mean, etc.**  

We should take a closer look at where the `NA`s and how big of a problem they are.  
The function `isna` returns a DataFrame where for matrix entry we evalueated as `True`/`False` if the value is `NA`.  
Then, the `any` function can tell us if any of the evaluations was True in our columns:  

In [None]:
titanic.isna().any()

### Exercise 1.4 
1. Using the `isna` and `value_counts` methods, find out how many peoples' entries contain `NA`s.  
2. (optional) Find out in which columns these are and give a recommendation on how to deal with them.

In [None]:
titanic.isna().value_counts()

You should have seen that most `NA` entries are in the `deck` column and that only 15% of all entries are fully complete with values.   
`pandas` has a function to remove all rows with `Na`s called `dropna`, however right now we would loose far too much data.  
Let's assume we are not really interested in the deck the people were on and remove that column - then use `dropna`.

In [None]:
titanic2 = titanic.drop("deck", axis=1)
titanic2 = titanic2.dropna()
titanic2.head()