# Live Coding Session 2:
What you will learn in this session:
 * Control Structures in Python
     * Conditional Processing (if-then-else)
     * Looped Processing (for- and while-loops)
 * Type conversion (casting)
 * Numpy
 * Matplotlib
 

## Task Description: 
The task, which has already been sketched in [Session1](Session1.ipynb) will be solved in this notebook. Instead of using only a few own-defined data instances, we import a comprehensive dataset from the publically available [insurance dataset](../../R/Lecture/data/insurance.csv) 

## For Loop
In this example only the basic usage of for-loops is demonstrated. The subject is described more thouroughly in notebook [04ControllStructures.ipynb](../Lecture/04ControllStructures.ipynb).


**Tasks:**
1. Read all data from the file [insurance dataset](../../R/Lecture/data/insurance.csv), into a nested list, i.e. each row in the file shall be represented by a single list (inner lists) and all of these inner lists shall be aggregated in an outer list. Use a for-loop for the line-by-line reading-process.   

## Conditional Execution
In this example only the basic usage of conditional expressions is demonstrated. The subject is described more thouroughly in notebook [04ControllStructures.ipynb](../Lecture/04ControllStructures.ipynb).

**Tasks:**
1. As can be seen in the code-cell above, the first line in the file is a header, which specifies the feature-names. Re-implement the reading process, such that the header-row is excluded from the data-rows. 
2. The output of the code-cell above also demonstrates, that all data has been imported as `string` variables. Re-implement the reading process, such that the numeric features are converted in their appropriate type.

**Tasks:**
1. For all numeric features, determine the mean in the smoker and the mean in the non-smoker partition. Do smokers have in average
    * a higher BMI? 
    * more children?
    * higher charges?
    * a higher age?

## Numpy
Arithmetic operations, like the calculation of descriptive statistics in the example above, can be realized much more comfortable and efficient, by applying the numpy package. [Numpy](https://docs.scipy.org/doc/) provides efficient datastructures and a vast bunch of functions for any kind of scientific calculation. The main datastructure is the numpy-array. The basics of numpy are described in the [numpy introduction notebook of this lecture](../Lecture/NP01numpyBasics.ipynb).

The code-cells below demonstrate how numpy can be applied to efficiently solve the task of the example above (do features like BMI significantly vary between the groups of smokers and non-smokers?).

First, a numpy-array is generated from the nested list `clientList` (in the third line of the code-cell below).

As can be seen in the output, the new numpy-array keeps the same data as the nested `clientList`. All values are stored as strings. 

Next, all the numeric features of the insurance data are extracted. Moreover, the type of this numeric columns is converted to float.

Numpy provides many functions for calculating descriptive statistics. These functions can be executed on any numpy-array. In the code cell below, just the mean of each numeric feature is calculated. First the mean over all clients. Then the mean over the smoker and non-smoker-partition, respectively. Note how easy numpy-arrays can be filtered (splitted) with respect to variable-values. Here, the array of all clients is partitioned with respect to the value of the feature `smoker`.

## Visualization with Matplotlib
[Matplotlib](http://matplotlib.org/) is the main Python 2D plotting library. In this notebook we apply the *pyplot*-module of Matplotlib in order to 
* draw scatter-plots for visual correlation analysis
* histograms for the visualization of value distributions

These are only two types out of a vast selection of plotting-types, provided by pyplot. The basics of matplotlib are described in the [matplotlib introduction notebook of this lecture](../Lecture/PLT01visualization.ipynb).

**Question:**

In the code cell below the feature `charges` is plotted versus the feature `BMI`. What can be concluded from this visualisation?

Next, the histograms of the features
* BMI
* charges
* number of children

are plotted: