In [7]:
# This is Lecture05 - Exercise 1 of the "Data Science" class 
# at Technische Hochschule Rosenheim

# Iris Dataset revisited

Do you remember Bernd the Botanist from the previous lecture? The one with the Iris flowers? If not, go back and re-read this business problem!

Bernd has decided to go ahead and solve his Iris-Flower classification problem. As there are too many flowers for him to measure all by himself, he her asked a few people to help him (you will find all datasets mentioned below in the folder `data`).

### 1) Mary, the biology student
He asks Mary to measure all the Iris setosa he has in the lab. At the end of the week, Mary provides him with a file `setosa.csv` and mentions "sorry it took so long, but I had to study for my final exam in every break I took". 

### 2) Tom, the gardener
He asks Tom to measure all the Iris versicolor. Tom is a very diligent person and comes back after two days with the file `versicolor.xlsx`, telling Bernd "I did all the measurements for the 50 flowers you asked for, first the sepal length and width in centimeters, and than the petal length and width, which I did in millimeters, as the numbers were quite small. Hope this helps!"

### 3) Angi and Angus, two summer interns
He asks Angi and Angus to measure all the Iris viriginica. The two decide to split up the work. They number each flower (from 1 to 50). Angi does the sepal measurements while Angus is responsible for the petal measurements. At the end of the week they give him two files, Angi has done the measurements in cm starting with plant number 1 going forward (`virginica angi.txt`) and Angus has done his measurements in mm starting with plant number 50 going backwards (`virginica angus.csv`). 

# Part I - Data Loading, Munging, Missing Values

First, you need to understand the data in more detail, clean and combine it.

**Try to make as many of your solution cells idempotent as possible.**

### Exercise I.1

Load each of the 4 files into a separate DataFrames each, called 'setosa_raw', 'versicolor_raw', 'angi_virginica_raw' and 'angus_virginica_raw'.

In [8]:
# imports
%matplotlib notebook
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

In [9]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [10]:
## ---------- SOLUTIONS
setosa_raw = pd.read_csv("data/setosa.csv", sep=";")

versicolor_raw = pd.read_excel("data/versicolor.xls", header=None)
versicolor_raw.columns = ["sepal length", "sepal width", "petal length", "petal width"]

angi_virginica_raw = pd.read_csv("data/virginica angi.txt", sep="\t", header=None, names=["#","sepal length","sepal width"], index_col=0)

angus_virginica_raw = pd.read_csv("data/virginica angus.csv", index_col=0)
angus_virginica_raw.columns = ["petal length", "petal width"]

display("setosa_raw.head()", "versicolor_raw.head()", "angi_virginica_raw.head()", "angus_virginica_raw.head()")

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.9,3.0,1.4,
4,4.6,3.1,1.5,0.2

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,7.0,3.2,47,14
1,6.4,3.2,45,15
2,6.9,3.1,49,15
3,5.5,2.3,40,13
4,6.5,2.8,46,15

Unnamed: 0_level_0,sepal length,sepal width
#,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6.3,3.3
2,5.8,2.7
3,7.1,3.0
4,6.3,2.9
5,6.5,3.0

Unnamed: 0_level_0,petal length,petal width
#,Unnamed: 1_level_1,Unnamed: 2_level_1
50,51,18
49,54,23
48,52,20
47,50,19
46,52,23


### Exercise I.2

Convert all measurements which are not in cm to cm (by changing the 'xxx_raw' DataFrames) and combine the DataFrames for Angi and Angus into one DataFrame 'virginica_raw'.

In [15]:
## ---------- SOLUTIONS
versicolor_raw[["petal length", "petal width"]] /= 10
angus_virginica_raw[["petal length", "petal width"]] /= 10

virginica_raw = angi_virginica_raw.merge(angus_virginica_raw, on="#")


display("setosa_raw", "versicolor_raw", "virginica_raw")

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.9,3.0,1.4,
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4
7,4.6,3.4,1.4,0.3
8,5.0,3.4,1.5,0.2
9,4.6,3.5,,0.1

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,7.0,3.2,0.47,0.14
1,6.4,3.2,0.45,0.15
2,6.9,3.1,0.49,0.15
3,5.5,2.3,0.4,0.13
4,6.5,2.8,0.46,0.15
5,5.7,2.8,0.45,0.13
6,6.3,3.3,0.47,0.16
7,4.9,2.4,0.33,0.1
8,6.6,2.9,0.46,0.13
9,5.2,2.7,0.39,0.14

Unnamed: 0_level_0,sepal length,sepal width,petal length,petal width
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,6.3,3.3,0.6,0.25
2,5.8,2.7,0.51,0.19
3,7.1,3.0,0.59,0.21
4,6.3,2.9,0.56,0.18
5,6.5,3.0,0.58,0.22
6,7.6,3.0,0.66,0.21
7,4.9,2.5,0.45,0.17
8,7.3,2.9,0.63,0.18
9,6.7,2.5,0.58,0.18
10,7.2,3.6,0.61,0.25


### Exercise I.3

Now you should have one DataFrame for each kind of Iris (each 'class'). Let's check each of these for missing values!

* Remove all tupels with missing values (creating three new DataFrames 'xxx_nmv' (for NoMissingValues), do not change the 'xxx_raw' DataFrames!).
* Replace all missing values with the mean of the attribute (again, create three new DataFrames 'xxx_mmv' (vor MeanMissingValues).

Save all three 'xxx_nmv' datasets into three csv files in the '/output'-directory.

In [17]:
## ---------- SOLUTIONS
setosa_nmv = setosa_raw.dropna().reset_index()
versicolor_nmv = versicolor_raw.dropna()
virginica_nmv = virginica_raw.dropna()

setosa_mmv = setosa_raw.fillna(setosa_raw.mean(), inplace=True)
versicolor_mmv = versicolor_raw.fillna(versicolor_raw.mean(), inplace=True)
virginica_mmv = virginica_raw.fillna(virginica_raw.mean(), inplace=True)

setosa_nmv.to_csv("output/setosa_nmv.csv")
versicolor_nmv.to_csv("output/versicolor_nmv.csv")
virginica_nmv.to_csv("output/virginica_nmv.csv")

------