## This notebook describes the proces of exploring and manipulating the data

For review purpose, there is an option to run or not run specific lines of code. Use run_all to run all script.

In [8]:
run_all = False

**Imports**

In [21]:
import pandas as pd
import math
import matplotlib.pyplot as plt


**Load db into DataFrame**

In [3]:
# df = pd.read_csv("../dload_db/csv/db_mcr.csv")
# Temporary static file, until we can use Django-server
df = pd.read_csv("../dload_db/csv/static_db_mcr.csv")

#### Exploration of data
This database represents subjects (persons) with a certain lifestyle and projected lifespan. Ultimately, this data is used to predict lifespan based on the other variables.

First, we answer some exploratory question. E.g. what kind of data are we working with, do we understand the variabels (columns), is the data making sense for each variable, are we missing data etc. I am starting with the most simple descriptives and build up from there.

In [9]:
run = False

if run or run_all:
    # Headers (variables), dtypes and shape
    print("Headers (variables):", [variable for variable in df.columns])
    print("Datatypes:", [dtype for dtype in df.dtypes])
    print("Rows: ", df.shape[0], "; Cols: ", df.shape[1], sep="")
    print("") # Create enter after information
    
    # Same can be done via .info()
    print(df.info())
    print("") # Create enter after information

    # Display head()
    display(df.head())

**Interpretation**<br>
Firstly we see that there are 8 variables. Of these 1 is a dependent variable (lifespan) and 7 are independent variables. 

These variables are describes as:<br>
***genetic:** the expected lifespan in years based genetic factors, independent on lifestyle<br>
**lenght:** lenght in cm<br>
**mass:** weight (mass) in kg<br>
**exercise:** (average) daily hours of exercise<br>
**smoking:** (average) daily number of sigarettes<br>
**alcohol:** (average) daily number of glasses alcohol<br>
**suger:** (average) daily cubes of sugar (4 gramms)<br>
**lifespan (dependant):** projected lifespan in years, dependent on lifestyle<br>*

Based on the descriptions, we can deduce that all variables are numbers, mostly floats but also possibly ints (lenght?). All variables should be positive numbers and depending on the variable have a certain logical range. 

We see in '.info()' that some variables have dtype 'object' which is equivalent to 'string'. We need to find out why and whether we need to manipulate the data. Also, we see that there are no 'Null' or empty datapoints from comparing the index (shape) with the non-null count. However, there might be datapoints that have other annotations to indicate data is missing (e.g. "NaN") or other non-numerical data.

##### Check for non numerical data
Start with listing all indices that have non numerical data. Respecify the dtype as either int() or float(). For all non-complient data, list the (unique) indices (in a set) and decide what to do with them.

In [10]:
run = False

if run or run_all:
    # List indices of all non numerical entries
    list_non_num = []
    for variable in df.columns:
        list_non_num = list_non_num + df.index[~pd.to_numeric(df[variable], errors="coerce").notnull()].tolist()

    set_non_num = set(list_non_num)
    print("There are ", len(set_non_num), " instances with missing data: ", set_non_num, sep="")

    # Remove incomplete/missing data from dataset
    df = df.loc[~df.index.isin(set_non_num)]

    # Then we convert the object-type variables to numeric and check.
    for variable in df.columns:
        df[variable] = pd.to_numeric(df[variable])
    print("Datatypes:", [dtype for dtype in df.dtypes])

**Interpretation**<br>
There are 10 indices (rows) with non-numerical data. In the code above, (for now) I have decided to exclude them from the dataset, as they account for <0,25% of the data. Possibly in later stages I might want to use a different way to handle these exeptions, if the model requires so.

*Note: if the output indicates that "there are 0 instances with missing data", this might be due to running an already 'cleaned' version of df.*

##### Check for illogical data
For each variable I decide by 'common sense' and looking at the data what the range should be in which the data is 'makes sense', or in other words is 'logical'. We start with looking at the histograms to get an impression of the distribution, min/max values etc. 

For all data that does not comply to the logical range, list the (unique) indices (in a set) and decide what to do with them.

In [68]:
run = False

if run or run_all:
    # Prepare and dispay histograms
    for variable in df.columns:
        # Set bins of histogram to whole numbers 
        bin_min = df[variable].min(); bin_max = df[variable].max()
        bins = range(math.floor(bin_min), math.ceil(bin_max))

        # Define the histogram
        n, bins, patches = plt.hist(x=df[variable], bins=bins, rwidth=0.75) 
        plt.grid(axis='y', alpha=0.75)
        plt.ylabel("Count")
        plt.title(f"Histogram of {variable}")
        plt.show()

# Define range that the data 'logically' can in a dictionary
range_var = {
    'genetic': (40, 120),
    'length': (147, 220),
    'mass': (30, 200),
    'exercise': (0, 8),
    'smoking': (0, 60),
    'alcohol': (0, 12),
    'sugar': (0, 50),
    'lifespan': (40, 120)
}

# # List data that does not comply with range
# list_illogical = []
# for variable in df.columns:
#     # Load range of variable in range_var
#     range_min = range_var[variable][0]; range_max = range_var[variable][1]
#     # list_illogical = list_illogical + df.index[(df[variable] < range_min) | (df[variable] > range_max)]

print(list_illogical)

test = df[df['genetic'] < 40]
print(test)

Float64Index([], dtype='float64')
Empty DataFrame
Columns: [genetic, length, mass, exercise, smoking, alcohol, sugar, lifespan]
Index: []


**Interpretation**

        # (this makes most/all variables 'logically interpretable', but I must beware that the data are mostly floats)