# Lab1 Python for data science!

This is an introductory class

The objectives of this lab is for you to get:
* familiar with the lab strucutre
* a Python crash course / reminder
* an introduction / reminder to key data science libraries (Numpy, Pandas)
* experience with loading, validating, and visualizing data.

## Quick note on the labs
The labs will be made available on [GitHub](https://github.com/Faur/ITU-Data-Science-in-Games-Exercises) on a roling basis.
Be sure to have the most recent version locally by pulling from the repo.
This can be done from the notebook by using the cell below.
Remove the comment symbol `#` and run the cell (`Ctrl` + `Enter`).
`!` tells the notebook to run the command in the terminal, instead of in thr Python interpriter.

Some important notes:
* **Shut down notebooks** when you are done. Otherwise the server will run out of resources, and we will be forced to restart the them.
* Server storage is volatile! I.e. you must **save everything locally** that you don't want to loose.

In [None]:
# ! git pull

## Python Crash Course

If you are new to Python [this 45 min video](https://www.youtube.com/watch?v=N4mEzFDjqtA) gives a good introduction to the key concepts.


In [None]:
# Makes matplotlib plots work better with Jupyter
%matplotlib inline

# Import the necessary libraries. 
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Task 1: Loading the data

> Estimated task time: 10 minutes.

The first this first assignment you must 

1) load the data in `./data/Data-Mining-Spring-2018.csv` using **`pandas`**.
Pandas is an open-source Python library providing high-performance, easy-to-use data structures and data analysis tools.
It is very popular among data scientists and statisticians as it allows you to work very quickly and efficiently.
 * Have a look at the `pandas.read_csv` function.

2) Make sense of the data by printing the first 10 values. Determine the number of observations and features in the data, and have a quick look at what data types they are (or should be).
* Pandas dataframes have the `head` method that is useful printing a limited number of observations.

In [None]:
# Check that data and data path is present
basedir = "./"
assert os.path.isdir(basedir+"data") and os.path.exists(basedir + "data/Data-Mining-Spring-2018.csv"), 'Data not found. Make sure to have the most recent version!'


In [None]:
## YOUR CODE HERE 


## Task 2: Cleaning the data

> Estimated task time: 10 minutes.

We don't want to work with all the features for this exercise.

1) Select a subset of the features, as defined by `feature_sub` (cell below).

2) Rename the columns, such that `What degree are you studying?` becomes `Degree`, and `Shoe Size` becomes `ShoeSize`. Not having spaces (or long names) makes it easier to work with the data in `pandas`.
 * Look at the `rename` method.

3) Convert the columns to the appropriate data formats (e.g. `Age` should be a float, and `Gender` should be a string).
 * `to_numeric` is a useful method (but not the only way) to convert strings to numerical values, and the `errors` argument can be used to handle errors.
 * `dropna` can be used to remove `nan` values.


In [None]:
feature_sub = ['Age', 'Gender', 'Shoe Size', 'Height', 'What degree are you studying?']

In [None]:
## YOUR CODE HERE


## Visualizing the Data
Now that we have `Age`, `ShoeSize`, and `Height` as numerical we can start visualizing it.
Simple visualizations, like histograms are an easy way to get a sense of the data, check for outliers, faulty or anything else we need to take care of.

In [None]:
def hist_plot(data):
    plt.figure(figsize=[8,8])

    plt.subplot(211)
    plt.hist(data.Age.values)
    plt.title("Age")

    plt.subplot(223)
    plt.hist(data.ShoeSize.values)
    plt.title("ShoeSize")

    plt.subplot(224)
    plt.hist(data.Height.values)
    plt.title("Height")

    plt.tight_layout()
    plt.show()

hist_plot(data_sub)

## Task 3: Remove Invalid values

> Estimated task time: 15 minutes.

In the histograms above we see that several values seem suspicious, e.g. a height of 19 cm is probably not true.
In this exercise your job is to remove the faulty observations.
This is ofcourse fundamentally a subjective taks, where you will have to rely on your domain knowledge.

1) Remove the observations with invalid data points.
 * `df.where`/`df.mask` in conjunction with `dropna` can be useful for these kinds of operations.

2) Visualize the data again. If it still looks strange go back to 1)


In [None]:
## YOUR CODE HERE


## Task 4: Convert Gender to Integers

> Estimated task time: 15 minutes.

We often prefer working with integer class labels, rather than strings. 
As you can see the gender has been specified in several different ways, so you need to do some work making the data interpritable.
For this taks you should:

1) Create a new column called `GenderNumerical` with 0's for males, 1's for females, and 2's for other.
 * Define a function that interprets the `Gender` string, and returns the appropriate number.
  * Python distinguishes between upper and lower case, so when working with strings it can sometimes help converting everything to lower case.
 * Use `df.apply` to apply the function to every element in the dataframe.

2) Determine the ratio of the three gender categories.


In [None]:
print("Values in the 'Gender' column:")
print(np.unique(data_sub['Gender']))

In [None]:
def determine_gender_numerical(string):
    ## YOUR CODE HERE


## Scatter plot visualization

Now that we removed the faulty observations in the data we can visualize it further.
As long as the number of features is small a pair plot is an easy way to quickly get an overview of relationships between the different features.

We can make such a plot easily using `seaborn`, a popular visualization library.
It is based on `matplotlib`, but provides a higher-level API, making it one of the easiest ways to make pretty plots.
It also has nice `pandas` integration, as shown below.
Another cool python library to look into is [`bokeh`](https://bokeh.pydata.org).
It allows you to easily create interactive plots.

**Question**: What relationships do you see in the data?

In [None]:
sns.pairplot(data_sub, hue="Degree", diag_kind='hist')
# diag_kind='hist' is necessary when you have small classes, as kde-plot fails for classes with one observation.

plt.show()

## Task 5: Normalize Data

> Estimated task time: 10 minutes.

For the last task you must normalize the data.
Many data science methods require that we first normalize the data.
Typically we would want to use a library (e.g. `sklearn.preprocessing.normalize`), but for this task you should do it yourself.


1) Make a new DataFrame, `data_norm`, where all the floating point columns are normalized using to zero mean and unit (1) variance using the following equation:
$$
x_{norm} = \frac{x-\mu}{\sigma}
$$

2) Add the `Degree` column to the normalized DataFrame.

In [None]:
data_norm = None ## YOUR CODE HERE


## PCA Visualization

One example of a method that requires normalization is Principal Component Analsysis (PCA).
PCA is a popular visualization technique, as it allows you to project high dimensional data into a low dimensional space.
This is usefull for reducing the number of features, or for visualizing features.

A simple PCA plot that projects 'Age', 'ShoeSize', 'Height' down into 2 dimensions is performed below.
We cannot use 'Degree' and 'GenderNumerical' to compute the projection, as they are nominal.
We can however use them to color the plots in order to see how well PCA semarates the two clusters.
We see that PCA doesn't seem to separate 'Degree', where as 'Height' is separated somewhat nicely.

In [None]:
from sklearn.decomposition import PCA

data_as_numpy = data_norm[['Age', 'ShoeSize', 'Height']].values
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(data_as_numpy)
principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])

principalDf['Degree'] = data_sub.Degree.values
principalDf['GenderNumerical'] = data_sub.GenderNumerical.values

sns.scatterplot(data=principalDf, x='PC1', y='PC2', hue='Degree')
plt.show()

sns.scatterplot(data=principalDf, x='PC1', y='PC2', hue='GenderNumerical', palette=sns.color_palette("muted",3))
plt.show()