# Lab1 Python for data science!

This is an introductory class

The objectives of this lab is for you to get:
* familiar with the lab strucutre
* a Python crash course / reminder
* an introduction / reminder to key data science libraries (Numpy, Pandas)
* experience with loading, validating, and visualizing data.

## Quick note on the labs
The labs will be made available on [GitHub](https://github.com/Faur/ITU-Data-Science-in-Games-Exercises) on a roling basis.
Be sure to have the most recent version locally by pulling from the repo.
This can be done from the notebook by using the cell below.
Remove the comment symbol `#` and run the cell (`Ctrl` + `Enter`).
`!` tells the notebook to run the command in the terminal, instead of in thr Python interpriter.

Some important notes:
* **Shut down notebooks** when you are done. Otherwise the server will run out of resources, and we will be forced to restart the them.
* Server storage is volatile! I.e. you must **save everything locally** that you don't want to loose.

In [None]:
# ! git pull

## Python Crash Course

If you are new to Python [this 45 min video](https://www.youtube.com/watch?v=N4mEzFDjqtA) gives a good introduction to the key concepts.


In [None]:
# Makes matplotlib plots work better with Jupyter
%matplotlib inline

# Import the necessary libraries. 
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Task 1: Loading the data

The first this first assignment you must 

1) load the data in `./data/Data-Mining-Spring-2018.csv` using **`pandas`**.
Pandas is an open-source Python library providing high-performance, easy-to-use data structures and data analysis tools.
It is very popular among data scientists and statisticians as it allows you to work very quickly and efficiently.

2) Make sense of the data by printing the first 10 values. Determine the number of observations and features in the data, and have a quick look at what data types they are (or should be).
* Pandas dataframes have the `head` method that is useful printing a limited number of observations.

In [None]:
# Check that data and data path is present
assert os.path.isdir("./data") and os.path.exists("./data/Data-Mining-Spring-2018.csv"), 'Data not found. Make sure to have the most recent version!'


In [None]:
## YOUR CODE HERE 
data = pd.read_csv('data/Data-Mining-Spring-2018.csv', sep=",")
print('Data shape:', data.shape)
print('I.e. there are',data.shape[0],'observations, each with',data.shape[1],'features.')
# print(data.columns.values)  # <-- Print the name of all the feature.

data.head(10)

## Task 2: Cleaning the data
1) We don't want to work with all the features for this exercise. Select a subset of the features, as defined by `feature_sub` (cell below)

2) Rename the columns, such that `What degree are you studying?` becomes `Degree`, and `Shoe Size` becomes `ShoeSize`. Not having spaces (or long names) makes it easier to work with the data in `pandas`.
 * Look at the `rename` method.

2) Convert the columns to the appropriate data formats (e.g. `Age` should be a float).
 * `to_numeric` is a useful method (but not the only way) to convert strings to numerical values, and the `errors` can be used to handle errors.
 * `dropna` can be used to remove `nan` values.


In [None]:
feature_sub = ['Age', 'Gender', 'Shoe Size', 'Height', 'What degree are you studying?']

In [None]:
## YOUR CODE HERE

# 1
data_sub = data[feature_sub]

# 2
data_sub = data_sub.rename(columns={'What degree are you studying?': "Degree"})
data_sub = data_sub.rename(columns={'Shoe Size': "ShoeSize"})

# 3
data_sub['Age'] = pd.to_numeric(data_sub['Age'], errors='coerce')
data_sub['ShoeSize'] = pd.to_numeric(data_sub['ShoeSize'], errors='coerce')
data_sub['Height'] = pd.to_numeric(data_sub['Height'], errors='coerce')
data_sub = data_sub.dropna()

print(data_sub.describe())
print('\n\n')
print(data_sub.head(15))


Now that we have `Age`, `ShoeSize`, and `Height` as numerical we can start visualizing it.
Simple visualizations, like histograms are an easy way to check for outliers or faulty data that we need to take care of.

In [None]:
def hist_plot(data):
    plt.figure(figsize=[8,8])

    plt.subplot(211)
    plt.hist(data.Age.values)
    plt.title("Age")

    plt.subplot(223)
    plt.hist(data.ShoeSize.values)
    plt.title("ShoeSize")

    plt.subplot(224)
    plt.hist(data.Height.values)
    plt.title("Height")

    plt.tight_layout()
    plt.show()

hist_plot(data_sub)

## Task 3: Remove Invalid values

In the histograms above we see that several values seem suspicious, e.g. a height of 19 cm is probably not true.
In this exercise your job is to remove the faulty observations.
This is ofcourse fundamentally a subjective taks, where you will have to rely on your domain knowledge.

1) Remove the observations with invalid data points.
 * `df.where`/`df.mask` can be useful for these kinds of operations.

2) Visualize the data agin. If it still looks strange go back to 1)


In [None]:
## YOUR CODE HERE

# TODO: look at df.where https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

data_sub.mask(data_sub.Age < 15, np.nan, inplace=True)
data_sub.mask(data_sub.Age > 70, np.nan, inplace=True)

data_sub.mask(data_sub.ShoeSize < 20, np.nan, inplace=True)
data_sub.mask(data_sub.ShoeSize > 60, np.nan, inplace=True)

data_sub.mask(data_sub.Height < 100, np.nan, inplace=True)
data_sub.mask(data_sub.Height > 250, np.nan, inplace=True)

data_sub = data_sub.dropna()
hist_plot(data_sub)

## Task 4: Convert Gender to Integers
We often prefer working with integer class labels, rather than strings. 
As you can see the gender has been specified in several different ways, so you need to do some work making the data nicer.
For this taks you should:

1) Create a new column called `GenderNumerical` with 0's for males, 1's for females, and 2's for other.
 * Define a function that interprets the `Gender` string, and returns the appropriate number.
  * Python distinguishes between upper and lower case, so when working with strings it can sometimes help converting everything to lower case.
 * Use `df.apply` to apply to every element in

2) Determine the ratio of the three gender categories.


In [None]:
print("Values in the 'Gender' column:")
print(np.unique(data_sub['Gender']))

In [None]:
def determine_gender_numerical(string):
    ## YOUR CODE HERE
    
    string = string.lower()
    if string in ['man', 'male', 'alpha male', 'man ', 'm']:
        return 0
    elif string in ['f', 'female', 'woman']:
        return 1
    else:
        return 2

data_sub['GenderNumerical'] = data_sub['Gender'].apply(determine_gender_numerical)

print("Male\t", np.sum(data_sub.GenderNumerical==0) / len(data_sub))
print("Female\t", np.sum(data_sub.GenderNumerical==1) / len(data_sub))
print("Other\t", np.sum(data_sub.GenderNumerical==2) / len(data_sub))

data_sub.head(10)

## Scatter plot visualization

Another cool python library to look into is [Bokeh](https://bokeh.pydata.org).
It allows you to easily create 

In [None]:
sns.set(style="ticks")

sns.pairplot(data_sub, hue="Degree")
plt.title("Subset")
plt.show()

## Task 5: Normalize Data

For this task you are suppose to 


In [None]:
data_norm = ## YOUR CODE HERE

def normalize(df):
    result = pd.DataFrame()
    for feature_name in df.columns:
        if df[feature_name].dtype == np.float64 or df[feature_name].dtype == np.int64:
            result[feature_name] = (df[feature_name] - df[feature_name].mean()) / (df[feature_name].std())
    return result

data_norm = normalize(data_sub)
data_norm['Degree'] = data_sub.Degree

data_norm.head(10)

## PCA Visualization

In [None]:
from sklearn.decomposition import PCA

data_as_numpy = data_norm[['Age', 'ShoeSize', 'Height', 'GenderNumerical']].values
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(data_as_numpy)
principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])
principalDf['Degree'] = data_sub.Degree

sns.scatterplot(data=principalDf, x='PC1', y='PC2', hue='Degree')
plt.show()