# Lab1 Python for data science!

This is an introductory class

The objectives of this lab is for you to get:
* familiar with the lab strucutre
* a Python crash course / reminder
* an introduction / reminder to key data science libraries (Numpy, Pandas)
* experience with loading, validating, and visualizing data.

## Lab structure
The labs will be made available on [GitHub](https://github.com/Faur/ITU-Data-Science-in-Games-Exercises) on a roling basis.
You can update your local version through the notebook by removing the comment symbol `#` in the cell below.
(`!` tells the notebook to run the command in the terminal, instead of in thr Python interpriter).

Some important notes
* Shut down notebooks when you are done. Otherwise we will run out of resources, and be forced to restart the servers.
* Server storage is volatile! I.e. you must **save everything locally** that you don't want to loose.

In [None]:
# ! git pull

## Python Crash Course

If you are new to Python [this 45 min video](https://www.youtube.com/watch?v=N4mEzFDjqtA) gives a good introduction to the key concepts.


In [None]:
# Makes matplotlib plots work better with Jupyter
%matplotlib inline  

# Import the necessary libraries. 
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Task 1: Loading the data

The first this first assignment you must 

1) load the data in `./data/Data-Mining-Spring-2018.csv` using **`pandas`**.
Pandas is an open-source Python library providing high-performance, easy-to-use data structures and data analysis tools.
It is very popular among data scientists and statisticians as it allows you to work very quickly and efficiently.

2) Make sense of the data by printing the first 20 values. Determine the number of observations and features in the data, and have a quick look at what types they are (or should be).


In [None]:
# Check that data and data path is present
assert os.path.isdir("./data") and os.path.exists("./data/Data-Mining-Spring-2018.csv"), 'Data not found. Make sure to have the most recent version!'


In [None]:
## YOUR CODE HERE 
data = pd.read_csv('data/Data-Mining-Spring-2018.csv', sep=",")
print('Data shape:', data.shape)
print('I.e. there are',data.shape[0],'observations, each with',data.shape[1],'features.')
# print(data.columns.values)  # <-- Print the name of the feature.

data[:20]

## Task 2: Cleaning the data
1) We don't want to work with all the features for this exercise. Select a subset of the features, as defined by `feature_sub` (cell below)

2) Rename the columns, such that `What degree are you studying?` becomes `Degree`, and `Shoe Size` becomes `ShoeSize`. Not having spaces (or long names) makes it easier to work with the data in `pandas`.
 * Look at the `rename` method.

2) Convert the columns to the appropriate data formats (e.g. `Age` should be a float).
 * `to_numeric` is a useful method (but not the only way) to convert strings to numerical values, and the `errors` can be used to handle errors.
 * `dropna` can be used to remove `nan` values.

3) Visualize each of the numerical featues (Age, ShoeSize, Height). The objective here is to see if there are any outliers that we want to take care off

In [None]:
feature_sub = ['Age', 'Gender', 'Shoe Size', 'Height', 'What degree are you studying?']

In [None]:
## YOUR CODE HERE

# 1
data_sub = data[feature_sub]

# 2
data_sub = data_sub.rename(columns={'What degree are you studying?': "Degree"})
data_sub = data_sub.rename(columns={'Shoe Size': "ShoeSize"})

# 3
data_sub['Age'] = pd.to_numeric(data_sub['Age'], errors='coerce')
data_sub['ShoeSize'] = pd.to_numeric(data_sub['ShoeSize'], errors='coerce')
data_sub['Height'] = pd.to_numeric(data_sub['Height'], errors='coerce')
data_sub = data_sub.dropna()

# 4
def hist_plot(data):
    plt.figure(figsize=[8,8])

    plt.subplot(211)
    plt.hist(data.Age.values)
    plt.title("Age")

    plt.subplot(223)
    plt.hist(data.ShoeSize.values)
    plt.title("ShoeSize")

    plt.subplot(224)
    plt.hist(data.Height.values)
    plt.title("Height")

    plt.tight_layout()
    plt.show()
hist_plot(data_sub)

## Task 3: Remove Invalid values

This taks is more subjective than the others.
You will use your domain knowledge to determine what are reasonable values for each of these features, and remove the invalid ones.

1) Remove invalid data points.
 * `loc` is useful for this.

2) Visualize the data agin. If it still looks strange go back to 1)



In [None]:
## YOUR CODE HERE
data_sub.loc[data_sub.Age.map(lambda x: not 10.0 <= x <= 70.0), 'Age'] = np.nan
data_sub.loc[data_sub.ShoeSize.map(lambda x: not 20.0 <= x <= 60.0), 'ShoeSize'] = np.nan
data_sub.loc[data_sub.Height.map(lambda x: not 100.0 <= x <= 250.0), 'Height'] = np.nan
data_sub = data_sub.dropna()

data_sub.info()
data_sub[:10]

## Task 4: Convert Genter to Integers
We often prefer working with integer class labels, rather than strings. 
As you can see the gender has been specified in several different ways, so you need to do some work making the data nicer.
For this taks you should:

1) convert males to 0, females to 1, and other for 2

2) Determine the ratio of the three categories?

In [None]:
## YOUR CODE HERE
data_sub['Gender'] = data_sub['Gender'].apply(lambda x: x.lower())

data_sub.loc[data_sub.Gender.map(lambda x: x in ['man', 'male', 'alpha male', 'man ', 'm']), 'Gender'] = '0'
data_sub.loc[data_sub.Gender.map(lambda x: x in ['f', 'female', 'woman']), 'Gender'] = '1'
data_sub[['Gender']] = pd.to_numeric(data_sub['Gender'], errors='coerce')
data_sub.loc[data_sub.Gender.map(lambda x: np.isnan(x)), 'Gender'] = 2

data_sub[:20]


## Task 5: Visualize the Data

In [None]:
sns.set(style="ticks")

sns.pairplot(data_sub, hue="Degree")