# Workshop 01: Introduction to Data Science

In [1]:
print("helolo")

helolo


## Setup

1. **Get the data**. Download the Titanic: Machine Learning from Disaster dataset from Kaggle (both "train.csv" and "test.csv"); put the two files in the same directory as this notebook.
2. **Install `pandas`**. Open Anaconda3 navigator, and open the console (Environments --> root). Type `conda install pandas` into the console, and wait for it to install.
3. **Install `sklearn`**. The same as in 2, but typing `conda install sklearn`.

## Dataset Anatomy

**Some basic definitions**. Let's do away with the boring shit first. If we're going to be doing data science, then we need to talk about what a "dataset" is.

* **Object**, a real-world entity that we describe with data.
* **Features**, the properties of an object that we use to describe it.
* **Data**, actual measurements of properties of objects.
* **Labels**, the category that the object belongs to.
* **Dataset**, a set of feature-vectors and associated labels.
* **Model**

**An example**. Take, for example, an enrollment list for a subject. In this case, each student (corresponding to a row in our list) constitutes an *object*. The *features* that we record for each student might be (for example) "StudentID", "Name", "AssignmentMark1", "AssignmentMark2", "ExamMark", and "Passed/Failed". The values for these features for any given student would constitute the *data* for that student. In this case, let's say that "Passed/Failed" is the thing that we want to predict, then "PassedFailed" would become our *label*. We could construct some kind of model that takes in a student's "AssignmentMark1" and "AssignmentMark2" and predicts whether they pass or fail the subject. This could then be used on next-year's student cohort to identify any students in danger of failing.

## Data Science Process

The point of this workshop is to give you an idea of what "doing data science" actually entails, and then to help you through that process (and some of the tools we could use) on an example data set. This process has roughly five steps:

1. **Cleaning** your data;
2. **Exploring** your data for interesting patterns;
3. **Feature Engineering** to generate new informative features for the data;
4. **Modelling** that is building your classifier (using one of the common algorithms); and
5. **Explaining** your model (that is, why it works well or poorly).

This is obviously a pretty naive workflow, but it gives you a bit of a sense of where to start. In practice, there's a lot of back-and-forth between the steps (especially 2, 3, and 4).

### 1. Cleaning

When you first get your data, you'll usually find that it's "unclean" - that is, not yet suitable to be used in any of our models. This will typically be because there are **missing values**, **anomalously extreme values (outliers)**, or because the **data is non-numeric**. Cleaning means dealing with each of these issues.

**Dealing Missing Values & Outliers**
* **Ignore them.** This is lightweight and easy, but you'll probably run into issues later on when it comes to training your model (some )
Delete the rows with missing/extreme data.
* **Delete them.** This saves you from having to deal with null-value errors later on, as well as the implexities of imputation (the next thing), but it's also kind of wasteful.
* **Impute them.**

**Dealing Non-Numeric**
* **Encoding.**

### Aside: The Pandas Dataframe

If we want to use Python, then we need to get the data into some kind of Python objet to mess around with it. We're going to use the `Dataframe` supplied by the `pandas` module.

A `Dataframe` is pretty much the same thing as a table (like in Excel). Columns pertain to features (e.g. age, weight) and rows to data objects (e.g. a person).

### Reading the Data
```Python
import pandas as pd
df = pd.read_csv('data_file.csv') # read "data_file.csv" a Dataframe
df.head() # prints first five rows of the dataframe
```

### Initial Exploration
After importing `pandas`, it's good practice to check that everything's been read in correctly, then mess around with it a little to get a sense of what features you have, and how they might be related to the label.

```Python
df.describe() # generates summary statistics over the data
df.info() # prints dataframe metadata

df['col_name'].value_counts()
pd.crosstab(df['col_name_1'], df['col_name_2'])
```

### Cleaning the Data

**Slicing data by row-index and column-name**. Just like 

```Python
oneCol = df['<column>']
twoCols = df[['<col_name_1>', '<col_name_2>']]
df.loc[<row_index>]
df.iloc[<row_number>]
df.at[<row_index>, 'col_name']
```

### Cleaning

Cleaning 

**Conditional slicing**

```Pythnon
something

```

### Cleaning

**Finding dirty data**. Finding NaNs and outliers.

### 

**Creating new columns**

As with a table, it's possible to look up data by-column or by-row really easily with a dataframe. Use something akin to the following code to read data. You can use any of the following methods to do this.

## Exercises

**Primary**. We're going to start off by reading and messing around with our data. Move the "data.csv" into the same directory as this Jupyter Notebook.
1. Read the data (check that it's been read correctly).
2. Familiarise yourself with the data a little more.
2. Clean the data (we want a Dataframe that has no missing values).

**Additional**. 
1. 

## 3. Feature Engineering

## 4. Modelling/Evaluation

In [1]:
import pandas as pd

## Dataframes

`df = pd.read_csv(<path_to_data_file>)`  
`df.head()`  
`df.describe()`  
`df[<column_name>]`
`df[]`

**Reading data into pandas dataframes**  
Raw data is typically stored either in files (like .csv) or in some kind of database. If we want to use Python to manipulate that data, however, we need to read it into some kind of idiomatic data structure - pandas provides the `dataframe` object for this purpose. **some explanation of what a dataframe object is**.

In [26]:
# read data into pandas dataframe
df = pd.read_csv("datasets/titanic_train.csv")

# print out the "head" (first five rows) to check it's been properly imported
df.at[0, 'Sex']
df.loc[0:10, ['Sex', 'Embarked']]

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
5,male,Q
6,male,S
7,male,S
8,female,S
9,female,C


**Creating new columns**  
The raw data you read directly into pandas dataframes will pretty much always be "unclean" (i.e. unsuitable for consumption by a model). This is likely because (1) the data is non-numeric or non-boolean (especially in the case of strings like the "Name" feature of this dataframe), (2) some of the data is missing, and/or (3) some of the data has been entered input erroneously. Forget about (2) and (3) for the moment; let's focus on (1).

Firstly, notice that values for the "Sex" feature are represented as strings ("male" and "female"). The problem with this is that the string cannot be consumed by most models (e.g. k-nearest neighbours). Since it's a binary feature (i.e. only takes two possible values), we can encode it as a new binary feature that takes the value 1 if "Sex" is "male", and 0 if "female".

Let's do the same with "Embarked" (embarkation location). Since "Embarked" can take three values, "C", "Q", and "S", we'll need to encode the strings using either (1) three binary variables ("EmbarkedAtC", "EmbarkedAtQ", or "EmbarkedAtS") or (2) a simple numerical mapping (C->1, Q->2, S->3).

In [11]:
# define a function to transform one of the values from "Sex" into a binary digit
def binarize_sex(sex):
    if sex == "male":
        return 1
    else:
        return 0

# create a new column "IsMale" by 'applying' binarize_sex to each row of the "Sex" column
df["IsMale"] = df["Sex"].apply(binarize_sex)

# do the same for "Embarked"
def encode_embarked(emb):
    if emb == "C":
        return 1
    elif emb == "Q":
        return 2
    else:
        return 3

df["EmbarkedEncoded"] = df["Embarked"].apply(encode_embarked)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,IsMale,EmbarkedEncoded
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,3
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,3
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,3


The features "SibSp" (number of siblings or spouses) and "Parch" (number of parents or children) are sort of hard to understand when taken alone. Let's try to aggregate the two features into a single feature, "CountFamily" (number of family members).

In [12]:
# generate new column "CountFamily" via element-wise summation of "SibSp" and "Parch"
df["CountFamily"] = df["SibSp"] + df["Parch"]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,IsMale,EmbarkedEncoded,CountFamily
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,3,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,3,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,3,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,3,0


**Dropping existing columns**  
At this point, we want to drop all the features that we're not going to use in our model.

In [13]:
# drop the columns
#   inplace=True, change the dataframe object in-place, rather than returning a copy
#   axis=1, perform the operation on the column axis (1) rather than the row axis (default, 0)
df.drop(["Name", "Sex", "SibSp", "Parch", "Ticket", "Cabin", "Embarked"], inplace=True, axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare,IsMale,EmbarkedEncoded,CountFamily
0,1,0,3,22.0,7.25,1,3,1
1,2,1,1,38.0,71.2833,0,1,1
2,3,1,3,26.0,7.925,0,3,0
3,4,1,1,35.0,53.1,0,3,1
4,5,0,3,35.0,8.05,1,3,0


**Writing-out to data file**  
So you're done cleaning your data. Now we either want to (1) pass it straight into a model or (2) save it in its clean form on the disk.

In [14]:
# write the dataframe to csv
#  index=False, don't write-out with an index column, since our dataframe already has one (the unnamed leftmost column)
df.to_csv("datasets/titanic_train_clean.csv", index=False)

## Exercise
Write a *function* which takes a raw dataframe (from the titanic dataset) and returns a clean dataframe (like the one generated above). Try it on the file "datasets/titanic_test.csv".

In [29]:
df = pd.read_csv("datasets/titanic_train.csv")
def clean_titanic(df):
    df["IsMale"] = df["Sex"].apply(binarize_sex)
    df["EmbarkedEncoded"] = df["Embarked"].apply(encode_embarked)
    df.drop(["Name", "Sex", "SibSp", "Parch", "Ticket", "Cabin", "Embarked"], inplace=True, axis=1)
    return df

# test clean_titanic
clean_titanic(df)

Unnamed: 0,PassengerId,Pclass,Age,Fare,IsMale,EmbarkedEncoded
0,892,3,34.5,7.8292,1,2
1,893,3,47.0,7.0000,0,3
2,894,2,62.0,9.6875,1,2
3,895,3,27.0,8.6625,1,3
4,896,3,22.0,12.2875,0,3
5,897,3,14.0,9.2250,1,3
6,898,3,30.0,7.6292,0,2
7,899,2,26.0,29.0000,1,3
8,900,3,18.0,7.2292,0,1
9,901,3,21.0,24.1500,1,3


# Removed

## 1. Introduction to Data Science

### Data Science Paradigms

The kinds of problems that we're interested in in data science can be split (crudely) into two categories: supervised learning, and unsupervised learning.
* **Supervised Learning**. Given set of input variables, X (features), with corresponding output variables y (labels), come up with some kind of function that maps X to y. For example, you might be given a data set of people's ages, weights, and heights (the features), and their associated sexes (the labels); under a supervised learning paradigm, we try to build a model trys to predict a person's sex given their age, weight, and height.
* **Unsupervised Learning**. Given a set of input variables, X, learn the underlying distribution of the data. "Learning the underlying distribution" of the data is a fancy way of saying "find records that are either similar to or associated with each other" in some way. Notice that this sounds a lot more open-ended than supervised learning. That's mainly because - in this case - we're not given any notion of "truth" (i.e. a label) that we're trying to guess at. It's our job to actually discover patterns in the data.

We'll motivate this workshop under **supervised learning** paradigm.