# Practical 0
## Part II. Exploring data with Pandas

Before we start building models using machine learning libraries such as Scikit-learn or Tensorflow, we might always need to explore and clean the data. 

In Part II of the worksheet, we will use Pandas to explore an example dataset - Titanic Disaster data, taken from [Kaggle Competition](https://www.kaggle.com/c/titanic/overview), for predicting the likelihood for a person to survive given some ticket information.


In this practical, we will guide you through the use of this package - but in the future we do expect you to make use of the package's documentation. This can be found here:
https://pandas.pydata.org/pandas-docs/stable/



### Hello Pandas
Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python. While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.


In [None]:
import pandas

Whilst this is an acceptable way of loading in the library, when working with large projects it can be a bit tiresome to write ```pandas``` in full every time you are required to leverage on the library. Fortunately for us, we can make use of ```as``` when importing the library to shorten the call. We can do this by doing the following:

In [None]:
import pandas as pd

Now we load the data:

As the data is in the form of a comma-seperated value (csv) file, we will make use of pandas' read_csv function. Documentation for this can be found here:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html


In [None]:
url = "https://raw.githubusercontent.com/ashishpatel26/Titanic-Machine-Learning-from-Disaster/master/input/train.csv"
data = pd.read_csv(url, sep=",")

# View the data frame within a cell
data

In [None]:
# Your code: Check the shape of the data
# 


## Data Exploration

Before building a model, we want to explore the data first: some data cleaning, visualisation and simple statistics will be useful here. 

In [None]:
data

And to just get a quick glimpse of the data that we have loaded, we can just call ```data.head(n_rows)``` where ```n_rows``` is equal to the number of rows we want to see.

In [None]:
data.head(10)

Before we feed our data into a classifier, we first have to do a bit of  manipulation to the ```DataFrame``` object. For the purposes of this practical we will convert much of the string data into categorical data. This is a fairly simple task in which we can leverage ```numpy``` to make things easier:

In [None]:
# Drop the irelevant variables 
# - Note that the column names are case sensitive here
data = data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

data.head(3)

In [None]:
# Fill in missing values with a mean
age_mean = data['Age'].mean()
data['Age'] = data['Age'].fillna(age_mean)

In [None]:
from scipy.stats import mode

# Fill in missing values with mode for discrete variables
mode_embarked = mode(data['Embarked'])[0][0]
data['Embarked'] = data['Embarked'].fillna(mode_embarked)

As there are only two unique values for the column Sex, we have no problems of ordering.

In [None]:
data['gender'] = data['Sex'].map({'female': 0, 'male': 1}).astype(int)

For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically.

To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically.

In [None]:
pd.get_dummies(data['Embarked'], prefix='Embarked').head(10)

In [None]:
data = pd.concat([data, pd.get_dummies(data['Embarked'], prefix='Embarked')], axis=1)

Exercise:

Write the code to create dummy variables for the column Sex.

In [None]:
# Your code here




In [None]:
data = data.drop(['Sex', 'Embarked'], axis=1)

# Put column name to a list
cols = data.columns.tolist()

# Reoder the column names and the dataframe (data) according the new column order
cols = [cols[1]] + cols[0:1] + cols[2:]
data = data[cols]

We review our processed training data.

In [None]:
data.head(10)

In [None]:
# Summarise the dataset: descriptive statistics
data.describe()

### Visualising the data
Data visualisation can be performed using Pandas and Matplotlib. Another popular libraries for data visualisation is [Seaborn](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html) 

In [None]:
# %matplotlib inline: To make matplotlib inline graphics
%matplotlib inline 
import matplotlib.pyplot as plt

In [None]:
# Histograms for checking the distributions of the variables.
data.Survived.value_counts()


In [None]:

# Plot the histogram
data.Survived.value_counts().plot(kind='bar')

In [None]:
y = data["Survived"].values

In [None]:
data['Age'].plot(kind='hist') # Histogram for age

In [None]:
# Boxplots to compare the distribution of continuous variables by groups
data.boxplot(column='Age', by='Survived')
data.boxplot(column='Fare', by='Survived')

In [None]:
# Scatter plots
# Visualise the data by groups in colors
df0=data[data['Survived']==0] # subset of data
df1=data[data['Survived']==1] # subset of data
ax = df0.plot(kind='scatter', x='Age', y='Fare', color='green', label='survived')
df1.plot(kind='scatter', x='Age', y='Fare', color='red', label='Not Survived', ax=ax)

Exercise:

What are the other variables that you would like to visualise in order to understand the association between those variables and survival data? 

In [None]:
# Your answer or code here



Now using the code above, analyse the column definitions and determine what features you would like to use for predicting if a person would survive or not.

In [None]:
X = data.values[:,1:] # remember to exclude the output column (the first column here)
print(X.shape)

Now we can check to see whether or not this data has been set up correctly by ensuring that there are a equal amount of samples in both ```X``` and ```y```. If this throws and exception, alter your code to make it work. If you are still stuck, call over a demonstrator to help.

In [None]:
if X.shape[0] != y.shape[0]:
    raise Exception("Sample counts do not align! Try again!")

## Colab Specific

You can mount your Google Drive as a VM local drive. You can save your notebooks in your Drive, or directly to your GitHub repositories, or simply download them locally.

For more instruction, see https://colab.research.google.com/notebooks/io.ipynb

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls /content/gdrive/

MyDrive


You can access to your Drive files using this path (and try click on the link) "/content/gdrive/MyDrive/"

## Helpers

### Pandas Cheatsheet
https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf


### List slicing tips

If you're new to the python programming language, understanding list slicing may be a bit difficult. Here's a quick guide.

Given the following list:

In [None]:
l = ["This", "is", "a", "list", "of", "strings"]

If I wanted to get the first element of that list, I'd simply:

In [None]:
l[0]

If I wanted to get the last element of that list, I'd simply:

In [None]:
l[-1]

If I wanted to get everything after the first element:

In [None]:
l[1:]

And if I wanted to get everything before the last element:

In [None]:
l[:-1]

And finally, everything after the first and before the last:

In [None]:
l[1:-1]