## Data exploration and visualisation

In this notebook, we will address one of the most important parts in any research: data exploration and visualisation.

Often we will provide a link to the documentation of a method or library. Make yourself familiar with this way of working, the web is a great source of information if you know *what* you want to do, but not *how*.

Another source of information for any programming/data science/Python question is [*stackoverflow*](https://stackoverflow.com/), where anyone can post a question, and the community answers. Good answers generally have a lot of upvotes and will therefore end up at the top of the page.

### Pandas
We will use the popular tool Pandas for data analysis and manipulation. The [*Pandas website*](http://pandas.pydata.org) contains all relevant information about the tool. The the Open Access HTML version of the [Python for Data Analysis book ](https://wesmckinney.com/book/) is also a great resource.

### Data

The data used in this notebook come from the CHAMP project (Collaborative Harmonization of Methods for Profiling of Amphetamine Type Stimulants), which is a project that aimed to create methods for drugs profiling. Using these methods, different seizures of drugs might be linked together.

The data was somewhat altered to make it applicable for this course. It consists of 160 street samples of XTC tablets. Of each street sample, at least two tablets were measured by one of several laboratories. The laboratories cooperating in this research are NBI (National Bureau of Investigation, Finland), LPS (Laboratoire de Police Scientifique de Lyon, France), UNIL (UniversitÃ© de Lausanne, Switzerland) and NFI (Netherlands Forensic Institute). The properties of the tablets measured are the diameter (mm), thickness (mm), weight (mg), and purity (percentage of MDMA).

### Load the data

First step is to load the data. Pandas has a special function to load .csv-files. Csv stands for 'comma separated values'.

In [None]:
# To be able to use Pandas, we have to import the package:
import pandas as pd
import numpy as np

# To load the file containing the data, first we specify the path to the file:
xtc_file = 'wg2_Champ_data_xtc.csv'

# Then we use pandas to load the file
xtc_data = pd.read_csv(xtc_file)

# The pandas function `read_csv()` returns a so-called pandas dataframe
print('type:', type(xtc_data))

Notice that the function `read_csv()` interpreted the data itself, we only provided the file without any additional information. Sometimes, for example when the data contains dates or times, more arguments need to be provided. See the [*Pandas API reference*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

### Pandas dataframes

In this notebook, we will work with a type of data structure called a `DataFrame`. The cool thing about a dataframe is that it is relatively easy to manipulate and work with it.

In [None]:
# To have a quick look at the dataframe, just enter the variable name and run the cell:
xtc_data

In [None]:
# Another way of inspecting it, is by using `head()` or `tail()`, that will provide the first or last records, respectively.
# the number within the brackets is for how many rows to be shown
xtc_data.head(5)

In [None]:
# Try it yourself. Can you check the last 4 rows of the dataset
### YOUR CODE HERE ###

In [None]:
# To quickly get some information on the data, use info()
xtc_data.info()

In [None]:
# To get some simple statistics, use describe()
xtc_data.describe()
# Pay attention: the function describe understood that column Street_sample_name is not a numeric column and so it did not
# provide statistics. But the columns for batch number and measurement were treated as numbers when in fact are just labels.
# Their statistics are meaningless.

#### Data selection

You might want to access a specific row or column of your dataframe. You can use `.iloc[]` for this. See examples below.

In [None]:
# Select the first row:
xtc_data.iloc[0]

# Note: Pandas uses zero indexing (like Python).

In [None]:
# Select rows 11 - 21:
xtc_data.iloc[10:20]

In [None]:
# Select a specific column:
xtc_data['Purity']

In [None]:
# Combine: select specific columns from specific rows:
xtc_data.iloc[0:2][['Weight', 'Purity']]

# Notice that we provided the argument ['Weight', 'Purity'] as a list to indicate the desired columns. 
# Since the argument itself contains brackets and we provide the argument inside brackets, we end up with double brackets.

In [None]:
# As an exercise, show below the values for the columns 'Diameter' and 'Thickness' for the last 10 rows.

# Tip: With the '-' sign you can make selections from the bottom. 
# For example, xtc_data.iloc[-1][['Weight', 'Purity']] returns the weight 
# and purity from the bottom row


### YOUR CODE HERE ###

Pandas has also the option to select rows based on a sequence of boolean values (`True` or `False`). This will be a very useful feature later on as it allows us to make selections based on specific conditions.

In [None]:
# We will now select the first value with a boolean list
selections = [False] * len(xtc_data) # initialize a list with False values -> [False, False, ..., False]
selections[0] = True # set the first value to True -> [True, False, ..., False]

xtc_data[selections]

In [None]:
# Show the first 5 rows of the table by editing the selections list.
### YOUR CODE HERE ###

### Missing values, duplicates

In most dataset, we have to deal with *missing values*. There are different ways to deal with them. In this notebook, we simple delete the samples with missing values. Of course, something like this should be done with caution. If the samples with missing values are not randomly distributed, but have a pattern, you will end up with a biased dataset.

Also, there might be duplicates in the dataset, for example due to a data entry error. It is important that these duplicates are removed.

Look again at the output for `xtc_data.info()`. Can you see if there are any missing values? (tip: pay attention to the 'Non-Null Count')

In [None]:
### YOUR CODE HERE ###

Now you know that there is a column with missing values, we will further search for the exact rows that contain missing values.

Pandas has the function `isna()` ([*API reference*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html?highlight=isna#pandas.DataFrame.isna)) that returns a sequence of boolean values indicating for each cell if it has a missing value. For example, to get the results for the batchnumber column, you should use `xtc_data['batchnumber'].isna()`. We saw earlier how we could use a list of booleans to make selections of a dataframe.

In [None]:
# Show the xtc_data table again, but now only the rows that contain a missing value.
### YOUR CODE HERE ##

Because the rows with missing values are very limited, we wil decide te remove them. 
For this, you can use the function `dropna()` ([*API reference*](http://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna)).
For example, to remove rows with missing values in the batchnumber column,  you should use `xtc_data.dropna(subset='batchnumber')`.

In [None]:
# Try it yourself. Replace None below with the correct function 
# so that the rows with missing values will be removed.

### YOUR CODE HERE ###

In [None]:
xtc_data.info()

In [None]:
# At this point, the dataframe should contain 496 rows. If it does not, rerun the cells up to this point by 
# selecting this cell and choosing 'Cell-Run All Above'.
xtc_data.shape[0] == 496

To search for duplicates, pandas created the method `duplicated()` ([*API reference*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html?highlight=duplicated#pandas.DataFrame.duplicated)). It returns a sequence of the rows that are marked as duplicate.


In [None]:
# Show below the xtc table again, but only with the rows that are duplicates of previous rows.

### YOUR CODE HERE ##

With the `drop_duplicates()` function in Pandas, you can drop duplicate rows ([*API reference*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html?highlight=drop_duplicates#pandas.DataFrame.drop_duplicates)).

In [None]:
# Replace None below with the Pandas function that removes the duplicate rows

### YOUR CODE HERE ###

In [None]:
# At this point, the dataframe should contain 494 rows. If it does not, rerun the cells up to this point
len(xtc_data) == 494

### Data processing

You have loaded the data from a file, but in most cases, you will not use the data exactly as provided in the file. You might need to process the data first. 

Personally, I like all variables to be lowercase. The column names can be changed using the function `rename()`([*API reference*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)).

In [49]:
# Go to the link above, and find out how to change the column names. Make all column names lower case.
# On top of that, add the units to the column names, i.e. `diameter (mm)`

### YOUR CODE HERE ###

# TIP: if you use xtc_data.rename(), this will rename the column names based on your input.
# By default, the rename-method (and a lot of other methods considering pandas dataframes) returns
# the altered dataframe, it does not automatically change the original. To change this behaviour, you 
# can either assign the output to a variable (i.e. new_name = xtc_data.rename()), or overwrite the old one
# (xtc_data = xtc_data.rename()). Another way the latter can be accomplished is by
# adding the argument `inplace = True`, see also the website link above.

In [None]:
xtc_data.head(5)

Assume that you want to work with grams instead of milligrams considering the weight. Or you want to change the purity from percentage to fraction. This can be done by creating a new column.

In [None]:
# For example, we can create a new column 'weight (g)' that contains the weight in grams instead of milligrams:
xtc_data['weight (g)'] = xtc_data['weight (mg)']/1000

# Use the methods you learned before to see if the new column is there!
### YOUR CODE HERE ###

The old column `weight (mg)` can now be removed. This can be done using the method `drop()`. Look at [*the documentation*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to see how this is done, and remove the column. 

In [53]:
# Remove the column 'weight (mg)'
### YOUR CODE HERE ###

In [None]:
# Check if the data is as you expect:
xtc_data.head(5)

In [None]:
# Exercise: create a column 'volume (mm^3)` out of the columns 'diameter (mm)' and 'thickness (mm)'.
# For this you need the formula for volume of a cylinder (https://en.wikipedia.org/wiki/Cylinder) 
# and the number pi (which we have imported for you):
from math import pi

### YOUR CODE HERE ###

In [None]:
# Just a check to see if you got it right (output should be 'True'):
np.allclose(
    xtc_data['volume (mm^3)'][100],
    238.00647038734022
)

In [None]:
# We can make a new column to represent the laboratory:
xtc_data['laboratory'] = xtc_data['street_sample_name'].str[0:3]
xtc_data

### Aggregation

Sometimes you might want to aggregate data. With the method `.describe()` we already aggregated all data and calculated the mean, standard deviation, minimum and some other statistics. If you want to aggregate data based on one of the columns, you can use `groupby()` ([API reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby)).
We will show an example below

In [None]:
# For each batch, calculate the mean weight:
xtc_data.groupby('batchnumber')['weight (g)'].mean()

In [None]:
# Exercise: for each batch, calculate the maximum purity.
### YOUR CODE HERE ###

In [None]:
# Another one: let's see actually how many samples we have for each batch
xtc_data['batchnumber'].value_counts()

### Data visualisation

So far, we've only seen the raw data, including some basic statistics. To get more feeling for the data,
we will plot the data using plotting methods from pandas.

In [None]:
# Let's see how many samples were measured in each laboratory:
xtc_data['laboratory'].value_counts()

In [None]:
# This can be visualised in several ways:
xtc_data['laboratory'].value_counts().plot(kind='bar');
# (the ";" is to suppress the reference)

Try for yourself, see what other kinds of visualisations you can use at [*the documentation*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).

In [None]:
### YOUR CODE HERE ###

In [None]:
xtc_data['laboratory'].value_counts().plot(kind='pie');

In the example above, we are dealing with categorical data. For numerical data, we can make other kinds of plots, such as a histogram, boxplot and a scatter plot.

In [None]:
# Create a histogram for the purity:
xtc_data['purity (%)'].plot(kind='hist');

An easy way to detect outliers is by using box plots.

In [None]:
# Create a boxplot for all columns:
xtc_data.plot(kind='box', rot=270);

Exercise: In the output above, we see the boxplots for all columns. However, a boxplot for the batch number or the measurement is not interesting at all, furthermore all boxplots are depicted in one figure causing most of them to turn out really small and unreadable. Let alone detect outliers. Create a boxplot for the diameter, thickness and weight together, and two separate boxplots for the purity and the volume of the tablets. Also, play around with the different arguments. Above, `rot=270` is to rotate the labels on the x-axis. The figure can be given a title using `title='a title'`. If you have questions, see if you can find it in [*the documentation*](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html), or Google it!

In [None]:
# boxplot for the diameter, thickness and weight
### YOUR CODE HERE ###

In [None]:
# boxplots for the volume
### YOUR CODE HERE ###

In [None]:
# boxplots for the purity
### YOUR CODE HERE ###

Have you spotted any outliers? It is always hard to tell, since extreme values can occur. If you find outliers in your own project, try to see where they come from, for example a mistake in reading the file, or a measurement error. If you cannot find anything, it might be just an extreme value and you can leave it as is. For the notebook, we leave the data as is.

Another way of inspecting the data is by looking at the correlation coefficients between the different variables. These will give insight in the (in)dependence between them.

In [None]:
# To see if there are any correlations in your data, pandas also has a built-in method to provide the 
# correlation between every pair of variables at once:
columns_to_test = ["diameter (mm)", "thickness (mm)", "purity (%)", "weight (g)", "volume (mm^2)"]
corr = xtc_data[columns_to_test].corr()
corr

In [None]:
# let's try to make it easier to read
corr.style.background_gradient(cmap='coolwarm').format(precision=2)

In [None]:
# Exercise: create a scatterplot of the variables that have the highest correlation coefficient:
### YOUR CODE HERE ###

### Dimension reduction

Just for fun, we will also reduce the dimensions of the data using the algorithm [*T-SNE*](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). This algorithm transforms each high-dimensional sample to a two-dimensional point, in such a way that if two samples are alike, the points are also close together. This can also provide more insight into your data. For now, it is not important that you understand every bit of code. We just want to give you an example of such an algorithm, and show that everybody can apply it. More information can be found in [*the documentation*](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).

In [None]:
# For fun: tsne

# import the package
from sklearn.manifold import TSNE

# specify the data (only numerical data as input)
x = xtc_data[['diameter (mm)', 'thickness (mm)', 'purity (%)', 'weight (g)', 'volume (mm^3)']]

# run the tsne algorithm (n_components)
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(x)

In [None]:
# Visualise the results. Comp-1 and comp-2 are the coordinates of the transformed samples.
df_tsne = pd.DataFrame()
df_tsne['comp-1'] = z[:,0]
df_tsne['comp-2'] = z[:,1]
df_tsne['purity (%)'] = xtc_data['purity (%)']
df_tsne.plot(kind='scatter', x='comp-1', y='comp-2', c='purity (%)', colormap='viridis');

# Kaggle Challenge

Do you want to put your data science skills to the test? See if you can implement your own model that predicts poisonous mushrooms in the Kaggle challenge: https://www.kaggle.com/t/3fb3213893214f28825b0f8848e471c9