# Analysis through Visualisation

## Pima Indians Dataset
This dataset describes the medical records for Pima Indians
and whether or not each patient will have an onset of diabetes within five years.

We are going to use the pandas library for loading the data (which is in CSV).

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
# Seaborn for plotting and styling
import seaborn as sns

* Summarise distributions of numeric features
  * We are going to use histograms

In [None]:
filename = "../data/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# df stands for "Data Frame"
df = pd.read_csv(filename, names=names)
h = df.hist()
plt.show()

Which attribute is Gaussian, exponential, or skewed?

### Density Plots
Also known as Kernel Density Plots, Density Trace Graph.

A Density Plot visualises the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise. The peaks of a Density Plot help display where values are concentrated over the interval.

An advantage Density Plots have over Histograms is that they're better at determining the distribution shape because they're not affected by the number of bins used (each bar used in a typical histogram). A Histogram comprising of only 4 bins wouldn't produce a distinguishable enough shape of distribution as a 20-bin Histogram would. However, with Density Plots, this isn't an issue. 

In [None]:
df.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

### Box plots
Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the
middle 50% of the data).

In [None]:
bp = df.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

In [None]:
# Using seaborn
f = plt.figure(figsize=(18,8))
sns.boxplot(data=df)

Remove some columns

In [None]:
stats_df = df.drop(['class', 'test'], axis=1)
plt.figure(figsize=(18,5))
sns.boxplot(data=stats_df)

Plot removed attribute on its own

In [None]:
test_df = df[['test']]
test_df.head()

In [None]:
sns.boxplot(data=test_df)

Outliers? Always be carefull before removing them!

### Correlation Matrix

In [None]:
correlations = df.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

In [None]:
# In seaborn we can use a heatmap:
# Heatmap
sns.heatmap(correlations)

### Scatter plots
Scatter plots are useful for spotting structured relationships between variables, like whether you could summarise the
relationship between two variables with a line

In [None]:
sns.lmplot(data=df, y='age', x='preg', hue='class', fit_reg=False)

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(df)
plt.show()

## Wine Quality Dataset
This dataset contains instances for red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
between 0 (very bad) and 10 (very excellent).

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables 
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

* Summarise distributions of numeric features
  * We are going to use histograms

In [None]:
df = pd.read_csv('../data/winequality-red.csv', sep=';')
h = df.hist()
plt.show()

Which attribute is Gaussian, exponential, or skewed?

### Density Plots

In [None]:
df.plot(kind='density', subplots=True, layout=(4,3), sharex=False)
plt.show()

### Box plots

In [None]:
bp = df.plot(kind='box', subplots=True, layout=(4,3), sharex=False, sharey=False)
plt.show()

In [None]:
# Using seaborn
f = plt.figure(figsize=(18,8))
sns.boxplot(data=df)

Lets remove some columns:

In [None]:
stats_df = df.drop(['total sulfur dioxide', 'free sulfur dioxide'], axis=1)
plt.figure(figsize=(18,5))
sns.boxplot(data=stats_df)

Plot removed attributes on a separate plot:

In [None]:
sulfur_df = df[['total sulfur dioxide', 'free sulfur dioxide']]
sulfur_df.head()

In [None]:
sns.boxplot(data=sulfur_df)

### Correlation Matrix

In [None]:
correlations = df.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

In [None]:
# In seaborn we can use a heatmap:
# Heatmap
sns.heatmap(correlations)

### Scatter plots

In [None]:
sns.lmplot(data=df, y='pH', x='chlorides', hue='quality', fit_reg=False)

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(df)
plt.show()