## Data Analysis

In this notebook we will introduce the Pandas Python package and use an example dataset to go over some basics of data analysis using Python. Specifically, we will cover:
* Introduction to Pandas
    * Loading data with Pandas
    * Manipulating data with Pandas
* Basic data analysis
    * Visualising data
    * Exploring correlations

First, let's import the required packages:

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline

### Introduction to Pandas

[Pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Let's explore some of the benefits of Pandas.

#### Loading data with Pandas
Pandas can be used as a quick and convenient way to to load data. For example, let's compare how you might load and print data from a csv file without Pandas and with Pandas.

*(1) Loading csv data without Pandas*:

In [None]:
arr_houses = np.genfromtxt('Data/houses.csv', delimiter = ',', dtype = '|U')
arr_houses

*(2) Loading csv data with Pandas*:

In [None]:
df_houses = pd.read_csv('Data/houses.csv', sep=',', header=0)
df_houses.head(5)

Let's take a moment to understand the differences between using Numpy and using Pandas to import a csv. 

When we import data using `np.genfromtext` we create a *Numpy Array*, while when we import data using `pd.read_csv` we create a *Pandas DataFrame*. A Pandas DataFrame is similar to a Numpy Array, but with many added benefits.

For example, one of the benefits of using a Pandas DataFrame over of a Numpy Array, is the ability to use `df.head` to print the *head* of a DataFrame in a readable tabular format. Note that `df.head(k)` only prints the first `k` rows.

Let's explore some more...

#### Manipulating data with Pandas
In many cases, it is easier to manipulate data in an Pandas DataFrame than to manipulate data in a Numpy Array. For example, let us compute the mean of each column in the `houses.csv` data.

In [None]:
df_houses.mean()

Let us reduce the dataset to only house with 1 bathroom.

In [None]:
df_houses_1bathroom = df_houses[df_houses['Number of Bathrooms']==1]
df_houses_1bathroom.head()

And let us compute the average square meters for houses with 1 bathroom.

In [None]:
df_houses_1bathroom['Square Meters'].mean()

Finally, let us add the number of bedrooms to the number of bathrooms and create an additional column with the total number of known rooms.

In [None]:
df_houses['Total Rooms'] = df_houses['Number of Bedooms'] + df_houses['Number of Bathrooms']
df_houses.head()

### Example data
Let's now use Pandas to analyse some real data. The dataset we will use can be described as follows:

>Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."
> [SOURCE](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html)

Let's first use Pandas to import the data and look at its head.

In [None]:
df_diabetes = pd.read_csv('Data/diabetes.tsv', sep='\t', header=0)
df_diabetes.head()

The data appears to be a mix of intergers and floats. We can easily check using the `dtypes` attribute as follows.

In [None]:
df_diabetes.dtypes

#### Visualising data
Let's now visualise the diabetes data in different ways.

Let's create a *histogram* plot of the values in the 'response of interest' column.

In [None]:
df_diabetes.hist('Y',bins=30);

(Try changing the number of bins in the above plot to see how the plot changes).

Let's create a *scatter* plot to compare the values in the 'body mass index' column to the values in the 'response of interest' column.

In [None]:
df_diabetes.plot.scatter(x='BMI', y='Y');

Let's create another *scatter* plot to compare the values in the 3rd 'blood serum measurement' column to the 'response of interest' column.

In [None]:
df_diabetes.plot.scatter(x='S3', y='Y');

#### Exploring correlations
Note the different shapes in the above two scatter plots. The shape of the cloud of data points is partly due to the *correlation* between the variables on the `x` and `y` axes of the plot. Furthermore, it is possible to measure the correlation between all the different pairs of columns in the diabetes DataFrame as follows.

In [None]:
df_diabetes_corr = df_diabetes.corr()
df_diabetes_corr.round(2).head(len(df_diabetes_corr))

Looking along the **Y** row in the above matrix, we see that the highest value is in the **BMI** column and the lowest value (largest negative value) is in the **S3** column. This suggests that **BMI** is *positively correlated* with **Y** and **S3** is *negatively correlated* with **Y**.