# Data Visualization with Python and Jupyter

In this module of the course, we will use some of the libraries available with Python and Jupyter to examine our data set. In order to better understand the data, we can use visualizations such as charts, plots, and graphs. We'll use some commont tools such as [`matplotlib`](https://matplotlib.org/users/index.html)  and [`seaborn`](https://seaborn.pydata.org/index.html) and gather some statistical insights into our data.

We'll continue to use the [`insurance.csv`](https://www.kaggle.com/noordeen/insurance-premium-prediction/download) file from you project assets, so if you have not already [`downloaded this file`](https://www.kaggle.com/noordeen/insurance-premium-prediction/download) to your local machine, and uploaded it to your project, do that now.

## Table of Contents

1. [Using the Jupyter notebook](#jupyter)<br>
2. [Load the data](#data)<br>
3. [Visualize Data](#visualize)<br>
4. [Understand Data](#understand)<br>

<a id="jupyter"></a>
## 1. Using the Jupyter notebook

### Jupyter cells

When you are editing a cell in Jupyter notebook, you need to re-run the cell by pressing **`<Shift> + <Enter>`**. This will allow changes you made to be available to other cells.

Use **`<Enter>`** to make new lines inside a cell you are editing.

#### Code cells

Re-running will execute any statements you have written. To edit an existing code cell, click on it.

#### Markdown cells

Re-running will render the markdown text. To edit an existing markdown cell, double-click on it.

<hr>

### Common Jupyter operations

Near the top of the Jupyter notebook page, Jupyter provides a row of menu options (`File`, `Edit`, `View`, `Insert`, ...) and a row of tool bar icons (disk, plus sign, scissors, 2 files, clipboard and file, up arrow, ...).

#### Inserting and removing cells

- Use the "plus sign" icon to insert a cell below the currently selected cell
- Use "Insert" -> "Insert Cell Above" from the menu to insert above

#### Clear the output of all cells

- Use "Kernel" -> "Restart" from the menu to restart the kernel
    - click on "clear all outputs & restart" to have all the output cleared

#### Save your notebook file locally

- Clear the output of all cells
- Use "File" -> "Download as" -> "IPython Notebook (.ipynb)" to download a notebook file representing your session

<hr>

<a id="data"></a>
## 2.0 Load the data 

A lot of data is **structured data**, which is data that is organized and formatted so it is easily readable, for example a table with variables as columns and records as rows, or key-value pairs in a noSQL database. As long as the data is formatted consistently and has multiple records with numbers, text and dates, you can probably read the data with [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), an open-source Python package providing high-performance data manipulation and analysis.

### 2.1 Load our data as a pandas data frame

**<font color='red'><< FOLLOW THE INSTRUCTIONS BELOW TO LOAD THE DATASET >></font>**

* Highlight the cell below by clicking it.
* Click the `10/01` "Find data" icon in the upper right of the notebook.
* Add the locally uploaded file `insurance.csv` by choosing the `Files` tab. Then choose the `insurance.csv`. Click `Insert to code` and choose `Insert Pandas DataFrame`.
* The code to bring the data into the notebook environment and create a Pandas DataFrame will be added to the cell below.
* Run the cell


In [None]:
# Place cursor below and insert the Pandas DataFrame for the Insurance Expense data


### 2.2 Update the variable for our Pandas dataframe

We'll use the Pandas naming convention df for our DataFrame. Make sure that the cell below uses the name for the dataframe used above. For the locally uploaded file it should look like df_data_1 or df_data_2 or df_data_x. 

**<font color='red'><< UPDATE THE VARIABLE ASSIGNMENT TO THE VARIABLE GENERATED ABOVE. >></font>**

In [None]:
# Replace data_df_1 with the variable name generated above.
df = df_data_1

<a id="visualize"></a>
## 3.0 Visualize Data

Pandas uses [`Matplotlib`](https://matplotlib.org/users/index.html) as the default for visualisations.
In addition, we'll use [`Numpy`](https://numpy.org), which is "The fundamental package for scientific computing with Python".
The conventions when using Jupyter notebooks is to import numpy as `np` and to import matplotlib.pyplot as `plt`. You can call these variables whatever you want, but you will often see them done this way.

Import the packages and also add the magic line starting with `%` to output the charts within the notebook. This is what is known as a [`magic command`](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

%matplotlib inline

### 3.1 Seaborn

Seaborn is a Python data visualization library based on matplotlib. It is an easy to use visualisation package that works well with Pandas DataFrames. 

Below are a few examples using Seaborn. 

Refer to this [documentation](https://seaborn.pydata.org/index.html) for information on lots of plots you can create.

In [None]:
import seaborn as sns

### 3.2 Statistical description

We can use the Pandas method [`describe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) to get some statistics that will later be seen in our visualizations. This will include numeric data, but exclude the categorical fields.

In [None]:
df.describe()

### Question 1: Is there relationship between BMI and insurance expenses?

We'll explore the data by asking a series of questions (hypothesis). The use of plots can help us to find relationships and correlations.
[`Body Mass Index`](https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmicalc.htm)  (BMI) is a measure of body fat based on height and weight that applies to adult men and women. It is often correlated with health outcomes, so let's use a [`Seaborn jointplot`](http://seaborn.pydata.org/generated/seaborn.jointplot.html) with a scatterplot to see if that holds for our data.

In [None]:
sns.jointplot(x=df["expenses"], y=df["bmi"], kind="scatter")

plt.show()

#### Answer:

It doesn't not appear that there is a good correlation between BMI and the expenses for these patients. We see from the histogram on the right that BMI is normally distributed, and from the histogram on top we see that Expenses are clustered around the lower amounts. It does not look like BMI would be a good predictor of the expenses.

### Question 2:  Is there relationship between gender and insurance expenses?

Our next hypothesis might be that there is a correlation between gender and expenses. We can use the [`Seaborn boxplot`](https://seaborn.pydata.org/generated/seaborn.boxplot.html). A boxplot uses quartiles to show how the data is distributed, and will give us a good comparison between the 2 categories represented by `gender`. The horizontal line through our boxes is the median value. The area above the median line is the 3rd quartile, representing the values of the 50th-75th percentiles, and the area below the median line is the 2nd quartile, representing the values of the 25th-50th percentiles. The rest of the data is collapse into lines called "whiskers" and outliers are plotted as single points.

In [None]:
plt.figure(figsize = (5, 5))
sns.boxplot(x = 'sex', y = 'expenses',  data = df)

#### Answer: 
On average claims from male and female are the same, and both have approximately the same median (the value in the middle of the distribution. The 3rd quartile is "fatter" for the males, meaning there is a broader distribution of values, and it skews to a higher amount. The 4th quartile also skews higher for the males, so this category contains more of the higher expenses.

### Question 3: Is there relationship between region and claim amount?

Perhaps there is a correlation between the various regions and the insurance expenses. We can once again use a series of boxplots to see the differences betweent the regions.

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'region', y = 'expenses',  data = df)

#### Answer: 

In this case we see that the median values across regions are nearly the same. There is some variation for the distribution of expense values, and the southeast reagion has more of the higher values in the 3rd and 4th quartile. The differences aren't particularly large, however, and it is unlikely that region could be a good predictor of expenses.

### Question: Is there relationships between claim amount between smokers and non-smokers?

Given the overwhelming evidence that smoking causes mortality (death) and morbidity (disease), we might guess that there is a relationship betweem insurance claims and smoking.
Let's use a boxplot to examine this.

In [None]:
plt.figure(figsize = (5, 5))
sns.boxplot(x = 'smoker', y = 'expenses',  data = df)

#### Answer: 

We can see that the mean, and indeed the entire interquartile range from 25% to 75% is much higher in expense for the smokers than for the non-smokers. It looks like whether or not an individual is a smoker could be a good predictor of insurance expenses.

### Question: is the smoker group well represented?

We'll want to make sure that we have a pretty good sample size for both groups.

In [None]:
# make the plot a little bigger
countplt, ax = plt.subplots(figsize = (10,7))

ax = sns.countplot(x='smoker', data=df)

#### Answer:

Yes, it looks like smokers are a large enough group to be statistically significant.

### Question: Is there relationship between claim amount and age?

It seems reasonable to assume that there might be different insurance costs for different age groups. For example, older adults tend to require more health care.
Since this is continuous data, let's use a scatter plot to investigate.

In [None]:
sns.jointplot(x=df['expenses'], y=df['age'], kind='scatter')

plt.show()

#### Answer: 

Yes, it does look like Claim amounts increase with age. Furthermore, there are interesting bands around the expenses for `$1,200`, up to `$3,000`, and above `$3,000`.

<a id="understand"></a>
## 4.0 Understand data

Now that we have had a look at the data, let's bring some of this information together.

In order to look at the relationship between multiple variables, we can use the [`Seaborn pairplot()`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) method. This will plot each of the variables of the data set on both the x and y axes, in every possible combination. From this we can quickly see patterns that indicate the relationship between the variables.
We'll use the `hue`  to color one of the features in the plot to compare it to the other 2 variables.


### 4.1 Impact of Smoking

See which variable correlate with smoking. `Red` indicates a smoker.

In [None]:
claim_pplot=df[['age', 'bmi', 'children', 'smoker', 'expenses']]
claim_pplot.head()
sns.pairplot(claim_pplot, kind="scatter", hue = "smoker" , markers=["o", "s"], palette="Set1")
plt.show()

#### Analyis

We can see some interesting things from these plots. Whereas older people tend to have more expenses, we can see from `age` vs. `expenses` that smoking is a more dominant feature. The same holds for `BMI` vs `expenses`. 

### 4.2  Impact of Gender

What is the correlation between the features and gender. `Red` is female, `Blue` is male.

In [None]:
claim_pplot=df[['age', 'bmi', 'children', 'sex', 'expenses']]
claim_pplot.head()
sns.pairplot(claim_pplot, kind="scatter", hue = "sex" , markers=["o", "s"], palette="Set1")
plt.show()

#### Analysis: 

Gender has very little impact of the expenses.

#### REGION IMPACT

In [None]:
claim_pplot=df[['age', 'bmi', 'children', 'region', 'expenses']]
claim_pplot.head()
sns.pairplot(claim_pplot, kind="scatter", hue = "region" , markers=["o", "s","x","+"], palette="Set1")
plt.show()

#### Analysis: 

Region does have some imact on the expenses, which can be seen in the `age` vs. `expenses` chart where the `northeast` region appears in the lowest band more commonly, followed by the `northwest` region, and the `southeast` region is clearly higher and more prevelant in the highest band.

### Show correlations

We can quantify the correlations between features of the data set using [`Pandas corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) method. This will output a table with a numberical value for the correlation coefficient.

In [None]:
df[['age', 'sex','bmi', 'children', 'smoker', 'region', 'expenses']].corr(method='pearson')

#### Analysis:

We can see from the numerical correlation coefficient that there is little relationship amongst the numerical features.

## Summary: 

From our visual analysis of the data, we see that the best predictor of insurance claim expenses is whether or not the individual is a smoker.