#  Barclays x GA: Python Day 3 - Pandas and EDA

---

<a id="learning-objectives"></a>
## Learning Objectives
*After completing this notebook, you will be able to:*

- Define what Pandas is and how it relates to data science.
- Manipulate Pandas `DataFrames` and `Series`.
- Filter and sort data using Pandas.
- Manipulate `DataFrame` columns.
- Understand the different kinds of missing data, and know how to handle null and missing values.
- Visualise data with a range of different charts

## Contents:
* [Introduction to Pandas](#pandas-intro)
* [DataFrame methods and attributes](#dataframe-methods)
* [Setting values](#setting-values)
* [Selecting columns](#selecting-cols)
* [Transforming columns](#transforming-cols)
* [Selecting rows](#selecting-rows)
* [Sorting data](#sorting-data)
* [Missing data](#missing-data)
* [Value counts](#value-counts)
* [Grouping](#groupby)
* [Visualisations](#visualisations)


<a id="pandas-intro"></a>

# <font color='blue'> Introduction to Pandas

Pandas is a Python library that primarily adds two new datatypes to Python: `DataFrame` and `Series`.

- A `Series` is a sequence of items, where each item has a unique label (called an `index`).
- A `DataFrame` is a table of data. Each row has a unique label (the `row index`), and each column has a unique label (the `column index`).
- Note that each column in a `DataFrame` can be considered a `Series` (`Series` index).

Behind the scenes, these datatypes use the `numpy` (numerical Python) library. NumPy primarily adds the `ndarray` (n-dimensional array) datatype to Pandas. An `ndarray` is similar to a Python list, in that it stores ordered data. However, it differs in three respects:

* Each element has the same datatype (typically fixed-size, e.g., a 32-bit integer).
* Elements are stored contiguously (immediately after each other) in memory for fast retrieval.
* The total size of an `ndarray` is fixed.

Storing `Series` and `DataFrame` data in `ndarray`s makes Pandas faster and uses less memory than standard Python datatypes. Many libraries (such as scikit-learn) accept `ndarray`s as input rather than Pandas datatypes, so we will frequently convert between them.


## Using Pandas

Pandas is frequently used in data science because it offers a large set of commonly used functions, is relatively fast, and has a large community. Because many data science libraries also use NumPy to manipulate data, you can easily transfer data between libraries (as we will often do in this class!).

Pandas is a large library that typically takes a lot of practice to learn. 

It heavily overrides Python operators, resulting in odd-looking syntax. For example, given a `DataFrame` called `cars` which contains a column `mpg`, we might want to view all cars with mpg over 35. To do this, we might write: `cars[cars['mpg'] > 35]`. 

In standard Python, this would most likely give a syntax error.  

Pandas also highly favors certain patterns of use. 

For example, looping through a `DataFrame` row by row is highly discouraged. 

Instead, Pandas favors using **vectorized functions** that operate column by column. (This is because each column is stored separately as an `ndarray`, and NumPy is optimized for operating on `ndarray`s.)

Do not be discouraged if Pandas feels overwhelming. Gradually, as you use it, you will become familiar with which methods to use and the "Pandas way" of thinking about and manipulating data.

---
## <font color='red'> Exercise: Reading in pay gap data
    
Today we'll be working with a dataset on the gender pay gap across companies in the UK. 

Let's start by reading in a CSV as a Pandas `DataFrame`.

1. Use the `read_csv()` Pandas function to read in a file from the `Data` directory (which is inside the directory this notebook is in). 

The file has been downloaded from https://gender-pay-gap.service.gov.uk/viewing/download. It's called `UK Gender Pay Gap Data - 2019 to 2020.csv`; read this in as a DataFrame called `pay_gap_2019_20`
    

2. Use the `head` command on `pay_gap_2019_20` to visually inspect the data. What's strange about it? Use `read_csv()` again but try playing around with the `header` parameter (e.g. `read_csv(header=5)`) until the final DataFrame looks right. What does the `header` parameter do?


3. Continue to inspect `pay_gap_2019_20` visually and figure out:

    
* What the data contains
 
* What each column corresponds to
    
* What each row corresponds to
    

3. Use `shape` to figure out how many rows are in `pay_gap_2019_20`. 

4. List as many potential data quality issues as you can in `pay_gap_2019_20`

---

<a id="dataframe-methods"></a>

# <font color='blue'> DataFrame Methods and Attributes

We've seen that Pandas `DataFrame` is perhaps the most important class of object in Pandas, and comes with a set of attributes (or properties) and methods that can be applied specifically to Pandas ``DataFrames``. 

We start by importing ``pandas`` and reading in a CSV file using the ``read_csv`` function. The ``header=2`` parameter specifies that the column names are in row ``2`` of the underlying CSV file.

We preview the first five rows of the ``DataFrame`` using the ``head`` method. 


In [None]:
import pandas as pd

In [None]:
pay_gap_2019_20 = pd.read_csv('./data/UK Gender Pay Gap Data - 2019 to 2020.csv',header=2)
pay_gap_2019_20.head(5)


We can access the index, which is a numbering system that labels each row with a unique number according to its position in the DataFrame (like indexing in a list)

In [None]:
pay_gap_2019_20.index

We can also quickly access the column names

In [None]:
pay_gap_2019_20.columns

The ``shape`` attribute is a good way of figuring out how big our dataset is

In [None]:
pay_gap_2019_20.shape

We can confirm that our ``DataFrame`` is the correct type

In [None]:
type(pay_gap_2019_20)     

----

## Checking data types

We can check the types of data in individual columns. **But first, we need to deliberately engineer a problem with our data by running the cell below**

In [None]:
pay_gap_2019_20 = pay_gap_2019_20.astype({'DiffMedianHourlyPercent': 'str',
                                         'DiffMeanBonusPercent': 'str',
                                         'DiffMeanHourlyPercent':'str'})

Now we can check the types using `dtypes()`

In [None]:
pay_gap_2019_20.dtypes

We can see that most of the columns in our dataset are ``float64``, i.e. floating point or **decimal** numbers.

But we can also see that the `DiffMeanHourlyPercent`, `DiffMedianHourlyPercent` and `DiffMeanBonusPercent` columns are **not** a numeric type. If a column in a DataFrame contains a mix of types, Pandas labels its type as `object`.

Since we want Pandas to treat these columns as numeric columns, we need to convert it using the `to_numeric` function. 

In [None]:
pay_gap_2019_20['DiffMeanHourlyPercent'] = pd.to_numeric(pay_gap_2019_20['DiffMeanHourlyPercent'])



Now when we run `dtypes` again, we can see the `DiffMeanHourlyPercent` column has a numeric type.

In [None]:
pay_gap_2019_20.dtypes

That leaves the `DiffMedianHourlyPercent` and `DiffMeanBonusPercent` columns to convert. Instead of running `to_numeric()` two more times, it's more efficient to convert multiple columns to different types using the `astype` method.

**Note that the information we give Pandas about which columns to convert, and which types to convert them to, is formatted as a dictionary**

In [None]:
pay_gap_2019_20 = pay_gap_2019_20.astype({'DiffMedianHourlyPercent': 'float64',
                                         'DiffMeanBonusPercent': 'float64'})


Running `dtypes` a final time, we see that all the columns in our DataFrame are of the correct type.

In [None]:
pay_gap_2019_20.dtypes

<a id="setting-values"></a>

# <font color='blue'> Setting values in a DataFrame

To change the value of a single element in a DataFrame, we use the `at` method.

We pass it the position of the element we want to set the value of, in the format `[index,column_name]`

In [None]:
pay_gap_2019_20.at[0,'Address'] = 'test value 2'

In [None]:
pay_gap_2019_20.head()

<a id="selecting-cols"></a>

# <font color='blue'> Selecting columns

Pandas DataFrames have structural similarities with Python-style lists and dictionaries. We can select, or extract, columns from a `DataFrame` using column names.



In the example below, we select a column of data using the name of the column in a similar manner to how we select a dictionary value with the dictionary key.

In [None]:
pay_gap_2019_20['EmployerName']

The result is a Pandas **series**. We can think of this as being the Pandas equivalent of a list.

In [None]:
type(pay_gap_2019_20['EmployerName'])

We can also select a single column using this syntax

In [None]:
pay_gap_2019_20[['EmployerName']]

The result is a DataFrame

In [None]:
type(pay_gap_2019_20[['EmployerName']])

We can select multiple columns using this syntax too.

In [None]:
pay_gap_2019_20[['EmployerName','Address']]

A neater way of doing it could be using this syntax, which does exactly the same thing.

In [None]:
columns_to_select = ['EmployerName','Address']  

pay_gap_2019_20[columns_to_select]            

<a id="transforming-cols"></a>

# <font color='blue'> Transforming columns
    
Once we've selected columns, we can perform transformations on them (e.g converting an entire column to lowercase) or calculations with them (e.g. adding two columns together to create a new column).

## Changing column names

There are a few different ways to change column names. 

### Renaming individual columns

Individual column names can be changed like this. We could add as many columns as we wanted to the dictionary below, in the format `{'old_column_name':'new_column_name'}`

`rename` is by default **not** an **in place** method, i.e. it doesn't change the underlying DataFrame. In order to make methods **in place** we need to add an extra input to the `rename` method; `inplace=True`

In [None]:
pay_gap_2019_20.head()

In [None]:
pay_gap_2019_20.head()

In [None]:
pay_gap_2019_20.rename(columns={'Address':'EmployerAddress'},inplace=True)


Now we can see the column has been renamed 

In [None]:
pay_gap_2019_20.head(2)

### Renaming all columns

It's also possible to rename **all** the columns in a DataFrame using the syntax

``DataFrame.columns = [full list of new column names]``

---

## Creating new columns

We can create new columns by performing calculations on existing columns. Let's say we want to create a new column that gives the Difference in Mean Hourly Pay as a proportion rather than a percentage. 

In [None]:
pay_gap_2019_20['DiffMeanHourlyPercent']/100

In [None]:
pay_gap_2019_20['DiffMeanHourlyProportion'] = pay_gap_2019_20['DiffMeanHourlyPercent']/100
pay_gap_2019_20.head()

## Removing columns

We can use the `drop` method to do this. Once again, unless we specify that the method is `inplace` the underlying DataFrame won't be changed.

In [None]:
pay_gap_2019_20.drop(columns=['DateSubmitted','DueDate'],inplace=True)
pay_gap_2019_20.head()

## Applying functions to columns

Sometimes we'll want to perform a calculation or operation on each row of a DataFrame column. There are a few different ways to do this.

### Vectorised functions

In Pandas it's discouraged to loop through all the rows in a DataFrame, applying a function or operation to each row. 

Vectorised functions, which quickly apply a function to an entire column without having to explicitly write a loop, are much faster and more efficient. 

Here are some examples.

We can convert columns to lowercase.

In [None]:
pay_gap_2019_20['EmployerName'] = pay_gap_2019_20['EmployerName'].str.lower()

In [None]:
pay_gap_2019_20.head(5)

We can replace strings. This can be used to remove strings, too by replacing them with a blank space or `''`

In [None]:
pay_gap_2019_20['EmployerName'] = pay_gap_2019_20['EmployerName'].str.replace('limited','')

In [None]:
pay_gap_2019_20.head(5)

We can also perform calculations with entire columns.

In [None]:
pay_gap_2019_20['DiffMedianHourlyPercent']*100

---
## <font color='red'> Exercise: Column calculations
    
1. Figure out how to use the `mean()` `DataFrame` method to work out the mean value of the 
`DiffMeanHourlyPercent` column.


2. Drop the `FemaleTopQuartile` column from the `DataFrame`.


---

<a id="selecting-rows"></a>

# <font color='blue'> Selecting rows

## Selecting rows by index

We can use the `loc` command to pick out a specific row of a DataFrame.

We use the syntax `loc[a,b]` where `a` is the index of the row we want to access, and `b` is the name of the column. 

As with lists, `:` means 'give me everything' so in this example below, we're accessing the **first** row of data and **all** the columns.

In [None]:
pay_gap_2019_20.loc[0,:]

We can specify a **range** of rows we want to extract. This gives us rows **0** to **2** **inclusive of row 5** and all the columns.

In [None]:
pay_gap_2019_20.loc[0:2,:]

We can specify rows and single columns, too.

In [None]:
pay_gap_2019_20.loc[0:2,'EmployerName']

Or the rows we want plus the list of columns we want.

In [None]:
pay_gap_2019_20.loc[0:2,['EmployerName','EmployerAddress']]

Or, the rows we want and the **range** of columns we want (notice the `:` operator again)

In [None]:
pay_gap_2019_20.loc[0:2,'EmployerName':'SicCodes']

## Selecting rows using logical tests

Often we won't know the exact index of the row we're looking for. 

Maybe we want to find all the rows where the `DiffMedianHourlyPercent` is greater than 10%.

We start by writing a **filter** or a logical test that will be `True` for the rows we're interested in. 

We're interested in the `DiffMedianHourlyPercent` column so our filter looks like this:

In [None]:
pay_gap_filter = pay_gap_2019_20['DiffMedianHourlyPercent']>10

In [None]:
pay_gap_filter

When we inspect this filter, we can see it's a long list of `True` and `False` values; the value of the filter is `True` for rows that pass the logical test and `False` for rows that don't.

In [None]:
pay_gap_filter

Now we **apply** our filter to our DataFrame

In [None]:
pay_gap_2019_20[pay_gap_filter]

We can also write and apply our filter in a single step

In [None]:

pay_gap_2019_20[pay_gap_2019_20['DiffMedianHourlyPercent']>10]

It's also possible to combine logical tests using `and` and `or` operators. For example, to find all rows where `DiffMedianHourlyPercent` is greater than 10% **and** `DiffMeanHourlyPercent` is greater than 10%, we can write:

**Note that the `and` operator here is written as `&`**

In [None]:
pay_gap_filter_2 = (pay_gap_2019_20['DiffMedianHourlyPercent']>10) & (pay_gap_2019_20['DiffMeanHourlyPercent']>10)

pay_gap_2019_20[pay_gap_filter_2]



Similarly, to find all rows where `DiffMedianHourlyPercent` is greater than 10% **or** `DiffMeanHourlyPercent` is greater than 10%, we can write:

**Note that the `or` operator here is written as `|`**

In [None]:
pay_gap_filter_3 = (pay_gap_2019_20['DiffMedianHourlyPercent']>10) | (pay_gap_2019_20['DiffMeanHourlyPercent']>10)

new_df = pay_gap_2019_20[pay_gap_filter_3]


We can also use the ``str.contains()`` method to find all rows that contain a particular string.

In [None]:
pay_gap_2019_20['EmployerName'].str.lower().str.contains('school')

In [None]:
pay_gap_filter_4 = pay_gap_2019_20['EmployerName'].str.lower().str.contains('school')

pay_gap_2019_20[pay_gap_filter_4]


---
## <font color='red'> Exercise: Filtering rows and columns
    
    
1. Select companies where the median hourly pay gap is in favour of women, i.e. where `DiffMedianHourlyPercent` is **negative**


2. Select companies that have 'college' in the name


3. Select companies that have a mean hourly pay gap greater than 10%, i.e. where `DiffMeanHourlyPercent` is greater than 10



<a id="sorting-data"></a>

# <font color='blue'> Sorting data
    
It's easy to sort data in ascending/descending order according to a particular column. We do this using the `sort_values` method.

In [None]:
pay_gap_2019_20.sort_values(by='DiffMedianHourlyPercent',ascending=True)

---
## <font color='red'> Exercise: Sorting data
    
1. Which company has the lowest median hourly pay gap?


2. Which companies have the top 5 highest mean hourly pay gap?



---

<a id="missing-data"></a>
# <font color='blue'> Handling missing values
    
Sometimes, values will be missing from the source data or as a byproduct of manipulations. It is very important to detect missing data. Missing data can:

- Make the entire row ineligible to be training data for a model.
- Hint at data-collection errors.
- Indicate improper conversion or manipulation.
- Actually not be missing — it sometimes means "zero," "false," "not applicable," or "entered an empty string."

In Pandas, a "null" value is either `None` or `np.NaN` (Not a Number). 

Many fixed-size numeric datatypes (such as integers) do not have a way of representing `np.NaN`. So, numeric columns will be promoted to floating-point datatypes that do support it.

Let's check our gender pay gap dataset for missing values.

We can do this using the `isnull()` method and summing up the values for each column.

In [None]:
pay_gap_2019_20.isnull()

In [None]:
pay_gap_2019_20.isnull().sum()

We can choose to drop rows containing ``NaN`` values, or fill in ``NaN`` values with a string, float or other element of our choice. 

Be careful when doing either of these things; you could end up unintentionally removing rows, or filling in values that don't make sense or aren't accurate.

In this case, it would be important to clarify whether a ``NaN`` value in a particular column means the amount is zero, or whether it means the amount is unknown.

We can **fill in** NaN values with a value of our choice using `fillna()`. For example, it makes sense to fill in `CompanyLinkToGPGInfo` with a string like 'no URL provided'.

In [None]:
pay_gap_2019_20['CompanyLinkToGPGInfo'].fillna('No URL',inplace=True)

We can now see that this column no longer has any missing values.

In [None]:
pay_gap_2019_20.isnull().sum()

We might want to **drop** rows where there is no company number provided, since this means we won't be able to look up the company on Companies House.

In [None]:
pay_gap_2019_20.dropna(subset=['CompanyNumber'],inplace=True)

Again, we can now see that there are no missing values in the `CompanyNumber` column.

In [None]:
pay_gap_2019_20.isnull().sum()

<a id="summary-stats"></a>

# <font color='blue'> Summary statistics

Pandas has a bunch of built-in methods to quickly summarize your data and provide you with a quick general understanding. 

The ``describe`` method gives summary statistics for the numeric columns in the data.

Let's start by reading in our pay gap data again.

In [None]:
pay_gap_2019_20.describe()

It's also possible to get summary statistics for all columns, including non-numeric ones.

In [None]:
pay_gap_2019_20.describe(include='all')

It's also possible to compute statistics like the median for individual columns.

In [None]:
pay_gap_2019_20['DiffMeanHourlyPercent'].median()

---
## <font color='red'> Exercise: Descriptive statistics
    
Interpret the results above to answer the following questions:

* What's the mean % difference in hourly pay between men and women, across all companies?
* What's the median % difference in hourly pay between men and women, across all companies?


Use your knowledge of `isna()` to figure out:

* How many companies haven't provided a website address?
* How many companies don't give their employees bonuses? 


<a id="value-counts"></a>
# <font color='blue'> Getting value counts

Sometimes we might want to see the breakdown of different values in a column. This is easy with the `value_counts` function.

In our gender dataset, let's check the breakdown of company sizes

In [None]:
pay_gap_2019_20[['EmployerSize']]

In [None]:
pay_gap_2019_20['EmployerSize'].value_counts()

In [None]:
pay_gap_2019_20['EmployerSize'].value_counts('normalize')

---
## <font color='red'> Exercise: Value counts
    
Interpret the results above to answer the following questions:

* What's the mean % difference in hourly pay between men and women, across all companies?
* What's the median % difference in hourly pay between men and women, across all companies?


Use your knowledge of `isna()` to figure out:

* How many companies haven't provided a website address?
* How many companies don't give their employees bonuses? 


<a id="groupby"></a>
# <font color='blue'> Grouping data

Sometimes we might want a more detailed breakdown using more than one column. 

Let's look at the mean pay gap across all companies, grouped by company size.

In [None]:
pay_gap_2019_20.groupby('EmployerSize')['DiffMeanHourlyPercent'].mean()

Based on these results, is there a relationship between a company sizes and pay gaps?

<a id="visualisations"></a>

# <font color='blue'> Visualisations

In this section, we'll learn about how plotting works in Pandas and Matplotlib. 

It'a important to know that Pandas uses Matplotlib behind the scenes to make plots. 

So, you will notice that Pandas plotting methods often use similar parameter names as Matplotlib methods. You can also use Matplotlib functions in combination with Pandas methods to alter the plots after drawing them. 

For example, you can use Matplotlib's `xlabel` and `title` functions to label the plot's x-axis and title, respectively, after it is drawn.

As we explore different types of plots, notice:

1. Different types of plots are drawn very similarly; they even tend to share parameter names.

2. In Pandas, calling `plot()` on a `DataFrame` is different to calling it on a `Series`. Although the methods are both named `plot`, they may take different parameters.

Toward the end of the lab, we will show some motivational plots using Seaborn, a popular statistics plotting library, as well as go more in-depth about how Matplotlib works.

Pandas documentation is a good, comprehensive source of information on different plotting functions and parameters.

[Link to Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)

Let's start by importing the libraries we'll be using.

In [None]:
# Numpy and Pandas for data manipulation
import pandas as pd
import numpy as np

# Import the two data visualisation libraries we'll be using
import seaborn as sns
import matplotlib.pyplot as plt

# Set some formatting parameters for this notebook
plt.style.use('fivethirtyeight')
%matplotlib inline
from IPython.display import HTML

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

---

## <font color='red'> Code-along: Reading in credit risk data
    
Today we'll be working with a dataset from Kaggle, which gives information about people applying for loans. 

Take a few minutes to read more about the dataset here: https://www.kaggle.com/c/home-credit-default-risk/overview

Then, read in the file `credit_risk.csv` from the `data` directory in this folder.

In [None]:
credit_df = pd.read_csv('./data/credit_risk.csv')
credit_df.head()

How many rows and columns are in the dataset?

In [None]:
credit_df.shape

Use `head` to preview the first few rows of the dataset, and look at the column names. Here's an explanation of what each column means:

* SK_ID_CURR: Client ID
* TARGET: This is 1 if the client has made at least one late loan payment, 0 if not
* CODE_GENDER: Gender of the client
* FLAG_OWN_CAR:	This is 1 if the client owns a car, 0 if not
* FLAG_OWN_REALTY: This is 1 if client owns a house or flat, 0 if not
* CNT_CHILDREN:	Number of children the client has
* AMT_INCOME_TOTAL:	Income of the client
* AMT_GOODS_PRICE: For consumer loans it is the price of the goods for which the loan is given
* AMT_CREDIT: Amount of the loan
* NAME_INCOME_TYPE:	Client's income type (businessman, working, maternity leave)
* NAME_EDUCATION_TYPE: Level of highest education the client achieved
* NAME_FAMILY_STATUS: Family status of the client
* REGION_POPULATION_RELATIVE: Normalized population of region where client lives (higher number means the client lives in more populated region)
* NAME_HOUSING_TYPE: What is the housing situation of the client (renting, living with parents, ...)
* DAYS_BIRTH: Client's age in days at the time of application
* DAYS_EMPLOYED: How many days before the application the person started current employment
* OCCUPATION_TYPE: What kind of occupation does the client have


In [None]:
credit_df.head()

<a id="histograms"></a>

# <font color='blue'> Histograms

Histograms show the spread of values within a single variable.


Let's create a histogram to show the spread of the number of children people have.

We set the number of buckets, or bars, to 20 and specify the limits of the x-axis.

In [None]:
credit_df['CNT_CHILDREN'].hist(bins=5,range=(0,5))

What does this histogram tell us?

* Most people applying for a loan have 0 children
* This histogram doesn't follow a 'bell curve' shape so we can say this variable isn't **normally distributed**


## <font color='red'> Exercise: Histograms
    
Make histograms to show the spread of each of these variables:
* Client income: Set the number of bins to 10, and the x-axis range from 0 to 600,000
* Loan amount: Set the number of bins to 20, and the x-axis range from 0 to 2,500,000

Interpret these histograms, including:
* Are the variables **normally distributed**?
* Roughly what's the most commonly requested loan amount?
* What's the most common earnings bracket?

<a id="bar-charts"></a>

# <font color='blue'> Bar charts

## Simple bar charts

Now we want to make a visualisation to show how many loans were granted to people with different levels of education.

In [None]:
credit_df['NAME_EDUCATION_TYPE'].value_counts().plot(kind='bar',title='Loan applications by education level');

## <font color='red'> Exercise: Bar charts
    
Make bar charts showing:

* The different marital statuses of clients
* The housing situations of clients

Interpret each bar chart to figure out:

* What the most common marital status is for loan applicants
* What the most and least common housing situation is for loan applicants

<a id="box-plots"></a>
# <font color='blue'> Box plots

We can use boxplots to quickly summarize distributions and get a **five-number summary** of a dataset:

- min = minimum value
- 25% = first quartile (Q1) = median of the lower half of the data
- 50% = second quartile (Q2) = median of the data
- 75% = third quartile (Q3) = median of the upper half of the data
- max = maximum value

**Interquartile Range (IQR)** = Q3 - Q1

**Outliers:**

- below Q1 - 1.5 * IQR
- above Q3 + 1.5 * IQR

In [None]:
credit_df['AMT_INCOME_TOTAL'].plot(kind='box');

We can also look at the boxplots of income broken down by gender. This plot tells us:

* On average, male customers tend to earn more than women
* The **range** of women's salaries is smaller than men's

In [None]:
credit_df.boxplot(column='AMT_INCOME_TOTAL', by='CODE_GENDER',figsize=(10,8));


## <font color='red'> Exercise: Box plots
    
Make box plots to answer the following questions.

* Do men tend to apply for larger loans than women?
* Do people with higher levels of education have higher salaries on average?
* Do people with higher levels of education have more debt?


<a id="scatter-plots"></a>

# <font color='blue'> Scatter plots
    
Scatter plots can be used to show the relationship between two variables. 

Let's do this using the credit dataset, to show the relationship between income and loan amount.


In [None]:
credit_df.plot(kind='scatter',x='AMT_INCOME_TOTAL',y='AMT_CREDIT',xlim=(0,1000000))

What does this scatter plot tell us? 
* It looks like there's a positive correlation between a person's income and the size of the loan they're applying for

We can change the transparency of the dots to 0.3 using the `alpha` parameter.

In [None]:
credit_df.plot(kind='scatter',x='AMT_INCOME_TOTAL',y='AMT_CREDIT',xlim=(0,1000000),alpha=0.3)

Now let's the colour of each point according to the value of the `TARGET` column, so each point is coloured according to whether a client has made a late payment or not. We do this using the `c` argument and the `colormap` option.


In [None]:
credit_df.plot(kind='scatter',x='AMT_INCOME_TOTAL',y='AMT_CREDIT',
                          c='TARGET',xlim=(0,1000000),colormap='bwr',alpha=0.5);


## <font color='red'> Exercise: Scatter plots
    
Make scatter plots to answer the following questions.

* Is there a correlation between a person's age **in years** and the size of the loan they're applying for? You'll need to create a new column in the DataFrame, containing the client's age in years.

* Is there a correlation between how many **years** a person has been employed for, and the size of the loan they're applying for? Again, you'll need to create a new column in the DataFrame containing the length of the client's employment in years.

You might need to use the following options to format your scatter plots:
* `xlim` and `ylim` to set the axis limits
* `alpha` to set the transparency of the points


<a id="pair-plots"></a>

# <font color='blue'> Pair plots
    
Often when we're exploring a large dataset, we'll want to answer questions like:
* What's the best subset of features to use in my model?
* Which of my features have the strongest relationship with my dependent variable?
* Which features have no correlation with my dependent variable so I can ditch them?
* What kind of relationship exists between a pair of variables? Is it linear, or something else?

Pair plots are a quick way of seeing the relationships between all the variables in our dataset in one go, and saves the hassle of having to generate lots of scatter plots one by one.

**Again, we're only plotting the first 1000 rows of the dataset; plotting the whole dataset might slow your computers down!**

In [None]:
import seaborn as sns

sns.pairplot(credit_df)

<a id="correlation-matrix"></a>

# <font color='blue'> Correlation matrix
    
Correlation (or the correlation coefficient) tells us whether there’s an association between two variables. It can only take values from -1 to 1. 

A strongly positive correlation between two variables X and Y means:
* When X is high, Y is high 
* When X is low, Y is low 

A strongly negative correlation between two variables X and Y means:
* When X is high, Y is low 
* When X is low, Y is high 

A correlation close to zero between two variables X and Y means there’s no association between them, and both variables are just doing their own thing. 

A correlation matrix shows the correlation coefficient between every pair of variables in a dataset.


In [None]:
credit_df.corr()

In [None]:
sns.heatmap(credit_df.corr())

---

## <font color='red'> Exercise: US election data 
    
Use pandas to read in the file `us_presidential_votes.csv` from the `data` folder, as a DataFrame called `votes`. 

Visually inspect the `DataFrame`. What do you think it contains? What does each row correspond to, and what does each column represent? Use the information available at https://www.kaggle.com/joelwilson/2012-2016-presidential-elections to help you!

Use the `columns` method to get a list of all the columns in the dataset. You'll need to refer back to this list when answering the questions below

Use `describe()` to find:

* The mean population across all counties
* The mean population density across all counties
* The smallest vote share achieved by Trump in any county
* The largest vote share achieved by Clinton in any county

Produce a histogram showing the spread of values for Trump's vote share across all counties. Is this variable normally distributed?

Produce a scatter plot to show the relationship between the proportion of people with a batchelors degree in a county, and Trump's vote share in that county. How can we interpret this scatter plot?

Produce a scatter plot to show the relationship between the proportion of over-65s in a county, and Trump's vote share in that county. How can we interpret this scatter plot?

Now let's try to find out which demographic features in a county are most strongly correlated with a high vote share for Trump. Use a combination of correlation matrices and correlation heatmaps to decide on **one** variable that you think is the strongest predictor of Trump's vote share in a county.