![Ironhack logo](https://i.imgur.com/1QgrNNw.png)
# LAB | Statistics Visualization

## Introduction
We'll use the datasets to have a deeper understanding of some important variable distributions.

We'll understand how to:
- have a glance on the statistics distribution just by observing the main statistical numerical description of our dataset.
- understand the effect of a normal and non-normal distribution on our outlier analysis
- understand how one variable can impact on another variable distribution

## Import libraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Read dataset from `weight-height-money.csv`.

Take a look at the first rows of the dataset.

In [None]:
# your answer here
pandas.read_csv()

## Use the describe method to understand the data.

In [None]:
# your answer here

## What can you observe? Compare mean and median values for each variable of our dataset.

In [None]:
# your answer here

## From that observation, what can you conclude? Can you imagine which kind of distribution each one has?

Try to predict which category each variable most likely seems to fall into (without plotting it yet):
- Left skewed
- Right skewed
- Gaussian-like

_hint: Remember the effect of outliers on the mean and median values. Usually, the comparison between mean and median already leads us to meaningful insights regarding the variable's distribution. If mean and median are close, you can suppose that most of the data is concentrated in a region within the mean. However, if median and mean are far apart, you can suppose that some observations are pulling the mean closer to them._

In [None]:
# your answer here

# Univariate Analysis

## Gender count

Count how many Male and Female exist on this dataset using pandas. 

In [None]:
# your answer here

## Visual gender count

Use seaborn (sns) to visually see how many male and female exist on the dataset.

_hint: If you don't know how to do this, you can google: seaborn + the pandas method to count itens_

In [None]:
# your answer here

## Consider only Height

Create a pandas series of the `height` variable.

In [None]:
# your answer here

### Histogram-plot

Plot the histogram of the `height`

In [None]:
plt.figure(figsize=(12, 8))
# your answer here

### Box-plot

Plot the boxplot of the `height`

In [None]:
plt.figure(figsize=(12, 8))
# your answer here

What do you think the distribution of `height` is like? Do you think it is common for variables to behave like that in real life?

In [None]:
# your answer here

### KDE distribution for height

Plot the kde (kernel-density-estimation) distribution (without the hist distribution) of the `height`.

In [None]:
# your answer here

### Analysis

As we can see we have a high count for height in the range near 60 to 75. How many people fall off 3 standard deviations from the mean? Can you consider them outliers? Why is that?

#### Calculate the mean

In [None]:
# your answer here

#### Calculate the standard deviation 

In [None]:
# your answer here

#### Calculate the values for the cutoffs:

`upper_cutoff = mean` <b><span style="color:red">+</span></b> `3 * standard_deviation` 

`lower_cutoff = mean` <b><span style="color:red">-</span></b> `3 * standard_deviation`

#### Now filter the original dataset. 

Use the values you calculated above to filter the original dataset. 

You should obtain a filtered dataset containing only the rows in which the `Height` column is greater than the upper cutoff and lower than the lower cutoff.

In [None]:
# your answer here

Expected results:

|      | Gender   |   Height |   Weight |        Money |
|-----:|:---------|---------:|---------:|-------------:|
|  994 | Male     |  78.0959 | 255.691  | 1357.11      |
| 1317 | Male     |  78.4621 | 227.343  |    5.45797   |
| 2014 | Male     |  78.9987 | 269.99   |  131.474     |
| 3285 | Male     |  78.5282 | 253.889  |    0.0896631 |
| 3757 | Male     |  78.6214 | 245.734  |  204.113     |
| 6624 | Female   |  54.6169 |  71.3937 |  226.061     |
| 9285 | Female   |  54.2631 |  64.7001 |  646.532     |

#### Finally, calculate the shape of this filtered dataset and compare with the original dataframe.

Which percentage felt in these thresholds? Did you expect this value? Why?

In [None]:
# your answer here

## Now perform the same analysis for `money` variable.

You'll do exactly the same analysis for a variable of your dataset that has some different aspects. Let's try to understand that.

## Consider only Money

Create a pandas series of the `money` variable.

In [None]:
# your code here

Shape

In [None]:
# your code here

### Histogram-plot

In [None]:
# your code here

### Box-plot

In [None]:
# your code here

### KDE distribution for height

In [None]:
# your code here

### Analysis

Again, how many people fall off 3 standard deviations in both cases (which percentage of the dataset)? Let's do it by parts:

#### Calculate the mean

In [None]:
# your answer here

#### Calculate the standard deviation 

In [None]:
# your answer here.

#### Calculate the values for the cutoffs:

`upper_cutoff = mean` <b><span style="color:red">+</span></b> `3 * standard_deviation` 

`lower_cutoff = mean` <b><span style="color:red">-</span></b> `3 * standard_deviation`

#### Again, filter the original dataset. 

In [None]:
# your answer here

#### Finally, calculate the shape of this filtered dataset and compare with the original dataframe.

Which percentage felt in these thresholds? Did you expect this value?

In [None]:
# your answer here

Expected result:

|     | Gender   |   Height |   Weight |   Money |
|----:|:---------|---------:|---------:|--------:|
| 234 | Male     |  67.3698 |  176.636 | 3725.08 |
| 294 | Male     |  64.4252 |  169.109 | 3942.97 |
| 355 | Male     |  72.9386 |  216.097 | 3762.42 |
| 518 | Male     |  68.3465 |  178.676 | 3286.66 |
| 662 | Male     |  69.431  |  172.326 | 3798.71 |
|   ...   | ...   |   ... |   ... |   ... |
| 9873 | Female   |  63.7072 |  132.761 | 3164.37 |
| 9888 | Female   |  65.1059 |  149.695 | 3929.57 |
| 9922 | Female   |  58.7525 |  106.846 | 3541.68 |
| 9930 | Female   |  68.5444 |  148.828 | 3916.32 |
| 9946 | Female   |  66.6245 |  149.828 | 6535.36 |

Can you consider them outliers?

In [None]:
# your answer here

By now, you should have observed that in order to consider an observation an `outlier`, one needs to take into account the **distribution** of the variable. In fact, most statistical aspects do not mean anything until you understand the variable's distribution.

# Bivariate Analysis

## Considering both height and weight

Now we'll consider both height and weight variables to understand our data. We'll perform what is called a bivariate analysis.

### Perform a scatterplot to check the relation between Height and Weight

In [None]:
plt.figure(figsize=(12, 8))
# your answer here

### Do the same plot, but color the markers by Gender

In [None]:
plt.figure(figsize=(12, 8))
# your answer here

### Which insights hue'ing the plot using the variable Gender has brought to you? 

In [None]:
# your answer here

### Create a variable called `gender_groupby` to group data by `Gender`. However, don't define any aggregations yet. 

Just perform the groupby operation.

In [None]:
# your code here

#### Run `gender_groupby.head()` to check the groups obtained.

In [None]:
# your answer here

#### Run gender_groupby.describe().T to check the statistics for each group. 

_hint: You can transpose this result to obtain a better visualization of the results_


In [None]:
# your answer here

The results above should give you some insights of the effect of gender on your dataset both visually and numerically.

## Distribution plots

### Verifying the distribution of your variables for each Gender.

We have seen that each variable in our dataset has its own "DNA". This is the distribution of each variable. However, the story does not stop there. Each variable has peculiarities within it and it is our job as data analyst to discover it. We'll see futurely that what `machine learning models` mostly do for us is automate this process (if we understand them, of course) 

In this case, specifically, we want to understand the effect of our `Gender` variable on the distribution of our dataset.


### First, plot the distribution of the `Height` variable again.

In [None]:
# your answer here

### Now, filter your dataset for each gender. Create a dataframe called `men` and another called `women` and plot the `Height` distribution for each of them in the same plot.

In [None]:
# your answer here

### What insights could you observe from that? What is the impact of `Gender` on `Height`

In [None]:
# your answer here

**Extra Note:** The issue: https://github.com/mwaskom/seaborn/issues/861 has a discussion of why distplot doesn't have a `hue` argument and how to overcome it (look for FacetGrid)

Try to do the same for the `Money` variable. What is the impact of `Gender` on `Money` ?

_Hint: for the Money variable, try specifying `hist=False`_


In [None]:
# your answer here

## Boxplot 

### Gender vs Height

Plot the boxplot considering the x-axis as `Gender` and y-axis as `Height`

In [None]:
plt.figure(figsize=(12, 8))
# your answer here

### Gender vs Money

In [None]:
plt.figure(figsize=(12, 8))
# your answer here

From the conclusions of the previous exercises, did you expect the bloxplots to be like the ones above?

In [None]:
# your answer here

### Multivariate Analysis

Use `sns.pairplot` to see some combinations obtained so far. Use `hue = 'Gender`. 

Note that in a real problem, pairplot starts to get messy since there can be countless number of variables in a dataset. Use it wisely.

Usually people only plot this graph and don't take any conclusions from it. Don't fall into that trap. 

In [None]:
# your answer here