<a href="https://colab.research.google.com/github/DenMantm/colab/blob/main/examples/M1_2_6_Your_new_descriptive_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Your good friend descriptive statistics

Let's take a look at **central tendency** and **variability** with a simple data set.

But like I said, `pandas` can help us out here with the `.mean()` method.

In [None]:
import pandas as pd
df=pd.DataFrame([
                 {'name':'Smushface', 'salary':1200},
                 {'name':'Jen', 'salary':25000},
                 {'name':'James', 'salary':55000},
                 {'name':'John', 'salary':35000},
                 {'name':'Josephine', 'salary':25000},
                 {'name':'Jacques', 'salary':15000},
                 {'name':'Bill Gates', 'salary':100000}])

df

The problem with adding everything together is **Bill Gates is exerting undue influence**. His salary is an **outlier** - a number that's either way too high or way too low and kind of screws up our data. He might be making that much money, but by taking the mean we aren't doing a good job describing what we'd think of as the "average".


**The mean is suseptible to outliers** because of the way it is calculated. You need to be careful with it, and because of that, the mean is definitely not my favourite way of getting the average.

## The MEDIAN

The **median** is like a new, improved mean, in that it describes the central tendency **without being susceptible to outliers**. To compute the median you do two things:

1. Order the numbers largest to smallest
2. Pick the middle number

In [None]:
df['salary'].sort_values()

We have seven values, so it will be number four. Count up the list to discover it: **35,000** is the median. I'll prove it, too, using the power of `pandas`.

In [None]:
df['salary'].median()

See? Told you!

If you happen to have an **even number of data points**, you won't have a middle number. In that case, you'll take the **mean of the middle two numbers**.

My favorite description of the median comes from [Statistics for the Terrified](http://www.conceptstew.co.uk/pages/mean_or_median.html)

> We are all much more familiar with the mean - why? People like using the mean because it is a much easier thing to deal with than the median, mathematically, particularly in more complex situations...
> ...
> Always use the median when the distribution is skewed. You can use either the mean or the median when the population is symmetrical, because then they will give almost identical results.

Which to me reads like "if you have a computer, **use the median.**"



```
# This is formatted as code
```

## The MODE

The **mode** is the least-used measurement of central tendency: it's the **most popular value**. Even though our salary dataset has a most popular value, the mode actually shouldn't be used with *continuous* data.  You should only use it with discrete data.

Let's say our friends are reviewing a restaurant. 

In [None]:
import pandas as pd

# Let's build a data set
reviews_df = pd.DataFrame([
 { 'restaurant': 'Burger King', 'reviewer': 'Smushface', 'yelp_stars': 2 } ,
 { 'restaurant': 'Burger King', 'reviewer': 'Jen', 'yelp_stars': 2 },
 { 'restaurant': 'Burger King', 'reviewer': 'James', 'yelp_stars': 5 },
 { 'restaurant': 'Burger King', 'reviewer': 'John', 'yelp_stars': 4 },
 { 'restaurant': 'Burger King', 'reviewer': 'Josephine', 'yelp_stars': 4 },
 { 'restaurant': 'Burger King', 'reviewer': 'Jacques', 'yelp_stars': 3 },
 { 'restaurant': 'Burger King', 'reviewer': 'Bill Gates', 'yelp_stars': 2 }    
])
reviews_df

In [None]:
reviews_df['yelp_stars'].mode()

Although most people gave Burger King a `3` or above, the fact that **the most popular score is `2`** might mean something.

My favourite example of the mode being useful (and possibly the only example of the mode being useful) is **Amazon reviews.** For example, [this charger for a MacBook](https://www.amazon.com/Apple-Magsafe-Adapter-Charger-MacBook/dp/B014Z9P2VI/ref=sr_1_3?ie=UTF8&qid=1467768369&sr=8-3&keywords=macbook+charger) has some... interesting reviews.

## Lets quickly review the measures of central tendency

There are three measures of central tendency.

* The **mean** is the sum of all the numbers divided by the count of the numbers, and is pulled towards outliers.
* The **median** is the middle number, and is not affected by outliers.
* The **mode** is the most frequent number, is only used with nominal data, and isn't really useful.

The median should probably be your favorite.




<br>Ok, let's go back to the FutureLearn platform. 