# Understanding Descriptive Statistics

Import the necessary libraries here:

In [41]:
# Libraries

import pandas as pd
import numpy as np
import random as rd
import math

## Challenge 1
#### 1.- Define a function that simulates rolling a dice 10 times. Save the information in a dataframe.
**Hint**: you can use the *choices* function from module *random* to help you with the simulation.

In [3]:
def roll_the_dice(number_of_rolls):
    return rd.choices([i for i in range(1,6)], k = number_of_rolls)

df = pd.DataFrame(roll_the_dice(10), columns=['dice_results'])
df.shape

(10, 1)

#### 2.- Plot the results sorted by value.

In [148]:
df.sort_values(by=['dice_results'])

Unnamed: 0,dice_results
4,1
2,2
7,2
5,3
6,3
0,4
3,4
9,4
1,5
8,5


#### 3.- Calculate the frequency distribution and plot it. What is the relation between this plot and the plot above? Describe it with words.

In [150]:
df['dice_results'].value_counts()

4    3
5    2
3    2
2    2
1    1
Name: dice_results, dtype: int64

*COMMENT:* Sort by value show all ocurrences while value_counts() shows de quantity of ocurrences by value.
Sort_values works with DataFrames while Value_Counts is only for Series

## Challenge 2
Now, using the dice results obtained in *challenge 1*, your are going to define some functions that will help you calculate the mean of your data in two different ways, the median and the four quartiles. 

#### 1.- Define a function that computes the mean by summing all the observations and dividing by the total number of observations. You are not allowed to use any methods or functions that directly calculate the mean value. 

In [14]:
def mean_by_sum(series):
    return np.sum(series.tolist())/series.size

In [15]:
mean_by_sum(df['dice_results']) == df['dice_results'].mean()

True

#### 2.- First, calculate the frequency distribution. Then, calculate the mean using the values of the frequency distribution you've just computed. You are not allowed to use any methods or functions that directly calculate the mean value. 

In [27]:
def mean_by_frec_distr(series):
    series_mult = [i*j for i,j in zip(series.value_counts().index.tolist(), series.value_counts().tolist())]
    return np.sum(series_mult) / series.size

In [28]:
mean_by_frec_distr(df['dice_results']) == df['dice_results'].mean()

True

#### 3.- Define a function to calculate the median. You are not allowed to use any methods or functions that directly calculate the median value. 
**Hint**: you might need to define two computation cases depending on the number of observations used to calculate the median.

In [166]:
def median_by_lenght(sorted_list):
    middle_float = len(sorted_list)/2
    
    if len(sorted_list) % 2 == 0:
        indices = [int(middle_float)-1, int(middle_float)]
        
    else:
        indices = [int(math.floor(middle_index)), int(math.ceil(middle_index))]
        
    middle_values = [sorted_list[indices[0]], sorted_list[indices[1]]]                     
    return [np.sum(middle_values)/len(middle_values), indices]

In [167]:
sorted_list = df['dice_results'].sort_values().tolist()

median_by_lenght(sorted_list)[0] == np.median(df['dice_results'])

True

#### 4.- Define a function to calculate the four quartiles. You can use the function you defined above to compute the median but you are not allowed to use any methods or functions that directly calculate the quartiles. 

In [174]:
def quartiles_by_length(series, quantiles):
    dict_quartiles = dict(zip(['lower_quartile', 'median', 'upper_quartile'], 
                              [series.size*num for num in quantiles]
                             ))
    bool_quartiles = [bool((val*2) % 2 == 0) for val in dict_quartiles.values()]
    
    for k,bool_val in enumerate(bool_quartiles):
        middle_value = list(dict_quartiles.values())[k]
        
        if bool_val:
            series_val = [series.sort_values().iloc[int(middle_value) -1], 
                          series.sort_values().iloc[int(middle_value)]]
            yield [list(dict_quartiles.keys())[k], np.sum(series_val)/len(series_val)]
        
        else:
            series_val = [series.sort_values().iloc[int(math.floor(middle_value))], 
                         series.sort_values().iloc[int(math.ceil(middle_value))]]
            
            yield [list(dict_quartiles.keys())[k], np.sum(series_val)/len(series_val)]

In [177]:
dict_quartiles = {val[0]:val[1] for val in quartiles_by_length(df['dice_results'], [.25, .5, 0.75])}
print(f'My result \t{list(dict_quartiles.values())}')

# COMPARING W/ PANDAS AND NUMPY
pandas_quartiles = df['dice_results'].sort_values().quantile([0.25, 0.5, 0.75]).tolist()
numpy_quartiles = [np.percentile(df['dice_results'].sort_values().tolist(), 25),
                   np.percentile(df['dice_results'].sort_values().tolist(), 50),
                   np.percentile(df['dice_results'].sort_values().tolist(), 75)]

print(f"PANDAS \t\t{pandas_quartiles}")
print(f"NUMPY \t\t{numpy_quartiles}")

My result 	[2.5, 3.5, 4.5]
PANDAS 		[2.25, 3.5, 4.0]
NUMYPY 		[2.25, 3.5, 4.0]


In [178]:
"""
Dear TA: Looking in internet why my result was wrong in comparison to Pandas/Numpy, 
apparently some people have an issue with it too.

An alternative that might be ok is using the median function that I have create beforehand and subdividing the list, 
but in only works for Q1 and Q3.
"""

sorted_list = df['dice_results'].sort_values().tolist()

median, median_indices = median_by_lenght(sorted_list)

Q1 = median_by_lenght(sorted_list[:median_indices[0]])[0]
Q3 = median_by_lenght(sorted_list[median_indices[-1] + 1:])[0]

quartiles = {k:val for k,val in zip(['Q1', 'median', 'Q3'], [Q1, median, Q3])}
quartiles

{'Q1': 2.0, 'median': 3.5, 'Q3': 4.5}

## Challenge 3
Read the csv `roll_the_dice_hundred.csv` from the `data` folder.
#### 1.- Sort the values and plot them. What do you see?

In [187]:
path = '../data/roll_the_dice_hundred.csv'

df = pd.read_csv(path)
print(df.shape)
df.head()

(100, 3)


Unnamed: 0.1,Unnamed: 0,roll,value
0,0,0,1
1,1,1,2
2,2,2,6
3,3,3,1
4,4,4,6


In [188]:
df.sort_values(by=['value'])

Unnamed: 0.1,Unnamed: 0,roll,value
0,0,0,1
47,47,47,1
56,56,56,1
9,9,9,1
73,73,73,1
...,...,...,...
17,17,17,6
11,11,11,6
24,24,24,6
21,21,21,6


In [191]:
df.drop(['Unnamed: 0', 'roll'], axis=1, inplace= True)
df['value'].value_counts()

6    23
4    22
2    17
3    14
5    12
1    12
Name: value, dtype: int64

In [192]:
"""
Unnamed and roll are repeated index values;
6 and 4 are the most repeated values
"""
df.head()

Unnamed: 0,value
0,1
1,2
2,6
3,1
4,6


#### 2.- Using the functions you defined in *challenge 2*, calculate the mean value of the hundred dice rolls.

In [None]:
# your code here

#### 3.- Now, calculate the frequency distribution.


In [None]:
# your code here

#### 4.- Plot the histogram. What do you see (shape, values...) ? How can you connect the mean value to the histogram? 

In [None]:
# your code here

In [None]:
"""
your comments here
"""

#### 5.- Read the `roll_the_dice_thousand.csv` from the `data` folder. Plot the frequency distribution as you did before. Has anything changed? Why do you think it changed?

In [None]:
# your code here

In [None]:
"""
your comments here
"""

## Challenge 4
In the `data` folder of this repository you will find three different files with the prefix `ages_population`. These files contain information about a poll answered by a thousand people regarding their age. Each file corresponds to the poll answers in different neighbourhoods of Barcelona.

#### 1.- Read the file `ages_population.csv`. Calculate the frequency distribution and plot it as we did during the lesson. Try to guess the range in which the mean and the standard deviation will be by looking at the plot. 

In [None]:
# your code here

#### 2.- Calculate the exact mean and standard deviation and compare them with your guesses. Do they fall inside the ranges you guessed?

In [None]:
# your code here

In [None]:
"""
your comments here
"""

#### 3.- Now read the file `ages_population2.csv` . Calculate the frequency distribution and plot it.

In [None]:
# your code here

####  4.- What do you see? Is there any difference with the frequency distribution in step 1?

In [None]:
"""
your comments here
"""

#### 5.- Calculate the mean and standard deviation. Compare the results with the mean and standard deviation in step 2. What do you think?

In [None]:
# your code here

In [None]:
"""
your comments here
"""

## Challenge 5
Now is the turn of `ages_population3.csv`.

#### 1.- Read the file `ages_population3.csv`. Calculate the frequency distribution and plot it.

In [None]:
# your code here

#### 2.- Calculate the mean and standard deviation. Compare the results with the plot in step 1. What is happening?

In [None]:
# your code here

In [None]:
"""
your comments here
"""

#### 3.- Calculate the four quartiles. Use the results to explain your reasoning for question in step 2. How much of a difference is there between the median and the mean?

In [None]:
# your code here

In [None]:
"""
your comments here
"""

#### 4.- Calculate other percentiles that might be useful to give more arguments to your reasoning.

In [None]:
# your code here

In [None]:
"""
your comments here
"""

## Bonus challenge
Compare the information about the three neighbourhoods. Prepare a report about the three of them. Remember to find out which are their similarities and their differences backing your arguments in basic statistics.

In [None]:
# your code here

In [None]:
"""
your comments here
"""