# Understanding Descriptive Statistics

Import the necessary libraries here:

In [1]:
import pandas as pd
import random
import plotly.express as px
import plotly.figure_factory as ff

## Challenge 1
#### 1.- Define a function that simulates rolling a dice 10 times. Save the information in a dataframe.
**Hint**: you can use the *choices* function from module *random* to help you with the simulation.

In [2]:
ten_times = random.choices(range(1,7),k=10) 
ten_times

[1, 2, 6, 5, 1, 6, 3, 4, 3, 3]

#### 2.- Plot the results sorted by value.

In [3]:
fig = px.histogram(ten_times)
fig.show()

#### 3.- Calculate the frequency distribution and plot it. What is the relation between this plot and the plot above? Describe it with words.

In [4]:
fig = ff.create_distplot([ten_times], group_labels=['Ten times'])
fig.show()

In [5]:
'''In a normal histogram the y axes represents the total number of events for each bin of the x axis,
while in a frequency distribution chart y axes represents the number of events for each bin divided by the total
number of events.'''

'In a normal histogram the y axes represents the total number of events for each bin of the x axis,\nwhile in a frequency distribution chart y axes represents the number of events for each bin divided by the total\nnumber of events.'

## Challenge 2
Now, using the dice results obtained in *challenge 1*, your are going to define some functions that will help you calculate the mean of your data in two different ways, the median and the four quartiles. 

#### 1.- Define a function that computes the mean by summing all the observations and dividing by the total number of observations. You are not allowed to use any methods or functions that directly calculate the mean value. 

In [6]:
def meancalc(data: iter):
    return sum(data)/len(data)
meancalc(ten_times)

3.4

#### 2.- First, calculate the frequency distribution. Then, calculate the mean using the values of the frequency distribution you've just computed. You are not allowed to use any methods or functions that directly calculate the mean value. 

In [7]:
freq_dist = {i:ten_times.count(i) for i in range(1,7)}
mean_freq = sum(freq_dist.values())/len(freq_dist.values())
print(freq_dist)
print(mean_freq)

{1: 2, 2: 1, 3: 3, 4: 1, 5: 1, 6: 2}
1.6666666666666667


#### 3.- Define a function to calculate the median. You are not allowed to use any methods or functions that directly calculate the median value. 
**Hint**: you might need to define two computation cases depending on the number of observations used to calculate the median.

In [8]:
def mediancalc(data: iter):
    if len(data)%2 == 0:
        return sum(sorted(data)[len(data)//2-1:len(data)//2+1])/2
    else:
        return sorted(data)[len(data)//2]
mediancalc(ten_times)

3.0

#### 4.- Define a function to calculate the four quartiles. You can use the function you defined above to compute the median but you are not allowed to use any methods or functions that directly calculate the quartiles. 

In [9]:
def quartilecalc(data: iter, q: range(1,4)):
    data = sorted(data)
    p = int(q*(len(data)+1)/4)
    if q ==4:
        return data[-1]
    elif len(data)%2 == 0:
        return data[p-1] + (data[p] - data[p-1])*(4-q)/4
    else:
        return data[p]
quartilecalc([7, 15, 36, 39, 40, 41], 4)

41

In [10]:
{f'Quartile {i}': quartilecalc(ten_times,i) for i in range(0,5)}

{'Quartile 0': 1.0,
 'Quartile 1': 1.75,
 'Quartile 2': 3.0,
 'Quartile 3': 5.25,
 'Quartile 4': 6}

In [11]:
def quantilecalc(data: iter, q: float):
    data = sorted(data)
    p = int(q*(len(data)+1))
    if q ==1:
        return data[-1]
    if len(data)%2 == 0:
        return data[p-1] + (data[p] - data[p-1])*1-q
    else:
        return data[p]
quantilecalc([7, 15, 36, 39, 40, 41], 1)

41

In [12]:
{f'quantile {i/10}': quantilecalc(ten_times,i/10) for i in range(0,11)}

{'quantile 0.0': 1.0,
 'quantile 0.1': 0.9,
 'quantile 0.2': 1.8,
 'quantile 0.3': 2.7,
 'quantile 0.4': 2.6,
 'quantile 0.5': 2.5,
 'quantile 0.6': 3.4,
 'quantile 0.7': 4.3,
 'quantile 0.8': 5.2,
 'quantile 0.9': 5.1,
 'quantile 1.0': 6}

## Challenge 3
Read the csv `roll_the_dice_hundred.csv` from the `data` folder.
#### 1.- Sort the values and plot them. What do you see?

In [13]:
hundred_times = random.choices(range(1,7),k=100)
print(hundred_times)
fig = px.histogram(hundred_times)
fig.show()

[5, 6, 5, 5, 3, 3, 6, 2, 3, 5, 3, 1, 2, 3, 1, 5, 3, 1, 3, 2, 3, 1, 4, 1, 6, 6, 1, 4, 4, 2, 5, 6, 5, 4, 3, 6, 3, 4, 2, 5, 5, 4, 6, 4, 6, 3, 1, 3, 4, 2, 1, 1, 6, 1, 5, 4, 1, 1, 6, 3, 6, 6, 2, 6, 2, 4, 5, 3, 2, 2, 4, 6, 1, 6, 1, 5, 4, 6, 6, 5, 4, 1, 1, 1, 2, 4, 5, 5, 1, 2, 2, 5, 3, 4, 6, 2, 5, 4, 1, 1]


In [14]:
"""
The more times you roll the dice the more uniform the distribution resembling the theoretical probability
of each values: 0.16666666.
"""

'\nThe more times you roll the dice the more uniform the distribution resembling the theoretical probability\nof each values: 0.16666666.\n'

#### 2.- Using the functions you defined in *challenge 2*, calculate the mean value of the hundred dice rolls.

In [15]:
meancalc(hundred_times)

3.5

#### 3.- Now, calculate the frequency distribution.


In [16]:
fig = ff.create_distplot([ten_times,hundred_times], group_labels=['Ten times','Hundred times'])
fig.show()

#### 4.- Plot the histogram. What do you see (shape, values...) ? How can you connect the mean value to the histogram? 

In [17]:
fig = ff.create_distplot([ten_times,hundred_times], group_labels=['Ten times','Hundred times'])
fig.show()

In [18]:
"""
The frequency tend to 0.166666(1/6) for each value and the average(3.4) tends t 3.5 wich is 
the mean and median  of the posible values (x axis)
"""

'\nThe frequency tend to 0.166666(1/6) for each value and the average(3.4) tends t 3.5 wich is \nthe mean and median  of the posible values (x axis)\n'

#### 5.- Read the `roll_the_dice_thousand.csv` from the `data` folder. Plot the frequency distribution as you did before. Has anything changed? Why do you think it changed?

In [19]:
thousand_times = pd.read_csv('../data/roll_the_dice_thousand.csv')
thousand_times

Unnamed: 0.1,Unnamed: 0,roll,value
0,0,0,5
1,1,1,6
2,2,2,1
3,3,3,6
4,4,4,5
...,...,...,...
995,995,995,1
996,996,996,4
997,997,997,4
998,998,998,3


In [20]:
print(meancalc(thousand_times['value']))
fig = ff.create_distplot([hundred_times,thousand_times['value']], group_labels=['Hundred times','Thousand Times'])
fig.show()

3.447


In [21]:
"""
We continue to see the same tendency towards an uniform distribution with 1/6 frequency for each value
"""

'\nWe continue to see the same tendency towards an uniform distribution with 1/6 frequency for each value\n'

## Challenge 4
In the `data` folder of this repository you will find three different files with the prefix `ages_population`. These files contain information about a poll answered by a thousand people regarding their age. Each file corresponds to the poll answers in different neighbourhoods of Barcelona.

#### 1.- Read the file `ages_population.csv`. Calculate the frequency distribution and plot it as we did during the lesson. Try to guess the range in which the mean and the standard deviation will be by looking at the plot. 

In [22]:
ages_population = pd.read_csv('../data/ages_population.csv')
ages_population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   observation  1000 non-null   float64
dtypes: float64(1)
memory usage: 7.9 KB


In [23]:
fig = ff.create_distplot([ages_population['observation']], group_labels=['Ages population'])
fig.show()

In [24]:
'''The mean seems to be 40 and the standard deviation around 12'''

'The mean seems to be 40 and the standard deviation around 12'

#### 2.- Calculate the exact mean and standard deviation and compare them with your guesses. Do they fall inside the ranges you guessed?

In [25]:
def stdcalc(data: iter):
    mean = meancalc(data)
    return (sum((x-mean)**2 for x in data)/(len(data) -1))**0.5


In [26]:
print(meancalc(ages_population['observation']))
print(stdcalc(ages_population['observation']))

36.56
12.81649962597677


In [27]:
"""
The mean is a bit lower and the standard deviation a bit higher
"""

'\nThe mean is a bit lower and the standard deviation a bit higher\n'

#### 3.- Now read the file `ages_population2.csv` . Calculate the frequency distribution and plot it.

In [28]:
ages_population2 = pd.read_csv('../data/ages_population2.csv')
ages_population2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   observation  1000 non-null   float64
dtypes: float64(1)
memory usage: 7.9 KB


In [29]:
fig = ff.create_distplot([ages_population2['observation']], group_labels=['Ages population2'])
fig.show()

####  4.- What do you see? Is there any difference with the frequency distribution in step 1?

In [30]:
"""
The shape is similar (normal distribution) but all the ages are extrmely centered around 24-32 years with
virtually no one younger than 18 or older than 36
"""

'\nThe shape is similar (normal distribution) but all the ages are extrmely centered around 24-32 years with\nvirtually no one younger than 18 or older than 36\n'

#### 5.- Calculate the mean and standard deviation. Compare the results with the mean and standard deviation in step 2. What do you think?

In [31]:
print(meancalc(ages_population2['observation']))
print(stdcalc(ages_population2['observation']))

27.155
2.9698139326891835


In [32]:
"""
We confirm that the population is mutch younger and of similar ages(concentrated and with higher kurtosis)
"""

'\nWe confirm that the population is mutch younger and of similar ages(concentrated and with higher kurtosis)\n'

## Challenge 5
Now is the turn of `ages_population3.csv`.

#### 1.- Read the file `ages_population3.csv`. Calculate the frequency distribution and plot it.

In [33]:
ages_population3 = pd.read_csv('../data/ages_population3.csv')
ages_population3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   observation  1000 non-null   float64
dtypes: float64(1)
memory usage: 7.9 KB


In [34]:
fig = ff.create_distplot([ages_population3['observation']], group_labels=['Ages population3'])
fig.show()

#### 2.- Calculate the mean and standard deviation. Compare the results with the plot in step 1. What is happening?

In [35]:
print(meancalc(ages_population3['observation']))
print(stdcalc(ages_population3['observation']))

41.989
16.14470595986593


In [36]:
"""
The average age is a bit higher and there is more deviation. Looking at the distribution
it seems it is right skewed with a separe group of older people (probably retirement homes or 
old rent apartments)
"""

'\nThe average age is a bit higher and there is more deviation. Looking at the distribution\nit seems it is right skewed with a separe group of older people (probably retirement homes or \nold rent apartments)\n'

#### 3.- Calculate the four quartiles. Use the results to explain your reasoning for question in step 2. How much of a difference is there between the median and the mean?

In [37]:
print({f'Quartile {i}': quartilecalc(ages_population['observation'],i) for i in range(0,5)})
print(meancalc(ages_population['observation']))
print({f'Quartile {i}': quartilecalc(ages_population3['observation'],i) for i in range(0,5)})
print(meancalc(ages_population3['observation']))

{'Quartile 0': 1.0, 'Quartile 1': 28.0, 'Quartile 2': 37.0, 'Quartile 3': 45.0, 'Quartile 4': 82.0}
36.56
{'Quartile 0': 1.0, 'Quartile 1': 30.0, 'Quartile 2': 40.0, 'Quartile 3': 53.0, 'Quartile 4': 77.0}
41.989


In [38]:
"""
In the first case the distribution is slightly left skewed, and the mean is slightly lower than the median.
In the second case the distribution is noticeably skewed to the right with its mean higher than the median.
"""

'\nIn the first case the distribution is slightly left skewed, and the mean is slightly lower than the median.\nIn the second case the distribution is noticeably skewed to the right with its mean higher than the median.\n'

#### 4.- Calculate other percentiles that might be useful to give more arguments to your reasoning.

In [39]:
print({f'Quantile {i/10}': quantilecalc(ages_population['observation'],i/10) for i in range(1,11)})
print({f'Quantile {i/10}': quantilecalc(ages_population3['observation'],i/10) for i in range(1,11)})

{'Quantile 0.1': 19.9, 'Quantile 0.2': 25.8, 'Quantile 0.3': 29.7, 'Quantile 0.4': 33.6, 'Quantile 0.5': 36.5, 'Quantile 0.6': 39.4, 'Quantile 0.7': 42.3, 'Quantile 0.8': 46.2, 'Quantile 0.9': 52.1, 'Quantile 1.0': 82.0}
{'Quantile 0.1': 21.9, 'Quantile 0.2': 27.8, 'Quantile 0.3': 31.7, 'Quantile 0.4': 35.6, 'Quantile 0.5': 39.5, 'Quantile 0.6': 44.4, 'Quantile 0.7': 49.3, 'Quantile 0.8': 56.2, 'Quantile 0.9': 66.1, 'Quantile 1.0': 77.0}


In [40]:
"""
We observe that quantile 0.7, 0.8, 0.9 are significantly higher in the second case
"""

'\nWe observe that quantile 0.7, 0.8, 0.9 are significantly higher in the second case\n'

## Bonus challenge
Compare the information about the three neighbourhoods. Prepare a report about the three of them. Remember to find out which are their similarities and their differences backing your arguments in basic statistics.

In [41]:
fig = ff.create_distplot([ages_population['observation'],ages_population2['observation'],ages_population3['observation']], group_labels=['Ages population','Ages population2','Ages population3'])
fig.show()

In [42]:
ages_population['Neighbourhood'] = 'Neighbourhood 1'
ages_population2['Neighbourhood'] = 'Neighbourhood 2'
ages_population3['Neighbourhood'] = 'Neighbourhood 3'
ages_all = pd.concat([ages_population,ages_population2,ages_population3])
ages_all.head(10)

Unnamed: 0,observation,Neighbourhood
0,68.0,Neighbourhood 1
1,12.0,Neighbourhood 1
2,45.0,Neighbourhood 1
3,38.0,Neighbourhood 1
4,49.0,Neighbourhood 1
5,27.0,Neighbourhood 1
6,39.0,Neighbourhood 1
7,12.0,Neighbourhood 1
8,42.0,Neighbourhood 1
9,33.0,Neighbourhood 1


In [43]:
fig = px.box(ages_all,y='observation',x='Neighbourhood')
fig.show()

In [44]:
"""
Compared to neighbourhood 1, neighbourhood 2 has much concentrated and young population with 
a distribution with high kurtosis.
Neighbourhood 3 is a bit older and skeed to the right with a much higher 3rd quartile compared to neighbourhood 1
"""

'\nCompared to neighbourhood 1, neighbourhood 2 has much concentrated and young population with \na distribution with high kurtosis.\nNeighbourhood 3 is a bit older and skeed to the right with a much higher 3rd quartile compared to neighbourhood 1\n'