# Reto 3: Desviación estándar

## 1. Objetivos:
- Utilizar la desviación estándar para realizar un análisis de dispersión de nuestros datos
 
---
    
## 2. Desarrollo:

### a) Desviación estándar y distribución de los datos

Como ya vimos, la desviación estándar es la medida que nos da la "desviación típica" (o esperada) de nuestros datos a comparación del promedio. Eso quiere decir que normalmente vamos a esperar que una gran parte de nuestros datos se encuentren a 1 desviación estándar de distancia del promedio. Entre más nos alejamos, menos muestras deberíamos de encontrar.

Vamos a comprobar esto usando nuestro dataset de meteoritos que orbitan cerca de la Tierra. Tu Reto consiste en los siguientes pasos:

1. Crea un DataFrame con el dataset `near_earth_objects-jan_feb_1995-clean.csv`.
2. Obtén la cantidad total de datos en tu DataFrame.
3. Obtén la desviación estándar de la columna 'estimated_diameter.meters.estimated_diameter_max'. Los siguientes pasos realízalos todos utilizando esta columna.
4. Obtén el porcentaje de muestras que están a una distancia de 1 desviación estándar del promedio.
5. Obtén el porcentaje de muestras que están a una distancia de 2 desviaciones estándares del promedio (multiplicar * 2).
6. Obtén el porcentaje de muestras que están a una distancia de 3 desviaciones estándares del promedio (multiplicar * 3).
7. Compara los porcentajes obtenidos y comenta con tus compañeros y la experta tus hallazgos. ¿Qué significa esto? ¿La definición de desviación estándar tiene sentido? ¿Qué puedo inferir acerca de la dispersión de mis datos a partir de los valores obtenidos?

> Nota: Para obtener los porcentajes de los subconjuntos primero necesitas filtrar el DataFrame original para que sólo permanezcan las muestras que cumplan con los requisitos.

> Nota: Este Reto está diseñado para tener una dificultad media. No te frustres si al principio parece demasiado difícil. Comienza poco a poco, resolviendo el problema en pedazos pequeños, y si no tienes la menor idea de cómo proceder recuerda que la experta está ahí para ayudarte.

In [21]:
# Libraries needed

import pandas as pd

In [22]:
# Helper functions

def calc_std_deviations_from_mean(value, std_deviation, mean):
  if value > mean:
    return (value-mean)/std_deviation
  else:
    return (mean-value)/std_deviation

In [23]:
# 1. Dataframe creation

df_near_earth_objs = pd.read_csv('https://raw.githubusercontent.com/jaeem006/beduadp/master/Datasets/near_earth_objects-jan_feb_1995-clean.csv', index_col=0)
df_near_earth_objs.head()

Unnamed: 0,id,name,is_potentially_hazardous_asteroid,estimated_diameter.meters.estimated_diameter_min,estimated_diameter.meters.estimated_diameter_max,close_approach_date,epoch_date_close_approach,orbiting_body,relative_velocity.kilometers_per_second,relative_velocity.kilometers_per_hour
0,2154652,154652 (2004 EP20),False,483.676488,1081.533507,1995-01-07,789467580000,Earth,16.142864,58114.308667
1,3153509,(2003 HM),True,96.506147,215.794305,1995-01-07,789491340000,Earth,12.351044,44463.757734
2,3516633,(2010 HA),False,44.11182,98.637028,1995-01-07,789446820000,Earth,6.220435,22393.567277
3,3837644,(2019 AY3),False,46.190746,103.285648,1995-01-07,789513900000,Earth,22.478615,80923.015021
4,3843493,(2019 PY),False,22.108281,49.435619,1995-01-07,789446700000,Earth,4.998691,17995.288355


In [24]:
# 2. Obtention of the ammount of rows (objects near earth) in the data-set
ammount_of_objs = df_near_earth_objs.shape[0]
print('The ammount of objects near earth in the data-set is: {}.'.format(ammount_of_objs))

# 3. Obtention of the standard deviation for the maximum diameter of the near earth objects
std_dev_of_objs_max_diameter = df_near_earth_objs['estimated_diameter.meters.estimated_diameter_max'].std()
print('The standard deviation of the maximum diameter of the near earth objects is: {}'.format(std_dev_of_objs_max_diameter))

# 3.1 Obtention of the mean for the maximum diameter of the near earth objects
mean_of_objs_max_diameter = df_near_earth_objs['estimated_diameter.meters.estimated_diameter_max'].mean()
print('The mean of the maximum diameter of the near earth objects is: {}'.format(mean_of_objs_max_diameter))

The ammount of objects near earth in the data-set is: 333.
The standard deviation of the maximum diameter of the near earth objects is: 614.6915918552232
The mean of the maximum diameter of the near earth objects is: 410.08604223976545


In [25]:
# 3.2 Getting the ammount of standard deviations that the maximum diameter of the near earth objects are from the mean

df_near_earth_objs['max_diameter_of_objs.std_deviations_fom_mean'] = df_near_earth_objs.apply(lambda obj: calc_std_deviations_from_mean(obj['estimated_diameter.meters.estimated_diameter_max'], std_dev_of_objs_max_diameter, mean_of_objs_max_diameter), axis = 1)
df_near_earth_objs.head()

Unnamed: 0,id,name,is_potentially_hazardous_asteroid,estimated_diameter.meters.estimated_diameter_min,estimated_diameter.meters.estimated_diameter_max,close_approach_date,epoch_date_close_approach,orbiting_body,relative_velocity.kilometers_per_second,relative_velocity.kilometers_per_hour,max_diameter_of_objs.std_deviations_fom_mean
0,2154652,154652 (2004 EP20),False,483.676488,1081.533507,1995-01-07,789467580000,Earth,16.142864,58114.308667,1.092332
1,3153509,(2003 HM),True,96.506147,215.794305,1995-01-07,789491340000,Earth,12.351044,44463.757734,0.31608
2,3516633,(2010 HA),False,44.11182,98.637028,1995-01-07,789446820000,Earth,6.220435,22393.567277,0.506675
3,3837644,(2019 AY3),False,46.190746,103.285648,1995-01-07,789513900000,Earth,22.478615,80923.015021,0.499113
4,3843493,(2019 PY),False,22.108281,49.435619,1995-01-07,789446700000,Earth,4.998691,17995.288355,0.586718


In [26]:
# 4. Obtention of the percentage of the maximum diameter of the near earth objects at less than one standard deviation
ammount_of_objs_in_one_std_deviation = df_near_earth_objs[df_near_earth_objs['max_diameter_of_objs.std_deviations_fom_mean'] <= 1].shape[0]
percentage_of_objs_in_one_std_deviation = (100/ammount_of_objs)*ammount_of_objs_in_one_std_deviation
print('The percentage of the maximum diameter of the near earth objects at less than one standard deviation is: {}%'.format(percentage_of_objs_in_one_std_deviation))

# 5. Obtention of the percentage of the maximum diameter of the near earth objects at less than two standard deviations
ammount_of_objs_in_two_std_deviation = df_near_earth_objs[df_near_earth_objs['max_diameter_of_objs.std_deviations_fom_mean'] <= 2].shape[0]
percentage_of_objs_in_two_std_deviation = (100/ammount_of_objs)*ammount_of_objs_in_two_std_deviation
print('The percentage of the maximum diameter of the near earth objects at less than two standard deviation is: {}%'.format(percentage_of_objs_in_two_std_deviation))

# 6. Obtention of the percentage of the maximum diameter of the near earth objects at less than three standard deviations
ammount_of_objs_in_three_std_deviation = df_near_earth_objs[df_near_earth_objs['max_diameter_of_objs.std_deviations_fom_mean'] <= 3].shape[0]
percentage_of_objs_in_three_std_deviation = (100/ammount_of_objs)*ammount_of_objs_in_three_std_deviation
print('The percentage of the maximum diameter of the near earth objects at less than three standard deviation is: {}%'.format(percentage_of_objs_in_three_std_deviation))

The percentage of the maximum diameter of the near earth objects at less than one standard deviation is: 90.3903903903904%
The percentage of the maximum diameter of the near earth objects at less than two standard deviation is: 96.3963963963964%
The percentage of the maximum diameter of the near earth objects at less than three standard deviation is: 97.8978978978979%


### #7 Conclusions

* What do these results mean?

      The ammout of individuals covered increases as the ammount of standard deviations increases.

* Does the definition of standard deviation make sense?

      Yes, it does, it lets you know how far is the data from the mean.

* What could you infer about the data's dispersion with the results we got?

      The data analized is very compact because more than 90% of it is at less than one standard deviation.