## Reto 2: Tablas de frecuencias

### 1. Objetivos:
    - Aprender a generar tablas de frecuencias segmentando nuestros datos
 
---
    
### 2. Desarrollo:

#### a) Analizando distribución con tablas de frecuencias

Vamos a generar tablas de frecuencias de los siguientes datasets y columnas:

1. Dataset: 'near_earth_objects-jan_feb_1995-clean.csv'
    - Columnas a graficar: 'estimated_diameter.meters.estimated_diameter_max' y 'relative_velocity.kilometers_per_second'
2. Dataset: 'new_york_times_bestsellers-clean.json'
    - Columnas a graficar: 'price.numberDouble'
3. Dataset: 'melbourne_housing-clean.csv'
    - Columnas a graficar: 'land_size'
    
Estos conjuntos de datos son los mismos que graficamos en el Reto anterior. Antes de generar las tablas de frecuencias, revisa el rango de tus conjuntos de datos y decide el número de segmentos adecuado para cada uno.

Después, genera las tablas de frecuencias para cada uno de estos conjuntos de datos y compáralos con las gráficas de caja que realizaste en el Reto anterior. ¿Hay información nueva? ¿Qué ventajas o desventajas nos da esta nueva perspectiva?

Piensa cuál de las dos aproximaciones (boxplots y tablas de frecuencia) resulta más útil para detectar valores atípicos. ¿O simplemente son útiles en diferentes contextos?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
ne = pd.read_csv("/work/B2-Analisis-de-Datos-con-Python-2020-2021-Santander/Datasets/near_earth_objects-jan_feb_1995-clean.csv")
bs = pd.read_json("/work/B2-Analisis-de-Datos-con-Python-2020-2021-Santander/Datasets/new_york_times_bestsellers-clean.json")
m = pd.read_csv("/work/B2-Analisis-de-Datos-con-Python-2020-2021-Santander/Datasets/melbourne_housing-clean.csv")

ne_col1 = 'estimated_diameter.meters.estimated_diameter_max'
ne_col2 = 'relative_velocity.kilometers_per_second'
bs_col = 'price.numberDouble'
m_col = 'land_size'


Para obtener la cantidad de segmentos($k$):

$$n = 2^k$$

despejando $k$ con ley de logáritmos:

$$k = log_2n$$


In [None]:
def rango(df,col):
    r = df[col].max() - df[col].min()
    n = np.log2(r).round()
    return tablear(df,col,int(n))

def tablear(df,col,n):
    segmento = pd.cut(df[col],n,include_lowest)
    return df[col].groupby(segmento).count()

In [None]:
rango(ne,col1)

estimated_diameter.meters.estimated_diameter_max
(-3.535, 504.048]       246
(504.048, 1005.118]      55
(1005.118, 1506.188]     18
(1506.188, 2007.257]      7
(2007.257, 2508.327]      1
(2508.327, 3009.396]      2
(3009.396, 3510.466]      2
(3510.466, 4011.536]      1
(4011.536, 4512.605]      0
(4512.605, 5013.675]      0
(5013.675, 5514.745]      0
(5514.745, 6015.814]      0
(6015.814, 6516.884]      1
Name: estimated_diameter.meters.estimated_diameter_max, dtype: int64

In [None]:
rango(ne,col1)

estimated_diameter.meters.estimated_diameter_max
(-3.535, 504.048]       246
(504.048, 1005.118]      55
(1005.118, 1506.188]     18
(1506.188, 2007.257]      7
(2007.257, 2508.327]      1
(2508.327, 3009.396]      2
(3009.396, 3510.466]      2
(3510.466, 4011.536]      1
(4011.536, 4512.605]      0
(4512.605, 5013.675]      0
(5013.675, 5514.745]      0
(5514.745, 6015.814]      0
(6015.814, 6516.884]      1
Name: estimated_diameter.meters.estimated_diameter_max, dtype: int64

In [None]:
rango(ne,col2)

relative_velocity.kilometers_per_second
(0.642, 8.651]       85
(8.651, 16.62]      126
(16.62, 24.589]      79
(24.589, 32.558]     33
(32.558, 40.527]     10
Name: relative_velocity.kilometers_per_second, dtype: int64

In [None]:
rango(bs,bs_col)

price.numberDouble
(14.97, 19.99]      47
(19.99, 24.99]     479
(24.99, 29.99]    2486
(29.99, 34.99]      21
Name: price.numberDouble, dtype: int64

In [None]:
rango(m,m_col)

land_size
(-76.0, 4750.0]       11594
(4750.0, 9500.0]         32
(9500.0, 14250.0]         3
(14250.0, 19000.0]        9
(19000.0, 23750.0]        2
(23750.0, 28500.0]        0
(28500.0, 33250.0]        0
(33250.0, 38000.0]        2
(38000.0, 42750.0]        2
(42750.0, 47500.0]        0
(47500.0, 52250.0]        0
(52250.0, 57000.0]        0
(57000.0, 61750.0]        0
(61750.0, 66500.0]        0
(66500.0, 71250.0]        0
(71250.0, 76000.0]        2
Name: land_size, dtype: int64

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f8149d1c-e6b1-497a-9109-f01641a8231a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>