# Reto 2: Tablas de frecuencias

## 1. Objetivos:
- Aprender a generar tablas de frecuencias segmentando nuestros datos
 
---
    
## 2. Desarrollo:

#### a) Analizando distribución con tablas de frecuencias

Vamos a generar tablas de frecuencias de los siguientes datasets y columnas:

1. Dataset: 'near_earth_objects-jan_feb_1995-clean.csv'
    - Columnas a graficar: 'estimated_diameter.meters.estimated_diameter_max' y 'relative_velocity.kilometers_per_second'
2. Dataset: 'new_york_times_bestsellers-clean.json'
    - Columnas a graficar: 'price.numberDouble'
3. Dataset: 'melbourne_housing-clean.csv'
    - Columnas a graficar: 'land_size'
    
Estos conjuntos de datos son los mismos que graficamos en el Reto anterior. Antes de generar las tablas de frecuencias, revisa el rango de tus conjuntos de datos y decide el número de segmentos adecuado para cada uno.

Después, genera las tablas de frecuencias para cada uno de estos conjuntos de datos y compáralos con las gráficas de caja que realizaste en el Reto anterior. ¿Hay información nueva? ¿Qué ventajas o desventajas nos da esta nueva perspectiva?

Piensa cuál de las dos aproximaciones (boxplots y tablas de frecuencia) resulta más útil para detectar valores atípicos. ¿O simplemente son útiles en diferentes contextos?

In [1]:
# Importar bibliotecas y datos
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_earth_objects = pd.read_csv("https://raw.githubusercontent.com/jaeem006/beduadp/master/Datasets/near_earth_objects-jan_feb_1995-clean.csv", index_col=0)
df_nyt = pd.read_json("https://raw.githubusercontent.com/jaeem006/beduadp/master/Datasets/new_york_times_bestsellers-clean.json")
df_housing = pd.read_csv("https://raw.githubusercontent.com/jaeem006/beduadp/master/Datasets/melbourne_housing-clean.csv", index_col=0)

In [2]:
series_earth_objects_diameter = df_earth_objects['estimated_diameter.meters.estimated_diameter_max']
series_earth_objects_velocity = df_earth_objects['relative_velocity.kilometers_per_second']
series_nyt_price = df_nyt['price.numberDouble']
series_housing_land_size = df_housing['land_size']

In [3]:
# Calcular rangos
print("Rango de Earth Objects (diameter)", series_earth_objects_diameter.max() - \
                                           series_earth_objects_diameter.min())
print("Rango de Earth Objects (velocity)", series_earth_objects_velocity.max() - \
                                           series_earth_objects_velocity.min())
print("Rango de NYT Best Sellers (price)", series_nyt_price.max() - series_nyt_price.min())
print("Rango de Melbourne Housing (land size)", series_housing_land_size.max() - series_housing_land_size.min())

Rango de Earth Objects (diameter) 6513.905031051
Rango de Earth Objects (velocity) 39.8459916905
Rango de NYT Best Sellers (price) 20.0
Rango de Melbourne Housing (land size) 76000.0


In [4]:
# Crear los cuts
cuts_series_earth_objects_diameter = pd.cut(series_earth_objects_diameter, 20)
cuts_series_earth_objects_velocity = pd.cut(series_earth_objects_velocity, 10)
cuts_series_nyt_price = pd.cut(series_nyt_price, 10)
cuts_series_housing_land_size = pd.cut(series_housing_land_size, 25)

In [7]:
 cuts_series_housing_land_size

0        (-76.0, 3040.0]
1        (-76.0, 3040.0]
2        (-76.0, 3040.0]
3        (-76.0, 3040.0]
4        (-76.0, 3040.0]
              ...       
11641    (-76.0, 3040.0]
11642    (-76.0, 3040.0]
11643    (-76.0, 3040.0]
11644    (-76.0, 3040.0]
11645    (-76.0, 3040.0]
Name: land_size, Length: 11646, dtype: category
Categories (25, interval[float64, right]): [(-76.0, 3040.0] < (3040.0, 6080.0] < (6080.0, 9120.0] <
                                            (9120.0, 12160.0] ... (63840.0, 66880.0] <
                                            (66880.0, 69920.0] < (69920.0, 72960.0] <
                                            (72960.0, 76000.0]]

In [9]:
# Hacer agrupaciones
grouped_series_earth_objects_diameter = series_earth_objects_diameter.groupby(cuts_series_earth_objects_diameter).count()
grouped_series_earth_objects_velocity = series_earth_objects_velocity.groupby(cuts_series_earth_objects_velocity).count()
grouped_series_nyt_price = series_nyt_price.groupby(cuts_series_nyt_price).count()
grouped_series_housing_land_size = series_housing_land_size.groupby(cuts_series_housing_land_size).count()

In [10]:
grouped_series_housing_land_size

land_size
(-76.0, 3040.0]       11541
(3040.0, 6080.0]         64
(6080.0, 9120.0]         21
(9120.0, 12160.0]         2
(12160.0, 15200.0]        3
(15200.0, 18240.0]        7
(18240.0, 21280.0]        0
(21280.0, 24320.0]        2
(24320.0, 27360.0]        0
(27360.0, 30400.0]        0
(30400.0, 33440.0]        0
(33440.0, 36480.0]        0
(36480.0, 39520.0]        3
(39520.0, 42560.0]        1
(42560.0, 45600.0]        0
(45600.0, 48640.0]        0
(48640.0, 51680.0]        0
(51680.0, 54720.0]        0
(54720.0, 57760.0]        0
(57760.0, 60800.0]        0
(60800.0, 63840.0]        0
(63840.0, 66880.0]        0
(66880.0, 69920.0]        0
(69920.0, 72960.0]        0
(72960.0, 76000.0]        2
Name: land_size, dtype: int64