## Reto 2: Tablas de frecuencias

### 1. Objetivos:
    - Aprender a generar tablas de frecuencias segmentando nuestros datos

---
    
### 2. Desarrollo:

#### a) Analizando distribución con tablas de frecuencias

Vamos a generar tablas de frecuencias de los siguientes datasets y columnas:

1. Dataset: 'near_earth_objects-jan_feb_1995-clean.csv'
    - Columnas a graficar: 'estimated_diameter.meters.estimated_diameter_max' y 'relative_velocity.kilometers_per_second'
2. Dataset: 'new_york_times_bestsellers-clean.json'
    - Columnas a graficar: 'price.numberDouble'
3. Dataset: 'melbourne_housing-clean.csv'
    - Columnas a graficar: 'land_size'
    
Estos conjuntos de datos son los mismos que graficamos en el Reto anterior. Antes de generar las tablas de frecuencias, revisa el rango de tus conjuntos de datos y decide el número de segmentos adecuado para cada uno.

Después, genera las tablas de frecuencias para cada uno de estos conjuntos de datos y compáralos con las gráficas de caja que realizaste en el Reto anterior. ¿Hay información nueva? ¿Qué ventajas o desventajas nos da esta nueva perspectiva?

Piensa cuál de las dos aproximaciones (boxplots y tablas de frecuencia) resulta más útil para detectar valores atípicos. ¿O simplemente son útiles en diferentes contextos?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
dfNEO = pd.read_csv('/content/drive/MyDrive/DataScience/Remoto Datasets/Remoto near_earth_objects-jan_feb_1995-clean.csv')
neos = dfNEO[['estimated_diameter.meters.estimated_diameter_max', 'relative_velocity.kilometers_per_second']]

n = round(1 + np.log2(neos.shape[0]))

## **Diameter**

In [None]:
diamSegments = pd.cut(neos['estimated_diameter.meters.estimated_diameter_max'], n)
neos['estimated_diameter.meters.estimated_diameter_max'].groupby(diamSegments).count()

estimated_diameter.meters.estimated_diameter_max
(-3.535, 726.746]       285
(726.746, 1450.513]      32
(1450.513, 2174.28]       9
(2174.28, 2898.048]       3
(2898.048, 3621.815]      2
(3621.815, 4345.582]      1
(4345.582, 5069.349]      0
(5069.349, 5793.117]      0
(5793.117, 6516.884]      1
Name: estimated_diameter.meters.estimated_diameter_max, dtype: int64

## **Velocity**

In [None]:
velSegments = pd.cut(neos['relative_velocity.kilometers_per_second'], n)
neos['relative_velocity.kilometers_per_second'].groupby(velSegments).count()

relative_velocity.kilometers_per_second
(0.642, 5.109]      22
(5.109, 9.536]      70
(9.536, 13.963]     69
(13.963, 18.391]    82
(18.391, 22.818]    38
(22.818, 27.245]    21
(27.245, 31.673]    17
(31.673, 36.1]       8
(36.1, 40.527]       6
Name: relative_velocity.kilometers_per_second, dtype: int64

## **Prices**

In [None]:
dfNYT = pd.read_json('/content/drive/MyDrive/DataScience/Remoto Datasets/Remoto new_york_times_bestsellers-clean.json')
price = dfNYT['price.numberDouble']

n = round(1 + np.log2(price.shape[0]))

priceSegments = pd.cut(price, n)
price.groupby(priceSegments).count()

price.numberDouble
(14.97, 16.528]        3
(16.528, 18.067]      11
(18.067, 19.605]       0
(19.605, 21.144]      33
(21.144, 22.682]      24
(22.682, 24.221]      48
(24.221, 25.759]     407
(25.759, 27.298]    1257
(27.298, 28.836]     986
(28.836, 30.375]     243
(30.375, 31.913]       9
(31.913, 33.452]       0
(33.452, 34.99]       12
Name: price.numberDouble, dtype: int64

## **Land size**

In [None]:
dfLand = pd.read_csv('/content/drive/MyDrive/DataScience/Remoto Datasets/Remoto melbourne_housing-clean.csv')
size = dfLand['land_size']

n = round(1 + np.log2(size.shape[0]))

landSegments = pd.cut(size, n)
size.groupby(landSegments).count()

land_size
(-76.0, 5066.667]         11600
(5066.667, 10133.333]        28
(10133.333, 15200.0]          3
(15200.0, 20266.667]          7
(20266.667, 25333.333]        2
(25333.333, 30400.0]          0
(30400.0, 35466.667]          0
(35466.667, 40533.333]        3
(40533.333, 45600.0]          1
(45600.0, 50666.667]          0
(50666.667, 55733.333]        0
(55733.333, 60800.0]          0
(60800.0, 65866.667]          0
(65866.667, 70933.333]        0
(70933.333, 76000.0]          2
Name: land_size, dtype: int64