## Limpieza de la data

Lo que haremos en esta fase:

1. Hacer una copia a la base de datos original (siempre hay que hacer esto)
2. Ahora si procedemos hacer la limpieza:

    ✅ Datos faltantes (Nulos).
    
    ✅ Registros duplicados.
    
    ✅ Formatos inconsistentes (fechas, nombres, números).
    
    ✅ Valores atípicos (outliers).

In [6]:
# Cargar data
import pandas as pd
import sqlite3

# Creamos una variable y le asignamos la ruta del archivo
ruta = "../data/raw/ufo_data.db"
# Creamos la conexión
conexion = sqlite3.connect(ruta)
cursor = conexion.cursor()


In [7]:
# Crear una copia de la db
copia = "../data/processed/copia.db"

consulta = f"""
ATTACH DATABASE '{copia}' AS copia;
SELECT sql FROM sqlite_master WHERE type='table';
"""
cursor.executescript(consulta)

# Copiar las tablas a la base de datos de respaldo
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tablas = cursor.fetchall()

for tabla in tablas:
    nombre_tabla = tabla[0]
    consulta_copia = f"CREATE TABLE copia.{nombre_tabla} AS SELECT * FROM {nombre_tabla};"
    cursor.execute(consulta_copia)

print("✅ Copia de la BD creada correctamente...")
# Ya no volver a ejecutar esto
# Si lo hacemos primero hay que cerrar la conexión
# Y luego hay que eliminar el archivo copia.db

✅ Copia de la BD creada correctamente...


In [9]:
revisar_nulos = """
                SELECT COUNT(*) AS nulos
                FROM copia.ufo_table
                WHERE datetime IS NULL
                    OR city IS NULL
                    OR state IS NULL
                    OR country IS NULL
                    OR shape IS NULL
                    OR `duration (seconds)` IS NULL
                    OR `duration (hours/min)` IS NULL
                    OR comments IS NULL
                    OR `date posted` IS NULL
                    OR latitude IS NULL
                """
revisar_nulos = pd.read_sql_query(revisar_nulos, conexion)
revisar_nulos

Unnamed: 0,nulos
0,13816


In [34]:
revisar_duplicados = """
                    SELECT COUNT(DISTINCT latitude) AS distintos_registros
                    FROM copia.ufo_table;
                     """
                     
revisar_duplicados = pd.read_sql_query(revisar_duplicados, conexion)
revisar_duplicados

Unnamed: 0,distintos_registros
0,18427


In [30]:
total_registros = "SELECT COUNT(*) FROM copia.ufo_table;"
total_registros = pd.read_sql_query(total_registros, conexion)
total_registros

Unnamed: 0,COUNT(*)
0,80332


In [31]:
total_duplicados = 80332 - 19900
total_duplicados

60432

In [None]:
identificar_duplicados = """
                        SELECT *,
                        ROW_NUMBER() OVER(PARTITION BY datetime ORDER BY datetime) AS num_fila
                        FROM copia.ufo_table
                        WHERE datetime = '2014-05-07 00:00:00';
                         """
identificar_duplicados = pd.read_sql_query(identificar_duplicados, conexion)
identificar_duplicados

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,num_fila
0,2014-05-07 00:00:00,bocholt (germany),,de,circle,240,4 minutes,((HOAX)) ((NUFORC Note: No information provi...,2014-05-08,51.833333,6.6,1
1,2014-05-07 00:00:00,detroit,mi,us,fireball,180,3 minutes,Fire balls in detroit sky.,2014-05-08,42.3313889,-83.045833,2


In [26]:
duplicados = """
            SELECT *
            FROM (
                SELECT *,
                ROW_NUMBER() OVER(PARTITION BY datetime ORDER BY datetime) AS num_fila
                FROM copia.ufo_table
            ) subconsulta
            WHERE num_fila > 1;
             """
duplicados = pd.read_sql_query(duplicados, conexion)
duplicados

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,num_fila
0,1943-08-15 00:00:00,unknown,la,,unknown,720.0,10-12 minutes,White light at midnight traveling into outer s...,2014-01-30,37.245443,-107.827839,2
1,1947-07-01 20:00:00,maywood,ca,us,disk,120,2 minutes,1947 UFO sighting Date: ? sometime in early su...,2005-04-16,33.9866667,-118.184444,2
2,1947-07-01 20:00:00,wexford,pa,us,unknown,10,10 seconds,I have told this to people over many years. Wh...,2002-09-13,40.6263889,-80.056111,3
3,1947-07-15 15:00:00,hazelton (northeast of),id,us,disk,600,10 min.,The Object was Huge&#44Saucer-shaped&#44beauti...,2001-08-05,42.5963889,-114.135278,2
4,1947-07-15 21:00:00,san jose,ca,us,chevron,240,2-4min ?,The object seen that summer evening was cheve...,2004-04-27,37.3394444,-121.893889,2
...,...,...,...,...,...,...,...,...,...,...,...,...
10853,2014-05-03 22:30:00,travelers rest,sc,us,fireball,60,1 minute,Orange/yellow fireball over Greenville&#44 SC.,2014-05-08,34.9675,-82.443611,3
10854,2014-05-04 00:00:00,irving,tx,us,fireball,2,2 seconds,A friend and I were at Running Bear Park and I...,2014-05-08,32.8138889,-96.948611,2
10855,2014-05-06 23:00:00,melbourne,fl,us,sphere,45,45 seconds,Saw bright orange orb about 1000 feet up about...,2014-05-08,28.0833333,-80.608333,2
10856,2014-05-07 00:00:00,detroit,mi,us,fireball,180,3 minutes,Fire balls in detroit sky.,2014-05-08,42.3313889,-83.045833,2


In [5]:
# Cerrar conexion
conexion.close()