# Limpieza y Análisis de Datos con Vino

Este notebook trabaja con el archivo `data\winemag-data-130k-v2.csv`.

Objetivos:
1. Cargar datos y explorarlos.
2. Detectar y tratar valores nulos.
3. Eliminar duplicados.
4. Arreglar tipos de datos.
5. Limpiar texto.
6. Detectar columnas vacías y borrarlas.
7. Guardar el dataset limpio.
8. Hacer un mini análisis final.

## 1. Importación y exploración inicial

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Cargar el CSV (si lo tienes guardado como 'wines.csv')
df = pd.read_csv("data/winemag-data-130k-v2.csv")

# Mostrar primeras filas
print(df.head())

# Ver información general
print(df.info())

# Estadísticas básicas (incluyendo textopython3 -m ipykernel install --user --name=venv_ra02 --display-name "Python (.venv RA02)" --sys-prefi)
display(df.describe(include='all'))

print(df.describe())

print(df.isnull().sum())



   id   country                                        description  \
0   0     Italy  Aromas include tropical fruit, broom, brimston...   
1   1  Portugal  This is ripe and fruity, a wine that is smooth...   
2   2        US  Tart and snappy, the flavors of lime flesh and...   
3   3        US  Pineapple rind, lemon pith and orange blossom ...   
4   4        US  Much like the regular bottling from 2012, this...   

                          designation  points  price           province  \
0                        Vulkà Bianco      87    NaN  Sicily & Sardinia   
1                            Avidagos      87   15.0              Douro   
2                                 NaN      87   14.0             Oregon   
3                Reserve Late Harvest      87   13.0           Michigan   
4  Vintner's Reserve Wild Child Block      87   65.0             Oregon   

              region_1           region_2         taster_name  \
0                 Etna                NaN       Kerin O’Keefe  

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
count,129971.0,129908,129971,92506,129971.0,120975.0,129908,108724,50511,103727,98758,129971,129970,129971
unique,,43,119955,37976,,,425,1229,17,19,15,118840,707,16757
top,,US,"Seductively tart in lemon pith, cranberry and ...",Reserve,,,California,Napa Valley,Central Coast,Roger Voss,@vossroger,Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma...,Pinot Noir,Wines & Winemakers
freq,,54504,3,2009,,,36247,4480,11065,25514,25514,11,13272,222
mean,64985.0,,,,88.447138,35.363389,,,,,,,,
std,37519.540256,,,,3.03973,41.022218,,,,,,,,
min,0.0,,,,80.0,4.0,,,,,,,,
25%,32492.5,,,,86.0,17.0,,,,,,,,
50%,64985.0,,,,88.0,25.0,,,,,,,,
75%,97477.5,,,,91.0,42.0,,,,,,,,


                  id         points          price
count  129971.000000  129971.000000  120975.000000
mean    64985.000000      88.447138      35.363389
std     37519.540256       3.039730      41.022218
min         0.000000      80.000000       4.000000
25%     32492.500000      86.000000      17.000000
50%     64985.000000      88.000000      25.000000
75%     97477.500000      91.000000      42.000000
max    129970.000000     100.000000    3300.000000
id                           0
country                     63
description                  0
designation              37465
points                       0
price                     8996
province                    63
region_1                 21247
region_2                 79460
taster_name              26244
taster_twitter_handle    31213
title                        0
variety                      1
winery                       0
dtype: int64


## 2. Detección y manejo de valores nulos (NaN)

In [4]:
print("Valores nulos por columna")
#Conteo valores nulos
print(df.isna().sum())

# Revisar valores faltantes
print(df.isnull().sum())

# Eliminar filas sin país o sin puntos (por ejemplo)
df = df.dropna(subset=['price','country'])

# Rellenar valores faltantes en 'price' con 0
df['points'] = df['points'].fillna(0)

print('\nDespues de limpieza de NAN:')
print(df.isna().sum())

display(df.head())


Valores nulos por columna
id                           0
country                      0
description                  0
designation              34768
points                       0
price                        0
province                     0
region_1                 19516
region_2                 70624
taster_name              24496
taster_twitter_handle    29416
title                        0
variety                      1
winery                       0
dtype: int64
id                           0
country                      0
description                  0
designation              34768
points                       0
price                        0
province                     0
region_1                 19516
region_2                 70624
taster_name              24496
taster_twitter_handle    29416
title                        0
variety                      1
winery                       0
dtype: int64

Despues de limpieza de NAN:
id                           0
country             

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
5,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem


## 3. Eliminación de duplicados

In [5]:
# Detectar filas duplicadas completas
duplicados = df[df.duplicated()]

print("Filas duplicadas detectadas:")
display(duplicados)

# Eliminar duplicados
df.drop_duplicates(inplace=True)

print("Tamaño tras eliminar duplicados:", df.shape)

Filas duplicadas detectadas:


Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery


Tamaño tras eliminar duplicados: (120916, 14)


## 4. Conversión de tipos de datos

In [None]:
# Los mejores valuados de Italia
italian_high_score = df[(df['country'] == 'France') & (df['points'] > 90 )].sort_values('price', ascending=True)


# Los mejores puntuados de España
mejores_espanya = df[(df['country'] == 'Spain')].sort_values('points', ascending=False)

print(mejores_espanya[['title', 'points', 'price']])