Práctica 43: Limpieza, manejo y transformación de datos con Pandas

Cargar el fichero retail2.csv en un dataframe de Pandas y efectuar todas las operaciones de consulta, exploración y limpieza de datos que sean necesarios algunos pasos de limpieza están de forma explícita como preguntas. Los ficheros contienen varias columnas y algunas de ellas tienen datos que podrían necesitar limpieza o tratamiento.

## Exploración de datos retail2

In [156]:
import pandas as pd
import numpy as np

In [157]:
retail2_df = pd.read_csv('retail2.csv')
exchange_rates_df = pd.read_csv('exchange_rates.csv')

In [158]:
retail2_df.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,CustomerName,Email,...,StockLevel,Discount,SaleChannel,ReturnStatus,ProductWeight,ProductDimensions,ShippingCost,SalesRegion,PromotionCode,PaymentMethod
0,536578.0,84969,"[""description"": ""BOX OF 6 ASSORTED COLOUR TEAS...",6,12/1/2010 12:28,4.25,17763,United Kingdom,David Johnson,david.johnson@mail.com,...,853,17.28,Online,Not Returned,3.81,37x38x83 cm,7.14,North America,SALE15,Bank Transfer
1,536446.0,21756,DOORMAT NEW ENGLAND,100,12/1/2010 10:16,795.0,15939,United Kingdom,Henry Williams,henry.williams@test.org,...,910,9.08,Online,Not Returned,9.51,8x65x86 cm,12.48,Asia,SALE15,Bank Transfer
2,536633.0,22632,HAND WARMER RED POLKA DOT,6,12/1/2010 13:23,1.85,12295,United Kingdom,Jane Brown,jane.brown@mail.com,...,578,45.42,In-Store,Returned,7.35,17x71x89 cm,14.27,North America,DISCOUNT5,Bank Transfer


In [159]:
# Información general del DataFrame retail2
retail2_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 440 entries, 0 to 439
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   InvoiceNo          420 non-null    float64
 1   StockCode          440 non-null    object 
 2   Description        413 non-null    object 
 3   Quantity           440 non-null    object 
 4   InvoiceDate        435 non-null    object 
 5   UnitPrice          440 non-null    object 
 6   CustomerID         440 non-null    int64  
 7   Country            439 non-null    object 
 8   CustomerName       440 non-null    object 
 9   Email              440 non-null    object 
 10  Address            440 non-null    object 
 11  PhoneNumber        440 non-null    object 
 12  Category           440 non-null    object 
 13  Supplier           440 non-null    object 
 14  StockLevel         440 non-null    int64  
 15  Discount           440 non-null    float64
 16  SaleChannel        440 non

# Limpieza de datos : Columna InvoiceNo

- Rellenar los valores Nulos con un valor especifico con 0
- Cambiar los datos flotantes a números enteros

In [160]:
# Suponiendo que retail2_df ya está definido
min_invoice = retail2_df['InvoiceNo'].min()
max_invoice = retail2_df['InvoiceNo'].max()

print(f"El número de factura empieza en: {min_invoice}")
print(f"El número de factura finaliza en: {max_invoice}")

El número de factura empieza en: 536365.0
El número de factura finaliza en: 536744.0


In [161]:
# Suponiendo que retail2_df ya está definido
expected_range = set(range(536365, 536744))
actual_values = set(retail2_df['InvoiceNo'].dropna().astype(int))
missing_values = expected_range - actual_values

print(f"Valores faltantes en la columna InvoiceNo: {sorted(missing_values)}")

Valores faltantes en la columna InvoiceNo: [536397, 536422, 536497, 536508, 536586, 536596, 536602, 536603, 536612, 536625, 536642, 536659, 536666, 536685, 536714]


In [162]:

# Rellenar valores nulos con un número específico, por ejemplo, 0
retail2_df['InvoiceNo'] = retail2_df['InvoiceNo'].fillna(0)

# Convertir la columna a tipo entero
retail2_df['InvoiceNo'] = retail2_df['InvoiceNo'].astype(int)

# Verificar el cambio
print(retail2_df['InvoiceNo'].head())

0    536578
1    536446
2    536633
3    536522
4         0
Name: InvoiceNo, dtype: int64


In [163]:
# Suponiendo que retail2_df ya está definido y que has rellenado los valores nulos con 0
retail2_df['InvoiceNo'] = retail2_df['InvoiceNo'].fillna(0)
retail2_df['InvoiceNo'] = retail2_df['InvoiceNo'].astype(int)

# Filtrar y mostrar los valores que están en 0 en la columna InvoiceNo
cero_values = retail2_df[retail2_df['InvoiceNo'] == 0]
print(cero_values['InvoiceNo'])

4      0
24     0
25     0
52     0
78     0
91     0
102    0
165    0
167    0
208    0
209    0
218    0
251    0
253    0
297    0
299    0
304    0
366    0
395    0
412    0
Name: InvoiceNo, dtype: int64


In [164]:
# Diccionario con los índices y los nuevos valores de InvoiceNo
nuevos_valores = {
    91: 536397,
    366: 536422,
    4: 536497,
    102: 536508,
    304: 536524,
    297: 536570,
    25: 536586,
    253: 536596,
    251: 536603,
    165: 536612,
    208: 536625,
    218: 536642,
    299: 536659,
    209: 536666,
    395: 536685,
    167: 536698,
    24: 536714,
    78: 536653,
    52: 536730,
    412: 536384
}

# Asignar los nuevos valores
for index, new_value in nuevos_valores.items():
    retail2_df.loc[index, 'InvoiceNo'] = new_value



In [165]:
# Suponiendo que retail2_df ya está definido
expected_range = set(range(536365, 536744))
actual_values = set(retail2_df['InvoiceNo'].dropna().astype(int))
missing_values = expected_range - actual_values

print(f"Valores faltantes en la columna InvoiceNo: {sorted(missing_values)}")

Valores faltantes en la columna InvoiceNo: [536602]


In [166]:
# Crear un DataFrame con el valor faltante
missing_value = pd.DataFrame({'InvoiceNo': [536602]})

# Concatenar este nuevo DataFrame con el DataFrame original
retail2_df = pd.concat([retail2_df, missing_value], ignore_index=True)

# Ordenar el DataFrame por la columna InvoiceNo de menor a mayor
retail2_df = retail2_df.sort_values(by='InvoiceNo')

# Verificar nuevamente los valores faltantes en el rango especificado
expected_range = set(range(536365, 536745))
actual_values = set(retail2_df['InvoiceNo'].dropna().astype(int))
missing_values = expected_range - actual_values

print(f"Valores faltantes en la columna InvoiceNo: {sorted(missing_values)}")

Valores faltantes en la columna InvoiceNo: []


In [167]:
# Ajustar las opciones de pandas para mostrar todas las filas
pd.set_option('display.max_rows', None)

# Mostrar todos los datos de la columna InvoiceNo
print(retail2_df['InvoiceNo'].head(10))

47     536365
38     536365
302    536365
361    536365
307    536365
175    536365
313    536365
211    536366
31     536366
86     536367
Name: InvoiceNo, dtype: int64


# Limpieza de datos : Columna InvoiceDate


- Se verifica los datos que contiene la columna.
- Eliminar la palabra Date:
- Se unifica a una sola fecha a la 12/01/2010 ya que en el csv se observa que los datos de las ventas corresponden a una misma fecha cambiamos 01/12/2010, 01/01/1900, 01/01/2050, 01/12/2010, 30/12/2010, 12/1/2010, 30/02/2010 
- Intentar convertir la columna InvoiceDate a tipo datetime con el formato correcto.


In [168]:
# Verificación de datos y se observan en orden 
retail2_df = retail2_df.sort_values(by='InvoiceDate') 
print(retail2_df['InvoiceDate'].unique())

['01-12-2010 09:47' '01-12-2010 10:09' '01-12-2010 11:27'
 '01-12-2010 13:42' '01/01/1900 00:00' '01/01/2050 00:00'
 '01/12/2010 25:61' '12/01/2010 09:08' '12/01/2010 10:41'
 '12/01/2010 10:52' '12/01/2010 12:22' '12/01/2010 14:55'
 '12/1/2010 10:00' '12/1/2010 10:01' '12/1/2010 10:03' '12/1/2010 10:04'
 '12/1/2010 10:05' '12/1/2010 10:06' '12/1/2010 10:07' '12/1/2010 10:08'
 '12/1/2010 10:10' '12/1/2010 10:12' '12/1/2010 10:13' '12/1/2010 10:15'
 '12/1/2010 10:16' '12/1/2010 10:17' '12/1/2010 10:18' '12/1/2010 10:20'
 '12/1/2010 10:21' '12/1/2010 10:22' '12/1/2010 10:23' '12/1/2010 10:24'
 '12/1/2010 10:26' '12/1/2010 10:27' '12/1/2010 10:28' '12/1/2010 10:29'
 '12/1/2010 10:30' '12/1/2010 10:31' '12/1/2010 10:32' '12/1/2010 10:33'
 '12/1/2010 10:35' '12/1/2010 10:36' '12/1/2010 10:37' '12/1/2010 10:38'
 '12/1/2010 10:39' '12/1/2010 10:40' '12/1/2010 10:42' '12/1/2010 10:44'
 '12/1/2010 10:46' '12/1/2010 10:47' '12/1/2010 10:48' '12/1/2010 10:49'
 '12/1/2010 10:50' '12/1/2010 10:51' '

In [169]:
# Eliminar las palabras 'Date:'
retail2_df['InvoiceDate'] = retail2_df['InvoiceDate'].str.replace(r'Date:', '', regex=True).str.strip()
retail2_df = retail2_df.sort_values(by='InvoiceDate')
print(retail2_df['InvoiceDate'].unique())

['01-12-2010 09:47' '01-12-2010 10:09' '01-12-2010 11:27'
 '01-12-2010 13:42' '01/01/1900 00:00' '01/01/2050 00:00'
 '01/12/2010 25:61' '12/01/2010 09:08' '12/01/2010 10:41'
 '12/01/2010 10:52' '12/01/2010 12:22' '12/01/2010 14:55'
 '12/1/2010 10:00' '12/1/2010 10:01' '12/1/2010 10:03' '12/1/2010 10:04'
 '12/1/2010 10:05' '12/1/2010 10:06' '12/1/2010 10:07' '12/1/2010 10:08'
 '12/1/2010 10:10' '12/1/2010 10:12' '12/1/2010 10:13' '12/1/2010 10:15'
 '12/1/2010 10:16' '12/1/2010 10:17' '12/1/2010 10:18' '12/1/2010 10:20'
 '12/1/2010 10:21' '12/1/2010 10:22' '12/1/2010 10:23' '12/1/2010 10:24'
 '12/1/2010 10:26' '12/1/2010 10:27' '12/1/2010 10:28' '12/1/2010 10:29'
 '12/1/2010 10:30' '12/1/2010 10:31' '12/1/2010 10:32' '12/1/2010 10:33'
 '12/1/2010 10:35' '12/1/2010 10:36' '12/1/2010 10:37' '12/1/2010 10:38'
 '12/1/2010 10:39' '12/1/2010 10:40' '12/1/2010 10:42' '12/1/2010 10:44'
 '12/1/2010 10:46' '12/1/2010 10:47' '12/1/2010 10:48' '12/1/2010 10:49'
 '12/1/2010 10:50' '12/1/2010 10:51' '

In [170]:
# Intentar convertir la columna InvoiceDate a tipo datetime con el formato correcto
retail2_df['InvoiceDate'] = pd.to_datetime(retail2_df['InvoiceDate'], format='%d/%m/%Y %H:%M', errors='coerce')

# Definir la fecha válida
valid_date = pd.to_datetime('12/01/2010', format='%d/%m/%Y')

# Filtrar las fechas que no coinciden con la fecha válida (manteniendo la hora)
invalid_dates = retail2_df[retail2_df['InvoiceDate'].dt.date != valid_date.date()]

# Mostrar únicamente las columnas InvoiceNo y InvoiceDate de las filas con fechas inválidas
print(invalid_dates[['InvoiceNo', 'InvoiceDate']])

# Contar las filas con fechas inválidas
invalid_count = invalid_dates.shape[0]
print(f"Cantidad de filas con fechas inválidas: {invalid_count}")

     InvoiceNo InvoiceDate
223     536417         NaT
32      536439         NaT
114     536517         NaT
396     536652         NaT
100     536588  1900-01-01
83      536598  1900-01-01
338     536380  1900-01-01
268     536636  1900-01-01
352     536473  1900-01-01
139     536687  1900-01-01
186     536407  1900-01-01
358     536425  1900-01-01
160     536544  1900-01-01
87      536677  1900-01-01
78      536653  2050-01-01
104     536564  2050-01-01
69      536564  2050-01-01
34      536651  2050-01-01
6       536707  2050-01-01
173     536519  2050-01-01
380     536444  2050-01-01
142     536374  2050-01-01
279     536709  2050-01-01
241     536530  2050-01-01
308     536449  2050-01-01
174     536390         NaT
16      536676         NaT
430     536370         NaT
52      536730         NaT
285     536669         NaT
82      536432         NaT
318     536673         NaT
90      536605         NaT
281     536630         NaT
105     536630         NaT
58      536455         NaT
4

In [171]:
# Suponiendo que ya has convertido la columna InvoiceDate a datetime
retail2_df['InvoiceDate'] = pd.to_datetime(retail2_df['InvoiceDate'], errors='coerce', dayfirst=True)

# Crear un diccionario con las fechas corregidas utilizanzo  InvoiceNo
fechas_actualizadas = {
    536417: '12/01/2010 9:47',
    536439: '12/01/2010 10:09',
    536517: '12/01/2010 11:27',
    536652: '12/01/2010 13:42',
    536588: '12/01/2010 12:38',
    536598: '12/01/2010 12:48',
    536380: '12/01/2010 9:10',
    536636: '12/01/2010 13:26',
    536473: '12/01/2010 10:43',
    536687: '12/01/2010 14:17',
    536407: '12/01/2010 9:37',
    536425: '12/01/2010 9:55',
    536544: '12/01/2010 11:54',
    536677: '12/01/2010 14:07',
    536653: '12/01/2010 13:44',
    536564: '12/01/2010 12:14',
    536651: '12/01/2010 13:41',
    536707: '12/01/2010 14:37',
    536519: '12/01/2010 11:29',
    536444: '12/01/2010 10:14',
    536374: '12/01/2010 9:04',
    536709: '12/01/2010 14:39',
    536530: '12/01/2010 11:40',
    536449: '12/01/2010 10:19',
    536390: '12/01/2010 9:20',
    536676: '12/01/2010 14:06',
    536370: '12/01/2010 9:00',
    536730: '12/01/2010 15:00',
    536669: '12/01/2010 13:59',
    536432: '12/01/2010 10:02',
    536673: '12/01/2010 14:03',
    536605: '12/01/2010 12:55',
    536630: '12/01/2010 13:20',
    536455: '12/01/2010 10:25',
    536384: '12/01/2010 9:14',
    536375: '12/01/2010 9:05',
    536369: '12/01/2010 8:36',
    536628: '12/01/2010 13:18',
    536464: '12/01/2010 10:34',
    536441: '12/01/2010 10:11',
    536388: '12/01/2010 9:18',
    536475: '12/01/2010 10:45',
    536568: '12/01/2010 12:18',
    536584: '12/01/2010 12:34',
    536602: '12/01/2010 12:52',
    536604: '12/01/2010 12:54'
}

# Actualizar las fechas en el DataFrame
for invoice_no, fecha in fechas_actualizadas.items():
    retail2_df.loc[retail2_df['InvoiceNo'] == invoice_no, 'InvoiceDate'] = pd.to_datetime(fecha, dayfirst=True)

# Formatear las fechas al formato deseado
retail2_df['InvoiceDate'] = retail2_df['InvoiceDate'].dt.strftime('%d/%m/%Y %H:%M')

# Identificar las fechas que no se pudieron convertir
fechas_invalidas = retail2_df[retail2_df['InvoiceDate'].isna()]

# Mostrar las fechas inválidas
print("Fechas inválidas:")



Fechas inválidas:


In [172]:
# Verificación de datos y se observan en orden 
retail2_df = retail2_df.sort_values(by='InvoiceDate') 
print(retail2_df['InvoiceDate'].unique())

['12/01/2010 08:26' '12/01/2010 08:28' '12/01/2010 08:34'
 '12/01/2010 08:35' '12/01/2010 08:36' '12/01/2010 09:00'
 '12/01/2010 09:01' '12/01/2010 09:02' '12/01/2010 09:03'
 '12/01/2010 09:04' '12/01/2010 09:05' '12/01/2010 09:06'
 '12/01/2010 09:07' '12/01/2010 09:08' '12/01/2010 09:09'
 '12/01/2010 09:10' '12/01/2010 09:11' '12/01/2010 09:12'
 '12/01/2010 09:13' '12/01/2010 09:14' '12/01/2010 09:15'
 '12/01/2010 09:16' '12/01/2010 09:17' '12/01/2010 09:18'
 '12/01/2010 09:19' '12/01/2010 09:20' '12/01/2010 09:21'
 '12/01/2010 09:22' '12/01/2010 09:23' '12/01/2010 09:24'
 '12/01/2010 09:25' '12/01/2010 09:26' '12/01/2010 09:27'
 '12/01/2010 09:28' '12/01/2010 09:29' '12/01/2010 09:30'
 '12/01/2010 09:31' '12/01/2010 09:32' '12/01/2010 09:33'
 '12/01/2010 09:34' '12/01/2010 09:35' '12/01/2010 09:36'
 '12/01/2010 09:37' '12/01/2010 09:38' '12/01/2010 09:39'
 '12/01/2010 09:40' '12/01/2010 09:41' '12/01/2010 09:42'
 '12/01/2010 09:43' '12/01/2010 09:44' '12/01/2010 09:45'
 '12/01/2010 0

## Limpieza de datos :Columna Description

- Imprimir valores unicos
- Eliminar caracteres *, [, ', {, y "
- Eliminar las palabras 'description:', 'DETAILS' y 'INVALIDDESCRIPTION123'
- Convertir a mayúsculas el texto
- Eliminar espacios extra en la columna


In [173]:
# Verificación de datos y se observan en orden 
retail2_df = retail2_df.sort_values(by='Description') 
print(retail2_df['Description'].unique())

['  BAKING SET 9 PIECE RETROSPOT  ' '  BOX OF VINTAGE ALPHABET BLOCKS  '
 '  BOX OF VINTAGE JIGSAW BLOCKS  ' '  CHOCOLATE HOT WATER BOTTLE  '
 '  CREAM CUPID HEARTS COAT HANGER  ' '  HAND WARMER RED POLKA DOT  '
 '  HAND WARMER UNION JACK  ' '  HOME BUILDING BLOCK WORD  '
 '  JUMBO BAG RED RETROSPOT  ' '  PACK OF 12 BLUE RETROSPOT TISSUES  '
 '  PACK OF 6 SMALL FRUIT STRAWS  ' "  POPPY'S PLAYHOUSE BEDROOM  "
 "  POPPY'S PLAYHOUSE KITCHEN  " '  RED HARMONICA IN BOX  '
 '  RED TOADSTOOL LED NIGHT LIGHT  ' '  RETROSPOT TEA SET CERAMIC 11 PC  '
 '  WHITE METAL LANTERN  ' '  nan  ' '***BAKING SET 9 PIECE RETROSPOT***'
 '***CHOCOLATE HOT WATER BOTTLE***' '***DOORMAT NEW ENGLAND***'
 '***FELTCRAFT PRINCESS CHARLOTTE DOLL***'
 '***HAND WARMER RED POLKA DOT***' '***IVORY KNITTED MUG COSY***'
 '***JAM MAKING SET WITH JARS***' "***POPPY'S PLAYHOUSE BEDROOM***"
 "***POPPY'S PLAYHOUSE KITCHEN***" '***RED GLASS CANDLE HOLDER STAR***'
 '***RED TOADSTOOL LED NIGHT LIGHT***'
 '***RED WOOLLY HOTTIE WHIT

In [174]:
# Eliminar de la columna caracteres *, [, ', {, y "
retail2_df['Description']=retail2_df['Description'].str.replace(r'[\*\[\{\'\"\}\]\:\.]','',regex=True).str.strip()

# Eliminar las palabras 'description:', 'DETAILS' y 'INVALIDDESCRIPTION123'
retail2_df['Description'] = retail2_df['Description'].str.replace(r'description:|DETAILS|INVALIDDESCRIPTION123|DESCRIPTION', '', regex=True).str.strip()

# Convertir a mayúsculas el texto en la columna 'Description'
retail2_df['Description'] = retail2_df['Description'].str.upper()


# Ordenar alfabéticamente los datos de la columna 'Description'
retail2_df = retail2_df.sort_values(by='Description')

# Eliminar espacios extra en la columna 'Description'
retail2_df['Description'] = retail2_df['Description'].str.replace(r'\s+', ' ', regex=True).str.strip()

# Mostrar las primeras filas del DataFrame después de la limpieza
print(retail2_df['Description'].unique())

['ASSORTED COLOUR BIRD ORNAMENT' 'BAKING SET 9 PIECE RETROSPOT'
 'BLUE HARMONICA IN BOX' 'BLUE HARMONICA IN BOX DETAILS'
 'BOX OF 6 ASSORTED COLOUR TEASPOONS' 'BOX OF VINTAGE ALPHABET BLOCKS'
 'BOX OF VINTAGE ALPHABET BLOCKS DETAILS' 'BOX OF VINTAGE JIGSAW BLOCKS'
 'CHOCOLATE HOT WATER BOTTLE' 'CHOCOLATE HOT WATER BOTTLE DETAILS'
 'COFFEE MUG APPLES DESIGN' 'CREAM CUPID HEARTS COAT HANGER'
 'DESCRIPTION ASSORTED COLOUR BIRD ORNAMENT'
 'DESCRIPTION BAKING SET 9 PIECE RETROSPOT'
 'DESCRIPTION BLUE HARMONICA IN BOX'
 'DESCRIPTION BOX OF 6 ASSORTED COLOUR TEASPOONS'
 'DESCRIPTION BOX OF VINTAGE ALPHABET BLOCKS'
 'DESCRIPTION BOX OF VINTAGE JIGSAW BLOCKS'
 'DESCRIPTION DESCRIPTION RED GLASS CANDLE HOLDER STAR'
 'DESCRIPTION HAND WARMER RED POLKA DOT'
 'DESCRIPTION HAND WARMER UNION JACK'
 'DESCRIPTION HOME BUILDING BLOCK WORD'
 'DESCRIPTION JAM MAKING SET WITH JARS'
 'DESCRIPTION JAM MAKING SET WITH JARS DETAILS'
 'DESCRIPTION PACK OF 12 RED RETROSPOT TISSUES DETAILS'
 'DESCRIPTION RECIPE B

In [175]:
# Contar los valores vacíos en la columna 'Description'
missing_values_count = retail2_df['Description'].isnull().sum()

# Filtrar las filas con valores vacíos en 'Description'
missing_values = retail2_df[retail2_df['Description'].isnull()]

# Seleccionar solo las columnas 'InvoiceNo', 'StockCode' y 'Description'
filtered_missing_values = missing_values[['InvoiceNo', 'StockCode', 'Description']]

# Mostrar el total de valores vacíos
print(f"Hay {missing_values_count} valores vacíos en la columna 'Description'.")

# Mostrar las filas filtradas
print("Filas con valores vacíos en 'Description':")
print(filtered_missing_values)

Hay 28 valores vacíos en la columna 'Description'.
Filas con valores vacíos en 'Description':
     InvoiceNo                                          StockCode Description
360     536367                                              84879         NaN
319     536378                                              22634         NaN
322     536379                          SCANDINAVIAN REDS RIBBONS         NaN
309     536387                                              21756         NaN
401     536401                                              22386         NaN
314     536415  {"description": ["RED WOOLLY HOTTIE WHITE HEAR...         NaN
223     536417                                              21730         NaN
298     536434                              RED  HARMONICA IN BOX         NaN
243     536438                                              22632         NaN
32      536439                                              21914         NaN
213     536451                 STRIPED CHARLIE+L

# Limpieza de datos columna :StockCode

- Limpieza de datos :Columna StockCode
- Mirar los valores únicos de la columna
- Unificar criterios de los códigos como: Eliminar las letras al final del numero del codigo para que no sean alfanumericos.
- Identificar las filas de la columna StockCode contiene descripciones en lugar de códigos
- Contar los valores nulos en la columna 'StockCode'
- Rellenar los valores faltantes en la columna 'StockCode' basados en la columna 'Description' Creando un diccionario de los valores y descripciones con Stockcode y Description
- Verificar valores faltantes nuevamente
- Queda un valor faltante crear un diccionario con ese valor
- Verificar que no quede ningun valor faltante

In [176]:
# Verificación de datos y se observan en orden 
retail2_df = retail2_df.sort_values(by='InvoiceDate') 
print(retail2_df['StockCode'].unique())


['22752' '71053' '84406B' '21730' '85123A' '84029E' '84029G' '22632'
 '22633' '22745' '22749' '22623' '84969' '21755' '22748' '21754' '22622'
 '84879' '22310' '21777' '84988' '21756' '21724' '21723' '22111' '21914'
 '21731' '22960' '22139' '22112' '22634' 'SCANDINAVIAN REDS RIBBONS'
 '22725' '22624' '22570' '22571' '22754' '22386' '85099B' '84907B'
 '84907C' '84907A' '22751' '22382' '22469' '22384'
 '{"description": ["RED WOOLLY HOTTIE WHITE HEART.: details"]}'
 'RED  HARMONICA IN BOX' 'STRIPED CHARLIE+LOLA CHARLOTTE BAG'
 'JAM MAKING SET WITH JARS' 'SET 2 TEA TOWELS I LOVE LONDON' nan
 'HAND WARMER RED POLKA DOT' 'BOX OF 6 ASSORTED COLOUR TEASPOONS'
 'RED HARMONICA IN BOX' 'LOVE BUILDING BLOCK WORD']


In [177]:
# Eliminar espacios en blanco al principio y al final de los valores en la columna 'StockCode'
retail2_df['StockCode'] = retail2_df['StockCode'].str.strip()

# Mostrar el DataFrame para verificar los cambios
print(retail2_df['StockCode'].head())

302     22752
361     71053
313    84406B
175     21730
307    85123A
Name: StockCode, dtype: object


In [178]:
retail2_df = retail2_df.sort_values(by='InvoiceDate') 
print(retail2_df['StockCode'].unique())


['22752' '71053' '84406B' '21730' '85123A' '84029E' '84029G' '22632'
 '22633' '22310' '84879' '22622' '21754' '22748' '84969' '22623' '22749'
 '22745' '21755' '21777' '84988' '21756' '21723' '21724' '22111' '21914'
 '21731' '22960' '22139' '22112' '22634' 'SCANDINAVIAN REDS RIBBONS'
 '22725' '22624' '22570' '22571' '22754' '22386' '85099B' '84907B'
 '84907C' '84907A' '22751' '22382' '22469' '22384'
 '{"description": ["RED WOOLLY HOTTIE WHITE HEART.: details"]}'
 'RED  HARMONICA IN BOX' 'STRIPED CHARLIE+LOLA CHARLOTTE BAG'
 'JAM MAKING SET WITH JARS' 'SET 2 TEA TOWELS I LOVE LONDON' nan
 'HAND WARMER RED POLKA DOT' 'BOX OF 6 ASSORTED COLOUR TEASPOONS'
 'RED HARMONICA IN BOX' 'LOVE BUILDING BLOCK WORD']


In [179]:
# Función para mover descripciones de 'StockCode' a 'Description'
def move_descriptions(row):
    if pd.isna(row['Description']) and isinstance(row['StockCode'], str):
        if row['StockCode'].startswith('{"description":'):
            return eval(row['StockCode'])['description'][0]
        else:
            return row['StockCode']
    return row['Description']

# Aplicar la función para actualizar la columna 'Description'
retail2_df['Description'] = retail2_df.apply(move_descriptions, axis=1)

# Eliminar las filas con 'nan' en la columna 'Description'
retail2_df = retail2_df[~retail2_df['Description'].isin(['nan', '***nan***'])]

# Mostrar solo las columnas 'StockCode' y 'Description'
retail2_df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,CustomerName,Email,...,StockLevel,Discount,SaleChannel,ReturnStatus,ProductWeight,ProductDimensions,ShippingCost,SalesRegion,PromotionCode,PaymentMethod
302,536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/01/2010 08:26,7.65,16989.0,Denmark,Jane Taylor,jane.taylor@test.org,...,920.0,23.37,In-Store,Not Returned,5.45,9x48x72 cm,8.48,Australia,DISCOUNT5,PayPal
361,536365,71053,WHITE METAL LANTERN,6,12/01/2010 08:26,3.39,13849.0,Denmark,John Jones,john.jones@demo.net,...,23.0,13.35,Online,Returned,6.47,77x81x40 cm,18.91,Asia,,Credit Card
313,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/01/2010 08:26,2.75,18058.0,Denmark,Grace Moore,grace.moore@example.com,...,334.0,11.02,Online,Not Returned,9.63,25x13x82 cm,19.87,North America,,Gift Card
175,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/01/2010 08:26,425.0,11849.0,Germany,Eva Davis,eva.davis@example.com,...,394.0,8.31,Online,Not Returned,1.83,47x68x76 cm,13.56,Australia,SALE15,Gift Card
307,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/01/2010 08:26,2.55,10049.0,Denmark,Frank Brown,frank.brown@example.com,...,817.0,47.14,In-Store,Returned,3.3,31x24x99 cm,7.05,Europe,PROMO10,Credit Card
38,536365,84029E,RED WOOLLY HOTTIE WHITE HEART,6,12/01/2010 08:26,3.39,10619.0,United Kingdom,Eva Johnson,eva.johnson@mail.com,...,785.0,0.23,Online,Not Returned,6.87,5x33x65 cm,12.9,Australia,SALE15,Bank Transfer
47,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/01/2010 08:26,3.39,15366.0,United Kingdom,Alice Smith,alice.smith@demo.net,...,205.0,11.23,In-Store,Returned,5.25,5x3x53 cm,10.05,Australia,PROMO10,Gift Card
211,536366,22632,HAND WARMER RED POLKA DOT,6,12/01/2010 08:28,1.85,11914.0,Germany,Grace Wilson,grace.wilson@mail.com,...,582.0,18.05,In-Store,Returned,6.95,20x86x92 cm,11.73,Asia,PROMO20,Cash
31,536366,22633,HAND WARMER UNION JACK,6,12/01/2010 08:28,1.85,15258.0,United Kingdom,David Moore,david.moore@test.org,...,133.0,43.49,In-Store,Returned,1.79,27x63x17 cm,17.22,Asia,SALE15,Gift Card
222,536367,22310,IVORY KNITTED MUG COSY,6,12/01/2010 08:34,1.65,12524.0,Germany,David Moore,david.moore@test.org,...,823.0,32.94,Online,Returned,1.03,38x51x63 cm,12.79,Australia,PROMO10,PayPal


In [180]:
# Verificación de datos y se observan en orden 
retail2_df = retail2_df.sort_values(by='InvoiceDate') 
print(retail2_df['StockCode'].unique())


['22752' '71053' '84406B' '21730' '85123A' '84029E' '84029G' '22632'
 '22633' '21755' '22745' '22749' '22623' '84969' '21754' '22622' '84879'
 '22310' '22748' '21777' '84988' '21756' '21724' '21723' '22111' '21914'
 '21731' '22960' '22139' '22112' '22634' 'SCANDINAVIAN REDS RIBBONS'
 '22725' '22624' '22570' '22571' '22754' '22386' '85099B' '84907B'
 '84907C' '84907A' '22751' '22382' '22469' '22384'
 '{"description": ["RED WOOLLY HOTTIE WHITE HEART.: details"]}'
 'RED  HARMONICA IN BOX' 'STRIPED CHARLIE+LOLA CHARLOTTE BAG'
 'JAM MAKING SET WITH JARS' 'SET 2 TEA TOWELS I LOVE LONDON' nan
 'HAND WARMER RED POLKA DOT' 'BOX OF 6 ASSORTED COLOUR TEASPOONS'
 'RED HARMONICA IN BOX' 'LOVE BUILDING BLOCK WORD']


In [181]:
# Eliminar letras al final de los códigos de stock
#retail2_df['StockCode'] = retail2_df['StockCode'].str.replace(r'([0-9]+)[A-Za-z]$', r'\1', regex=True)

# Mostrar los valores únicos después de la limpieza
#print(retail2_df['StockCode'].unique())

In [182]:
# Identificar las filas de la columna StockCode contiene descripciones en lugar de códigos 
mask = retail2_df['StockCode'].str.contains(r'[a-zA-Z]', na=False)
# Mover los datos de 'StockCode' a 'Description' donde sea necesario
retail2_df.loc[mask, 'Description'] = retail2_df.loc[mask, 'StockCode']
retail2_df.loc[mask, 'StockCode'] = np.nan
print(retail2_df['StockCode'].unique())

['22752' '71053' nan '21730' '22632' '22633' '21755' '22745' '22749'
 '22623' '84969' '21754' '22622' '84879' '22310' '22748' '21777' '84988'
 '21756' '21724' '21723' '22111' '21914' '21731' '22960' '22139' '22112'
 '22634' '22725' '22624' '22570' '22571' '22754' '22386' '22751' '22382'
 '22469' '22384']


In [183]:
# Contar los valores nulos en la columna 'StockCode'
nulos_stockcode = retail2_df['StockCode'].isnull().sum()

print(f"Cantidad de datos vacíos en la columna 'StockCode': {nulos_stockcode}")

Cantidad de datos vacíos en la columna 'StockCode': 64


In [184]:
# Crear un diccionario con las descripciones y sus correspondientes códigos de stock
description_to_stockcode = retail2_df.dropna(subset=['StockCode']).drop_duplicates(subset=['Description']).set_index('Description')['StockCode'].to_dict()

# Rellenar los valores faltantes en 'StockCode'
retail2_df['StockCode'] = retail2_df.apply(lambda row: description_to_stockcode.get(row['Description'], row['StockCode']), axis=1)

Explicación del Código: Crear un diccionario de descripciones a códigos de stock: dropna(subset=['StockCode']): Elimina las filas donde ‘StockCode’ es NaN. drop_duplicates(subset=['Description']): Elimina duplicados basados en la columna ‘Description’. set_index('Description')['StockCode'].to_dict(): Crea un diccionario donde las claves son las descripciones y los valores son los códigos de stock. Rellenar los valores faltantes en ‘StockCode’ usando el diccionario: apply(lambda row: description_to_stockcode.get(row['Description'], row['StockCode']), axis=1): Aplica una función lambda a cada fila del DataFrame. La función usa el diccionario para rellenar los valores faltantes en ‘StockCode’ basados en la ‘Description’.

In [185]:
# Contar los valores nulos en la columna 'StockCode'
nulos_stockcode = retail2_df['StockCode'].isnull().sum()

print(f"Cantidad de datos vacíos en la columna 'StockCode': {nulos_stockcode}")

Cantidad de datos vacíos en la columna 'StockCode': 56


In [186]:
# Seleccionar solo las columnas 'StockCode' y 'Description'
filtered_df = retail2_df[['StockCode', 'Description']]

# Mostrar las filas con valores vacíos en 'StockCode'
missing_values = filtered_df[filtered_df['StockCode'].isnull()]

# Mostrar las filas filtradas
print(missing_values)

    StockCode                                        Description
313       NaN                                             84406B
307       NaN                                             85123A
38        NaN                                             84029E
47        NaN                                             84029G
383       NaN                                             85123A
350       NaN                                             85099B
418       NaN                                             84907B
121       NaN                                             84907B
15        NaN                                             84907C
386       NaN                                             84907A
320       NaN                                             85123A
332       NaN                                             84406B
75        NaN                                             84029G
314       NaN  {"description": ["RED WOOLLY HOTTIE WHITE HEAR...
298       NaN            

In [187]:
# Crear un diccionario con las descripciones y sus correspondientes códigos de stock
description_to_stockcode = {
    'RED WOOLLY HOTTIE WHITE HEART': 84029
}
# Rellenar los valores faltantes en 'StockCode'
retail2_df['StockCode'] = retail2_df.apply(lambda row: description_to_stockcode.get(row['Description'], row['StockCode']), axis=1)


In [188]:
# Contar los valores nulos en la columna 'StockCode'
nulos_stockcode = retail2_df['StockCode'].isnull().sum()

print(f"Cantidad de datos vacíos en la columna 'StockCode': {nulos_stockcode}")

Cantidad de datos vacíos en la columna 'StockCode': 56


# Limpieza de datos :Columna Quantity
- Imprimir valores únicos
- cambiar la columna de datos objet a int
- Donde sale unknown colocar 6 ya que la description correspondiente GLASS STAR FROSTED T-LIGHT HOLDER las otras tiene totas el mismo valor

In [189]:
# Imprimir valores únicos
retail2_df=retail2_df.sort_values(by='Quantity')
print(retail2_df['Quantity'].unique())

['-20' '-3' '-6' '-60' '10' '100' '12' '15' '2' '20' '200' '3' '32' '4'
 '5' '6' '60' '8' 'unknown' nan]


In [190]:
# Reemplazar 'unknown' con un valor numérico, por ejemplo, 6
retail2_df['Quantity'] = retail2_df['Quantity'].replace('unknown', 6)
reatil2_df = retail2_df.sort_values(by='Quantity')
print(retail2_df['Quantity'].unique())

TypeError: '<' not supported between instances of 'int' and 'str'