Descargamos el dataset en un pandas dataframe

In [2]:
import pandas as pd
import os
import time
file = "Online Retail.xlsx"
url = "https://github.com/Alf-caput/LAB02_ReglasYPatronesSecuenciales/raw/main/Online%20Retail.xlsx"
if os.path.exists(file):
    target = file
else:
    target = url
df = pd.read_excel(target)

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


Con **.info** podemos ver el número de filas, columnas y los tipos almacenados.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   541909 non-null  int64         
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 33.1+ MB


Observamos 541909 entradas y 8 columnas.  
Buscando documentación en kaggle:  
Se trata de un dataset que contiene las transacciones que han ocurrido entre 01/12/2009 y 09/12/2011 en UK de una compañía que vende principalmente regalos de ocasión, adicionalmente se especifica que muchos de los compradores son mayoristas.  
Columnas del dataset:
- **InvoiceNo:**  
    - Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- **StockCode:**
    - Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- **Description:**
    - Product (item) name. Nominal.
- **Quantity:** 
    - The quantities of each product (item) per transaction. Numeric.
- **InvoiceDate:**
    - Invoice date and time. Numeric. The day and time when a transaction was generated.
- **UnitPrice**:
    - Unit price. Numeric. Product price per unit in sterling (Â£).
- **CustomerID:**
    - Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- **Country:**
    - Country name. Nominal. The name of the country where a customer resides.

Viendo los tipos que devuelve **.info** vemos que pandas nos ha guardado Quantity, UnitPrice y CustomerID como tipos numéricos, mientras que el resto excepto InvoiceDate (tipo fecha) son de tipo object, que es el tipo por defecto que usa pandas para strings.

Con **.isna** podemos ver cuantos valores na tiene en total y por columna el dataframe.  
(De **.info** también se puede deducir pero es menos intuitivo)

In [4]:
print(f'Número total de NA en el dataframe: {(col_na:=df.isna().sum()).sum()}')
pd.DataFrame({'Valores NA': col_na})

Número total de NA en el dataframe: 1454


Unnamed: 0,Valores NA
InvoiceNo,0
StockCode,0
Description,1454
Quantity,0
InvoiceDate,0
UnitPrice,0
CustomerID,0
Country,0


Encontramos pocos valores NA, que podríamos completar si conocemos su StockCode, no obstante estudiamos un poco más el dataset antes de tomar una acción en este aspecto.  
Con **.describe** podemos ver el rango de valores que pueden tomar diferentes columnas (omitimos la columna "count" y transponemos por legibilidad).

In [8]:
(
    df
    .describe()
    .drop('count')
    .T
)

Unnamed: 0,mean,min,25%,50%,75%,max,std
Quantity,9.55225,-80995.0,1.0,3.0,10.0,80995.0,218.081158
InvoiceDate,2011-07-04 13:34:57.156386048,2010-12-01 08:26:00,2011-03-28 11:34:00,2011-07-19 17:17:00,2011-10-19 11:27:00,2011-12-09 12:50:00,
UnitPrice,4.611114,-11062.06,1.25,2.08,4.13,38970.0,96.759853
CustomerID,15287.518434,12346.0,14367.0,15287.0,16255.0,18287.0,1484.746041


Vemos que existen valores que no nos interesan para el problema que queremos resolver, existen cantidades y precio de unidades negativas ("Quantity", "UnitPrice") que pueden deberse a devoluciones u otro tipo de gestión en el producto.  

In [35]:
df_negs = (
    df
    .loc[(df.loc[:, 'Quantity'] <= 0) | (df.loc[:, 'UnitPrice'] <= 0), :]
)
print(f"Entradas con cantidad o precio unidad negativas (o cero): {df_negs.shape[0]}")

negs_ratio = round(len(df_negs)/len(df), 2)
print(f"Ratio: {negs_ratio}")

df_negs.head()

Entradas con cantidad o precio unidad negativas (o cero): 11805
Ratio: 0.02


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom


Tenemos más de 11_000 entradas, que suponen un 2% del dataset.  
Como no nos interesan podemos eliminar esas entradas.  

In [36]:
df_std = df.drop(df_negs.index)
df_std.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


Pasemos a analizar un poco cada variable.  
InvoiceNo si comienza con "c" se trata de una cancelación:

In [60]:
df_std.loc[:, 'Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Finland',
       'Austria', 'Bahrain', 'Israel', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

(13, 0)

In [57]:
no_unique = df_std.loc[:, 'InvoiceNo'].unique().astype(str)
for no in no_unique:
    try:
        int(no)
    except:
        print('Letter here')
no_unique

Letter here


array(['536365', '536366', '536367', ..., '581585', '581586', '581587'],
      dtype='<U7')

In [52]:
import numpy as np
import re

# Assuming arr is your numpy array containing strings

# Define a regular expression pattern to match strings starting with "c"
pattern = re.compile('b$', flags=re.IGNORECASE)

# Initialize a boolean mask indicating whether each element starts with "c"
starts_with_c_mask = np.array([bool(pattern.match(element)) for element in no_unique])

# Use the mask to filter elements starting with "c"
strings_starting_with_c = no_unique[starts_with_c_mask]

print(strings_starting_with_c)


[]


In [4]:
df_unique_desc = (
    df
    .loc[:, 'Description']
    .drop_duplicates()
)
print(len(df_unique_desc))
df_unique_desc.head()

4224


0     WHITE HANGING HEART T-LIGHT HOLDER
1                    WHITE METAL LANTERN
2         CREAM CUPID HEARTS COAT HANGER
3    KNITTED UNION FLAG HOT WATER BOTTLE
4         RED WOOLLY HOTTIE WHITE HEART.
Name: Description, dtype: object

In [5]:
import re
a = 'hola123mundo45.6'
num = re.findall(r'\d+', a)
print(num)

['123', '45', '6']


In [6]:
import pandas as pd
import re

# Supongamos que tienes una serie llamada 'serie'
serie = df.loc[:, 'Description'].astype(str)

# Define una función para encontrar la longitud de los números en una cadena
def num_in_desc_len(desc):
    nums = re.findall(r'\d+', desc)  # Busca todos los números en la cadena
    len_num = sum(len(num) for num in nums)
    return len_num

# Aplica la función a cada elemento de la serie
longitudes_numeros = serie.apply(num_in_desc_len)

# Crea un DataFrame con la serie original y las longitudes de los números encontrados
df_nums = pd.DataFrame({'serie': serie, 'longitudes_numeros': longitudes_numeros})

# Imprime el DataFrame resultante
print(df_nums)


                                      serie  longitudes_numeros
0        WHITE HANGING HEART T-LIGHT HOLDER                   0
1                       WHITE METAL LANTERN                   0
2            CREAM CUPID HEARTS COAT HANGER                   0
3       KNITTED UNION FLAG HOT WATER BOTTLE                   0
4            RED WOOLLY HOTTIE WHITE HEART.                   0
...                                     ...                 ...
541904          PACK OF 20 SPACEBOY NAPKINS                   2
541905         CHILDREN'S APRON DOLLY GIRL                    0
541906        CHILDRENS CUTLERY DOLLY GIRL                    0
541907      CHILDRENS CUTLERY CIRCUS PARADE                   0
541908        BAKING SET 9 PIECE RETROSPOT                    1

[541909 rows x 2 columns]


In [7]:
print(df_nums.loc[:, 'longitudes_numeros'].unique())
num_digits = 3
df_filtered = (
    df
    .loc[df_nums.loc[:, 'longitudes_numeros'] == num_digits, :]
    .loc[df.loc[:, 'Quantity'] > 0, :]
    .loc[df.loc[:, 'UnitPrice'] > 0, :]
)
df_filtered.tail(), len(df_filtered)

[0 1 2 3 4 5 8 6 7 9]


(       InvoiceNo StockCode                  Description  Quantity  \
 541496    581498     23354  6 GIFT TAGS 50'S CHRISTMAS          9   
 541612    581514     21705      BAG 500g SWIRLY MARBLES        84   
 541615    581516     21705      BAG 500g SWIRLY MARBLES        24   
 541690    581538     23319  BOX OF 6 MINI 50'S CRACKERS         1   
 541815    581579     23319  BOX OF 6 MINI 50'S CRACKERS        12   
 
                InvoiceDate  UnitPrice  CustomerID         Country  
 541496 2011-12-09 10:26:00       1.63       15287  United Kingdom  
 541612 2011-12-09 11:20:00       0.39       17754  United Kingdom  
 541615 2011-12-09 11:26:00       0.39       14422  United Kingdom  
 541690 2011-12-09 11:34:00       2.49       14446  United Kingdom  
 541815 2011-12-09 12:19:00       2.49       17581  United Kingdom  ,
 1973)

In [8]:
import pandas as pd
import re

# Supongamos que tienes una serie llamada 'serie'
serie = df.loc[:, 'Description'].astype(str)

# Define una función para encontrar la longitud de los números en una cadena
def encontrar_longitud_numeros(cadena):
    numeros_encontrados = re.findall(r'\d+', cadena)  # Busca todos los números en la cadena
    longitudes = [len(numero) for numero in numeros_encontrados]  # Calcula la longitud de cada número
    return longitudes

# Aplica la función a cada elemento de la serie
longitudes_numeros = serie.apply(encontrar_longitud_numeros)

# Crea un DataFrame con la serie original y las longitudes de los números encontrados
df = pd.DataFrame({'serie': serie, 'longitudes_numeros': longitudes_numeros})

# Imprime el DataFrame resultante
print(df)


                                      serie longitudes_numeros
0        WHITE HANGING HEART T-LIGHT HOLDER                 []
1                       WHITE METAL LANTERN                 []
2            CREAM CUPID HEARTS COAT HANGER                 []
3       KNITTED UNION FLAG HOT WATER BOTTLE                 []
4            RED WOOLLY HOTTIE WHITE HEART.                 []
...                                     ...                ...
541904          PACK OF 20 SPACEBOY NAPKINS                [2]
541905         CHILDREN'S APRON DOLLY GIRL                  []
541906        CHILDRENS CUTLERY DOLLY GIRL                  []
541907      CHILDRENS CUTLERY CIRCUS PARADE                 []
541908        BAKING SET 9 PIECE RETROSPOT                 [1]

[541909 rows x 2 columns]


In [9]:
df_unique_desc = (
    df
    .loc[:, 'Description']
    .dropna()
    .drop_duplicates()
)
len(df_unique_desc)

KeyError: 'Description'

In [None]:
df_unique_desc.head()

0     WHITE HANGING HEART T-LIGHT HOLDER
1                    WHITE METAL LANTERN
2         CREAM CUPID HEARTS COAT HANGER
3    KNITTED UNION FLAG HOT WATER BOTTLE
4         RED WOOLLY HOTTIE WHITE HEART.
Name: Description, dtype: object

In [None]:
df_unique_desc = df_unique_desc.loc.astype(str)
df_unique_desc['Description']

IndexingError: Too many indexers

In [None]:
df_unique_desc[~df_unique_desc.str.contains('[a-z]') & df_unique_desc.str.contains('\d')]

TypeError: bad operand type for unary ~: 'float'

In [None]:
df_unique_desc['Numero_Extraido'] = df_unique_desc.str.extract(r'(\d+)')
df_unique_desc.head()

0     WHITE HANGING HEART T-LIGHT HOLDER
1                    WHITE METAL LANTERN
2         CREAM CUPID HEARTS COAT HANGER
3    KNITTED UNION FLAG HOT WATER BOTTLE
4         RED WOOLLY HOTTIE WHITE HEART.
Name: Description, dtype: object

Podemos completar 

In [None]:
df_unique_desc_code = (
    df
    .loc[:, ['StockCode', 'Description']]
    .dropna()
    .drop_duplicates()
)
len(df_unique_desc_code)

4792

In [None]:
count_by_name = df.groupby('Description')['StockCode'].nunique()
count_by_name.sort_values(inplace=True)

In [None]:
df.loc[df.loc[:, 'Description'] == 'wrongly marked 23343', :].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
415582,572547,20713,wrongly marked 23343,200,2011-10-24 17:01:00,0.0,15287,United Kingdom


In [None]:
df.loc[df.loc[:, 'StockCode'] == 20713, :].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
522,536409,20713,JUMBO BAG OWLS,1,2010-12-01 11:45:00,1.95,17908,United Kingdom
1117,536527,20713,JUMBO BAG OWLS,10,2010-12-01 13:04:00,1.95,12662,Germany
1439,536542,20713,JUMBO BAG OWLS,30,2010-12-01 14:11:00,1.95,16456,United Kingdom
6381,536938,20713,JUMBO BAG OWLS,20,2010-12-03 12:05:00,1.95,14680,United Kingdom
7788,537054,20713,JUMBO BAG OWLS,3,2010-12-05 11:40:00,1.95,16931,United Kingdom


In [None]:
count_by_name, count_by_name.index, count_by_name.values

(Description
 20713                           1
 POSSIBLE DAMAGES OR LOST?       1
 POSTAGE                         1
 POSTE FRANCE CUSHION COVER      1
 POSY CANDY BAG                  1
                              ... 
 found                          25
 damages                        43
 damaged                        43
 ?                              47
 check                         146
 Name: StockCode, Length: 4223, dtype: int64,
 Index([                             20713,        'POSSIBLE DAMAGES OR LOST?',
                                 'POSTAGE',       'POSTE FRANCE CUSHION COVER',
                          'POSY CANDY BAG', 'POTTERING IN THE SHED METAL SIGN',
                           'POTTERING MUG',   'POTTING SHED CANDLE CITRONELLA',
                'POTTING SHED ROSE CANDLE',      'POTTING SHED SEED ENVELOPES',
        ...
                             'thrown away',           'Unsaleable, destroyed.',
                                 'Damaged',                     

In [None]:
df_unique_desc = (
    df
    .loc[:, 'Description']
    .dropna()
    .drop_duplicates()
)
df_unique_stockcode = (
    df
    .loc[:, 'StockCode']
    .dropna()
    .drop_duplicates()
)
df_unique_desc_code = (
    df
    .loc[:, ['Description', 'StockCode']]
    .dropna()
    .drop_duplicates()
)
len(df_unique_desc), len(df_unique_stockcode), len(df_unique_desc_code)

(4223, 4070, 4792)

In [None]:
df_unique_desc.head()

0     WHITE HANGING HEART T-LIGHT HOLDER
1                    WHITE METAL LANTERN
2         CREAM CUPID HEARTS COAT HANGER
3    KNITTED UNION FLAG HOT WATER BOTTLE
4         RED WOOLLY HOTTIE WHITE HEART.
Name: Description, dtype: object

In [None]:
df.loc[df.loc[:, 'Description'] == "WHITE HANGING HEART T-LIGHT HOLDER", :].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
49,536373,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 09:02:00,2.55,17850,United Kingdom
66,536375,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 09:32:00,2.55,17850,United Kingdom
220,536390,85123A,WHITE HANGING HEART T-LIGHT HOLDER,64,2010-12-01 10:19:00,2.55,17511,United Kingdom
262,536394,85123A,WHITE HANGING HEART T-LIGHT HOLDER,32,2010-12-01 10:39:00,2.55,13408,United Kingdom


In [None]:
(
    df
    .describe()
    .drop('count')
    .T
)

Unnamed: 0,mean,min,25%,50%,75%,max,std
Quantity,9.55225,-80995.0,1.0,3.0,10.0,80995.0,218.081158
InvoiceDate,2011-07-04 13:34:57.156386048,2010-12-01 08:26:00,2011-03-28 11:34:00,2011-07-19 17:17:00,2011-10-19 11:27:00,2011-12-09 12:50:00,
UnitPrice,4.611114,-11062.06,1.25,2.08,4.13,38970.0,96.759853
CustomerID,15287.518434,12346.0,14367.0,15287.0,16255.0,18287.0,1484.746041


Nos fijamos que existen valores negativos tanto en cantidad (Quantity) como en precio de la unidad (UnitPrice).  
Veamos cuantas entradas tienen este problema para estudiar que hacer.

In [None]:
df_negs = df.loc[(df.loc[:, 'Quantity'] <= 0) | (df.loc[:, 'UnitPrice'] <= 0), :]
df_negs.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,2010-12-01 09:41:00,27.5,14527,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,2010-12-01 09:49:00,4.65,15311,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,2010-12-01 10:24:00,1.65,17548,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,2010-12-01 10:24:00,0.29,17548,United Kingdom


In [None]:
len(df_negs)

11805

Vemos que existen más de 11 mil entradas con valores negativos en alguna de las columnas "Quantity" y "UnitPrice".  

In [None]:
negs_ratio = round(len(df_negs)/len(df), 2)
print(negs_ratio)

0.02


Vemos que solo el 2% de los datos contienen valores negativos, consideramos que es una parte pequeña de los datos y por tanto los eliminamos del dataset.

In [None]:
df_std = (
    pd.concat([df, df_negs])
    .drop_duplicates(keep=False)
)
df_std.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


Comprobamos que hemos realizado correctamente la transformación.

In [None]:
df_negs = df_std.loc[(df_std.loc[:, 'Quantity'] <= 0) | (df_std.loc[:, 'UnitPrice'] <= 0), :]
len(df_negs)

0

In [None]:
df.loc[df.loc[:, 'Description'].isna(), :].head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,2010-12-01 11:52:00,0.0,15287,United Kingdom
1970,536545,21134,,1,2010-12-01 14:32:00,0.0,15287,United Kingdom
1971,536546,22145,,1,2010-12-01 14:33:00,0.0,15287,United Kingdom
1972,536547,37509,,1,2010-12-01 14:33:00,0.0,15287,United Kingdom
1987,536549,85226A,,1,2010-12-01 14:34:00,0.0,15287,United Kingdom


In [None]:
df.loc[:, 'StockCode'].nunique(), df.loc[:, 'Description'].nunique()

(4070, 4223)

In [None]:
num_uniques = len(df.loc[:, ['StockCode', 'Description']].drop_duplicates(keep=False))
print(f'El número de productos únicos es {num_uniques}')

El número de productos únicos es 1429


In [None]:
import pandas as pd

# Supongamos que df es tu DataFrame
df1 = pd.DataFrame({
    'columna1': ['a', 'c', 'a', 'c', 'c'],
    'columna2': ['x', 'z', 'x', 'z', 'z']
})

# Eliminar filas duplicadas y obtener un nuevo DataFrame con filas únicas
df_filas_unicas = df1.drop_duplicates()

print("DataFrame original:")
print(df1)

print("\nDataFrame con filas únicas:")
print(df_filas_unicas)

print("\nDataFrame keep=False:")
print(df1.drop_duplicates(keep=False))


DataFrame original:
  columna1 columna2
0        a        x
1        c        z
2        a        x
3        c        z
4        c        z

DataFrame con filas únicas:
  columna1 columna2
0        a        x
1        c        z

DataFrame keep=False:
Empty DataFrame
Columns: [columna1, columna2]
Index: []
