# ChatGPT Data Cleaning Exercises

## 📊 Exercise 1: Handling missing transaction amounts

Instruction:

Fill missing amount values with 0.

Drop rows where customer_id is missing.

Keep only the first 5 rows.

Data:

In [5]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "transaction_id": [101,102,103,104,105,106],
    "customer_id": [1,2,None,2,3,4],
    "amount": [200,np.nan,150,100,np.nan,400]
})

df

Unnamed: 0,transaction_id,customer_id,amount
0,101,1.0,200.0
1,102,2.0,
2,103,,150.0
3,104,2.0,100.0
4,105,3.0,
5,106,4.0,400.0


0    1.0
1    2.0
3    2.0
4    3.0
5    4.0
Name: customer_id, dtype: float64

Exercise 1 solution

In [6]:
df["amount"] = df["amount"].fillna(0)
df = df.dropna(subset=["customer_id"])
df = df.iloc[:5]  # first 5 rows
df

Unnamed: 0,transaction_id,customer_id,amount
0,101,1.0,200.0
1,102,2.0,0.0
3,104,2.0,100.0
4,105,3.0,0.0
5,106,4.0,400.0


## 📊 Exercise 2: Removing duplicate transactions

Instruction:

Remove duplicate transactions based on customer_id and amount.

Keep the first occurrence.

Data:

In [8]:
df = pd.DataFrame({
    "transaction_id": [101,102,103,104,105],
    "customer_id": [1,1,2,2,2],
    "amount": [200,200,100,100,300]
})
df

Unnamed: 0,transaction_id,customer_id,amount
0,101,1,200
1,102,1,200
2,103,2,100
3,104,2,100
4,105,2,300


In [11]:
df = df.drop_duplicate(['customer_id', 'amount'], keep="first")
df

AttributeError: module 'pandas' has no attribute 'drop_duplicate'

Exercise 2 solution

In [12]:
df_clean = df.drop_duplicates(subset=["customer_id","amount"], keep="first")
df_clean

Unnamed: 0,transaction_id,customer_id,amount
0,101,1,200
2,103,2,100
4,105,2,300


## 📊 Exercise 3: Selecting and filtering rows and columns

Instruction:

Keep only rows where amount > 150.

Select only transaction_id and amount columns.

Data:

In [13]:
df = pd.DataFrame({
    "transaction_id": [101,102,103,104],
    "customer_id": [1,2,3,4],
    "amount": [200,100,250,150]
})

df

Unnamed: 0,transaction_id,customer_id,amount
0,101,1,200
1,102,2,100
2,103,3,250
3,104,4,150


In [17]:
df = df[['transaction_id', 'amount'] and df['amount'] > 150]
df

Unnamed: 0,transaction_id,customer_id,amount
0,101,1,200
2,103,3,250


Exercise 3 solution

In [19]:
df_filtered = df.loc[df["amount"] > 150, ["transaction_id","amount"]]
df_filtered


Unnamed: 0,transaction_id,amount
0,101,200
2,103,250


## 📊 Exercise 4: Date handling and conversion

Instruction:

Convert transaction_date from string to datetime.

Create a new column month extracting the transaction month.

Filter for transactions in January 2025.

Data:

In [20]:
df = pd.DataFrame({
    "transaction_id": [101,102,103],
    "transaction_date": ["2025-01-05","2025-02-10","2025-01-20"],
    "amount": [200,150,300]
})

df

Unnamed: 0,transaction_id,transaction_date,amount
0,101,2025-01-05,200
1,102,2025-02-10,150
2,103,2025-01-20,300


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   transaction_id    3 non-null      int64 
 1   transaction_date  3 non-null      object
 2   amount            3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 204.0+ bytes


In [23]:
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   transaction_id    3 non-null      int64         
 1   transaction_date  3 non-null      datetime64[ns]
 2   amount            3 non-null      int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 204.0 bytes


In [26]:
df['month'] = df['transaction_date'].dt.month
df

Unnamed: 0,transaction_id,transaction_date,amount,month
0,101,2025-01-05,200,1
1,102,2025-02-10,150,2
2,103,2025-01-20,300,1


In [29]:
df = df.loc[df['month'] == 1]
df

Unnamed: 0,transaction_id,transaction_date,amount,month
0,101,2025-01-05,200,1
2,103,2025-01-20,300,1


Exercise 4 solution

In [31]:
df["transaction_date"] = pd.to_datetime(df["transaction_date"])
df["month"] = df["transaction_date"].dt.month
df_jan = df.loc[df["month"] == 1]
df_jan


Unnamed: 0,transaction_id,transaction_date,amount,month
0,101,2025-01-05,200,1
2,103,2025-01-20,300,1


## 📊 Exercise 5: Complex cleaning combining multiple steps

Instruction:

Drop duplicate transactions based on customer_id and amount.

Fill missing amount with median amount.

Filter for transactions after 2025-01-10.

Select only customer_id, amount, and transaction_date.

Data:

In [52]:
df = pd.DataFrame({
    "customer_id": [1,1,2,2,3],
    "transaction_date": ["2025-01-05","2025-01-15","2025-01-20","2025-01-25","2025-01-10"],
    "amount": [200,200,None,400,250]
})

df

Unnamed: 0,customer_id,transaction_date,amount
0,1,2025-01-05,200.0
1,1,2025-01-15,200.0
2,2,2025-01-20,
3,2,2025-01-25,400.0
4,3,2025-01-10,250.0


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       5 non-null      int64  
 1   transaction_date  5 non-null      object 
 2   amount            4 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 252.0+ bytes


In [36]:
df_dup = df.drop_duplicates(['customer_id', 'amount'])

In [49]:
df_dup['amount'] = df_dup['amount'].fillna(df_dup['amount'].median())

In [50]:
df_dup = df_dup.loc[df['transaction_date'] > '2025-01-10', ['customer_id', 'amount', 'transaction_date']]
df_dup

Unnamed: 0,customer_id,amount,transaction_date
2,2,250.0,2025-01-20
3,2,400.0,2025-01-25


Exercise 5 solution

In [54]:
df["transaction_date"] = pd.to_datetime(df["transaction_date"])
df = df.drop_duplicates(subset=["customer_id","amount"], keep="first")
median_amount = df["amount"].median()
df["amount"] = df["amount"].fillna(median_amount)
df_clean = df.loc[df["transaction_date"] > "2025-01-10", ["customer_id","amount","transaction_date"]]
df_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["amount"] = df["amount"].fillna(median_amount)


Unnamed: 0,customer_id,amount,transaction_date
2,2,250.0,2025-01-20
3,2,400.0,2025-01-25


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 4
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   customer_id       4 non-null      int64         
 1   transaction_date  4 non-null      datetime64[ns]
 2   amount            4 non-null      float64       
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 128.0 bytes
