# Data Cleaning & Preprocessing in Pandas

Data cleaning and preprocessing are essential steps in data science and machine learning because real-world data is often messy, incomplete, or inconsistent. Clean data ensures that models learn accurate patterns instead of noise or errors. Preprocessing helps handle missing values, remove duplicates, correct data types, and normalize values so algorithms can work effectively. It also improves model accuracy, reliability, and generalization by providing high-quality inputs. Without proper cleaning, even advanced models can produce misleading results. This stage also helps uncover hidden issues, ensures fairness, and makes the dataset usable for analysis or prediction. Overall, data cleaning forms the foundation for building trustworthy and high-performance ML systems.

In [1]:
import pandas as pd

In [2]:
pd.__version__

'2.2.3'

In [3]:
df = pd.read_excel('data_cleaning.xlsx') # Loading Data
df

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com
3,USER_004,5108603.0,2025-11-23,San Francisco,600.67,,Dining,Purchase,Approved,5651,,True,0,35-44,35,,jason.lee.manager@company.com
4,USER_004,5108603.0,2025-11-23,San Francisco,600.67,,Dining,Purchase,Approved,5651,,True,0,35-44,38,,emily.roberts.dev@gmail.com
5,USER_005,,2025-11-22,Los Angeles,160.24,Online Shop,Utilities,Purchase,Approved,5999,,False,0,18-24,20,Mastercard,finance.team.support@corporate.org
6,USER_006,,2025-11-22,Jacksonville,160.21,,Fuel,Purchase,Approved,5814,,False,0,18-24,18,Mastercard,maria.ali24@hotmail.com
7,USER_007,,2025-11-17,San Francisco,62.79,McDonalds,,Purchase,Approved,5812,Restaurants,False,0,35-44,36,Visa,david_smith742@protonmail.com
8,USER_007,,2025-11-17,San Francisco,62.79,McDonalds,,Purchase,Approved,5812,Restaurants,False,0,35-44,39,Visa,techinsider.newsletter@info.com
9,USER_008,2458591.0,2025-11-11,Philadelphia,866.85,Restaurant,Groceries,Purchase,Approved,5655,Sports & Riding Apparel Stores,True,0,25-34,31,,hamza.raza22@students.edu


## Checks for missing values (NaN) in dataframe.

- df.isnull() creates a True/False table showing where values are missing.
- sum() adds up all the True values in each column.

### Final result: a count of how many missing values each column has.

In [4]:
df.isnull().sum() #Return the sum of null value from each columns

User_ID             0
Transaction_ID      5
Date                0
Location            0
Amount              0
Merchant            3
Category            5
Transaction_Type    0
Status              0
MCC                 0
MCC Stands          4
Is_Online           0
Fraud_Flag          0
User_Age_Group      0
Age                 0
Card_Type           3
Email               0
dtype: int64

## dropna() - Remove any row or columns that contain missing values (NaN) from dataset.

dropna() is used to clean incomplete data by deleting rows/columns with missing entries. By default, it removes rows that have at least one NaN. 

You can also remove columns by using: df.dropna(axis=1).

### Is it useful?
Useful when missing data is too small or not important to keep. It ensure your dataset is clean so machine-learning models don’t get confused by NaN values.

## Axis "Analogy"
Literal Meaning: Delete any column that has at least one missing value anywhere in it. This is superhelpful when you want to clean your dataset by removing incomplete column/row instead of filling them.

axis=0 → remove rows with missing values (by default it will take "0")

In [5]:
df.dropna() #To see where we have complete set of information in our dataset

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com
11,USER_010,1533224.0,2025-11-02,Denver,709.53,Online Shop,Clothing,Purchase,Approved,5651,Family Clothing Stores,True,0,35-44,44,Discover,linda.green.marketing@agency.co


axis=1 → remove columns with missing values

In [6]:
df.dropna(axis=1)

Unnamed: 0,User_ID,Date,Location,Amount,Transaction_Type,Status,MCC,Is_Online,Fraud_Flag,User_Age_Group,Age,Email
0,USER_001,2025-12-09,Seattle,377.67,Purchase,Approved,5541,True,0,18-24,19,kabir.shah@gmail.com
1,USER_002,2025-12-06,Dallas,950.96,Refund,Approved,5651,True,0,25-34,22,sara.malik92@yahoo.com
2,USER_003,2025-11-27,Houston,733.33,Purchase,Approved,5732,False,0,18-24,17,ahmed.khan01@outlook.com
3,USER_004,2025-11-23,San Francisco,600.67,Purchase,Approved,5651,True,0,35-44,35,jason.lee.manager@company.com
4,USER_004,2025-11-23,San Francisco,600.67,Purchase,Approved,5651,True,0,35-44,38,emily.roberts.dev@gmail.com
5,USER_005,2025-11-22,Los Angeles,160.24,Purchase,Approved,5999,False,0,18-24,20,finance.team.support@corporate.org
6,USER_006,2025-11-22,Jacksonville,160.21,Purchase,Approved,5814,False,0,18-24,18,maria.ali24@hotmail.com
7,USER_007,2025-11-17,San Francisco,62.79,Purchase,Approved,5812,False,0,35-44,36,david_smith742@protonmail.com
8,USER_007,2025-11-17,San Francisco,62.79,Purchase,Approved,5812,False,0,35-44,39,techinsider.newsletter@info.com
9,USER_008,2025-11-11,Philadelphia,866.85,Purchase,Approved,5655,True,0,25-34,31,hamza.raza22@students.edu


### df.fillna("Error!!!")

Replaces all missing values (NaN) in the entire DataFrame with the text "Error!!!". It’s used to fill empty or null cells with a custom message instead of leaving them blank.

In [7]:
df.fillna("Error!!!")

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com
3,USER_004,5108603.0,2025-11-23,San Francisco,600.67,Error!!!,Dining,Purchase,Approved,5651,Error!!!,True,0,35-44,35,Error!!!,jason.lee.manager@company.com
4,USER_004,5108603.0,2025-11-23,San Francisco,600.67,Error!!!,Dining,Purchase,Approved,5651,Error!!!,True,0,35-44,38,Error!!!,emily.roberts.dev@gmail.com
5,USER_005,Error!!!,2025-11-22,Los Angeles,160.24,Online Shop,Utilities,Purchase,Approved,5999,Error!!!,False,0,18-24,20,Mastercard,finance.team.support@corporate.org
6,USER_006,Error!!!,2025-11-22,Jacksonville,160.21,Error!!!,Fuel,Purchase,Approved,5814,Error!!!,False,0,18-24,18,Mastercard,maria.ali24@hotmail.com
7,USER_007,Error!!!,2025-11-17,San Francisco,62.79,McDonalds,Error!!!,Purchase,Approved,5812,Restaurants,False,0,35-44,36,Visa,david_smith742@protonmail.com
8,USER_007,Error!!!,2025-11-17,San Francisco,62.79,McDonalds,Error!!!,Purchase,Approved,5812,Restaurants,False,0,35-44,39,Visa,techinsider.newsletter@info.com
9,USER_008,2458591.0,2025-11-11,Philadelphia,866.85,Restaurant,Groceries,Purchase,Approved,5655,Sports & Riding Apparel Stores,True,0,25-34,31,Error!!!,hamza.raza22@students.edu


Here we are filtering only one column to see the NaN values

In [8]:
nan_val = df['Transaction_ID'] #We will filter out the NaN values in Transaction id
nan_val

0     2867825.0
1     1419610.0
2     5614226.0
3     5108603.0
4     5108603.0
5           NaN
6           NaN
7           NaN
8           NaN
9     2458591.0
10    8078673.0
11    1533224.0
12          NaN
13    8078673.0
Name: Transaction_ID, dtype: float64

### Placing "Mean" where the value is NaN

Here we are taking the Mean(average of the transaction id where the data available) and place the mean in those places where data is empty. We are using fillna and mean function in Pandas

In [9]:
mean_placing = nan_val.fillna(df["Transaction_ID"].mean()) #ON 5,6,7,8,12
mean_placing.astype(int)

0     2867825
1     1419610
2     5614226
3     5108603
4     5108603
5     4474225
6     4474225
7     4474225
8     4474225
9     2458591
10    8078673
11    1533224
12    4474225
13    8078673
Name: Transaction_ID, dtype: int64

In this above example i have replaced the NaN value with their row mean

## Forward and Backward Filling example

Forward filling (ffill) and Backward filling (bfill) are methods to fill missing values in a dataset.

### Forward Filling (ffill):
- Fills each missing value with the previous non-missing value.
- Works like “carry forward the last known value.”
- Useful when data changes gradually (e.g., time-series).

### Backward Filling (bfill)
- Fills missing values with the next non-missing value.
- Works like “pull the next known value backward.”
- Useful when future data can logically fill earlier gaps.

In [10]:
forward_filling = nan_val.ffill()
forward_filling

0     2867825.0
1     1419610.0
2     5614226.0
3     5108603.0
4     5108603.0
5     5108603.0
6     5108603.0
7     5108603.0
8     5108603.0
9     2458591.0
10    8078673.0
11    1533224.0
12    1533224.0
13    8078673.0
Name: Transaction_ID, dtype: float64

In [11]:
backward_filling = nan_val.bfill()
backward_filling

0     2867825.0
1     1419610.0
2     5614226.0
3     5108603.0
4     5108603.0
5     2458591.0
6     2458591.0
7     2458591.0
8     2458591.0
9     2458591.0
10    8078673.0
11    1533224.0
12    8078673.0
13    8078673.0
Name: Transaction_ID, dtype: float64

In [12]:
df.duplicated() #Row Wise

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

In [13]:
w_out_duplicate = df.drop_duplicates()
w_out_duplicate

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com
3,USER_004,5108603.0,2025-11-23,San Francisco,600.67,,Dining,Purchase,Approved,5651,,True,0,35-44,35,,jason.lee.manager@company.com
4,USER_004,5108603.0,2025-11-23,San Francisco,600.67,,Dining,Purchase,Approved,5651,,True,0,35-44,38,,emily.roberts.dev@gmail.com
5,USER_005,,2025-11-22,Los Angeles,160.24,Online Shop,Utilities,Purchase,Approved,5999,,False,0,18-24,20,Mastercard,finance.team.support@corporate.org
6,USER_006,,2025-11-22,Jacksonville,160.21,,Fuel,Purchase,Approved,5814,,False,0,18-24,18,Mastercard,maria.ali24@hotmail.com
7,USER_007,,2025-11-17,San Francisco,62.79,McDonalds,,Purchase,Approved,5812,Restaurants,False,0,35-44,36,Visa,david_smith742@protonmail.com
8,USER_007,,2025-11-17,San Francisco,62.79,McDonalds,,Purchase,Approved,5812,Restaurants,False,0,35-44,39,Visa,techinsider.newsletter@info.com
9,USER_008,2458591.0,2025-11-11,Philadelphia,866.85,Restaurant,Groceries,Purchase,Approved,5655,Sports & Riding Apparel Stores,True,0,25-34,31,,hamza.raza22@students.edu


In [14]:
w_out_duplicate.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

In [15]:
print(df.duplicated(subset=["User_ID", "Location"]))

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13     True
dtype: bool


In [16]:
df["Location"].str.lower() 

0           seattle
1            dallas
2           houston
3     san francisco
4     san francisco
5       los angeles
6      jacksonville
7     san francisco
8     san francisco
9      philadelphia
10         san jose
11           denver
12    san francisco
13         san jose
Name: Location, dtype: object

In [17]:
df #Check if this effect in our original data -> It does not made the changes in original dataset

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com
3,USER_004,5108603.0,2025-11-23,San Francisco,600.67,,Dining,Purchase,Approved,5651,,True,0,35-44,35,,jason.lee.manager@company.com
4,USER_004,5108603.0,2025-11-23,San Francisco,600.67,,Dining,Purchase,Approved,5651,,True,0,35-44,38,,emily.roberts.dev@gmail.com
5,USER_005,,2025-11-22,Los Angeles,160.24,Online Shop,Utilities,Purchase,Approved,5999,,False,0,18-24,20,Mastercard,finance.team.support@corporate.org
6,USER_006,,2025-11-22,Jacksonville,160.21,,Fuel,Purchase,Approved,5814,,False,0,18-24,18,Mastercard,maria.ali24@hotmail.com
7,USER_007,,2025-11-17,San Francisco,62.79,McDonalds,,Purchase,Approved,5812,Restaurants,False,0,35-44,36,Visa,david_smith742@protonmail.com
8,USER_007,,2025-11-17,San Francisco,62.79,McDonalds,,Purchase,Approved,5812,Restaurants,False,0,35-44,39,Visa,techinsider.newsletter@info.com
9,USER_008,2458591.0,2025-11-11,Philadelphia,866.85,Restaurant,Groceries,Purchase,Approved,5655,Sports & Riding Apparel Stores,True,0,25-34,31,,hamza.raza22@students.edu


### To filter in Location column where "San Francisco" appear without the case sensivity

In [18]:
df["Location"].str.contains("san francisco", case=False)

0     False
1     False
2     False
3      True
4      True
5     False
6     False
7      True
8      True
9     False
10    False
11    False
12     True
13    False
Name: Location, dtype: bool

### I have added email addresses in the datasets to show the example (split) method in Pandas

In [19]:
df["Email"].str.split("@")

0                   [kabir.shah, gmail.com]
1                 [sara.malik92, yahoo.com]
2               [ahmed.khan01, outlook.com]
3          [jason.lee.manager, company.com]
4            [emily.roberts.dev, gmail.com]
5     [finance.team.support, corporate.org]
6                [maria.ali24, hotmail.com]
7          [david_smith742, protonmail.com]
8        [techinsider.newsletter, info.com]
9              [hamza.raza22, students.edu]
10         [contact.sales, businesshub.com]
11       [linda.green.marketing, agency.co]
12       [support_center.help, service.net]
13    [john.turner.analytics, dataworks.io]
Name: Email, dtype: object

In [20]:
type(df["Email"].str.split("@")[0])

list

In [21]:
c_dateformat = df["Date"] = pd.to_datetime(df["Date"]) # Example are already align with correct format that's why changes not occured
print(c_dateformat)

0    2025-12-09
1    2025-12-06
2    2025-11-27
3    2025-11-23
4    2025-11-23
5    2025-11-22
6    2025-11-22
7    2025-11-17
8    2025-11-17
9    2025-11-11
10   2025-11-06
11   2025-11-02
12   2025-11-17
13   2025-11-06
Name: Date, dtype: datetime64[ns]


### Applying Functions

First we will sort out the NaN value in our dataset to apply functions on cleaned datasets

In [22]:
df2 = df.dropna().copy()

In [23]:
df2 #here is our cleaned dataset

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com
11,USER_010,1533224.0,2025-11-02,Denver,709.53,Online Shop,Clothing,Purchase,Approved,5651,Family Clothing Stores,True,0,35-44,44,Discover,linda.green.marketing@agency.co


In [24]:
df2.info() #To validate our cleaned dataset

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 11
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   User_ID           4 non-null      object        
 1   Transaction_ID    4 non-null      float64       
 2   Date              4 non-null      datetime64[ns]
 3   Location          4 non-null      object        
 4   Amount            4 non-null      float64       
 5   Merchant          4 non-null      object        
 6   Category          4 non-null      object        
 7   Transaction_Type  4 non-null      object        
 8   Status            4 non-null      object        
 9   MCC               4 non-null      int64         
 10  MCC Stands        4 non-null      object        
 11  Is_Online         4 non-null      bool          
 12  Fraud_Flag        4 non-null      int64         
 13  User_Age_Group    4 non-null      object        
 14  Age               4 non-null      

### Adding a new column in the end to present the age group according to values in Age Columns

In [25]:
df2["Age_Group"] = df2["Age"].apply(lambda x: "Adult" if x >= 18 else "Minor")

In [26]:
df2

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email,Age_Group
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com,Adult
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com,Adult
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com,Minor
11,USER_010,1533224.0,2025-11-02,Denver,709.53,Online Shop,Clothing,Purchase,Approved,5651,Family Clothing Stores,True,0,35-44,44,Discover,linda.green.marketing@agency.co,Adult


We can acheive this via creating a function

In [27]:
def isMinor(x):
    return "Adult" if x >= 25 else "Minor"
df2["Age_Func"] = df2["Age"].apply(isMinor)

In [28]:
df2

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email,Age_Group,Age_Func
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Amex,kabir.shah@gmail.com,Adult,Minor
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,Discover,sara.malik92@yahoo.com,Adult,Minor
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa,ahmed.khan01@outlook.com,Minor,Minor
11,USER_010,1533224.0,2025-11-02,Denver,709.53,Online Shop,Clothing,Purchase,Approved,5651,Family Clothing Stores,True,0,35-44,44,Discover,linda.green.marketing@agency.co,Adult,Adult


In [29]:
card_type = {"Amex": "Paypal", "Discover":"MasterCard", "Visa":"Visa-International"}
df2["Card_Type"] = df2["Card_Type"].map(card_type)

In [30]:
df2

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email,Age_Group,Age_Func
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Paypal,kabir.shah@gmail.com,Adult,Minor
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Refund,Approved,5651,Family Clothing Stores,True,0,25-34,22,MasterCard,sara.malik92@yahoo.com,Adult,Minor
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa-International,ahmed.khan01@outlook.com,Minor,Minor
11,USER_010,1533224.0,2025-11-02,Denver,709.53,Online Shop,Clothing,Purchase,Approved,5651,Family Clothing Stores,True,0,35-44,44,MasterCard,linda.green.marketing@agency.co,Adult,Adult


In [31]:
df2["Transaction_Type"] = df2["Transaction_Type"].replace({"Refund": "Purchase"})

In [32]:
df2 

Unnamed: 0,User_ID,Transaction_ID,Date,Location,Amount,Merchant,Category,Transaction_Type,Status,MCC,MCC Stands,Is_Online,Fraud_Flag,User_Age_Group,Age,Card_Type,Email,Age_Group,Age_Func
0,USER_001,2867825.0,2025-12-09,Seattle,377.67,Amazon,Clothing,Purchase,Approved,5541,Service Stations,True,0,18-24,19,Paypal,kabir.shah@gmail.com,Adult,Minor
1,USER_002,1419610.0,2025-12-06,Dallas,950.96,Starbucks,Dining,Purchase,Approved,5651,Family Clothing Stores,True,0,25-34,22,MasterCard,sara.malik92@yahoo.com,Adult,Minor
2,USER_003,5614226.0,2025-11-27,Houston,733.33,McDonalds,Groceries,Purchase,Approved,5732,Electronics Stores,False,0,18-24,17,Visa-International,ahmed.khan01@outlook.com,Minor,Minor
11,USER_010,1533224.0,2025-11-02,Denver,709.53,Online Shop,Clothing,Purchase,Approved,5651,Family Clothing Stores,True,0,35-44,44,MasterCard,linda.green.marketing@agency.co,Adult,Adult


You can clearly see the that the refund value on index number 1 has been changed from Refund to Purchase