# 🧹 Module 2: Data Cleaning

## 🎯 Objective
Now that you’ve explored the dataset and identified issues, it’s time to clean the data so that we can use it for analysis.

You will:
- Convert data types (e.g., Date, Quantity, Price)
- Handle missing and invalid values
- Remove duplicates
- Standardize categorical variables

Your cleaned dataset will be the foundation for calculating portfolio value in the next module.

### Step 2.0 – Import the DataFrame

Make sure you start from the same dataset you explored in Module 1.

Import the CSV File


In [None]:
# 💡 Tip: Use the csv format

# YOUR CODE HERE


### Step 2.1 – Convert data types

Make sure:
- `Date` is a datetime object
- `Quantity` and `Price` are numeric

If conversion fails due to bad data, handle errors gracefully (e.g., use `errors='coerce'`).


In [None]:
# 💡 Tip: Use pd.to_datetime() and pd.to_numeric()

# Convert Date column
# YOUR CODE HERE

# Convert Quantity and Price
# YOUR CODE HERE


### Step 2.2 – Remove invalid or missing rows

Decide how to deal with rows that:
- Have missing `Asset`, `Quantity`, or `Price`
- Have `Quantity` or `Price` <= 0
- Have a missing or invalid `Date`

Hint: Not all missing values should be dropped blindly. You may want to drop only critical columns.

Clean based on what would break portfolio valuation.


In [None]:
# 💡 Tip: Use .dropna() and boolean filtering

# Remove rows with missing or invalid critical fields
# YOUR CODE HERE


### Step 2.3 – Keep only valid trade types

Valid trade types are: `'Buy'` and `'Sell'`

All others (e.g. `Hold`, empty string, NaN) should be removed or fixed.


In [None]:
# 💡 Tip: Use .isin() to filter only valid values

# Filter the 'Type' column
# YOUR CODE HERE


### Step 2.4 – Standardize text columns

Some entries in the `Asset` or `Type` columns might be lowercase or contain whitespace.

Clean them by:
- Removing extra spaces
- Making everything uppercase (or consistent case)
- Dropping empty strings if necessary


In [None]:
# 💡 Tip: Use .str.strip() and .str.upper()

# Clean Asset column
# YOUR CODE HERE

# Clean Type column
# YOUR CODE HERE


### Step 2.5 – Remove duplicates

If there are duplicate trades, remove them to avoid double-counting in future portfolio valuation.


In [None]:
# 💡 Tip: Use .drop_duplicates()

# YOUR CODE HERE


### Step 2.6 – Reset the DataFrame index

After all rows have been removed or filtered, reset the index to keep the dataset clean and sequential.


In [None]:
# 💡 Tip: Use .reset_index(drop=True)

# YOUR CODE HERE


### ✅ Final Check: Review the cleaned dataset

Display the first few rows and check:
- Are there still any NaNs?
- Are all values consistent and valid?


In [None]:
# 💡 Tip: Use .head() and .isnull().sum()

# YOUR CODE HERE


## 📦 Save the cleaned DataFrame

You’ll need this cleaned dataset for the next module (portfolio computation). Let’s save it into a new variable.


In [None]:
# 💡 Tip: Use a meaningful name like df_clean

# YOUR CODE HERE
