In [1]:
from Aeterna_Func import Generate_Data
import pandas as pd

# 🧼 Module 1: Data Exploration

## 🎯 Objective
Explore a messy dataset containing financial trades and identify problems that must be cleaned before analysis.  

By the end of this module, you'll be able to clean a raw trade log and prepare it to calculate portfolio value — a key task in any financial data pipeline.

---

## 📘 Assignment 1: First Exploration

Run the function below to generate your raw dataset.


In [2]:
# 💡 Tip: This function returns a DataFrame. Save it in a variable so you can explore it later.
df = Generate_Data()

## Step 1.1 – Display the first few rows

You should now display the first 5 rows of the dataset to understand its structure.

- What kind of information is stored?
- Do you see anything strange already?


In [3]:
# 💡 Tip: Use a method that shows the first 5 rows of a DataFrame.

# YOUR CODE HERE
df.head()

Unnamed: 0,Asset,Quantity,Price,Date,Type
0,123,twenty,1255188.12,,Hold
1,TSLA,9,916.0,2024-07-14,Sell
2,,,,2023-15-01,
3,AAPL,,,yesterday,Hold
4,,,3111455.83,2022-02-30,Purchase


## Step 1.2 – Check the shape of the dataset

How many rows and columns does it contain?


In [4]:
# 💡 Tip: Use `.shape` to see the dimensions of the DataFrame.

# YOUR CODE HERE
df.shape

(30, 5)

## Step 1.3 – Check column names

This tells you which features are present and how to reference them in code.


In [5]:
# 💡 Tip: Use `.columns` to access the column names.

# YOUR CODE HERE
df.columns

Index(['Asset', 'Quantity', 'Price', 'Date', 'Type'], dtype='object')

## Step 1.4 – Check data types

Each column must have the correct data type to allow calculations and filtering.

- Quantities and prices → numeric
- Dates → datetime
- Asset names and trade types → text

In [6]:
# 💡 Tip: Use `.dtypes` to inspect the type of each column.

# YOUR CODE HERE
df.dtypes

Asset       object
Quantity    object
Price       object
Date        object
Type        object
dtype: object

## Step 1.5 – Check for missing values

Missing values are one of the most common problems in real-world financial data.

Let’s count how many NaNs appear in each column.


In [7]:
# 💡 Tip: Use `.isnull().sum()` to get missing values per column.

# YOUR CODE HERE
df.isnull().sum()

Asset       0
Quantity    7
Price       6
Date        1
Type        1
dtype: int64

## Step 1.6 – View basic statistics

Now let’s look at statistical summaries of the dataset.

This helps you spot:
- Outliers
- Inconsistencies
- Unexpected values


In [8]:
# 💡 Tip: Use `.describe(include='all')` to cover both numeric and categorical data.

# YOUR CODE HERE
df.describe(include='all')

Unnamed: 0,Asset,Quantity,Price,Date,Type
count,30.0,23,24,29.0,29
unique,7.0,16,18,18.0,6
top,,twenty,free,,Sell
freq,8.0,4,4,5.0,9


## Step 1.7 – Look at unique values

Print the first few unique values in each column to identify:
- Inconsistencies in categories (e.g. "BUY" vs "buy")
- Invalid formats (e.g. bad dates or prices)

Loop through columns and print the `.unique()` values.

In [9]:
# 💡 Tip: Use a for-loop to loop over `df.columns`, and call `.unique()` on each column.

# YOUR CODE HERE
for column in df.columns:
    print("Column:" + column)
    print(df[column].unique())

Column:Asset
[123 'TSLA' '' 'AAPL' 'AMZN' 'MSFT' 'NFLX']
Column:Quantity
['twenty' 9 None 11 51 96 -43 2 20.5 79 46 '' 95 58 16 59 15]
Column:Price
[1255188.12 916.0 'NaN' None 3111455.83 591.58 720.35 599.67 1145.42
 'free' 1303.32 265.12 1182.83 6912771.46 881016.98 1864646.73 6573810.99
 667.38 1455.88]
Column:Date
['' '2024-07-14' '2023-15-01' 'yesterday' '2022-02-30' '2024-12-07'
 '2024-05-26' '2024-07-02' '2025-02-04' '2024-12-28' 'not a date'
 '2025-03-23' '2025-03-16' '2025-04-06' '2024-09-01' '2024-08-20' None
 '2024-09-06' '2024-10-28']
Column:Type
['Hold' 'Sell' '' 'Purchase' 'Buy' 123 None]


## Step 1.8 – Check for duplicate rows

Duplicate trades can corrupt portfolio calculations.
Count how many fully duplicated rows exist.

In [14]:
# 💡 Tip: Use `.duplicated().sum()` to count full-row duplicates.

# YOUR CODE HERE
print(df.duplicated().sum())

0


## ✅ Summary

By now you should be able to:
- Inspect a DataFrame's structure
- Identify missing data and inconsistent types
- Spot potential problems with categorical values
- Detect duplicate rows

In the next assignment, you’ll **fix these issues** and transform this dataset into something usable for portfolio analysis.

📌 Final Goal: Write a Python function that takes this raw DataFrame as input and returns the **total portfolio value over time**.  

You're on the right track — let's clean things up next 🔧


The file was messy and it will be fun to clean it. Exploring it in general was very insightful since it is a good way to know and understand the task at hand and plan or map it out in advance so you know what to do and how much work there is to be done. most of the commands very very straight forward and optimizes for the  user.

## 📦 Save the dirty DataFrame

You’ll need this dirty dataset for the next module (data cleaning). Let’s save it into a new variable.

In [15]:
# 💡 Tip: Use df.to_csv() and set index=False to avoid saving the index column

# YOUR CODE HERE
df.to_csv("Data_To_Clean" , index=False)

## 📘 Report

Now, write a small report on the findings about the dataset!