Basic Cleaning Example

Installing Yahoo finance api to get the stock data 

In [None]:
pip install yfinance 

importing the yfinance to get the data, pandas to clean and handle the data and matplotlib to visualise the results 

In [None]:
import yfinance as yf 
import pandas as pd
import matplotlib.pyplot as plt

This will enable you to download the Apple stock data for 2024 and print it. 

In [None]:
apple_df = yf.download("AAPL", start = "2024-01-01", end = "2025-01-01")
print(apple_df.head()) # displays the first few rows of the data

Before cleaning the data, you can inspect it by loading it using the command below, which will display the Index type (DatetimeIndex), Column data types and Missing values (if any).


In [None]:
print(apple_df.info())

This command will enable you to get the Summary Statistics by showing you the mean, std, min and max of price columns. This will help detect extreme outliers or weird data (e.g., negative prices).

In [None]:
print(apple_df.describe())

This command will enable you to check for missing values by showing you counts, this which helps decide if you should impute or drop those rows. 

In [None]:
print(apple_df.isnull().sum())

> This will allow you to forward fill any missing values; it copies the value from the previous day and keeps the price series smooth.

In [26]:
apple_df = apple_df.ffill()

> This will allow you to remove dupliucates dates.

In [32]:
apple_df = apple_df[~apple_df.index.duplicated(keep='first')]

> This will allow you to validate the data, by confirming if the columns are numeric and there are no negative prices.

This will make the Data column become a normal column istead of a index one, which will make it easier to use later for the graph. 

In [36]:
apple_df.reset_index(inplace=True)

In [None]:
print((apple_df[["Open", "Close"]] > 0).all()) 

if you want you can save the cleaned data.

In [34]:
apple_df.to_csv("apple_stock_clean.csv", index=False)
print("Clean data saved!")

Clean data saved!


This will allow you to plot the data onto a graph to spot any anomalies or see trends.

In [None]:
plt.figure(figsize=(10,5))
plt.plot(apple_df["Date"], apple_df["Close"], label="Apple Close Price", linewidth=2)
plt.title("Apple Stock Price (Cleaned Data)")
plt.xlabel("Date")
plt.ylabel("Price (USD)")
plt.legend()
plt.show()


This ensures my model trains on a realistic, consistent dataset, a crucial foundation before any predictive modelling.