## 1. Data Cleaning

### Importing Libraries and Loading the Dataset
Before getting started, we import the libraries we will be using for the data analysis:
- `pandas` is used for tabular data.
- `numpy` will be useful later for numerical operations.
- `matplotlib.pyplot` and `seaborn` are often used for visualisations (we'll use them later).


Next, we load the dataset using `pd.read_csv()`. This reads the CSV file into a DataFrame. 


In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

df = pd.read_csv("DASH_A1.csv")
df

Unnamed: 0,Date,Close,High,Low,Open,Volume
0,24-02-2022,100.419998,100.919998,85.177002,86.879997,6639000.0
1,01-08-2024,108.199997,112.769997,105.905998,108.620003,7965400.0
2,11-02-2025,193.089996,194.000000,189.500000,190.919998,6771900.0
3,13-04-2021,149.460007,150.360001,143.550003,146.839996,2823500.0
4,17-09-2024,129.880005,131.369995,126.900002,131.350006,2825500.0
...,...,...,...,...,...,...
1179,27-11-2024,178.440002,180.179993,177.699997,179.990005,2031100.0
1180,12-02-2025,200.889999,201.169998,195.197998,198.000000,9989400.0
1181,01-04-2025,182.419998,183.014999,178.259995,182.050003,3740700.0
1182,25-03-2024,137.820007,138.899994,136.740005,137.050003,2162800.0


As you can see from the initial output of the dataset, there are several issues that need to be addressed before performing analysis:
- Missing values
- Unsorted dates
- Duplicate rows

We will get to all of these issues below. 

### Data Overview
Before we begin cleaning, it is important to understand what the dataset looks like. We use `df.info()` to inspect the data types, number of values per column, and overall shape of the dataset.

In [23]:
print("Dataset Structure:")
df.info()

Dataset Structure:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1184 entries, 0 to 1183
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    1184 non-null   object 
 1   Close   1159 non-null   float64
 2   High    1160 non-null   float64
 3   Low     1168 non-null   float64
 4   Open    1167 non-null   float64
 5   Volume  1158 non-null   float64
dtypes: float64(5), object(1)
memory usage: 55.6+ KB


We also check for missing values in each column to identify where cleaning is needed. We will address this issue further down.

In [24]:
print("Missing Values per Column:")
print(df.isnull().sum())

Missing Values per Column:
Date       0
Close     25
High      24
Low       16
Open      17
Volume    26
dtype: int64


### Sorting Data and Checking for Duplicates
Since this dataset represents time series data, where each row corresponds to a trading day for DoorDash (DASH), it is helpful to set the `Date` column as the index. This will make it easier to work with the data rather than keeping the unique index value. 

In [25]:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df.set_index('Date', inplace=True)

Now that we have set the date as the index, we can sort the data in chronological order. 

In [26]:
df.sort_index(inplace=True)
df

Unnamed: 0_level_0,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-12-09,189.509995,195.500000,163.800003,182.000000,25373700.0
2020-12-10,186.000000,187.695007,172.636002,179.710007,
2020-12-11,175.000000,182.000000,168.250000,176.520004,4760600.0
2020-12-14,160.000000,170.000000,151.199997,169.100006,7859600.0
2020-12-15,158.889999,161.419998,153.759995,157.100006,5017000.0
...,...,...,...,...,...
2025-06-09,217.490005,219.830002,216.955002,218.029999,2710300.0
2025-06-10,214.970001,219.210007,210.927002,216.589996,3916700.0
2025-06-11,217.800003,219.529999,212.240005,214.184998,3091500.0
2025-06-12,216.600006,219.419998,215.675003,218.080002,2510400.0


Now, let's focus on duplicates. We check for two types:
- **Duplicate rows**: these are rows where all values (across every column) are identical.
- **Duplicate dates**: since this is time series data, each date should only appear once. 

In [27]:
duplicate_rows = df[df.duplicated()]
print(f"Number of duplicate rows: {duplicate_rows.shape[0]}")
duplicate_rows = df[df.duplicated(keep=False)]
print(duplicate_rows)

duplicate_dates = df[df.index.duplicated()]
print(f"Number of duplicate dates: {duplicate_dates.shape[0]}")
duplicate_dates = df[df.duplicated(keep=False)]
print(duplicate_dates)

Number of duplicate rows: 50
                 Close        High         Low        Open     Volume
Date                                                                 
2020-12-15  158.889999  161.419998  153.759995  157.100006  5017000.0
2020-12-15  158.889999  161.419998  153.759995  157.100006  5017000.0
2021-01-25  191.809998  215.389999  191.309998  194.330002  3430400.0
2021-01-25  191.809998  215.389999  191.309998  194.330002  3430400.0
2021-05-10  120.449997  125.089996  118.559998  124.900002  2277500.0
...                ...         ...         ...         ...        ...
2025-03-20  192.929993  195.210007  187.865005  188.070007  5600800.0
2025-04-17  181.240005  182.889999  177.979996  181.250000  2496100.0
2025-04-17  181.240005  182.889999  177.979996  181.250000  2496100.0
2025-04-28  187.880005  190.690002  186.550003  188.809998  3083600.0
2025-04-28  187.880005  190.690002  186.550003  188.809998  3083600.0

[100 rows x 5 columns]
Number of duplicate dates: 50
       

From our check, it appears that the duplicate dates aligns with the duplicated rows, suggesting there are 50 rows that are exact copies of others. Since the dataset is time series, this is problematic — each date should represent a single trading day. Let's go ahead and delete all the duplicate rows.

In [28]:
df.drop_duplicates(inplace=True)
print(f"Remaining duplicate rows: {df.duplicated().sum()}")

Remaining duplicate rows: 0


### Missing Data
After sorting and removing duplicates, the next step is to handle any missing values in the dataset. From our earlier check, we found that there are several missing entries per column. We follow the firm's official cleaning rules to handle missing data, one column at a time. 

Let's start with filling the missing values in the `Close` column first. They are forward-filled to avoid look-ahead bias.

In [29]:
df['Close'] = df['Close'].ffill()
print(f"Remaining missing values in 'Close': {df['Close'].isnull().sum()}")

Remaining missing values in 'Close': 0


Now let's move on to the `Open` column, which should be filled using the previous day's `Close` price.

In [30]:
df['Open'] = df['Open'].fillna(df['Close'].shift(1))
print(f"Remaining missing values in 'Open': {df['Open'].isnull().sum()}")



Remaining missing values in 'Open': 0


Next up is the the `High` and `Low` columns, which should be filled using the average value of that column for the month in which the missing entry appears.

In [33]:
df['Month'] = pd.to_datetime(df.index).to_period('M')
monthly_high_mean = df.groupby('Month')['High'].transform('mean')
df['High'] = df['High'].fillna(monthly_high_mean)
monthly_low_mean = df.groupby('Month')['Low'].transform('mean')
df['Low'] = df['Low'].fillna(monthly_low_mean)
df.drop('Month', axis=1, inplace=True)

print(f"Remaining missing values in 'High': {df['High'].isnull().sum()}")
print(f"Remaining missing values in 'Low' : {df['Low'].isnull().sum()}")

Remaining missing values in 'High': 0
Remaining missing values in 'Low' : 0


Lastly, let's move on to the `Volume` column, where missing values should be filled with 0 if the `Open` and `Close` prices are equal on a day, indicating no trading activity. Otherwise, missing values should be filled with the median trading volume across the dataset.

In [36]:
missing_volume = df['Volume'].isna()

zero_vol_mask = missing_volume & (df['Close'] == df['Open'])
df.loc[zero_vol_mask, 'Volume'] = 0

non_zero_vol_mask = missing_volume & (df['Close'] != df['Open'])
volume_median = df['Volume'].median()
df.loc[non_zero_vol_mask, 'Volume'] = volume_median

print(f"Remaining missing values in 'Volume': {df['Volume'].isnull().sum()}")

Remaining missing values in 'Volume': 0


Now that we have addressed each column, let's carry out a final check for any remaining missing values:

In [37]:
df.isnull().sum()

Close     0
High      0
Low       0
Open      0
Volume    0
dtype: int64

At this stage, we have successfully cleaned the dataset following the firm's official guidelines. We will save the cleaned dataset to a new CSV file for reuse.

In [38]:
df.to_csv('cleaned_data.csv')