# Data Cleaning

In compliance with the firm's data cleaning protocols we have performed the following actions to Door Dash's raw stock price data.

First, we import the neccessary libraries and set the **Date** column into the correct format.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("DASH_A1.csv")

df["Date"] = pd.to_datetime(df.Date, dayfirst=True)
df = df.set_index("Date").sort_index().drop_duplicates()


FileNotFoundError: [Errno 2] No such file or directory: 'DASH_A1.csv'

Next, we fill in missing values in the Close column using forward-fill to avoid look-ahead bias.

In [None]:
df.Close = df.Close.ffill()
# We clean Close's data first because cleaning Open's data require Close's data.
df

Moving on from the **Close** column, we fill in missing values in the Open column with the Close of the day before as an approximation, ignoring overnight trading.

In [None]:
df.Open = df.Open.fillna(df.Close.shift())
df

In order to facilitate accurate analysis of the **High** and **Low** columns, we require a close aproximation of these values when missing. Missing values in the **High** and **Low** columns are filled with the mean of the respective High or Low within that month.

In [None]:
df["Month"] = df.index.month_name()
# Since we made the date as index, so df.index is already a DatetimeIndex, which natively supports datetime properties. Therefore, we do not use .dt when accessing datetime-related attributes from the index.
df["High"] = df["High"].fillna(df.groupby("Month")["High"].transform("mean"))
# We use transform to make sure each row gets the mean without collapsing the DataFrame structure.
df["Low"] = df["Low"].fillna(df.groupby("Month")["Low"].transform("mean"))
df

To account for the lack of trading activity, we fill in missing **Volume** data with zero when the **Open** and **Close** prices are equal. When this is not the case, we use the median value.

In [None]:
volume_median = df["Volume"].median()
df.loc[(df["Open"] == df["Close"]) & (df["Volume"].isnull()), "Volume"] = 0
# We use loc to locate the data, and use & to insert in two conditions.
df.loc[(df["Open"] != df["Close"]) & (df["Volume"].isnull()), "Volume"] = volume_median
df

To verify that we have removed all duplicates and filled in missing values in accordance with the firm's standards, we perform the following.

In [None]:
# Data cleaning verification
print("Duplicates:", df.duplicated().sum())
print("Data is monotonically increasing:", df.index.is_monotonic_increasing)
print("Missing:", df.isnull().sum().sum())
df

# Feature Engineering

In this section, we perform the necessary 

In [None]:
df["DailyReturns"] = df.Close.pct_change()

2. Logarithmic Returns: Calculate the logarithmic returns using Close prices.

In [None]:
df["PrevClose"] = df.Close.shift() 
df["LogReturns"] = np.log(df.Close / df.PrevClose)

3. 20-Day Momentum: Calculate the 20-day momentum by subtracting the
Close price 20 days prior from the current Close price, providing insights
into the stock's short-term trend.

In [None]:
df["Momentum_20"] = df.Close - df.Close.shift(20)

4. 20-Day Simple Moving Average: Calculate the 20-day simple moving
average to smooth out short-term fluctuations and highlight longer-term
trends in the Close prices.

In [None]:
df["SMA_20"] = df.Close.rolling(window=20).mean()

5. 20-Day Rolling Volatility: Calculate the 20-day rolling volatility based on the
standard deviation of simple daily returns to indicate the stock's risk level.

In [None]:
df["Volatility_20"] = df.DailyReturns.rolling(window=20).std()


6. Day of the Week: Identify the day of the week for each trading day.

In [None]:
df["Day"] = df.index.day_name()

7. Price Surge Identification: Identify days where the price surged, defined as
when the daily return is more than 4 standard deviations above the mean daily
return for the period, indicating significant price movements.

In [None]:
mean_return = df["DailyReturns"].mean()
std_return = df["DailyReturns"].std()
df["PriceSurge"] = df["DailyReturns"] > (mean_return + 4 * std_return)

8. Volume Spike Identification: Identify days where the volume spiked, defined
as when the trading volume is more than 6 standard deviations above the
mean volume for the period, highlighting unusual trading activity.

In [None]:
mean_volume = df["Volume"].mean()
std_volume = df["Volume"].std()
df["VolumeSpike"] = df["Volume"] > (mean_volume + 6 * std_volume)

9. Bollinger Bands Calculation: Calculate the upper and lower Bollinger Bands
for the stock, which are set at 2 standard deviations above and below the 20-
day simple moving average, to identify overbought and oversold conditions.

In [None]:
df["SMA_20"] = df["Close"].rolling(window=20).mean()
df["Dev"] = df["Close"].rolling(window=20).std()

df["HighBand"] = df.SMA_20 + 2 * df.Dev
df["LowBand"] = df.SMA_20 - 2 * df.Dev

## Key Dates

# Data Visualisation

1. Plot the close prices along with the simple moving average line and Bollinger Bands.
Volume and volatility should be presented in subplots under the main plot, which all share an x-axis.



In [None]:
plt.style.use("ggplot")

fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(15,10), gridspec_kw={"height_ratios": [3, 1, 1]})
fig.subplots_adjust(hspace=0.3)

ax1.set_title("Bollinger on NASH")
ax1.set_ylabel("USD")
ax1.plot(df.Close, label = "Closing Price", color = "blue")
ax1.plot(df.SMA_20, label="20-day SMA", color = "red", linestyle ="--")
ax1.plot(df.HighBand, color="grey", linestyle =":")
ax1.plot(df.LowBand, color="grey", linestyle =":")
ax1.fill_between(df.index, df.HighBand, df.LowBand, color = "grey", alpha=0.3)
ax1.legend()

ax2.bar(df.index, df["Volume"], label="Volume", color="grey")
ax2.set_ylabel("Volume")

ax3.plot(df.Volatility_20, label="20-day Volatility", color="purple")
ax3.set_ylabel("Volatility")
ax3.set_title('20-Day Rolling Volatility')
ax3.set_xlabel("Date")

plt.tight_layout()

2. Plot a histogram of log returns.

In [None]:
recommended_bins = int(np.sqrt(len(df.LogReturns)))
plt.figure(figsize=(10, 5))
plt.hist(df.LogReturns, bins=recommended_bins, color="skyblue", edgecolor="black")

plt.title("Histogram of Log Returns")
plt.xlabel("Simple Log Returns")
plt.ylabel("Frequency")
plt.show()

3. Construct a scatter plot to explore the relationship between volume and daily returns.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df.DailyReturns, df.Volume)
plt.title("Volume vs Daily Returns")
plt.ylabel("Trading Volume (Hundreds of Millions USD)")
plt.xlabel("Simple Daily Returns")
plt.show()

# Reporting

A: According to the histogram of log returns, the data is centered around zero, indicating that the losses and gains are all small. It has s