# Data Analysis

Now that we've got clean data, let's start with some basic financial analysis.

First, let's load our CSV file into a DataFrame, covert our dates, set the index, and check for duplicated rows or missing values.

In [2]:
import pandas as pd

df = pd.read_csv("TSLA_clean.csv")
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date").sort_index().drop_duplicates()

print("Duplicates:", df.duplicated().sum())
print("Missing:", df.isnull().sum().sum())

Duplicates: 0
Missing: 0


## Returns

Returns refer to the gain or loss made on an initial investment, often expressed as a percentage. We can use the generic **percentage change** formula here:

$$ (price_{end} - price_{start}) / price_{start} $$

We can apply this to close prices to calculate the simple daily return:

$$ (close price_{today} - close price_{yesterday}) / close price_{yesterday} $$

In [3]:
jan31_closing = df.loc["2024-01-31", "Close"]
jan30_closing = df.loc["2024-01-30", "Close"]

jan31_return = (jan31_closing - jan30_closing) / jan30_closing
print(f"Return on 31 Jan was {jan31_return:.2%}")

Return on 31 Jan was -2.24%


In [4]:
name = "Jay"
print(f"Hi {name} !!")

Hi Jay !!


This simple daily return expresses a loss in value of 2.24% from one day to the next. Notice we leave our return in decimal form, but when we output it we use `f-strings` and `:.2%` to display it as a percentage.

If we wanted to use the above approach to calculate daily returns for each day in our data set, it would take a long time. Let's see how we can use pandas `pct_change()` to make this sort of work easy, by applying our percentage change formula one column at a time.

Notice how the first row in our data has a missing value **NaN** in the new daily return column. This is because our data doesn't have a close price for the day before it!

What to do with this missing value depends on what further analysis we want to do. If we want to carry out simple descriptive statistics like compute the mean, max, or standard deviation,  we can leave our missing value as NaN, because pandas will by default ignore NaNs when calculating these.

In [5]:
df["Returns"] = df.Close.pct_change()

df

Unnamed: 0_level_0,Close,High,Low,Open,Volume,Returns
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-02,14.620667,14.883333,14.217333,14.858000,71466000,
2015-01-05,14.006000,14.433333,13.810667,14.303333,80527500,-0.042041
2015-01-06,14.085333,14.280000,13.614000,14.004000,93928500,0.005664
2015-01-07,14.063333,14.318667,13.985333,14.223333,44526000,-0.001562
2015-01-08,14.041333,14.253333,14.000667,14.187333,51637500,-0.001564
...,...,...,...,...,...,...
2024-12-23,430.600006,434.510010,415.410004,28.586000,72698100,0.022657
2024-12-24,462.279999,462.779999,435.140015,435.899994,59551800,0.073572
2024-12-26,454.130005,465.329987,451.019989,465.160004,76366400,-0.017630
2024-12-27,431.660004,450.000000,426.500000,449.519989,82666800,-0.049479


In [6]:
df.Returns.max() * 100
df.Returns.idxmax()

Timestamp('2024-10-24 00:00:00')

In [7]:
# to find cumulative return (compounded daily)
(1 + df.Returns).prod() - 1 * 100


np.float64(-71.45068753713515)

For more complex analyses though, we may want to drop or fill this value. Let's calculate cumulative returns for the period. Instead of comparing a given day with the day before it, cumulative returns compare a given day with the first day of the period, to indicate how our stock has performed since our initial investment.

We generally fill missing daily returns with a 0, which indicates no change with the day before.

Because we're doing cumulative multiplication, we'll add 1 to the closing price to get a growth factor, so we can compound the return over time. After calculating the cumulative product, we subtract 1 to get back to a return.

In [8]:
df["Cumulative"] = (1 + df.Returns).cumprod() - 1
df 

Unnamed: 0_level_0,Close,High,Low,Open,Volume,Returns,Cumulative
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-02,14.620667,14.883333,14.217333,14.858000,71466000,,
2015-01-05,14.006000,14.433333,13.810667,14.303333,80527500,-0.042041,-0.042041
2015-01-06,14.085333,14.280000,13.614000,14.004000,93928500,0.005664,-0.036615
2015-01-07,14.063333,14.318667,13.985333,14.223333,44526000,-0.001562,-0.038120
2015-01-08,14.041333,14.253333,14.000667,14.187333,51637500,-0.001564,-0.039624
...,...,...,...,...,...,...,...
2024-12-23,430.600006,434.510010,415.410004,28.586000,72698100,0.022657,28.451460
2024-12-24,462.279999,462.779999,435.140015,435.899994,59551800,0.073572,30.618255
2024-12-26,454.130005,465.329987,451.019989,465.160004,76366400,-0.017630,30.060826
2024-12-27,431.660004,450.000000,426.500000,449.519989,82666800,-0.049479,28.523960


### Exercise: Bull or Bear?

Calculate the daily change in trading volume over the data frame. Then calculate the average change in trading volume over the period. Take the same approach we used for calculating daily returns, but considering `Volume` instead of `Close` price.

Then determine the trend in TSLA's stock for **Q1 2024**:

- Rising volume and increasing price might indicate a **bullish** trend (where the uptrend is backed by strong demand and could continue).

- Rising volume and decreasing price might indicate a **bearish** trend (where the downtrend is backed by strong selling pressure and could continue).

- Falling volume on price increase or decrease often indicate a trend is losing strength. It might suggest that momentum is waning and a price reversal is coming.

In [17]:

# by create q1 below, it is not mean adding the new df, it just creating a view of that portion!, if adjust any the df will be changed. We then use the copy() function
q1 = df.loc["2024-01-01":"2024-03-31"].copy()

q1["Daily_Volume_Change"] = q1.Volume.pct_change()

print(f"Volume change is {q1.Daily_Volume_Change.mean()}")
print(f"Price change is {q1.Returns.mean()}")





Volume change is 0.014709577351289112
Price change is -0.005226914518494


## Surges

Surges in price or trading volume can be helpful indicators for traders. One way to define a surge is as an increase on the day before by an amount higher than some defined threshold. That threshold is often defined as some number of standard deviations above the mean. Let's look at price surges of five standard deviations above the mean.

In [21]:
mean_return = df["Returns"].mean()
mean_return

threshold = mean_return + df.Returns.std() * 5

threshold

condition = df.Returns > threshold
df[condition]

Unnamed: 0_level_0,Close,High,Low,Open,Volume,Returns,Cumulative,Daily_Volume_Change,Cumulative_Vol_Change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-02-03,52.0,52.409332,44.901333,44.912666,705975000,0.198949,2.556609,1.99409,8.878474
2020-03-19,28.509333,30.133333,23.897333,24.98,452932500,0.183877,0.949934,0.269455,5.337734
2021-03-09,224.526672,226.029999,198.403336,202.726669,202569900,0.196412,14.3568,0.303866,1.834493
2024-10-24,260.480011,262.119995,242.649994,244.679993,204491900,0.21919,16.815877,1.526497,1.861387


## Moving Averages

Moving averages are a different kind of indicator, one that smooths out small variations in trading data to give a better picture of the overall trend.

A Simple Moving Average (SMA) is one which averages out a price over a specific period. The average is "moving" because when a new day is considered in the period, the oldest date is discarded.

Moving averages can be *fast*, when they cover a short period, or *slow* when they consider a longer period. The longer the period, the more those small variations are smoothed out.

In [22]:
# Fast MA suppose to pick up few latest days, sensitive to change in ST. rolling function with no of days (Period)
df["FastMA"] = df.Close.rolling(5).mean()
df["SlowMA"] = df.Close.rolling(50).mean()
df

Unnamed: 0_level_0,Close,High,Low,Open,Volume,Returns,Cumulative,Daily_Volume_Change,Cumulative_Vol_Change,FastMA,SlowMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-01-02,14.620667,14.883333,14.217333,14.858000,71466000,,,,,,
2015-01-05,14.006000,14.433333,13.810667,14.303333,80527500,-0.042041,-0.042041,0.126795,0.126795,,
2015-01-06,14.085333,14.280000,13.614000,14.004000,93928500,0.005664,-0.036615,0.166415,0.314310,,
2015-01-07,14.063333,14.318667,13.985333,14.223333,44526000,-0.001562,-0.038120,-0.525959,-0.376962,,
2015-01-08,14.041333,14.253333,14.000667,14.187333,51637500,-0.001564,-0.039624,0.159716,-0.277454,14.163333,
...,...,...,...,...,...,...,...,...,...,...,...
2024-12-23,430.600006,434.510010,415.410004,28.586000,72698100,0.022657,28.451460,-0.450157,0.017240,441.564001,324.8296
2024-12-24,462.279999,462.779999,435.140015,435.899994,59551800,0.073572,30.618255,-0.180834,-0.166711,438.048004,329.6920
2024-12-26,454.130005,465.329987,451.019989,465.160004,76366400,-0.017630,30.060826,0.282353,0.068570,440.848004,334.3832
2024-12-27,431.660004,450.000000,426.500000,449.519989,82666800,-0.049479,28.523960,0.082502,0.156729,439.946002,338.5898


## Volatility

Volatility looks at the degree of variance in a stock, and can be helpful for determining risk. Periods of high standard deviation indicate higher volatility and may suggest a riskier investment.


In [23]:
df["Volatility"]  = df.Returns.rolling(20).std()