## Module 1 Homework (2025 cohort)

In this homework, we're going to download finance data from various sources and make simple calculations or analysis.

### Question 1: [Index] S&P 500 Stocks Added to the Index

**Which year had the highest number of additions?**

Using the list of S&P 500 companies from Wikipedia's [S&P 500 companies page](https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), download the data including the year each company was added to the index.

Hint: you can use [pandas.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) to scrape the data into a DataFrame.

Steps:
1. Create a DataFrame with company tickers, names, and the year they were added.

In [1]:
import pandas as pd
from datetime import datetime

In [8]:
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
tables = pd.read_html(url)
df = tables[0]
df = df[['Symbol', 'Security', 'Date added']].copy()

2. Extract the year from the addition date and calculate the number of stocks added each year.
3. Which year had the highest number of additions (1957 doesn't count, as it was the year when the S&P 500 index was founded)? Write down this year as your answer (the most recent one, if you have several records).

In [10]:
df['Date added'] = pd.to_datetime(df['Date added'], errors='coerce')
df['Year added'] = df['Date added'].dt.year
additions_per_year = df['Year added'].value_counts().sort_index()
filtered = additions_per_year[additions_per_year.index != 1957]
max_year = filtered[filtered == filtered.max()].index.max() 
print(f"Year with the most additions (excluding 1957): {max_year}")

Year with the most additions (excluding 1957): 2017


*Context*: 
> "Following the announcement, all four new entrants saw their stock prices rise in extended trading on Friday" - recent examples of S&P 500 additions include DASH, WSM, EXE, TKO in 2025 ([Nasdaq article](https://www.nasdaq.com/articles/sp-500-reshuffle-dash-tko-expe-wsm-join-worth-buying)).

*Additional*: How many current S&P 500 stocks have been in the index for more than 20 years? When stocks are added to the S&P 500, they usually experience a price bump as investors and index funds buy shares following the announcement.

In [12]:
import datetime
today = datetime.datetime.today()
cutoff_date = today - pd.DateOffset(years=20)
df_long_term = df[df['Date added'] <= cutoff_date]
num_long_term = df_long_term.shape[0]
print(f"Number of current S&P 500 stocks in the index for more than 20 years: {num_long_term}")

Number of current S&P 500 stocks in the index for more than 20 years: 220


### Question 2. [Macro] Indexes YTD (as of 1 May 2025)

**How many indexes (out of 10) have better year-to-date returns than the US (S&P 500) as of May 1, 2025?**

Using Yahoo Finance World Indices data, compare the year-to-date (YTD) performance (1 January-1 May 2025) of major stock market indexes for the following countries:
* United States - S&P 500 (^GSPC)
* China - Shanghai Composite (000001.SS)
* Hong Kong - HANG SENG INDEX (^HSI)	
* Australia - S&P/ASX 200 (^AXJO)
* India - Nifty 50 (^NSEI)
* Canada - S&P/TSX Composite (^GSPTSE)
* Germany - DAX (^GDAXI)
* United Kingdom - FTSE 100 (^FTSE)
* Japan - Nikkei 225 (^N225)
* Mexico - IPC Mexico (^MXX)
* Brazil - Ibovespa (^BVSP)

*Hint*: use start_date='2025-01-01' and end_date='2025-05-01' when downloading daily data in yfinance

In [19]:
import yfinance as yf

index_tickers = ["^GSPC", "000001.SS", "^HSI", "^AXJO", "^NSEI", "^GSPTSE", "^GDAXI", "^FTSE", "^N225", "^MXX", "^BVSP"]
index_data = yf.download(index_tickers, start="2025-01-01", end="2025-05-01")
index_data

  index_data = yf.download(index_tickers, start="2025-01-01", end="2025-05-01")
[*********************100%***********************]  11 of 11 completed


Price,Close,Close,Close,Close,Close,Close,Close,Close,Close,Close,...,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume,Volume
Ticker,000001.SS,^AXJO,^BVSP,^FTSE,^GDAXI,^GSPC,^GSPTSE,^HSI,^MXX,^N225,...,^AXJO,^BVSP,^FTSE,^GDAXI,^GSPC,^GSPTSE,^HSI,^MXX,^N225,^NSEI
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2025-01-01,,,,,,,,,,,...,,,,,,,,,,154900.0
2025-01-02,3262.561035,8201.200195,120125.0,8260.099609,20024.660156,5868.549805,24898.000000,19623.320312,49765.199219,,...,304400.0,9373600.0,4.222199e+08,52445600.0,3.621680e+09,215089400.0,4.033400e+09,87535300.0,,283200.0
2025-01-03,3211.429932,8250.500000,118533.0,8224.000000,19906.080078,5942.470215,25073.500000,19760.269531,48957.238281,,...,329100.0,9804400.0,7.425039e+08,44372900.0,3.667340e+09,186569100.0,3.393800e+09,112782300.0,,312300.0
2025-01-06,3206.923096,8288.500000,120022.0,8249.700195,20216.189453,5975.379883,24999.800781,19688.289062,49493.558594,39307.050781,...,52200.0,9685600.0,7.662447e+08,70784900.0,4.940120e+09,239976800.0,2.465700e+09,139872100.0,137900000.0,278100.0
2025-01-07,3229.644043,8285.099609,121163.0,8245.299805,20340.570312,5909.029785,24929.900391,19447.580078,50085.500000,40083.300781,...,424300.0,11116400.0,7.415068e+08,62020000.0,4.517330e+09,237759800.0,3.581000e+09,142173400.0,127000000.0,262300.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-04-24,3297.288086,7968.200195,134580.0,8407.400391,22064.509766,5484.770020,24727.500000,21909.759766,56382.000000,35039.148438,...,639100.0,14113400.0,1.126606e+09,62636800.0,4.697710e+09,224419200.0,2.985800e+09,249950000.0,137100000.0,358800.0
2025-04-25,3295.060059,,134739.0,8415.299805,22242.449219,5525.209961,24710.500000,21980.740234,56720.121094,35705.738281,...,,13051800.0,8.027340e+08,70917400.0,4.236580e+09,214234300.0,3.025700e+09,217532100.0,134700000.0,387700.0
2025-04-28,3288.415039,7997.100098,135016.0,8417.299805,22271.669922,5528.750000,24798.599609,21971.960938,56980.128906,35839.988281,...,769000.0,11449700.0,7.387417e+08,55883200.0,4.257880e+09,224287200.0,2.466000e+09,193000200.0,132400000.0,320500.0
2025-04-29,3286.655029,8070.600098,135093.0,8463.500000,22425.830078,5560.830078,24874.500000,22008.109375,55613.429688,,...,710800.0,12761100.0,6.559248e+08,75547100.0,4.747150e+09,199905200.0,3.045200e+09,240938000.0,,357600.0


In [22]:
close_prices = index_data['Close']
close_prices = close_prices.dropna()
start_prices = close_prices.iloc[0]
end_prices = close_prices.iloc[-1]
returns = ((end_prices - start_prices) / start_prices) * 100
sp500_return = returns["^GSPC"]
better_than_us = (returns > sp500_return).sum()
print(f"Number of indexes with better YTD performance than the US (S&P 500): {better_than_us} out of {len(returns)-1}")

Number of indexes with better YTD performance than the US (S&P 500): 9 out of 10


### Question 3. [Index] S&P 500 Market Corrections Analysis


**Calculate the median duration (in days) of significant market corrections in the S&P 500 index.**

For this task, define a correction as an event when a stock index goes down by **more than 5%** from the closest all-time high maximum.

Steps:
1. Download S&P 500 historical data (1950-present) using yfinance

In [36]:
sp_500 = yf.download('^GSPC', start='1950-01-01')['Close']
sp_500 = sp_500['^GSPC'].dropna()

  sp_500 = yf.download('^GSPC', start='1950-01-01')['Close']
[*********************100%***********************]  1 of 1 completed


2. Identify all-time high points (where price exceeds all previous prices)

In [37]:
all_time_highs = sp_500.cummax()
high_dates = sp_500[sp_500 == all_time_highs].index

3. For each pair of consecutive all-time highs, find the minimum price in between
4. Calculate drawdown percentages: (high - low) / high × 100

In [38]:
corrections = []

for i in range(len(high_dates) - 1):
    start_date = high_dates[i]
    end_date = high_dates[i + 1]
    high_value = sp_500.loc[start_date]

    # Buscar el precio mínimo entre dos máximos
    low_between = sp_500.loc[start_date:end_date].min()
    low_date = sp_500.loc[start_date:end_date].idxmin()

    drawdown = (high_value - low_between) / high_value * 100

    if drawdown >= 5:
        duration = (low_date - start_date).days
        corrections.append(duration)

5. Filter for corrections with at least 5% drawdown
6. Calculate the duration in days for each correction period
7. Determine the 25th, 50th (median), and 75th percentiles for correction durations

In [39]:
corrections_series = pd.Series(corrections)
percentiles = corrections_series.quantile([0.25, 0.5, 0.75]).astype(int)

print("Correction Duration Percentiles (in days):")
print(f"25th percentile: {percentiles[0.25]} days")
print(f"Median (50th): {percentiles[0.5]} days")
print(f"75th percentile: {percentiles[0.75]} days")

Correction Duration Percentiles (in days):
25th percentile: 21 days
Median (50th): 39 days
75th percentile: 89 days


### Question 4.  [Stocks] Earnings Surprise Analysis for Amazon (AMZN)


**Calculate the median 2-day percentage change in stock prices following positive earnings surprises days.**

Steps:
1. Load earnings data from CSV ([ha1_Amazon.csv](ha1_Amazon.csv)) containing earnings dates, EPS estimates, and actual EPS. Make sure you are using the correct delimiter to read the data, such as in this command ```python pandas.read_csv("ha1_Amazon.csv", delimiter=';') ```

In [44]:
df_4 = pd.read_csv('https://raw.githubusercontent.com/DataTalksClub/stock-markets-analytics-zoomcamp/refs/heads/main/cohorts/2025/ha1_Amazon.csv', sep=';').iloc[:-1, :]
df_4['Earnings Date'] = pd.to_datetime(df_4['Earnings Date'].str.split(' at').str[0], errors='coerce')
numeric_cols = ['EPS Estimate', 'Reported EPS', 'Surprise (%)']
for col in numeric_cols:
    df_4[col] = pd.to_numeric(df_4[col].str.replace('[^-.0-9]', '', regex=True), errors='coerce')

2. Download complete historical price data using yfinance

In [46]:
amzn = yf.download('AMZN', start='1997-05-15')['Close'].reset_index()
amzn.columns = ['Date', 'Price']

  amzn = yf.download('AMZN', start='1997-05-15')['Close'].reset_index()
[*********************100%***********************]  1 of 1 completed


3. Calculate 2-day percentage changes for all historical dates: for each sequence of 3 consecutive trading days (Day 1, Day 2, Day 3), compute the *return* as Close_Day3 / Close_Day1 - 1. (Assume Day 2 may correspond to the earnings announcement.)

In [47]:
amzn['2_day_pct_change'] =((amzn['Price'].shift(-2) / amzn['Price']) - 1)

4. Identify positive earnings surprises (where "actual EPS > estimated EPS"). Both fields should be present in the file. You should obtain 36 data points for use in the descriptive analysis (median) later. 

In [49]:
positive_surprises = df_4[(df_4['Reported EPS'] > df_4['EPS Estimate']) | (df_4['Surprise (%)'] > 0)].copy()

5. Calculate 2-day percentage changes following positive earnings surprises. Show your answer in % (closest number to the 2nd digit): *return* * 100.0

In [50]:
dates_to_consider = positive_surprises['Earnings Date'].tolist()
amzn_filtered = amzn[amzn['Date'].isin(dates_to_consider)].copy()
amzn_filtered['2_day_pct_change'].median()*100

0.2672266474036067

In [51]:
merged = pd.merge_asof(
    df_4.sort_values('Earnings Date'),
    amzn.sort_values('Date'),
    left_on='Earnings Date',
    right_on='Date',
    direction='forward'
)

positive_surprises = merged[
    ((merged['Reported EPS'] > merged['EPS Estimate']) | (merged['Surprise (%)'] > 0)) &
    merged['2_day_pct_change'].notna()
].copy()

median_positive = positive_surprises['2_day_pct_change'].median()
median_all = amzn['2_day_pct_change'].median()

median_positive, median_all

(0.002672266474036067, 0.0016581674487468057)