
Questions to be considered for the project 

what industriess performed best over a certain time period?  
Are there trends in stock price movements by quarter, year, decade?  
Can we predict stock price trends using historical data?  
Impact of certain events on the stock prices i.e. US election, end of quarter performance announcements by companies, press releases etc

In [63]:
# importing libraries

import pandas as pd
import yfinance as yf
import seaborn as sns
import numpy as np 
import matplotlib.pyplot as plt

In [64]:
# creating a dictionary to define the industries and the companies that will be used for the project
industries = {
    'Technology': ['AAPL', 'MSFT', 'NVDA'], 
    'Quantum Computing': ['IONQ', 'RGTI', 'QBTS'],
    'Electric Vehicles': ['TSLA', 'RIVN', 'LCID'],
    'Renewable Energy': ['NEE', 'FSLR', 'ENPH']
}

In [65]:
# creating dictionaries to store data for different time periods
industry_data = {
    '5y': {},  # mid-term trends (5 years)
    '1y': {},  # quarterly/yearly movements (1 year)
    'event_specific': {},  # specific date ranges for event analysis 2024 US election
    '10y': {},  # long-term trends (10 years or max)
}

## Checking and cleaning the data

Due to the newness of some of the industries being considered for the project.  
The timeframe the data is available for each of the choosen companies will be checked to see if they have data available for the last 5 years. 

In [82]:
print(f'Company:\tData starts:\tLatest date:\t5yr available?')
print(f'------------------------------------------------------------------')

# checking the start date of the data for the chosen companies
for industry, companies in industries.items():
        for company in companies:
            ticker = yf.Ticker(company)
            hist = ticker.history(period='max')  # fetching the maximum available data
            start_date = hist.index.min().strftime('%Y-%m-%d')  # getting the earliest date available but looking at the min in the index for each company and formating the start date as a string to be shown in so only the date information and not the timestamp are included
            latest_date = hist.index.max().strftime('%Y-%m-%d')
            if pd.to_datetime(start_date) > pd.Timestamp('2019-12-27'):
                five_yr_available = 'No' 
            else: 
                five_yr_available = 'Yes'
            
            print(f"{company}\t\t{start_date}\t{latest_date}\t{five_yr_available}") # printing the company, date the data is available from, latest date information for the data pulled and a yes/no answer of if the data is available for the last 5 years


Company:	Data starts:	Latest date:	5yr available?
------------------------------------------------------------------
AAPL		1980-12-12	2024-12-27	Yes
MSFT		1986-03-13	2024-12-27	Yes
NVDA		1999-01-22	2024-12-27	Yes
IONQ		2021-01-04	2024-12-27	No
RGTI		2021-04-22	2024-12-27	No
QBTS		2020-12-11	2024-12-27	No
TSLA		2010-06-29	2024-12-27	Yes
RIVN		2021-11-10	2024-12-27	No
LCID		2020-09-18	2024-12-27	No
NEE		1973-02-21	2024-12-27	Yes
FSLR		2006-11-17	2024-12-27	Yes
ENPH		2012-03-30	2024-12-27	Yes


From the above we can see that there are several companies which do not have 5 years worth of data available.  
These are companies belowing to Quantum computing & electronic vechile industries, which is not surprising as the technology for these industries are relatively new/young in comparison to some of the historical companies/industries in the stock market.  

With this information the technology and renewable energy industry will be used for mid-long term analysis. 

In [71]:

# looping through each industry and getting the data for multiple time periods
for industry, companies in industries.items():
    industry_data['5y'][industry] = {}  # storing 5-year data
    industry_data['1y'][industry] = {}  # storing 1-year data
    industry_data['event_specific'][industry] = {}  # storing event-specific data
    industry_data['10y'][industry] = {}  # storing 10-year data
    for company in companies:
        # fetching 1 year of data for quarterly/yearly analysis
        hist_1y = yf.Ticker(company).history(period='1y')
        industry_data['1y'][industry][company] = hist_1y

        # fetching event-specific data (example: US election dates)
        hist_event = yf.Ticker(company).history(start='2023-12-26', end='2024-12-26')  # Event range going from one year prior to show the build up to the US 2024 election
        industry_data['event_specific'][industry][company] = hist_event

        if industry == 'Technology' or industry == 'Renewable Energy': # only doing mid to long term analysis for the technology and renewable energy industries as the quantam computing and electronic vechile companies do not have the data for a long enough time period to check for the industry overall
            # Fetch 10 years of data for trend predictions
            hist_10y = yf.Ticker(company).history(period='10y')
            industry_data['10y'][industry][company] = hist_10y

            # Fetch 5 years of data for general trends
            hist_5y = yf.Ticker(company).history(period='5y')
            industry_data['5y'][industry][company] = hist_5y

print (industry_data)

{'5y': {'Technology': {'AAPL':                                  Open        High         Low       Close  \
Date                                                                        
2019-12-27 00:00:00-05:00   70.558937   71.249695   69.831825   70.239006   
2019-12-30 00:00:00-05:00   70.156601   70.939461   69.128952   70.655884   
2019-12-31 00:00:00-05:00   70.270515   71.179405   70.171143   71.172134   
2020-01-02 00:00:00-05:00   71.799888   72.856628   71.545402   72.796036   
2020-01-03 00:00:00-05:00   72.020447   72.851776   71.862907   72.088310   
...                               ...         ...         ...         ...   
2024-12-20 00:00:00-05:00  248.039993  255.000000  245.690002  254.490005   
2024-12-23 00:00:00-05:00  254.770004  255.649994  253.449997  255.270004   
2024-12-24 00:00:00-05:00  255.490005  258.209991  255.289993  258.200012   
2024-12-26 00:00:00-05:00  258.190002  260.100006  257.630005  259.019989   
2024-12-27 00:00:00-05:00  257.899994  258.70

In [68]:
# creating an empty list to store summary statistics for each industry
industry_summary = []

# fetching the data for each industry by looping through the key value pairs returned from the .items() method
for industry, companies in industries.items():
    data = yf.Tickers(' '.join(companies))  # .Tickers() function is used to fetch the data for all of the companies in the list within the industry for that iteration of the for loop
    # creating lists to store the  market capitalisations, P/E ratios and dividend yields per company
    market_caps = []
    pe_ratios = []
    dividend_yields = []

    # getting the data for each company within the industry specified for the parent for loop
    for company in companies:
        info = data.tickers[company].info  # retrieveing detailed information about the company

        # Adding the market cap, P/E ratio and dividend yield for the current company to the lists defined above
        # getting the market cap and trailing P/E ratio, defaulting to 0 if data is missing
        market_caps.append(info.get('marketCap', 0))  
        pe_ratios.append(info.get('trailingPE', 0))
        #calculating the dividend yield as a percentage
        dividend_yields.append(info.get('dividendYield', 0) * 100 if info.get('dividendYield') else 0)

    # calculating the averages for the market cap, P/E ratio and dividend yield for the industry using the values added to the 3 lists for each company
    # first the 3 values are added together then the length of list is used to divide the sum giving the average for each industry
    avg_market_cap = sum(market_caps) / len(market_caps)
    avg_pe_ratio = sum(pe_ratios) / len(pe_ratios)
    avg_dividend_yield = sum(dividend_yields) / len(dividend_yields)

    # adding the results for the industry to the summary list
    industry_summary.append({
        'Industry': industry,  # industry name
        'Avg Market Cap ($B)': round(avg_market_cap / 1e9, 2),  # average market cap in billions
        'Avg P/E Ratio': round(avg_pe_ratio, 2),  # average price-to-earnings ratio
        'Avg Dividend Yield (%)': round(avg_dividend_yield, 2)  # Average dividend yield percentage
    })

# creating a DataFrame to store and display results 
summary_df = pd.DataFrame(industry_summary)  # converting summary list into a DataFrame
print(summary_df)  # displaying the summary table


            Industry  Avg Market Cap ($B)  Avg P/E Ratio  \
0         Technology              3460.73          43.74   
1  Quantum Computing                 5.84           0.00   
2  Electric Vehicles               471.57          39.80   
3   Renewable Energy                59.04          64.54   

   Avg Dividend Yield (%)  
0                    0.39  
1                    0.00  
2                    0.00  
3                    0.96  
