# About this notebook

**Requirements:**
1. This notebook requires a Cohere API Key. Save it as 'COHERE_API_KEY' in colab secrets or directly input in code section: "Create empty vector for adding documents".
2. All the required versions of libraries used in this notebook are mentioned in the first code section.

**Working of this notebook:**
1. Input for notebook is a dictionary containing stock tickers and company name in a dictionary called tickers. Code section: "MAIN".
2. This notebook downloads 10-K forms of companies using tickers dictionary and extracts accession numbers from downloaded file directories of each company. Code section: "Get accession numbers and download excel files".
4. Using ticker name, CIK is also fetched for each ticker using code section: "Get CIK from stock ticker".
5. Using CIK and accession numbers, excel files are downloaded.
6. These files are then converted to dataframe and processed and converted to langchain documents.
7. This data is then used to query Cohere.

In [1]:
# @title Install required libraries along with their correct version at the time of project creation

!pip install -q sec_edgar_downloader==5.0.3 faiss-cpu==1.11.0.post1 langchain==0.3.27 langchain-community==0.3.27
!pip install -q langchain-cohere==0.4.4 cohere==5.15.0

# Below mentioned libraries are already provided in colab
# !pip install -q requests==2.32.3
# !pip install -q numpy==2.0.2
# !pip install -q pandas==2.2.2
# !pip install -q google.colab==1.0.0
# !pip install -q IPython==7.34.0


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m92.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.5/259.5 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m104.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# @title Imports

import os
import shutil
import requests
import numpy as np
import pandas as pd
from google.colab import userdata
from IPython.display import display
from sec_edgar_downloader import Downloader

# LangChain & LangChain Community
import faiss
from langchain.chains import RetrievalQA
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS

# Cohere integrations
from langchain_cohere import ChatCohere
from langchain_cohere import CohereEmbeddings

In [3]:
# @title Delete all files in a directory: delete_files_in_folder(folder_path)

def delete_files_in_folder(folder_path):
  for item in os.listdir(sec_folder):
      item_path = os.path.join(sec_folder, item)
      if os.path.isdir(item_path):
          shutil.rmtree(item_path)

In [4]:
# @title Download SEC html/txt files -> download_sec_files(ticker)

# Header details for downloading data
USER_AGENT_COMPANY_NAME = "Company"
USER_AGENT_EMAIL = "company.mail@company.com"

# Set up download folder and downloader for downloading excel files
sec_folder = "/content/sec_filings/"
sec_downloader = Downloader(USER_AGENT_COMPANY_NAME, USER_AGENT_EMAIL, download_folder = sec_folder)

os.makedirs(sec_folder, exist_ok=True)

def download_sec_files(ticker):
    try:
        sec_downloader.get("10-K", ticker, limit=25)  # last 25 filings (usually 1 per year)
        return True
    except Exception as e:
        return False

In [5]:
# @title Get CIK from stock ticker -> get_cik_from_ticker(ticker)

def get_cik_from_ticker(ticker):
    url = "https://www.sec.gov/files/company_tickers.json"
    headers = {"User-Agent": "Your Name Contact@example.com"}  # SEC requires this

    response = requests.get(url, headers=headers)
    data = response.json()

    for entry in data.values():
        if entry["ticker"].lower() == ticker.lower():
            return str(entry["cik_str"]).zfill(10)  # pad with zeros to make 10 digits

    return None


In [6]:
# @title Get accession numbers and download excel files -> get_acc_nos(folder_path) & download_excel_file(excel_link, excel_file_path)

headers = {
    'User-Agent': 'microsoft bill.gates@microsoft.com',
    'Accept': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
}

# Get accession numbers
def get_acc_nos(folder_path):
    for _, dir, _ in os.walk(folder_path):
      if len(dir) > 5:
        return tuple(dir)

    return None

# Download excel file
def download_excel_file(excel_link, excel_file_path):
    try:
        response = requests.get(excel_link, headers=headers)
        response.raise_for_status()
        with open(excel_file_path, 'wb') as f:
            f.write(response.content)
        return True
    except requests.exceptions.RequestException as e:
        return False


In [7]:
# @title Convert excel data to dataframe -> excel_to_df(excel_file_path)
def excel_to_df(excel_file_path: str):

    # Description of sheet names that need to be read from excel files
    sheet_name_desc1 = 'consolidated statements of oper'
    sheet_name_desc2 = 'consolidated statements of incom'

    try:
        matched_sheet = ''
        excel_content = pd.ExcelFile(excel_file_path)
        for sheet_name in excel_content.sheet_names:
            if sheet_name_desc1 in sheet_name.lower() or sheet_name_desc2 in sheet_name.lower():
              matched_sheet = sheet_name
              break
        df = pd.read_excel(excel_file_path, sheet_name = matched_sheet, engine = 'openpyxl')

    except Exception as e:
        df = f'Could not read the excel file {excel_file_path}!'

    return df

In [8]:
# @title Correct the values in dataframe according to share and usd multipliers -> process_df_values(df, share_multiplier, usd_multiplier)

def process_df_values(df: pd.DataFrame, share_multiplier: int, usd_multiplier: int) -> pd.DataFrame:
    pre_string = ''
    nan_index, text_index = [], []

    # Some common terms to be taken care of when converting dataframe values
    common_averages = ["in share", "average share", "average common share"]
    usd_terms = ["in usd", "in dollars", "income per share"]

    for index, row in df.iterrows():
        try:
            # Get row description for appropriate value conversion
            row_desc = row.loc['Description'].lower()
            if row_desc[-1] == ":":
                row_desc = row_desc[:-1]
        except:
            continue

        if row.isnull().all():
          pre_string = ''
          nan_index.append(index)

        elif row.iloc[2:].isnull().all():
          if index - 1 in nan_index:
            pre_string = pre_string + " | " + row_desc
          else:
            pre_string = row_desc
          nan_index.append(index)

        else:
            try:
                # If value can't be converted to float, then skip that row
                float(row.iloc[2])

                if pre_string:
                    df.iloc[index, 0] = pre_string

                for col in df.columns[2:]:
                    val = row[col]
                    if any(common_average in row_desc for common_average in common_averages) or any(common_average in pre_string for common_average in common_averages):
                        val = float(val) * share_multiplier
                        val = str(val) + " shares"
                    else:
                        if all(usd_term not in row_desc and usd_term not in pre_string for usd_term in usd_terms):
                            val = float(val) * usd_multiplier
                            if abs(val) >= 1000000000:
                                val = str(val/1000000000) + " billion"
                            elif abs(val) >= 1000000:
                                val = str(val/1000000) + " million"

                        val = "USD " + str(val)

                    df.at[index, col] = val
            except:
                text_index.append(index)
                continue

    # Drop rows that contain text and not numbers
    if text_index:
      df = df.drop(text_index)

    # Drop empty rows
    df = df.dropna(subset=df.columns[1:])

    return df.reset_index(drop = True)

In [9]:
# @title Set column headings and find share and usd multipliers -> process_dataframe(df)

def process_dataframe(df: pd.DataFrame) -> pd.DataFrame:

    # Identify multipliers from the first row of the sheet
    multiplier_identifier = df.columns[0].lower()

    share_multiplier = 1
    if 'shares in billion' in multiplier_identifier:
        share_multiplier = 1000000000
    elif 'shares in million' in multiplier_identifier:
        share_multiplier = 1000000
    elif 'shares in thousand' in multiplier_identifier:
        share_multiplier = 1000

    usd_multiplier = 1
    if '$ in billion' in multiplier_identifier:
        usd_multiplier = 1000000000
    elif '$ in million' in multiplier_identifier:
        usd_multiplier = 1000000
    elif '$ in thousand' in multiplier_identifier:
        usd_multiplier = 1000

    # Rename columns with None heading for proper column renaming
    df.columns = [f'new_column_{i}' if col is None else col for i, col in enumerate(df.columns)]

    # Replace empty dataframe values with NaN for better processing
    df = df.map(lambda x: np.nan if (x == '' or x is None or (isinstance(x, str) and x.strip() == "")) else x)

    # Delete columns with very less unique values as they do not contain good data
    for column in df.columns:
        if df[column].nunique() < 5:
            df = df.drop(columns=[column])

    # Rename columns with their respective years
    df = df.rename(columns = {df.columns[0]: "Description",
                              df.columns[1]: int(df.iloc[0,1][-4:]),
                              df.columns[2]: int(df.iloc[0,2][-4:]),
                              df.columns[3]: int(df.iloc[0,3][-4:])})

    # Remove first row as its not required and add a new column 'Category' to dataframe
    df = df[1:]
    df.insert(0, 'Category', '')
    df = df.reset_index(drop = True)      # To make sure indices are in sync

    return process_df_values(df = df, share_multiplier = share_multiplier, usd_multiplier = usd_multiplier)


In [10]:
# @title Convert dataframe to langchain Documents -> df_to_document(df, company_name)

def df_to_document(df:  pd.DataFrame, company_name: str, company_years: list):
    documents = []

    # Delete all columns with years for which parsing is already done
    drop_cols = []
    for column in df.columns[2:]:
      if int(column) in company_years:
        drop_cols.append(column)
      else:
        company_years.append(int(column))

    df = df.drop(columns=drop_cols)

    # Convert each dataframe row to text for mebedding
    for index, row in df.iterrows():
      if row.iloc[0]:
        row_desc = f"For {company_name} {row.iloc[1]} in {row.iloc[0]} category for year"
      else:
        row_desc = f"For {company_name} {row.iloc[1]} for year"

      for column in df.columns[2:]:
        metadata = {"company": company_name, "year": column}
        documents.append(Document(page_content = f"{row_desc} {column} is {row[column]}.", metadata = metadata))

    return [documents, company_years]


In [11]:
# @title Create empty vector for adding documents

# Embedding and Cohere Configuration
cohere_api_key = userdata.get('COHERE_API_KEY')
embeddings = CohereEmbeddings(cohere_api_key=cohere_api_key, model="embed-english-v3.0")
vectorstore = FAISS.from_documents([Document(page_content="dummy")], embeddings)

In [12]:
# @title Execute data processing and vector creation -> execute_process(tickers)

def execute_process(tickers):
  for ticker, company_name in tickers.items():

      # Delete any existing file/ folder in the directory
      delete_files_in_folder(sec_folder)

      # Get CIK from ticker name
      cik = get_cik_from_ticker(ticker)

      # Download sec files for extracting accession numbers
      downloaded = download_sec_files(ticker = ticker)

      if cik is None or downloaded is False:
          continue

      # Get accession numbers from downloaded files
      accession_nos = get_acc_nos(sec_folder)

      delete_files_in_folder(sec_folder)

      # To store langchain Documents and years parsed of the company
      docs = []
      company_years = []

      # Construct excel links from all the extracted data and convert them to langchain documents
      for acc_no in accession_nos:
          excel_link = f"https://www.sec.gov/Archives/edgar/data/{cik}/{acc_no.replace('-','')}/Financial_Report.xlsx"
          excel_file_path = f"{sec_folder}{company_name}_Financial_20{acc_no.split('-')[1]}.xlsx"

          downloaded = download_excel_file(excel_link, excel_file_path)

          if downloaded is False:
              continue

          # Convert excel to dataframe
          df = excel_to_df(excel_file_path = excel_file_path)
          if not isinstance(df, pd.DataFrame):
              continue

          df = process_dataframe(df = df)

          # print(excel_file_path)
          # display(df)
          # print("\n\n")
          doc, company_years = df_to_document(df = df, company_name = company_name, company_years = company_years)
          docs.extend(doc)

      print("Years for which data is downloaded")
      print(f"{company_name}: {sorted(company_years)}")
      vectorstore.add_documents(docs)

In [13]:
# @title MAIN

# All stock tickers
tickers = {
    "A": "Agilent Technologies", "AAPL": "Apple", "ABBV": "AbbVie", "ABT": "Abbott Laboratories",
    "ADBE": "Adobe", "ADP": "Automatic Data Processing", "ADM": "Archer Daniels Midland",
    "AEP": "American Electric Power", "AFL": "Aflac", "AIG": "American International Group",
    "ALL": "Allstate", "AMGN": "Amgen", "AMT": "American Tower", "AMZN": "Amazon",
    "AXP": "American Express", "BA": "Boeing", "BAX": "Baxter International", "BBY": "Best Buy",
    "BDX": "Becton Dickinson", "BIIB": "Biogen", "BMY": "Bristol-Myers Squibb",
    "BRK-B": "Berkshire Hathaway", "BURL": "Burlington Stores", "C": "Citigroup", "CB": "Chubb",
    "CI": "Cigna", "CL": "Colgate-Palmolive", "CME": "CME Group", "CNC": "Centene",
    "COP": "ConocoPhillips", "COST": "Costco", "CSCO": "Cisco Systems", "CSX": "CSX Corporation",
    "CTSH": "Cognizant Technology Solutions", "CTVA": "Corteva", "CVS": "CVS Health",
    "CVX": "Chevron", "D": "Dominion Energy", "DE": "Deere & Co.", "DIS": "Walt Disney",
    "DLR": "Digital Realty", "DOW": "Dow", "DUK": "Duke Energy", "EL": "Estée Lauder",
    "EMR": "Emerson Electric", "ETN": "Eaton Corporation", "EXC": "Exelon",
    "F": "Ford Motor Company", "FIS": "FIS", "GE": "General Electric", "GILD": "Gilead Sciences",
    "GM": "General Motors", "GOOGL": "Google", "GS": "Goldman Sachs", "HD": "Home Depot",
    "HCA": "HCA Healthcare", "HUM": "Humana", "IBM": "IBM", "INTC": "Intel", "INTU": "Intuit",
    "IP": "International Paper", "ISRG": "Intuitive Surgical", "JNJ": "Johnson & Johnson",
    "KHC": "Kraft Heinz", "KMB": "Kimberly-Clark", "KMI": "Kinder Morgan", "LHX": "L3Harris Technologies",
    "LLY": "Eli Lilly", "LMT": "Lockheed Martin", "LRCX": "Lam Research", "LULU": "Lululemon",
    "LVS": "Las Vegas Sands", "MA": "Mastercard", "MCD": "McDonald's", "MCK": "McKesson",
    "MDT": "Medtronic", "META": "Meta", "MMM": "3M", "MO": "Altria Group", "MRK": "Merck",
    "MRNA": "Moderna", "MS": "Morgan Stanley", "MSFT": "Microsoft", "NFLX": "Netflix",
    "NKE": "Nike", "NSC": "Norfolk Southern", "NVDA": "NVIDIA", "O": "Realty Income",
    "PEP": "PepsiCo", "PFE": "Pfizer", "PG": "Procter & Gamble", "PLD": "Prologis",
    "PGR": "Progressive", "PSX": "Phillips 66", "PYPL": "PayPal", "RMD": "ResMed",
    "RSG": "Republic Services", "RTX": "Raytheon Technologies", "SBUX": "Starbucks",
    "SCHW": "Charles Schwab", "SLB": "Schlumberger", "SPGI": "S&P Global", "STT": "State Street",
    "STZ": "Constellation Brands", "SYY": "Sysco", "SYK": "Stryker", "T": "AT&T", "TGT": "Target",
    "TMO": "Thermo Fisher Scientific", "TRV": "Travelers", "TSLA": "Tesla",
    "TSM": "Taiwan Semiconductor", "UNH": "UnitedHealth Group", "UNP": "Union Pacific",
    "USB": "U.S. Bancorp", "V": "Visa", "VLO": "Valero", "VTR": "Ventas", "VZ": "Verizon",
    "WBA": "Walgreens Boots Alliance", "WFC": "Wells Fargo", "WMB": "Williams Companies",
    "WMT": "Walmart", "XEL": "Xcel Energy", "XOM": "ExxonMobil", "ZBH": "Zimmer Biomet",
    "ZTS": "Zoetis"
}

# file_list = ['/content/Apple_Financial_2021.xlsx', '/content/Moderna_Financial_2025.xlsx', '/content/CVS_Health_Financial_2024.xlsx', '/content/Tesla_Financial_2023.xlsx']
tickers = {"AAPL": "Apple", "MRNA": "Moderna", "CVS": "CVS Health", "TSLA": "Tesla"}

# Execute data processing and vectorization
execute_process(tickers)

# Embedding and Cohere Configuration
retriever = vectorstore.as_retriever(search_kwargs={"k": 25})
llm = ChatCohere(cohere_api_key=cohere_api_key)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=False)


Years for which data is downloaded
Apple: [2022, 2021, 2020, 2023, 2019, 2018, 2017, 2015, 2014, 2013, 2016, 2024]
Years for which data is downloaded
Moderna: [2018, 2017, 2016, 2019, 2024, 2023, 2022, 2021, 2020]
Years for which data is downloaded
CVS Health: [2020, 2019, 2018, 2022, 2021, 2024, 2023, 2017, 2016]
Years for which data is downloaded
Tesla: [2021, 2020, 2019, 2018, 2017, 2015, 2014, 2013, 2024, 2023, 2022, 2016]


In [14]:
# @title Query 1 to Cohere

query = "Compare Moderna and Tesla's net income"
response = qa_chain.invoke({"query": query})

print("\nAnswer:")
print(response['result'])


Answer:
Based on the provided data, here’s a comparison of Moderna and Tesla's net income (or net loss) for the available years:

### **Moderna Net (Loss) Income in Operating Expenses Category:**
- **2020**: USD -747.0 million (loss)  
- **2021**: USD 12.202 billion (profit)  
- **2022**: USD 8.362 billion (profit)  
- **2023**: USD -4.714 billion (loss)  
- **2024**: USD -3.561 billion (loss)  

### **Tesla Net Income Attributable to Common Stockholders in Operating Expenses Category:**
- **2021**: USD 5.519 billion (profit)  
- **2022**: USD 12.556 billion (profit)  
- **2023**: USD 14.997 billion (profit)  
- **2024**: USD 7.091 billion (profit)  

### **Key Observations:**
1. **Profitability Trends:**  
   - Moderna experienced significant profits in 2021 and 2022, likely due to its COVID-19 vaccine sales, but shifted to losses in 2023 and 2024.  
   - Tesla has consistently reported profits from 2021 to 2024, with a peak in 2023 and a slight decline in 2024.  

2. **Scale of Prof

In [15]:
# @title Query 2 to Cohere

query = "What was CVS's net income in 2022?"
response = qa_chain.invoke({"query": query})

print("\nAnswer:")
print(response['result'])


Answer:
According to the provided information, CVS Health's net income in 2022 was USD 4.165 billion.


In [16]:
# @title Query 3 to Cohere

query = "What is income tax provision?"
response = qa_chain.invoke({"query": query})

print("\nAnswer:")
print(response['result'])


Answer:
Income tax provision is an estimate of the amount a company expects to pay in taxes for the current year, based on its taxable income. It is recorded on the company's financial statements, typically in the income statement, as an expense. The provision reflects the company's best estimate of its tax liability for the period, considering applicable tax laws, rates, and any adjustments for deferred taxes.

Here’s a breakdown of key aspects of income tax provision:

1. **Current Tax Expense**: This is the tax owed on the company's taxable income for the current year, calculated based on the applicable tax rates and regulations.

2. **Deferred Tax**: This accounts for temporary differences between the company's accounting income and taxable income. Deferred tax assets and liabilities arise from these differences and are adjusted over time as they reverse.

3. **Estimation**: Since tax laws can be complex, the provision is an estimate and may be adjusted in future periods if actual

In [17]:
# @title Query 4 to Cohere

query = "Compare the operating expenses across all companies"
response = qa_chain.invoke({"query": query})

print("\nAnswer:")
print(response['result'])


Answer:
Here’s a comparison of the **total operating expenses** across **Apple**, **Tesla**, and **Moderna** for the available years:

### **Apple**
| Year  | Total Operating Expenses (USD Billion) |
|-------|-----------------------------------------|
| 2013  | 15.305                                  |
| 2014  | 18.034                                  |
| 2015  | 22.396                                  |
| 2016  | 24.239                                  |
| 2017  | 26.842                                  |
| 2018  | 30.941                                  |
| 2019  | 34.462                                  |
| 2020  | 38.668                                  |
| 2021  | 43.887                                  |
| 2022  | 51.345                                  |
| 2023  | 54.847                                  |
| 2024  | 57.467                                  |

### **Tesla**
| Year  | Total Operating Expenses (USD Billion) |
|-------|-----------------------------------------|
| 201

In [18]:
# @title Query 5 to Cohere
query = "Compare moderna's profitability across years"
response = qa_chain.invoke({"query": query})

print("\nAnswer:")
print(response['result'])


Answer:
To compare Moderna's profitability across the years provided, we'll focus on key metrics such as **Net Income (Loss)**, **(Loss) Income from Operations**, and **(Loss) Income Before Income Taxes**. Here’s a breakdown of Moderna's profitability for the years 2016 to 2024:

### **Net Income (Loss)**
- **2016**: USD -216.211 million (Net Loss)  
- **2017**: USD -255.916 million (Net Loss)  
- **2018**: USD -384.734 million (Net Loss)  
- **2020**: USD -747.0 million (Net Loss)  
- **2021**: USD 12.202 billion (Net Income)  
- **2022**: USD 8.362 billion (Net Income)  
- **2023**: USD -4.714 billion (Net Loss)  
- **2024**: USD -3.561 billion (Net Loss)  

### **(Loss) Income from Operations**
- **2017**: USD -269.356 million (Loss)  
- **2018**: USD -413.266 million (Loss)  
- **2020**: USD -763.0 million (Loss)  
- **2021**: USD 13.296 billion (Income)  
- **2022**: USD 9.42 billion (Income)  
- **2023**: USD -4.239 billion (Loss)  
- **2024**: USD -3.945 billion (Loss)  

### *