# Project Goal

The objective of this project is to perform an in-depth analysis of corporate annual reports from various companies over multiple years. The analysis is aimed at extracting, processing, and understanding the textual content within these reports to derive insights on:


### Detailed Aims and Methodologies

1. **In-depth Textual Data Extraction**
   - The first phase of the project is dedicated to the sophisticated extraction of textual data from corporate annual reports, typically available in PDF format. This involves parsing document structures to convert visual data into analyzable text, setting the stage for all subsequent analytical endeavors.

2. **Extraction and Analysis of R&D Sections**
   - Recognizing the critical role of Research and Development (R&D) in corporate growth and innovation, this project includes a focused extraction and analysis of R&D sections from the annual reports. By pinpointing these sections based on the table of contents, the analysis seeks to delve into the companies' commitments to innovation, R&D spending trends, and the strategic importance of R&D activities in their overall business strategy.
   
3. **Multifaceted Sentiment Analysis**
   - A significant emphasis is placed on dissecting the sentiment within the R&D annual reports. Employing a blend of traditional NLTK models and advanced transformer models, the project aims to capture a nuanced spectrum of sentiments, from explicit expressions to subtle tones that influence perceptions.

4. **Granular Keyword Analysis**
   - Delving into the specifics, the project conducts a detailed keyword analysis to identify and quantify the occurrence of specific positive and negative keywords within the reports. This granular analysis aims to gauge the reports' tonality, providing quantitative insights into the companies' narrative framing.

5. **Thematic Exploration through Topic Modeling**
   - Employing Latent Dirichlet Allocation (LDA), the project uncovers latent topics within the textual corpus, revealing the primary themes that pervade the corporate discourse. This thematic exploration is key to understanding the strategic priorities and operational challenges highlighted by companies.

6. **Visual Representation through Word Clouds**
   - To enhance data accessibility and comprehension, the project incorporates word clouds that visually highlight the most frequent words and themes, including separate visualizations for positive and negative keywords, offering an immediate sense of the reports' content and sentiment.

7. **Thematic Consistency Analysis using Cosine Similarity**
   - The project employs cosine similarity measures to assess thematic consistency or divergence across different years or among different companies. This analysis provides insights into the evolution or stability of strategic narratives over time in response to various factors.

This project uses a suite of Python libraries, including `pandas` for data manipulation, `transformers` and `nltk` for natural language processing, `pdfplumber` for PDF text extraction, `gensim` for topic modeling, and `matplotlib` and `WordCloud` for visualization, along with custom sentiment analysis modules. The comprehensive analytical approach employed in this project enables stakeholders to extract actionable insights from annual reports, evaluating the evolution of company narratives and strategic focuses in alignment with market dynamics and expectations.


In [None]:
# Import libraries
import pandas as pd
from transformers import pipeline, AutoTokenizer
import requests
import pdfplumber
import concurrent.futures
from io import BytesIO
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
import nltk
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import numpy as np
from gensim.downloader import load

# Download NLTK stopwords
nltk.download('stopwords')


This section of the code involves setting up the environment and defining functions for sentiment analysis using two approaches: the NLTK library and transformer models from the Hugging Face library.

### Import Statements

- `import sentiment_analysis.sentiment_nltk as n`: This imports NLTK-Based Sentiment Analysis Functions

- `import sentiment_analysis.sentiment_transformers as t`: Imports Transformer Model-Based Sentiment Analysis Functions


### NLTK-Based Sentiment Analysis Functions

- `analyze_sentiment_vader(text)`: Utilizes the VADER tool from the NLTK library to calculate a compound sentiment score for the input text, reflecting the overall sentiment from negative (-1) to positive (+1).

- `analyze_sentiment_vader_detail(text)`: Provides a detailed sentiment analysis using VADER, returning a dictionary with positive, negative, neutral, and compound scores.

### Transformer Model-Based Sentiment Analysis Functions

- `chunk_text(text, max_length)`: Splits a large text into smaller chunks of a specified maximum length. This is crucial for processing long texts with models that have input length limitations (max 512).

- `analyze_sentiment(text)`: Performs sentiment analysis on the input text using a imported transformer model. It chunks the text if necessary and aggregates sentiment scores across chunks to derive an average score.

- `analyze_sentiment(text, tokenizer, sentiment_analysis)`: An advanced version of the sentiment analysis function tailored for transformer models. It requires a tokenizer and a sentiment analysis 


In [None]:
# Import custom modules
import sentiment_analysis.sentiment_nltk as n
import sentiment_analysis.sentiment_transformers as t

In this section, we define functions for downloading and processing annual reports from the website `annualreports.com`. The functions are designed to handle the following tasks directly within Python. (Note: there are predownloaded examples in the folder 'reports' as raw pdf files and in folder discussions as text files)

### Function: `download_and_extract_text(company, year)`

This function is responsible for downloading annual reports for a specified company and year. It constructs the URL based on the company name and year, handling a special case for reports other than 2022 year. For reports from other than 2022 year, it iterates through potential URL prefixes to locate the report. Upon successfully retrieving the report, it uses `pdfplumber` to extract text from the PDF file. The extracted text from all pages is concatenated and returned. The function also provides feedback on the success or failure of processing each report.

### Function: `download_and_process_reports_parallel(companies, years)`

This function orchestrates the parallel downloading and processing of annual reports for a list of companies and a range of years. It uses a `ThreadPoolExecutor` from the `concurrent.futures` module to manage concurrent execution, submitting tasks to download and extract text for each company-year combination. The results are aggregated in a dictionary, mapping each company-year pair to its corresponding extracted text, facilitating efficient parallel processing and reducing overall execution time.

### Function: `get_top_n_topics(bow_doc, lda_model, n=5)`

Given a document represented as a bag-of-words (BoW) and an LDA model, this function identifies the top N topics within the document. It retrieves the list of topics and their associated probabilities from the LDA model, sorts them by probability in descending order, and selects the top N topics.

### Function: `get_top_keywords_for_topic(lda_model, topic_id, topn=5)`

This function retrieves the top N keywords for a given topic within an LDA model.


In [None]:
def download_and_extract_text(company, year):
    text = None
    response = None

    if year == '2022':
        url = f'https://www.annualreports.com/HostedData/AnnualReports/PDF/{company}_{year}.pdf'
        response = requests.get(url, stream=True)
    else:
        # This loop tries URLs with different prefixes for non-2022 reports
        for prefix in [chr(97 + i) for i in range(26)]:  # Iterates through all lowercase letters (a to z)
            url = f'https://www.annualreports.com/HostedData/AnnualReportArchive/{prefix}/{company}_{year}.pdf'
            response = requests.get(url, stream=True)
            if response.status_code == 200:
                break  # Exit the loop if a valid response is received

    # Check if a valid response was received
    if response is not None and response.status_code == 200:
        with BytesIO(response.content) as bytes_io:
            with pdfplumber.open(bytes_io) as pdf:
                pages_text = [page.extract_text() for page in pdf.pages if page.extract_text() is not None]
                text = ' '.join(pages_text)
        print(f"Successfully processed {company} {year}")
    else:
        print(f"Failed to process {company} {year}")

    return text

def download_and_process_reports_parallel(companies, years):
    all_texts = {}
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures = {}
        for company in companies:
            for year in years:
                future = executor.submit(download_and_extract_text, company, year)
                futures[future] = (company, year)  # Map future to (company, year)

        for future in concurrent.futures.as_completed(futures):
            company, year = futures[future]  # Get company and year from the future
            text = future.result()
            if text:
                all_texts[(company, year)] = text  # Store text with (company, year) as key

    return all_texts

def get_top_n_topics(bow_doc, lda_model, n=5):
    # Get the list of topic probabilities for the document
    doc_topics = lda_model.get_document_topics(bow_doc, minimum_probability=0.0)
    
    # Sort the topics by their probabilities (highest first)
    sorted_doc_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)
    
    # Select the top N topics
    top_n_topics = sorted_doc_topics[:n]
    
    return top_n_topics

def get_top_keywords_for_topic(lda_model, topic_id, topn=5):
    top_keywords = lda_model.show_topic(topic_id, topn=topn)
    return ', '.join([word for word, prob in top_keywords])

### Function: `keyword_analysis(text)`

This function performs a simple keyword analysis on the input text by counting the occurrences of predefined positive and negative keywords. It iterates through each list of keywords, tallying how many times each keyword appears within the text.

### Function: `make_word_cloud(text, title=None)`

This function generates a word cloud from the provided text, which is a visual representation highlighting the frequency of word occurrence in a visually appealing format. The function checks if the input text is non-empty and then proceeds to create a word cloud using the `WordCloud` class, with configurable dimensions and background color.

### Function: `make_map(text, positive, negative)`

The `make_map` function serves as a comprehensive visualization tool that generates word clouds for three categories: all words in the text, positive words, and negative words.

In [None]:
def keyword_analysis(text):
    positive_count = sum(text.count(word) for word in positive_keywords)
    negative_count = sum(text.count(word) for word in negative_keywords)
    return positive_count, negative_count

def make_word_cloud(text, title=None):
    if not text.strip():  # Checks if the text is empty or contains only whitespace
        print(f"No words found for '{title}'. Skipping word cloud generation.")
        return  # Exits the function early

    # Proceed with word cloud generation if text is not empty
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.figure(figsize=(10, 5))
    if title:
        plt.title(title)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

def make_map(text, positive, negative):
    # Display word cloud for all words
    print("All Words Word Cloud:")
    make_word_cloud(text, "All Words")

    # Display positive word cloud
    print("Positive Word Cloud:")
    positive_text = ' '.join([word for word in text.lower().split() if word in positive])
    make_word_cloud(positive_text, "Positive Words")

    # Display negative word cloud
    print("Negative Word Cloud:")
    negative_text = ' '.join([word for word in text.lower().split() if word in negative])
    make_word_cloud(negative_text, "Negative Words")

### Function: `preprocess_text(text)`

This function preprocesses the given text by removing English stopwords and tokenizing the text into individual words.

### Function: `get_vector(word)`

Attempts to retrieve the vector representation of a given word from a pre-trained word embedding model.

### Function: `average_topic_vector(topic_keywords)`

Calculates the centroid or average vector of a set of keywords representing a topic. This is achieved by averaging the vector representations of each keyword, resulting in a single vector that encapsulates the semantic essence of the entire topic.

### Function: `cosine_similarity(vec1, vec2)`

Computes the cosine similarity between two vectors.

### Function: `calculate_cosine_similarities(company_name, df)`

This function is designed to calculate and display a matrix of cosine similarities between topics associated with a specific company's reports across different years. It aggregates the topics for each year, computes their average vector representations, and then calculates the cosine similarity between each pair of yearly topic aggregations. The resulting matrix provides insights into the thematic consistency or evolution of the company's focus over time.


In [None]:
def preprocess_text(text):
    stop_words = stopwords.words('english')
    return [word for word in simple_preprocess(text) if word not in stop_words]
           
def get_vector(word):
    """Return the vector for a given word if it exists in the model."""
    try:
        return model[word]
    except KeyError:
        return np.zeros(model.vector_size)

def average_topic_vector(topic_keywords):
    """Compute the average vector for a list of topic keywords."""
    vectors = [get_vector(word) for word in topic_keywords.split(', ')]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors."""
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2) if norm_vec1 > 0 and norm_vec2 > 0 else 0

def calculate_cosine_similarities(company_name, df):
    """Calculate and print cosine similarities of topics aggregated by year for a given company."""
    # Filter the DataFrame for the specified company
    company_df = df[df['Company'] == company_name]
    yearly_topics = company_df.groupby('Year')['Topic Keywords'].apply(lambda topics: ', '.join(topics)).reset_index()
    yearly_topics['Average Vector'] = yearly_topics['Topic Keywords'].apply(average_topic_vector)
    cosine_similarities = pd.DataFrame(index=yearly_topics['Year'], columns=yearly_topics['Year'])
    
    # Calculate the cosine similarity between each pair of years
    for i, row_i in yearly_topics.iterrows():
        for j, row_j in yearly_topics.iterrows():
            cosine_similarities.at[row_i['Year'], row_j['Year']] = cosine_similarity(row_i['Average Vector'], row_j['Average Vector'])
    
    # Print the cosine similarity matrix
    print(cosine_similarities)

This code snippet initializes the analysis by defining two key lists: `companies` and `years`. The `companies` list contains the stock symbols of the companies of interest, while the `years` list specifies the range of years for which the analysis will be conducted. The function `download_and_process_reports_parallel` is then invoked to concurrently download and extract text from the annual reports of these companies for the specified years. The resulting texts are stored in the `all_texts` dictionary, keyed by tuples of `(company, year)`.

### Companies and Their Stock Symbols

| Stock Symbol  | Company Name               |
|---------------|----------------------------|
| NASDAQ_TSLA   | Tesla, Inc.                |
| NASDAQ_AAPL   | Apple Inc.                 |
| NASDAQ_MSFT   | Microsoft Corporation      |
| NASDAQ_AMZN   | Amazon.com, Inc.           |
| NYSE_BRK-A    | Berkshire Hathaway Inc.    |
| NYSE_PFE      | Pfizer Inc.                |
| NASDAQ_CCBG   | Capital City Bank Group Inc.|

Each entry in the `all_texts` dictionary represents the textual content extracted from an annual report for a specific company and year, providing a foundational dataset for subsequent analysis.


In [None]:
# Define the list of companies and years
companies = ['NASDAQ_TSLA', 'NASDAQ_AAPL', 'NASDAQ_MSFT', 'NASDAQ_AMZN', 'NYSE_BRK-A', 'NYSE_PFE', 'NASDAQ_CCBG']
years = ['2022', '2021', '2020', '2019', '2018']

# Download and process the reports
all_texts = download_and_process_reports_parallel(companies, years)

This code segment sets up the foundational elements for sentiment analysis by defining lists of positive and negative keywords and initializing a pre-trained sentiment analysis model.

- **Positive and Negative Keywords**: Two lists, `positive_keywords` and `negative_keywords`, are created to contain words commonly associated with positive and negative sentiments, respectively for financial reports.

- **Sentiment Analysis Model**: The code initializes a sentiment analysis model using the `distilbert-base-uncased-finetuned-sst-2-english` model.


In [None]:
# Define the positive and negative keywords
positive_keywords = [
    'good', 'great', 'positive', 'successful', 'profitable', 'improved', 'improving', 'excellent',
    'beneficial', 'strong', 'growth', 'upturn', 'bullish', 'booming', 'advantageous',
    'rewarding', 'lucrative', 'surplus', 'expansion', 'upswing', 'thriving', 'yielding',
    'gains', 'outperform', 'optimistic', 'upbeat', 'recovery', 'acceleration', 'enhancement',
    'rally', 'surge', 'boom', 'profitability', 'efficiency', 'superior', 'leadership',
    'innovation', 'breakthrough', 'high-demand', 'competitive edge', 'market leader',
    'dividend increase', 'shareholder value', 'capital gain', 'revenue growth', 'cost reduction',
    'strategic acquisition', 'synergy', 'scalability', 'liquidity'
]
negative_keywords = [
    'bad', 'poor', 'negative', 'loss', 'problem', 'decrease', 'difficult', 'weak', 'decline',
    'losses', 'bearish', 'slump', 'downturn', 'adverse', 'challenging', 'deteriorating',
    'declining', 'recession', 'deficit', 'contraction', 'downgrade', 'volatility', 'risk',
    'uncertainty', 'impairment', 'write-off', 'underperform', 'pessimistic', 'downbeat',
    'stagnation', 'erosion', 'turmoil', 'crisis', 'bankruptcy', 'default', 'devaluation',
    'overleveraged', 'layoffs', 'restructuring', 'downsizing', 'liquidation', 'fraud',
    'scandal', 'litigation', 'regulatory penalty', 'market exit', 'competitive pressure',
    'product recall', 'safety concern'
]

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentiment_analysis = pipeline("sentiment-analysis", model=model_name)

This code segment performs an in-depth analysis of textual content from annual reports for specified companies and years, focusing on sentiment analysis, keyword frequency, and topic modeling.

**Analysis Loop**: Iterates over each `(company, year)` pair, conducting analyses if the text for the pair is available in `all_texts`.

**Sentiment Analysis**: 
   - Utilizes VADER from NLTK to compute sentiment scores.
   - Employs a transformer-based model for a detailed sentiment analysis, displaying scores for each `(company, year)` pair.

**Keyword Analysis**: 
   - Cleans the text and counts occurrences of predefined positive and negative keywords.
   - Calculates the ratio of positive to negative keywords and visualizes them using word clouds.

**Topic Modeling**: 
   - Preprocesses the text to remove stopwords and tokenizes it.
   - Constructs a dictionary and corpus for LDA, then trains an LDA model to identify main topics and extracts top keywords for these topics.

**Data Aggregation**: 
   - Compiles sentiment scores, keyword counts, and topic keywords into a dictionary for each company-year pair.
   - Appends each dictionary to a list for later conversion into a DataFrame.

**DataFrame Creation**: 
   - Converts the list of dictionaries into a pandas DataFrame, providing a structured overview of analysis results.

In [None]:
data = []
lda_models = {}
text_key = [(company, year) for company in companies for year in years]

for key in text_key:
    company, year = key
    if key in all_texts:
        text = all_texts[key]
        sentiment_score_nltk = n.analyze_sentiment_vader(text)
        print(f"{company} {year}: Keyword Analysis")
        print(f"\n Sentiment Score NLTK = {sentiment_score_nltk}")
        scores = n.analyze_sentiment_vader_detail(text)
        print(f"  Positive Score: {scores['pos']}")
        print(f"  Negative Score: {scores['neg']}")
        print(f"  Neutral Score: {scores['neu']}")

        cleaned_text = t.clean_text(text)
        sentiment_score_transformer = t.analyze_sentiment(cleaned_text, tokenizer, sentiment_analysis)
        print(f"{company} {year}: Sentiment Score Transformer = {sentiment_score_transformer}\n")

        # Keyword analysis
        cleaned_text = t.clean_text(text)
        positive_count, negative_count = keyword_analysis(cleaned_text)

        print('Positive Keywords:', positive_count)
        print('Negative Keywords:', negative_count)
        print('Positive Keywords Ratio:', positive_count / (negative_count + positive_count ) if (negative_count + positive_count ) != 0 else "N/A")
        make_map(text, positive_keywords, negative_keywords)
        
        # topic modeling part 
        processed_text = preprocess_text(text)
        id2word = corpora.Dictionary([processed_text])
        corpus = [id2word.doc2bow(processed_text)]
        lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=10,
                     alpha='asymmetric',
                     eta='auto',
                     iterations=400,
                     passes=15,
                     eval_every=None)
        lda_models[(company, year)] = lda_model
        bow_text = id2word.doc2bow(processed_text)
        top_topics = get_top_n_topics(bow_text, lda_model, n=1)

        # Extract the top keyword for each of the top 5 topics
        top_keywords = [get_top_keywords_for_topic(lda_model, topic_id) for topic_id, _ in top_topics]

        # Append data with top keywords as separate columns
        data.append({
            "Company": company,
            "Year": year,
            "NLTK_Sentiment_Score": sentiment_score_nltk,
            "Positive_Score": scores['pos'],
            "Negative_Score": scores['neg'],
            "Neutral_Score": scores['neu'],
            "Transformer_Sentiment_Score": sentiment_score_transformer,
            "Positive_Keywords": positive_count,
            "Negative_Keywords": negative_count,
            "Positive_Keywords_Ratio": positive_count / (negative_count + positive_count ) if (negative_count + positive_count ) != 0 else "N/A",
            **{f"Topic Keywords": keyword for i, keyword in enumerate(top_keywords)}
        })
        print('-' * 50,'\n')
    
df = pd.DataFrame(data)

Results of the analysis are stored in a pandas DataFrame. The DataFrame contains the following columns: `Company`, `Year`, `NLTK_Sentiment_Score`, `Positive_Score`, `Negative_Score`, `Neutral_Score`, `Transformer_Sentiment_Score`, `Positive_Keywords`, `Negative_Keywords`, `Positive_Keywords_Ratio`, and `Topic Keywords`. The DataFrame is displayed below.

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
df

This segment provides results of the topic modeling analysis as plot.

In [None]:
key = ('NASDAQ_TSLA', '2022')
model = lda_models[key]
corpus = [id2word.doc2bow(preprocess_text(all_texts[key]))]
# vis = gensimvis.prepare(model, corpus, id2word)
# pyLDAvis.display(vis)

This section of the code is focused on loading a pre-trained word embedding model and performing cosine similarity calculations to analyze the thematic consistency or divergence of topics associated with companies' annual reports over different years. 

In [None]:
model = load('glove-wiki-gigaword-100')

In [None]:
calculate_cosine_similarities('NASDAQ_TSLA', df)

In [None]:
for company in companies:
    print(f"\n\nCompany: {company}")
    company_df = df[df['Company'] == company]
    company_topics = {}
    company_df = company_df[['Year', 'Topic Keywords']]
    print(company_df)
    calculate_cosine_similarities(company, df)