# Earning Call Report Generation using Large Language Models

For investors seeking to gain an edge in the market, the ability to efficiently analyze the content of earnings calls is crucial. Large language models offer a powerful solution to the challenges posed by unstructured data, enabling investors to extract meaningful insights and make better-informed decisions.

# Notebook Overview

In this notebook, we develop a comprehensive pipeline that performs sentiment analysis, generates a summary report, and extracts a list of 10 key takeaways for investors given a YouTube video. The process is implemented through the following steps:

1. **Obtain the Transcript**  
   Use the YouTube Transcript API to retrieve the transcript of an earning call, namely the [Alphabet Q2 2024 earnings call](https://www.youtube.com/watch?v=r9ylamQmNBU).

2. **Preprocess the Transcript**  
   Preprocess the retrieved transcript to ensure it is suitable for further analysis.

3. **Extract Important Keywords and Concepts**  
   Generate a list of keywords by finding the key terms and concepts in the transcript. These keywords will highlight the most significant and relevant topics of the call, helping us focus on the main aspects of the content. Additionally, we also include other keywords that capture other pertinent information available in the text.

4. **Filter the Transcript**  
   Use these keywords to filter the transcript and extract only the passages that contain these relevant terms.

5. **Perform Sentiment Analysis**  
   Evaluate the transcript using sentiment analysis, classifying sentiment associated with each of the filtered sentence as positive, negative, or neutral. For this task, we use [FinBERT](https://huggingface.co/ProsusAI/finbert), a sentiment analysis model pre-trained on financial data.

6. **Summarize the Transcript**  
   Split the transcript into chunks and summarize each chunk using a trained model, allowing us to generate condensed versions of the original text. For this task, we use [PEGASUS for Financial Summarization](https://huggingface.co/human-centered-summarization/financial-summarization-pegasus). This model was fine-tuned on a novel financial news dataset, which consists of 2K articles from Bloomberg, on topics such as stock, markets, currencies, rate and cryptocurrencies.

7. **Generate a Report and Key Takeaways**  
   Create a comprehensive report and a list of key takeaways from the summarized chunks using the [Microsoft Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) model. The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties.


# Main Concepts

> **Earnings Calls**

Earning calls are a critical source of information for investors, providing insights into a company's financial performance, strategic direction, and management's outlook. These calls offer an opportunity to hear directly from company executives about the factors driving their business, the challenges they face, and their plans for the future. For investors, this information is invaluable as it helps them make informed decisions about buying, selling, or holding a company's stock.


> **Unstructured Data**

However, the content of earnings calls is typically delivered in an unstructured format, often consisting of lengthy discussions, Q&A sessions, and prepared remarks. Extracting actionable insights from this unstructured data can be a complex and time-consuming task. Investors need to sift through large volumes of text, identify key points, analyze sentiment, and connect the dots to understand the broader implications for the company's future performance.

> **Large Language Models**

This is where large language models (LLMs) come into play. LLMs, like GPT-4 and specialized models such as FinBERT, are designed to process and analyze vast amounts of unstructured text data. These models can quickly identify key themes, extract relevant information, and even evaluate the sentiment expressed in different parts of the transcript. By leveraging LLMs, investors can streamline the process of analyzing earnings calls, allowing them to focus on the most critical information and make more informed investment decisions.

The ability of LLMs to handle unstructured data is particularly important because traditional data analysis methods are often inadequate for this type of content. Unstructured data, which includes natural language text, lacks the predefined structure found in databases or spreadsheets, making it challenging to analyze using conventional techniques. LLMs are specifically designed to understand and process this type of data, providing a powerful tool for investors who need to quickly and accurately interpret the nuances of an earnings call.

# Implementation

In [None]:
# Install Youtube Transcript API
!pip install youtube_transcript_api

Collecting youtube_transcript_api
  Downloading youtube_transcript_api-0.6.2-py3-none-any.whl.metadata (15 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->bitsandbytes)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->bitsandbytes)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->bitsandbytes)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-no

In [None]:
# Import libraries
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import textwrap
import torch
import plotly.express as px
import plotly.graph_objects as go
from youtube_transcript_api import YouTubeTranscriptApi
from tqdm import tqdm
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, PegasusTokenizer, PegasusForConditionalGeneration

In [None]:
# Set the default plotly template to "simple_white" for consistent styling
px.defaults.template = "simple_white"

# Assign the API token (e.g., Hugging Face API key) to the variable 'token'
token = 'HF_KEY'

In [None]:
# Check if a CUDA-enabled GPU is available and set the device accordingly.
device = "cuda" if torch.cuda.is_available() else "cpu"

# Output the selected device
device

'cuda'

# Get Video Transcription

In [None]:
# Get the video ID from the YouTube URL
link = 'https://www.youtube.com/watch?v=r9ylamQmNBU'
video_id = link.split('v=')[1]

In [None]:
# Fetch the transcript
transcript = YouTubeTranscriptApi.get_transcript(video_id)
transcript[:5]

[{'text': '>> Operator: WELCOME, EVERYONE.', 'start': 1.97, 'duration': 3.57},
 {'text': 'THANK YOU FOR STANDING BY FOR ', 'start': 3.838, 'duration': 3.037},
 {'text': 'THE ALPHABET SECOND QUARTER 2024',
  'start': 5.64,
  'duration': 3.504},
 {'text': 'EARNINGS CONFERENCE CALL. ', 'start': 6.975, 'duration': 3.737},
 {'text': 'AT THIS TIME, ALL PARTICIPANTS ',
  'start': 9.244,
  'duration': 3.637}]

# Preprocessing

Once the video transcription is obtained, it needs to be cleaned and prepared for analysis. Preprocessing involves a series of text normalization steps that ensure the data is consistent and ready for further operations. For example:
  * **Text Cleaning**: Remove punctuation, special symbols, double space, new lines, and unnecessary characters.
  * **Lowercasing**: Convert all text to lowercase to maintain uniformity.
  * **Stopword Removal**: Eliminate common words (e.g., "and," "the") that do not contribute to the analysis.
  * **Lemmatization**: Reduce words to their base or root form, ensuring that different forms of a word are treated the same.

In [None]:
# Join sentences of the transcript
text = " ".join([transcript[i]['text'] for i in range(len(transcript))])

# Convert text to lowercase
text = text.lower()

# Remove extra whitespace
text = text.replace("  ", " ")

# Remove new lines
text = text.replace("\n", "").replace('\r', ' ')

# # Define a regex pattern to keep only letters, numbers, spaces, and the specified special characters
pattern = r'[^a-z0-9\s%$&"\':.()\-\;]'  # Keep letters, numbers, spaces, %, $, &, ", ', (, ), -, :, ;

# # Remove unwanted characters using the regex pattern
text = re.sub(pattern, '', text)

# Remove extra whitespace
text = text.strip()

# Display text
text

'operator: welcome everyone. thank you for standing by for the alphabet second quarter 2024 earnings conference call. at this time all participants are in a listen-only mode. after the speaker presentation there will be a question and answer session. to ask a question during the session you will need to press 1 on your telephone. i would now like to hand the conference over to your speaker today jim friedland director welcome to alphabet\'s second-quarter 2024 earnings conference call. with us today are sundar pichai philipp schindler and ruth porat. now i\'ll quickly cover the safe harbor. some of the statements that we make today regarding our business operations and financial performance may be considered forward-looking. such statements are based on current expectations and assumptions that are subject to a number of risks and uncertainties. actual results could differ materially. please refer to our forms 10-k and 10-q includg the risk factors. we undertake no obligation to update

# Most Repeated Words

Now we want to find the most repeated words in the transcription. To do so, we  also lemmatize the text and remove stopwords.

In [None]:
# Download the 'stopwords' dataset from NLTK. This dataset contains a list of common stopwords
# in various languages that can be used to filter out unimportant words from text during preprocessing.
nltk.download('stopwords')

# Download the 'wordnet' dataset from NLTK. WordNet is a lexical database for the English language
# that provides synonyms, antonyms, and other lexical relations. It is commonly used for lemmatization
# and other natural language processing tasks.
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Step 1: Remove punctuation using regex
text_ = re.sub(r'[^\w\s]', '', text)

# Step 2: Split the text into words
words = text_.split()

# Step 3: Remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]

# Step 4: Lemmatize words
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

# Step 5: Count the frequency of each word
word_counts = Counter(lemmatized_words)

# Step 6: Get the most common words
most_common_words = word_counts.most_common(50)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# Create and display a horizontal bar chart of the most common words using Plotly for visualization.
fig = pd.DataFrame(most_common_words, columns=['word', 'count']).set_index('word').sort_values(by='count', ascending=True).plot(kind='barh', backend='plotly')
fig.update_layout(title='Most Common Words',
                  xaxis_title='Count',
                  yaxis_title='Word',
                  height=1200,
                  width=1600)
fig.show()

# Filtering by Keywords

This section focuses on narrowing down the large amount of transcribed text to only the most relevant parts. Filtering by keywords allows the user to focus on specific topics or areas of interest within the transcription.

**Steps**:
1. **Keyword Definition**: Define a list of keywords or phrases that are relevant to the analysis. To do that, we can also pick some of the most repeated words that we just found in our analysis.
2. **Text Scanning**: The text is scanned for occurrences of these keywords.
3. **Extract Relevant Sentences/Sections**: Sentences or sections of the text containing these keywords are extracted and isolated for further analysis.

In [None]:
# Separate text by phrases
phrases = text.split(". ")
phrases = [phrase.strip() for phrase in phrases]

phrases[:5]

['operator: welcome everyone',
 'thank you for standing by for the alphabet second quarter 2024 earnings conference call',
 'at this time all participants are in a listen-only mode',
 'after the speaker presentation there will be a question and answer session',
 'to ask a question during the session you will need to press 1 on your telephone']

In [None]:
# Example list of key phrases to search for
keywords = [
    "ai", "revenue", "profit", "growth", "youtube", "google", "cloud", "customer",
    "earnings", "income", "expenses", "cash flow", "operating margin",
    "capital expenditure", "investment", "return on investment", "market share",
    "sales", "cost", "loss", "milestone", "product launch",
    "expansion", "reduction", "innovation", "strategy", "forecast",
    "opportunity", "challenge", "performance", "projection",
    "risk", "uncertainty", "adjustment", "increase", "decrease",
]

# Extracting key financial phrases from the grouped text
extracted_phrases = {}
for keyword in keywords:
    extracted_phrases[keyword] = []
    pattern = r'\b' + keyword + r'\b'
    for i, phrase in enumerate(phrases):
        if re.search(pattern, phrase):
            extracted_phrases[keyword].append((i, phrase))

In [None]:
# Print some examples
for phrase in extracted_phrases[keywords[1]][:3]:
    print('Index: ', phrase[0], '\nPhrase: ', phrase[1].replace("revenue", "<REVENUE>"), '\n')

Index:  138 
Phrase:  advertisers who use profit optimization in smart bidding see a 15% uplift in profit on average compared to <REVENUE>-only bidding 

Index:  189 
Phrase:  search remained the largest contributor to <REVENUE> growth 

Index:  209 
Phrase:  turning to the google cloud segment <REVENUE>s were 10.3 billion for the quarter up 29% reflecting first significant growth in gcp which was above growth for cloud overall and includes an increasing contribution from ai and second strong google workspace growth primarily driven by increases in average <REVENUE> per seat 



In [None]:
# Count the number of phrases associated with each keyword and create a DataFrame to display the counts, sorted in ascending order.
n_phrases_by_keyword = {key:len(values) for key, values in extracted_phrases.items()}
n_phrases_by_keyword_df = pd.DataFrame(n_phrases_by_keyword, index=['count']).T.sort_values(by='count', ascending=True)

print(f'Percent of phrases containing keywords: {round(n_phrases_by_keyword_df.sum().values[0] / len(phrases), 3) * 100}%')

Percent of phrases containing keywords: 70.5%


In [None]:
# Create and display a horizontal bar chart of the count of phrases associated with each keyword.
fig = n_phrases_by_keyword_df.plot(kind='barh', backend='plotly')
fig.update_layout(
    title='Count of Phrases by Keyword',
    xaxis_title='Count',
    yaxis_title='Keyword',
    width=1600, # Set the width of the figure
    height=800  # Set the height of the figure
)

fig.show()

# Sentiment Analysis

With the relevant text filtered, sentiment analysis is performed to understand the emotional tone of the content. This involves categorizing the text based on the sentiment it conveys—whether it is positive, negative, or neutral.

**Steps:**
1. **Load Sentiment Analysis Model**: A pre-trained sentiment analysis model is loaded. In our case, we use `ProsusAI/finbert`. FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification
2. **Apply Sentiment Analysis**: The filtered text is passed through the model to determine the sentiment score or classification.
3. **Interpret Results**: The sentiment results are interpreted, showing whether the content is generally positive, negative, or neutral.

In [None]:
# Create a sentiment analysis pipeline using the 'ProsusAI/finbert' model.
sentiment_pipeline = pipeline("sentiment-analysis", model='ProsusAI/finbert', device=device)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
# Create a sentiment DataFrame for each keyword
sentiment_dfs = {}
for keyword in tqdm(keywords):

    # For each keyword, if there are at least 3 extracted phrases, create a DataFrame with phrases.
    if len(extracted_phrases[keyword]) > 2:

        # Create a dataframe for each keyword
        sentiment_df = pd.DataFrame(extracted_phrases[keyword], columns=['index', 'phrase'])
        sentiment_df = sentiment_df.copy()

        # Apply sentiment analysis to get labels and scores for each phrase.
        sentiment_df[['label', 'score']] = sentiment_df['phrase'].apply(lambda x: pd.Series(sentiment_pipeline(x)[0]))

        # Adjust sentiment scores to be negative for 'negative' labels.
        sentiment_df['score'] = sentiment_df.apply(lambda x: x['score'] * -1 if x['label'] == 'negative' else x['score'], axis=1)

        # Store the resulting DataFrame in the 'sentiment_dfs' dictionary.
        sentiment_dfs[keyword] = sentiment_df

  0%|          | 0/36 [00:00<?, ?it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 36/36 [00:09<00:00,  4.00it/s]


In [None]:
# Display the first few rows of the sentiment DataFrame for the keyword 'ai', showing the sentiment analysis results.
sentiment_dfs['ai'].head()

Unnamed: 0,index,phrase,label,score
0,20,they show tremendous ongoing momentum in searc...,positive,0.9275
1,22,and in terms of product innovation we are seei...,positive,0.928177
2,25,year-to-date our ai infrastructure and generat...,positive,0.697626
3,26,as i spoke about last quarter we are uniquely ...,positive,0.872008
4,29,importantly we are innovating at every layer o...,neutral,0.705883


In [None]:
# Count the number of sentiments (negative, neutral, positive) for each keyword:
sentiment_counts_by_keyword = {}

# Iterate over each keyword in the sentiment_dfs dictionary
for key in sentiment_dfs.keys():

    # Initialize a dictionary to store sentiment counts for the current keyword
    sentiment_counts = {'negative': 0,
                        'neutral': 0,
                        'positive': 0}

    # Count occurrences of each sentiment label in the DataFrame for the current keyword
    counts = sentiment_dfs[key]['label'].value_counts()

    # Update the sentiment counts dictionary with the counted values
    for label, count in counts.items():
        sentiment_counts[label] = count

    # Store the sentiment counts for the current keyword in the dictionary
    sentiment_counts_by_keyword[key] = sentiment_counts

# Create a DataFrame from the sentiment counts dictionary
sentiment_counts_by_keyword_df = pd.DataFrame.from_dict(sentiment_counts_by_keyword).T

# sentiment_counts_by_keyword_df = sentiment_counts_by_keyword_df.drop(columns=['neutral'])

# Create a total series that contains the sum of sentiment counts for each keyword
total = sentiment_counts_by_keyword_df.sum(axis=1)

# Normalize the DataFrame by the respective 'total' and sort values by positive sentiment
sentiment_counts_by_keyword_df = sentiment_counts_by_keyword_df.div(total, axis=0).sort_values(by='positive', ascending=True)

# Display the last few rows of the sorted DataFrame
sentiment_counts_by_keyword_df.tail()

Unnamed: 0,negative,neutral,positive
operating margin,0.0,0.166667,0.833333
performance,0.0,0.133333,0.866667
innovation,0.0,0.0,1.0
expansion,0.0,0.0,1.0
profit,0.0,0.0,1.0


In [None]:
# Custom hex color codes for red, blue, green
colors = ['#FF6347', '#1E90FF', '#32CD32']

# Create a horizontal bar chart using the sentiment counts DataFrame with Plotly
fig = sentiment_counts_by_keyword_df.plot(kind='barh', backend='plotly')

# Update layout for title and axis labels
fig.update_layout(
    title='Sentiment Counts by Keyword',
    xaxis_title='Count',
    yaxis_title='Keyword'
)

# Update traces to set custom colors
for i, color in enumerate(colors):
    fig.data[i].marker.color = color

fig.show()

* **Keywords with Predominantly Positive Sentiment**: Several keywords exhibit a strong inclination towards positive sentiment, as indicated by the substantial green segments. These keywords include "expansion", "profit", "performance", "operating margin", "growth", and "cloud".

* **Keywords with Predominantly Negative Sentiment**: A few keywords demonstrate a higher prevalence of negative sentiment, as evidenced by the larger red segments. These keywords are "expenses", "increase", "investment", "revenue", and "youtube."

* **Keywords with Predominantly Neutral Sentiment**: Some keywords demonstrate a high prevalence of neutral sentiment. For instance: "earnings", "customer", "google", "income", "opportunity", and "ai".



In [None]:
# Calculate the average sentiment score for each keyword:
avg_sentiment_by_keyword = {}

# Iterate over each keyword in the sentiment_dfs dictionary
for keyword in sentiment_dfs.keys():
    # Check if there are extracted phrases for the current keyword
    if len(extracted_phrases[keyword]) > 0:
        # Compute the average sentiment score for the keyword and store it in the dictionary
        avg_sentiment_by_keyword[keyword] = sentiment_dfs[keyword]['score'].mean()

# Create a DataFrame from the average sentiment dictionary, with the keyword as the index
# and sort the DataFrame by average sentiment score in descending order
avg_sentiment_by_keyword_df = pd.DataFrame.from_dict(avg_sentiment_by_keyword, orient='index', columns=['avg_sentiment']).sort_values(by='avg_sentiment', ascending=False)

# Display the first few rows of the sorted DataFrame
avg_sentiment_by_keyword_df.head()

Unnamed: 0,avg_sentiment
income,0.884088
performance,0.876743
profit,0.852781
operating margin,0.852604
earnings,0.849693


In [None]:
# Sort the DataFrame by 'avg_sentiment' in ascending order
sorted_df = avg_sentiment_by_keyword_df.sort_values(by='avg_sentiment', ascending=True)

# Create the plot
fig = go.Figure()

# Generate a list of colors based on the sentiment values
colors = ['#FF6347' if val < 0 else '#32CD32' for val in sorted_df['avg_sentiment']]

# Add a bar trace to the figure
fig.add_trace(go.Bar(
    x=sorted_df['avg_sentiment'],  # x-axis as average sentiment
    y=sorted_df.index,  # y-axis as keywords
    orientation='h',  # Horizontal bars
    marker=dict(color=colors),  # Apply the color list,
))

# Update layout for title and axis labels
fig.update_layout(
    title='Average Sentiment by Keyword',
    xaxis_title='Average Sentiment',
    yaxis_title='Keyword',
    plot_bgcolor='white',  # Set plot background to white
    paper_bgcolor='white'  # Set outer background to white
)

# Show the plot
fig.show()

Potential Implications:
* **Positive Perception**: The overall positive sentiment associated with the keywords suggests a generally favorable perception.
* **Effective Messaging**: The keywords with the highest sentiment scores might be particularly effective in communicating key messages or values.
* **Limited Data or Sentiment Analysis**: The absence of negative sentiment could indicate a limited dataset, a focus on positive aspects, or a potential bias in the sentiment analysis methodology.

# Summarization

To make the analysis more digestible, this section summarizes the filtered and sentiment-analyzed text. Summarization condenses the text, highlighting the key points and providing a concise version of the original content.

Steps:
1. **Choose Summarization Technique**: Decide on an extractive or abstractive summarization method.
2. **Apply Summarization Algorithm**: Use the selected method to generate a summary of the text.
3. **Review Summary**: The summary is reviewed to ensure it accurately reflects the key points of the original content.

In [None]:
# Let's load the model and the tokenizer
model_name = "human-centered-summarization/financial-summarization-pegasus"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)
model.to(device)

tokenizer_config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at human-centered-summarization/financial-summarization-pegasus and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PegasusForConditionalGeneration(
  (model): PegasusModel(
    (shared): Embedding(96103, 1024, padding_idx=0)
    (encoder): PegasusEncoder(
      (embed_tokens): Embedding(96103, 1024, padding_idx=0)
      (embed_positions): PegasusSinusoidalPositionalEmbedding(512, 1024)
      (layers): ModuleList(
        (0-15): 16 x PegasusEncoderLayer(
          (self_attn): PegasusAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): ReLU()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_nor

In [None]:
# Split the transcript into chunks (based on model's max token limit, around 512 tokens)
# The exact number of tokens per chunk depends on the specific model's token limit.
max_chunk_size = 1024

# Split the transcript into chunks
chunks = textwrap.wrap(text, max_chunk_size)

* max_length: Maximum length of the summary.
* min_length: Minimum length of the summary.
* length_penalty: Penalty for longer summaries (higher values make summaries shorter).
* num_beams: Number of beams for beam search (more beams increase summary quality but are slower).

In [None]:
# Summarize each chunk
summaries = []
for chunk in tqdm(chunks):
    inputs = tokenizer(chunk, return_tensors='pt', max_length=max_chunk_size, truncation=True).to(device)
    summary_ids = model.generate(inputs['input_ids'], max_length=32, length_penalty=2.0, num_beams=5, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    summaries.append(summary)

100%|██████████| 50/50 [00:22<00:00,  2.18it/s]


In [None]:
# Combine the summaries of each chunk
final_summary = "\n\n".join(summaries)
print(final_summary)

Company to release second-quarter 2024 earnings before the market opens on Tuesday.

Good day, everyone, and welcome to the search giant's conference call.

we are in a strong position to control our destiny. this was underscored by announcements at io cloud next

We're seeing higher engagement from younger users with ai overviews. ai is introducing new ways to search via lens

gemini is powering more than two billion monthly users

Google unveils astra, first data center and cloud region. trillium is our best performing and most energy-efficient tpu to date

We continue to see strong customer interest winning leading brands like hitachi.

we continue to drive fundamental differentiation with new advances. our momentum begins with ai infrastructure

our ai-powered applications portfolio is helping us win new customers.

we had a great brandcast this quarter and are pleased with the progress here.

waymo is making a real leader in the space. final earnings call with ruth on the call

su

In [None]:
# Calculate and print the approximate compression ratio of the summary.
print(f'Approx. Compression: {len(summaries) / len(text.split(". ")):.2%}')

Approx. Compression: 10.99%


# Generate Report & Key Takeaways

In [None]:
# Set the random seed for reproducibility in PyTorch operations.
torch.random.manual_seed(0)

# Load the pre-trained causal language model "microsoft/Phi-3-mini-4k-instruct".
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

# Load the corresponding tokenizer for the "microsoft/Phi-3-mini-4k-instruct" model.
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def get_report(prompt, output_tokens=1024, print_outcome=True):
    # Set the arguments for text generation:
    generation_args = {
        "max_new_tokens": output_tokens, # Limit the number of tokens generated to 'output_tokens'
        "return_full_text": False, # Return only the newly generated text, not the full conversation
        "do_sample": False, # Disable sampling for deterministic output (i.e., always generate the same output given the same input)
    }

    # Define the conversation history as a list of messages
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant specialized in generating concise and informative summaries and reports for investors."},
        {"role": "user", "content": "I need a summary of the latest quarterly financial performance of the following Company X. Here are the details: revenue increased by 10% compared to the previous quarter, net profit margin improved by 2%, and operating expenses decreased by 5%. Can you summarize this information for an investor?"},
        {"role": "assistant", "content": "Sure! Here’s a concise summary for investors: In the latest quarterly report for Company X, revenue rose by 10% from the previous quarter, net profit margin saw a 2% improvement, and operating expenses were reduced by 5%."},
        {"role": "user", "content": prompt},
    ]

    # Initialize the text-generation pipeline using the pre-loaded model and tokenizer.
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )

    # Generate the report based on the conversation history and specified arguments.
    output = pipe(messages, **generation_args)

    # If 'print_outcome' is True, print the generated text.
    if print_outcome:
        print(output[0]['generated_text'])

    # Return the generated text.
    return output[0]['generated_text']

In [None]:
# Create a prompt for generating a concise summary using the provided context.
prompt_summary = f"""
Use the context to summarize all the content available in the context. Be concise and specific.\n
Context:\n{final_summary}\n\n
Summary:
"""

In [None]:
# Get the report and print it
output_summary = get_report(prompt_summary)

 In the upcoming second-quarter 2024 earnings call, Google is expected to report strong financial performance, with a 14% or 15% increase in constant currency revenues, driven by search and cloud services. The company's search revenue remains the largest contributor, while cloud performance saw a 26% increase, with a 20% year-on-year rise in revenues. Google's G&A expenses decreased by 5%, and free cash flow was reported at $13.5 billion. The company's AI-powered applications, including AI overviews and AI-driven improvements in search ads, are contributing to this growth. The AI infrastructure and applications are driving fundamental differentiation and customer interest, with AI-powered tools like Gemini and AI-enhanced shopping ads expanding to performance max and standard campaigns. The company is also investing in AI-driven initiatives such as virtual try-on in shopping ads and AI-powered tools for demand generation and search ads. With a strong focus on AI, Google is positioning 

Generated report:

> *In the upcoming second-quarter 2024 earnings call, Google is expected to report strong financial performance, with a 14% or 15% increase in constant currency revenues, driven by search and cloud services. The company's search revenue remains the largest contributor, while cloud performance saw a 26% increase. The second quarter also saw a 20% year-on-year revenue rise, driven by subscriptions and platforms. Google's G&A expenses decreased by 5%, and free cash flow was reported at $13.5 billion. The company's investment in AI, including the launch of Astra, the first data center and cloud region, and the introduction of AI-powered applications like AI overviews and AI-powered shopping ads, is expected to continue driving growth. The company's strong customer interest and partnerships with leading brands like Hitachi are also highlighted. Sundar Pichai, CEO, emphasizes the importance of AI in driving fundamental differentiation and the company's momentum in AI infrastructure.*

In [None]:
def get_takeaways(prompt, output_tokens=1024, print_outcome=True):
    generation_args = {
        "max_new_tokens": output_tokens, # Limit the number of tokens generated to 'output_tokens'
        "return_full_text": False, # Return only the newly generated text, not the full conversation
        "do_sample": False, # Disable sampling for deterministic output (i.e., always generate the same output given the same input)
    }

    # Define the conversation history tailored for generating key takeaways
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant specialized in generating concise and informative key takeaways for investors from financial reports and earnings calls."},
        {"role": "user", "content": "Please extract the key takeaways for investors from the following financial report or earnings call details. Ensure the information is concise, focusing on growth, financial performance, and strategic initiatives."},
        {"role": "assistant", "content": "Absolutely! I’ll provide a clear and concise summary of the most important takeaways for investors."},
        {"role": "user", "content": prompt},
    ]

    # Initialize the text-generation pipeline using the pre-loaded model and tokenizer.
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )

    # Generate the takeaways based on the conversation history and specified arguments
    output = pipe(messages, **generation_args)

    # If 'print_outcome' is True, print the generated text.
    if print_outcome:
        print(output[0]['generated_text'])

    # Return the generated text.
    return output[0]['generated_text']

In [None]:
# Create a prompt to list the top 10 key points for investors using the provided context.
prompt_takeaways = f"""
List the top 10 key points that an investor would find most relevant based on the following context. Focus on financial and strategic insights.

Context:
{final_summary}

Top 10 Key Points for Investors:
"""

In [None]:
# Get the top 10 key points and print them
output_takeaways = get_takeaways(prompt_takeaways)

 1. Google's search and cloud services continue to drive strong revenue growth, with second-quarter revenues up 14% or 15% in constant currency.

2. The company's AI initiatives are gaining traction, with higher engagement from younger users and the introduction of new AI search features.

3. Google's AI-powered applications are helping to attract new customers, and the company is expanding AI tools to performance marketing campaigns.

4. The company's AI infrastructure is a key differentiator, with the AI-powered Gemini platform powering over two billion monthly users.

5. Google's AI-driven improvements have led to a 10% increase in broad match performance for search ads.

6. YouTube remains the top-watched streaming platform in the U.S., with a 130% increase in views over the last three years.

7. Creator ads on YouTube have a lower viewer CPA and higher conversion rate, indicating strong performance in this segment.

8. Google's G&A expenses have decreased by 5%, and the company ha