# **Earnings Call Summarization**

by Strong Sizhi Chen

In [1]:
import pandas as pd
df = pd.read_pickle("motley-fool-data.pkl")
df

Unnamed: 0,date,exchange,q,ticker,transcript
0,"Aug 27, 2020, 9:00 p.m. ET",NASDAQ: BILI,2020-Q2,BILI,"Prepared Remarks:\nOperator\nGood day, and wel..."
1,"Jul 30, 2020, 4:30 p.m. ET",NYSE: GFF,2020-Q3,GFF,Prepared Remarks:\nOperator\nThank you for sta...
2,"Oct 23, 2019, 5:00 p.m. ET",NASDAQ: LRCX,2020-Q1,LRCX,Prepared Remarks:\nOperator\nGood day and welc...
3,"Nov 6, 2019, 12:00 p.m. ET",NASDAQ: BBSI,2019-Q3,BBSI,"Prepared Remarks:\nOperator\nGood day, everyon..."
4,"Aug 7, 2019, 8:30 a.m. ET",NASDAQ: CSTE,2019-Q2,CSTE,Prepared Remarks:\nOperator\nGreetings and wel...
...,...,...,...,...,...
18750,"Nov 9, 2021, 1:00 p.m. ET",NYSE: SWX,2021-Q3,SWX,Prepared Remarks:\nOperator\nLadies and gentle...
18751,"Nov 18, 2021, 12:00 p.m. ET",NYSE: PNNT,2021-Q4,PNNT,"Prepared Remarks:\nOperator\nGood morning, and..."
18752,"Feb 08, 2022, 11:00 a.m. ET",NYSE: TDG,2022-Q1,TDG,Prepared Remarks:\nOperator\nThank you for sta...
18753,"Feb 28, 2022, 4:30 p.m. ET",NASDAQ: DVAX,2021-Q4,DVAX,"Prepared Remarks:\nOperator\nGood day, ladies ..."


**Clean the Earnings Call**

In [2]:
import re

def is_speaker_line(line: str) -> bool:
    line = line.strip()
    if line == "":
        return False

    # Pattern 1: "Name --"
    if re.match(r"^[A-Za-z][A-Za-z\s\.]*\s--\s", line):
        return True

    # Pattern 2: "Operator:", "Analyst:", etc.
    if re.match(r"^(Operator|Analyst|Coordinator|Speaker)\b", line, re.IGNORECASE):
        return True

    # Pattern 3: "Name - Title"
    if re.match(r"^[A-Za-z][A-Za-z\s\.]*\s-\s[A-Za-z]", line):
        return True

    # Pattern 4: lines ending with ":" (likely a speaker)
    if line.endswith(":") and len(line.split()) <= 4:
        return True

    return False


def is_disclaimer(line: str) -> bool:
    """ forward-looking statements / safe harbor """
    text = line.lower()
    keywords = [
        "forward-looking statements",
        "safe harbor",
        "actual results may differ",
        "undue reliance",
        "investors are cautioned",
        "risk factors",
        "securities and exchange commission",
    ]
    return any(k in text for k in keywords)


def is_operator_instruction(line: str) -> bool:
    """ operator instructions """
    text = line.lower()
    patterns = [
        "we will now begin the q&a",
        "we will now begin the question-and-answer",
        "please stand by",
        "your lines are open",
        "our first question comes from",
        "we'll now open the call",
        "this call is being recorded",
        "a replay will be available",
    ]
    return any(p in text for p in patterns)


def is_meaningless_short(line: str) -> bool:
    """ remove lines like 'Thanks', 'Good morning', 'Hi everyone' """
    text = line.strip().lower()

    # too short or generic
    if len(text) < 8:
        return True

    keywords = [
        "thank you",
        "thanks",
        "good morning",
        "good afternoon",
        "good evening",
        "hi everyone",
        "hello everyone",
        "welcome to",
    ]
    return any(text.startswith(k) for k in keywords)


def is_qa_marker(line: str) -> bool:
    text = line.lower()
    patterns = [
        "question-and-answer session",
        "q&a session",
        "questions and answers",
        "question and answer session",
    ]
    return any(p in text for p in patterns)


def clean_transcript(text: str) -> str:
    if not isinstance(text, str):
        return ""

    cleaned = []
    for line in text.split("\n"):
        line = line.strip()
        if line == "":
            continue

        if is_speaker_line(line):
            continue
        if is_disclaimer(line):
            continue
        if is_operator_instruction(line):
            continue
        if is_qa_marker(line):
            continue
        if is_meaningless_short(line):
            continue

        cleaned.append(line)

    return "\n".join(cleaned)

# Create a new column with cleaned transcripts
df["clean_transcript"] = df["transcript"].apply(clean_transcript)

In [3]:
# Inspect one example before/after
idx = 0
print("===== ORIGINAL TRANSCRIPT (first 500 chars) =====")
print(df.loc[idx, "transcript"][:1000], "\n")

print("===== CLEANED TRANSCRIPT (first 500 chars) =====")
print(df.loc[idx, "clean_transcript"][:1000])

===== ORIGINAL TRANSCRIPT (first 500 chars) =====
Prepared Remarks:
Operator
Good day, and welcome to the Bilibili 2020 Second Quarter Earnings Conference Call. Today's conference is being recorded.
At this time, I would like to turn the conference over to Juliet Yang, Senior Director of Investor Relations. Please go ahead.
Juliet Yang -- Senior Director of Investor Relations
Thank you, operator.
Please note the discussion today will contain forward-looking statements relating to the Company's future performance, and are intended to qualify for the Safe Harbor from liability, as established by the US Private Securities Litigation Reform Act. Such statements are not guarantees of future performance and are subject to certain risks and uncertainties, assumptions and other factors. Some of these risks are beyond the Company's control and could cause actual results to differ materially from those mentioned in today's press release and this discussion. A general discussion of the risk facto

# **TF-IDF Extractive Summarizer**
TF-IDF (Term Frequency–Inverse Document Frequency) is a numerical measure used in NLP to evaluate how important a word is within a sentence or document. It combines two ideas: TF, which counts how often a word appears in the current sentence (words appearing more frequently in that sentence are considered more relevant), and IDF, which penalizes words that appear too often across the entire document or corpus (common words like “the”, “and”, “we” get very low scores). By multiplying TF and IDF, TF-IDF assigns high scores to words that are both frequent in a sentence and relatively rare but meaningful globally—such as “revenue”, “margin”, or “cloud” in an earnings call. Because important sentences tend to contain many high-TF-IDF words, TF-IDF can be used for extractive summarization: each sentence is turned into a TF-IDF vector, compared with the document’s overall theme vector, and the most representative sentences are selected as the summary.

In [4]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

# 1. Simple sentence splitter
#    This avoids needing nltk/spacy for now.
def split_sentences(text):
    """
    Split a long transcript into sentences using simple regex.
    Removes very short or empty sentences.
    """
    if not isinstance(text, str) or text.strip() == "":
        return []

    # Split by newline or punctuation + optional whitespace
    raw_sents = re.split(r'[\n\.!?]+', text)

    # Clean whitespace and remove very short sentences
    sentences = [s.strip() for s in raw_sents if len(s.strip()) > 20]

    return sentences

# 2. Extractive summarization function (TF-IDF based)
def extractive_summary(text, num_sentences=5):
    """
    Produce an extractive summary by:
    1. Splitting transcript into sentences
    2. Computing TF-IDF vectors
    3. Computing cosine similarity to document vector
    4. Selecting top-k sentences
    """
    sentences = split_sentences(text)
    if len(sentences) == 0:
        return ""
    if len(sentences) <= num_sentences:
        return " ".join(sentences)

    # TF-IDF for all sentences
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf = vectorizer.fit_transform(sentences)   # sparse matrix

    # Document vector (average)
    doc_vector = tfidf.mean(axis=0)               # 1 x vocab_size (numpy matrix)

    # Convert doc_vector to a proper array
    doc_vector = np.asarray(doc_vector)           # shape (1, vocab_size)

    # Compute cosine similarity for each sentence vs document vector
    scores = cosine_similarity(tfidf, doc_vector).ravel()

    # Select the top-k sentences
    top_k_idx = np.argsort(scores)[-num_sentences:]
    top_k_idx = np.sort(top_k_idx)

    selected = [sentences[i]+"." for i in top_k_idx]
    return "\n".join(selected)

In [5]:
# 3. Test extractive summarization on a single transcript
sample_text = df.loc[15578, "clean_transcript"]

print("====== ORIGINAL TEXT (first 1000 chars) ======")
print(sample_text[:1000], "...\n")

summary = extractive_summary(sample_text, num_sentences=10)

print("====== EXTRACTIVE SUMMARY ======")
print(summary)

Good day and welcome to the Topgolf Callaway Brands Corp. 2022 third quarter earnings conference call. All participants will be in a listen-only mode. [Operator instructions] After today's presentation, there will be an opportunity to ask questions.
[Operator instructions] Please note this event is being recorded. I would now like to turn the conference over to Ms. Lauren Scott, director of investor relations. Please go ahead.
Jennifer Thomas, our chief accounting officer; and Patrick Burke, our senior vice president of global finance, are also in the room today for Q&A. Earlier today, the company issued a press release announcing its third quarter 2022 financial results. In addition, there is a presentation that accompanies today's prepared remarks and may make it easier for you to follow the call. This earnings presentation, as well as the earnings release, are both available under the company's investor relations website under the financial results tab.
The strength of our results u

# **TextRank baseline**
TextRank extractive summarization is a graph-based method for identifying the most important sentences in a document. It works by first splitting the text into individual sentences and computing the similarity between every pair—typically using TF-IDF or sentence embeddings. Each sentence becomes a node in a graph, and the similarity scores become edge weights connecting these nodes. The algorithm then applies PageRank—the same algorithm Google uses to rank web pages—to determine which sentences are most "central" or well-connected within the document. Sentences that are similar to many other important sentences receive higher scores. Finally, the top-ranked sentences are selected and arranged in their original order to form an extractive summary. This produces summaries that capture the main ideas by leveraging relationships among sentences rather than relying solely on keyword frequency.

In [6]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import re

def textrank_summary(text, num_sentences=5):
    """
    Extractive summarization using a TextRank-like approach:
    1. Split transcript into sentences
    2. Compute TF-IDF vectors for each sentence
    3. Build a sentence similarity graph (cosine similarity)
    4. Run PageRank on the graph
    5. Select top-k ranked sentences as summary
    """
    sentences = split_sentences(text)
    if len(sentences) == 0:
        return ""
    if len(sentences) <= num_sentences:
        return " ".join(sentences)

    # Step 1: TF-IDF for sentences
    vectorizer = TfidfVectorizer(stop_words="english")
    tfidf = vectorizer.fit_transform(sentences)

    # Step 2: compute sentence-to-sentence cosine similarity matrix
    sim_matrix = cosine_similarity(tfidf, tfidf)

    # Step 3: build graph from similarity matrix
    # Nodes: sentence indices; Edges: similarity weights
    nx_graph = nx.from_numpy_array(sim_matrix)

    # Step 4: run PageRank
    scores = nx.pagerank(nx_graph)

    # Step 5: sort sentences by PageRank score
    ranked_sentences = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    # pick top-k indices
    top_indices = [idx for idx, _ in ranked_sentences[:num_sentences]]
    top_indices = sorted(top_indices)  # keep original order

    selected = [sentences[i] for i in top_indices]
    return "\n".join(selected)

In [7]:
# Test on a single cleaned transcript
sample_text = df.loc[100, "clean_transcript"]

print("====== ORIGINAL (first 600 chars) ======")
print(sample_text[:1000], "...\n")

print("====== TF-IDF SUMMARY ======")
print(extractive_summary(sample_text, num_sentences=10), "\n")

print("====== TEXTRANK SUMMARY ======")
print(textrank_summary(sample_text, num_sentences=10))

Neal Froneman
Those are our single and most important priorities. I will be assisted in presenting that section by Jevon Martin and Grant Stuart. I'll then do a strategic update, which is titled positioning for positive impact; and I think you will find that very interesting. We do use our year-end results to provide strategic guidance to the market and our shareholders.
I will then hand over to Dr. Richard Stewart, our chief operating officer, who will present the results of the operations in detail. As you can see, we've called it operational excellence. We've had an outstanding year considering all the challenges of COVID and safety and so on.
Richard will then hand over to our CFO, Charl Keyter, who will then conduct the financial review, which I think you will also enjoy, an outstanding year from a financial perspective as well. And then I'll wrap up with a brief outlook and conclusion. So as I said, let's start with safety. And it gives me absolutely no pleasure to talk about fat

In [8]:
# Add TF-IDF summary for all rows
df["tfidf_summary"] = df["clean_transcript"].apply(
    lambda t: extractive_summary(t, num_sentences=50)
)

# Add TextRank summary for all rows
df["textrank_summary"] = df["clean_transcript"].apply(
    lambda t: textrank_summary(t, num_sentences=50)
)

In [9]:
# Save it
# Or save to CSV (larger file, but easy to inspect)
df.to_csv("motley_fool_with_summaries.csv", index=False)

In [10]:
import pandas as pd
df = pd.read_csv("motley_fool_with_summaries.csv")
df = df.drop_duplicates(subset=["tfidf_summary"])
df

Unnamed: 0,date,exchange,q,ticker,transcript,clean_transcript,tfidf_summary,textrank_summary
0,"Aug 27, 2020, 9:00 p.m. ET",NASDAQ: BILI,2020-Q2,BILI,"Prepared Remarks:\nOperator\nGood day, and wel...","Good day, and welcome to the Bilibili 2020 Sec...",The second quarter was another strong quarter ...,"Good day, and welcome to the Bilibili 2020 Sec..."
1,"Jul 30, 2020, 4:30 p.m. ET",NYSE: GFF,2020-Q3,GFF,Prepared Remarks:\nOperator\nThank you for sta...,"Finally, some of today's remarks will adjust f...",Adjusted EBITDA increased 31% and adjusted ear...,Adjusted EBITDA increased 31% and adjusted ear...
2,"Oct 23, 2019, 5:00 p.m. ET",NASDAQ: LRCX,2020-Q1,LRCX,Prepared Remarks:\nOperator\nGood day and welc...,Good day and welcome to the September 2019 Qua...,"During today's call, we will share our overvie...","During today's call, we will share our overvie..."
3,"Nov 6, 2019, 12:00 p.m. ET",NASDAQ: BBSI,2019-Q3,BBSI,"Prepared Remarks:\nOperator\nGood day, everyon...","Good day, everyone and thank you for participa...",The third quarter of 2019 had one more work da...,"Good day, everyone and thank you for participa..."
4,"Aug 7, 2019, 8:30 a.m. ET",NASDAQ: CSTE,2019-Q2,CSTE,Prepared Remarks:\nOperator\nGreetings and wel...,Greetings and welcome to the Caesarstone Limit...,We are managing a global growth acceleration p...,We are managing a global growth acceleration p...
...,...,...,...,...,...,...,...,...
18750,"Nov 9, 2021, 1:00 p.m. ET",NYSE: SWX,2021-Q3,SWX,Prepared Remarks:\nOperator\nLadies and gentle...,I would now like to turn the conference over t...,"Karen, will cover recent customer growth, liqu...","Karen, will cover recent customer growth, liqu..."
18751,"Nov 18, 2021, 12:00 p.m. ET",NYSE: PNNT,2021-Q4,PNNT,"Prepared Remarks:\nOperator\nGood morning, and...","Mr. Penn, please go ahead.\nAt this time, I'd ...",These dividend payments highlight the value of...,These dividend payments highlight the value of...
18752,"Feb 08, 2022, 11:00 a.m. ET",NYSE: TDG,2022-Q1,TDG,Prepared Remarks:\nOperator\nThank you for sta...,About 90% of our net sales are generated by pr...,"In our business, we saw another quarter of seq...","In our business, we saw another quarter of seq..."
18753,"Feb 28, 2022, 4:30 p.m. ET",NASDAQ: DVAX,2021-Q4,DVAX,"Prepared Remarks:\nOperator\nGood day, ladies ...","Good day, ladies and gentlemen, and welcome to...","Importantly, we expect to continue to grow rev...","Importantly, we expect to continue to grow rev..."


In [11]:
df.columns

Index(['date', 'exchange', 'q', 'ticker', 'transcript', 'clean_transcript',
       'tfidf_summary', 'textrank_summary'],
      dtype='object')

In [12]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

class AbstractiveSummarizer:
    """
    Abstractive summarization for earnings call transcripts
    Takes extractive summary (10 sentences) as input and generates fluent summary
    """

    def __init__(self, model_name="facebook/bart-large-cnn"):
        """
        Initialize the abstractive summarizer

        Args:
            model_name: HuggingFace model name
                - "facebook/bart-large-cnn" (recommended for general summaries)
                - "google/pegasus-cnn_dailymail" (good for news-style summaries)
                - "facebook/bart-large-xsum" (shorter, more abstractive)
                - "allenai/led-large-16384" (for very long texts)
        """
        self.model_name = model_name
        self.device = 0 if torch.cuda.is_available() else -1

        # Load tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

        if torch.cuda.is_available():
            self.model = self.model.to('cuda')

    def summarize(self, text, max_length=150, min_length=50,
                  num_beams=4, length_penalty=2.0,
                  no_repeat_ngram_size=3):
        """
        Generate abstractive summary from extractive summary

        Args:
            text: Input text (typically 10 sentences from extractive summary)
            max_length: Maximum length of generated summary
            min_length: Minimum length of generated summary
            num_beams: Number of beams for beam search (higher = better quality but slower)
            length_penalty: Penalty for length (>1.0 encourages longer summaries)
            no_repeat_ngram_size: Prevent repetition of n-grams

        Returns:
            Generated summary text
        """
        # Tokenize input
        inputs = self.tokenizer(
            text,
            max_length=1024,
            truncation=True,
            return_tensors="pt",
            padding=True
        )

        if torch.cuda.is_available():
            inputs = inputs.to('cuda')

        # Generate summary
        with torch.no_grad():
            summary_ids = self.model.generate(
                inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                max_length=max_length,
                min_length=min_length,
                num_beams=num_beams,
                length_penalty=length_penalty,
                early_stopping=True,
                no_repeat_ngram_size=no_repeat_ngram_size
            )

        # Decode summary
        summary = self.tokenizer.decode(
            summary_ids[0],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        )

        return summary

    def batch_summarize(self, texts, batch_size=4, **kwargs):
        """
        Summarize multiple texts in batches

        Args:
            texts: List of input texts
            batch_size: Number of texts to process at once
            **kwargs: Additional arguments for summarize()

        Returns:
            List of summaries
        """
        summaries = []

        for i in tqdm(range(0, len(texts), batch_size), desc="Generating summaries"):
            batch = texts[i:i+batch_size]

            for text in batch:
                summary = self.summarize(text, **kwargs)
                summaries.append(summary)

        return summaries


def compare_models(extractive_summary):
    """
    Compare different models on the same extractive summary
    Helps you choose the best model for your use case

    Args:
        extractive_summary: Input text (10 sentences from extractive method)
    """
    models = [
        "facebook/bart-large-cnn",
        "google/pegasus-cnn_dailymail",
        "facebook/bart-large-xsum",
    ]

    results = {}

    for model_name in models:
        print(f"\n{'='*70}")
        print(f"Model: {model_name}")
        print('='*70)

        try:
            summarizer = AbstractiveSummarizer(model_name)
            summary = summarizer.summarize(
                extractive_summary,
                max_length=130,
                min_length=40,
                num_beams=4
            )

            results[model_name] = summary
            print(f"\nSummary:\n{summary}\n")
            print(f"Length: {len(summary.split())} words")

        except Exception as e:
            print(f"Error with {model_name}: {str(e)}")
            results[model_name] = None

    return results

# Configuration presets for different use cases
SUMMARY_CONFIGS = {
    "short": {
        "max_length": 100,
        "min_length": 30,
        "num_beams": 4,
        "length_penalty": 1.5,
        "no_repeat_ngram_size": 3
    },
    "medium": {
        "max_length": 150,
        "min_length": 50,
        "num_beams": 4,
        "length_penalty": 2.0,
        "no_repeat_ngram_size": 3
    },
    "long": {
        "max_length": 300,
        "min_length": 100,
        "num_beams": 4,
        "length_penalty": 2.0,
        "no_repeat_ngram_size": 3
    }
}

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# Initialize abstractive summarizer
summarizer = AbstractiveSummarizer("facebook/bart-large-cnn")

# Process a single example first
sample_transcript = df.iloc[0]['transcript']

# Step 1: Your extractive summarization (you already have this)
extractive_summary1 = extractive_summary(sample_transcript, num_sentences=50)

# Step 2: Abstractive summarization
abstractive_summary = summarizer.summarize(
    extractive_summary1,  # Your 10 sentences
    **SUMMARY_CONFIGS["long"]
)

print("Extractive Summary:")
print(extractive_summary1)
print("\n" + "="*70 + "\n")
print("Abstractive Summary (fluent paragraph):")
print(abstractive_summary)

Extractive Summary:
Good day, and welcome to the Bilibili 2020 Second Quarter Earnings Conference Call.
The second quarter was another strong quarter of growth for Bilibili.
For the second quarter, MAUs were 172 million, up 55% and DAUs were up 52% to 51 million, both on a year-over-year basis.
7% from the same period last year.
9 million content creators uploading 6 million videos per month, representing increases of 123% and 148%, respectively, both year-over-year.
In the first half of 2020, the number of content creators who submitted their first video creation trial grew by 139% year-over-year.
2 billion, up 97% year-over-year.
By the end of the second quarter, we had 89 million official members who passed our 100-question exam, up 65% year-over-year.
Revenues from our mobile games business were up 36% year-over-year to RMB1.
[Foreign Speech] to our game portfolio.
These included simulation game Dark Boom, [Foreign Speech], a self-developed ACG game; [Foreign Speech], a highly anti

In [14]:
compare_models(extractive_summary1)


Model: facebook/bart-large-cnn

Summary:
The second quarter was another strong quarter of growth for Bilibili. MAUs were 172 million, up 55% and DAUs were up 52% to 51 million. Revenues from our mobile games business were up 36% year-over-year to RMB1.2 billion.

Length: 38 words

Model: google/pegasus-cnn_dailymail
Error with google/pegasus-cnn_dailymail: 
 requires the protobuf library but it was not found in your environment. Check out the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


Model: facebook/bart-large-xsum

Summary:
Good afternoon, and welcome to the Bilibili 2020 Second Quarter Earnings Conference Call, which is being webcast live on the company's website and mobile app, and will be available for replay on our website and social media platforms.

Length: 38 words


{'facebook/bart-large-cnn': 'The second quarter was another strong quarter of growth for Bilibili. MAUs were 172 million, up 55% and DAUs were up 52% to 51 million. Revenues from our mobile games business were up 36% year-over-year to RMB1.2 billion.',
 'google/pegasus-cnn_dailymail': None,
 'facebook/bart-large-xsum': "Good afternoon, and welcome to the Bilibili 2020 Second Quarter Earnings Conference Call, which is being webcast live on the company's website and mobile app, and will be available for replay on our website and social media platforms."}

In [None]:
# Don't try this one too slow
summarizer = AbstractiveSummarizer("facebook/bart-large-cnn")

def run_abstractive_summary(text):
    if not isinstance(text, str) or text.strip() == "":
        return ""

    try:
        summary = summarizer.summarize(
            text,
            **SUMMARY_CONFIGS["long"]
        )
        return summary
    except Exception as e:
        print("Error:", e)
        return ""

df_small = df.iloc[:50].copy()

df_small["abstractive_summary"] = df_small["tfidf_summary"].apply(run_abstractive_summary)
df_small