<a href="https://colab.research.google.com/github/DataAnalyticSscience/TextClassification/blob/main/Natural_Language_Processing_for_Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing for Text Summarization

### Natural Language Processing (NLP) for text summarization
-----

Natural Language Processing (NLP) for text summarization involves the use of computational techniques to understand, interpret, and generate concise and meaningful summaries of longer text documents. This process typically includes several key stages:

1. **Text Preprocessing:** Before summarizing, the text needs to be cleaned and prepared. This involves removing irrelevant data, normalizing text, and possibly parsing it into sentences or tokens.
2. **Understanding Context:** NLP models must understand the context and main ideas of the text. This is often achieved through techniques like Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and dependency parsing.
3. **Feature Extraction:** The model identifies key features and important sentences or phrases. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or newer neural network-based approaches are used for this purpose.
4. **Summarization Techniques:** There are two main types:
  - Extractive Summarization: This method involves selecting and compiling key sentences from the original text to form a summary. It's simpler but may lack coherence.
  - **Abstractive Summarization:** More complex, this method involves generating new sentences that capture the essence of the original text, often using advanced models like sequence-to-sequence neural networks.
5. **Refinement:** The generated summary is often refined for coherence, fluency, and conciseness. This might involve rephrasing or restructuring sentences.
6. **Evaluation:** The quality of the summary is evaluated using metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation), which compares the generated summary with a human-written reference summary.

Advancements in deep learning, especially models like BERT, GPT, and their variants, have significantly improved the quality of text summarization, making it more accurate and human-like. These technologies have wide applications, from news aggregation to generating executive reports from extensive documents.

### Why NLP for text summarization analysis
-----

Conducting Natural Language Processing (NLP) for text summarization analysis is essential for several reasons, particularly in our current age of information overload:

1. **Efficiency in Information Handling:** With the vast amount of textual data available online and in corporate databases, it becomes impractical to read and comprehend everything. Text summarization can condense this information, allowing users to understand content quickly.
2. **Time-Saving:** Summarization significantly reduces the time required to get the gist of a document. This is especially valuable in sectors like legal, academic, and business, where professionals need to process large volumes of text daily.
3. **Enhanced Accessibility:** Summarization makes information more accessible. For instance, it can help people with reading difficulties, or those who are not fluent in a particular language, to understand the content more easily.
4. **Improved Decision Making:** In business and research, executives and scientists can make better-informed decisions when they can quickly access the summarized essence of extensive reports, research papers, or data analyses.
5. **Content Curation and Recommendation:** Summarization can be used to curate content tailored to individual preferences, enhancing user experience in apps and websites by providing brief overviews of articles, papers, or books.
6. **Knowledge Management and Retrieval:** Summarized texts can be more effectively indexed and retrieved, improving the functionality of search engines and knowledge management systems.
7. **Support for Automated Reporting:** In journalism, finance, and weather forecasting, among other fields, text summarization can aid in generating automated reports, saving time and resources.
8. **Educational Purposes:** Summarization tools can assist students and educators by providing concise summaries of educational material, aiding in study and revision.
9. **Language Processing Research:** Summarization challenges NLP technologies to comprehend and reproduce human language more effectively, driving advancements in AI and machine learning.

By leveraging NLP for text summarization, organizations and individuals can manage, interpret, and utilize textual data more effectively, making informed decisions and staying informed in a fast-paced world.

### Objective of this analysis
-----

In our analysis, we will employ advanced Natural Language Processing (NLP) techniques to meticulously extract and highlight the pivotal sentences from [Google's Q3 2023 earnings transcript](https://abc.xyz/2023-q3-earnings-call/). Our objective is to deeply analyze the transcript of this call, isolating and identifying the key statements made by the speakers. This meticulous approach can help unearth the most critical insights, with a particular focus on several significant areas:

1. **Investment Perspective for Google Stocks:** By pinpointing the most influential sentences, our analysis will provide valuable insights for investors, aiding them in making more informed decisions about investing in Google's stocks.
2. **Deciphering Google's Product Roadmap:** We will delve into the details of Google's strategic plans, shedding light on their most significant upcoming products and innovations. This will offer a clearer understanding of the company's future direction in technology and services.
3. **Forecasting Google's Future Earnings Potential:** Our analysis will also focus on extracting insights related to Google's future earnings potential, which is vital for investors, analysts, and the financial community.

Upon the completion of the analysis of Google's earnings call, we will replicate this process to automatically dissect and elucidate the most salient points from [Snap, Inc.'s latest earnings transcript](https://s25.q4cdn.com/442043304/files/doc_financials/2023/q3/Snap-Inc-Q3-2023-Transcript.pdf). This will not only demonstrate the versatility and efficacy of our NLP methodology but also provide crucial insights into another major tech player's performance and strategies.

## Install and import requisite packages

In [315]:
# !pip install goose3

In [316]:
import re # relugar expression
import nltk # natural language toolkit
import string
from goose3 import Goose
import heapq
from IPython.core.display import HTML

# Tokenization
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

# Removing Stop Words
nltk.download('stopwords')
from nltk.corpus import stopwords

# Stemming and Lemmatization
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Frequency Distribution of the Words
from nltk.probability import FreqDist

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [317]:
url = 'https://abc.xyz/2023-q3-earnings-call/'
g = Goose()
data = g.extract(url)

In [318]:
data.cleaned_text

'For a PDF version of the transcript, please click here.\n\nThank you for standing by for the Alphabet third quarter 2023 earnings conference call. At this time, all participants are in listen-only mode. After the speaker presentation, there will be a question-and-answer session. To ask a question during the session, you will need to press *1 on your telephone. I would now like to hand the conference over to your speaker today Jim Friedland, Director of Investor Relations. Please go ahead.\n\nGood afternoon, everyone, and welcome to Alphabet’s Third Quarter 2023 Earnings conference call. With us today are Sundar Pichai, Philipp Schindler and Ruth Porat. Now I’ll quickly cover the Safe Harbor.\n\nSome of the statements that we make today regarding our business, operations and financial performance may be considered forward-looking. Such statements are based on current expectations and assumptions that are subject to a number of risks and uncertainties.\n\nActual results could differ mat

In [319]:
data.title

'2023 Q3 Earnings Call'

## Preprocessing the texts

### Format the text for analysis

**If you're starting directly from `data.cleaned_text`:**

This approach is suitable if the text is not split into sentences yet. You might replace full stops with full stops plus newline characters. However, be cautious as this can sometimes break sentences incorrectly (e.g., after abbreviations or decimal numbers).

In [320]:
original_text = data.cleaned_text.replace('. ', '.\n')

In [321]:
# Let's preview the text / data
original_text

'For a PDF version of the transcript, please click here.\n\nThank you for standing by for the Alphabet third quarter 2023 earnings conference call.\nAt this time, all participants are in listen-only mode.\nAfter the speaker presentation, there will be a question-and-answer session.\nTo ask a question during the session, you will need to press *1 on your telephone.\nI would now like to hand the conference over to your speaker today Jim Friedland, Director of Investor Relations.\nPlease go ahead.\n\nGood afternoon, everyone, and welcome to Alphabet’s Third Quarter 2023 Earnings conference call.\nWith us today are Sundar Pichai, Philipp Schindler and Ruth Porat.\nNow I’ll quickly cover the Safe Harbor.\n\nSome of the statements that we make today regarding our business, operations and financial performance may be considered forward-looking.\nSuch statements are based on current expectations and assumptions that are subject to a number of risks and uncertainties.\n\nActual results could di

### Step 1: Text Normalization
This step involves converting all text to lowercase for uniformity and removing punctuation, including newline characters, to clean the text.

In [322]:
# these are the types of punctuations to be removed
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [323]:
# Convert to lowercase
text = original_text.lower()

# Remove punctuation using translate
text = text.translate(str.maketrans('', '', string.punctuation))

# Remove special characters and extra spaces/newlines using regex
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text)

In [324]:
text

'for a pdf version of the transcript please click here thank you for standing by for the alphabet third quarter 2023 earnings conference call at this time all participants are in listenonly mode after the speaker presentation there will be a questionandanswer session to ask a question during the session you will need to press 1 on your telephone i would now like to hand the conference over to your speaker today jim friedland director of investor relations please go ahead good afternoon everyone and welcome to alphabets third quarter 2023 earnings conference call with us today are sundar pichai philipp schindler and ruth porat now ill quickly cover the safe harbor some of the statements that we make today regarding our business operations and financial performance may be considered forwardlooking such statements are based on current expectations and assumptions that are subject to a number of risks and uncertainties actual results could differ materially please refer to our form 10k inc

### Step 2: Tokenization
In this step, the normalized text is split into individual words or tokens. This is crucial for many NLP tasks as it breaks down the text into manageable units for analysis.

In [325]:
# Tokenize the text
tokens = word_tokenize(text)

### Step 3: Removing Stop Words
Stop words (like 'and', 'the', 'is') are removed here. These words are usually filtered out because they occur frequently and don't carry significant meaning for most analysis.

In [326]:
# Define stop words
# stop_words = set(stopwords.words('english'))
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [327]:
# Filter out stop words from tokens
filtered_tokens = [word for word in tokens if word not in stopwords]

In [328]:
len(stopwords)

179

### Step 4: Stemming and Lemmatization
This step involves reducing words to their base or root form through stemming and lemmatization. This helps in standardizing words to their core meaning.

In [329]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [330]:
# Apply stemming and lemmatization
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

In [331]:
# Print the processed tokens
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Stemmed Tokens: ['pdf', 'version', 'transcript', 'pleas', 'click', 'thank', 'stand', 'alphabet', 'third', 'quarter', '2023', 'earn', 'confer', 'call', 'time', 'particip', 'listenonli', 'mode', 'speaker', 'present', 'questionandansw', 'session', 'ask', 'question', 'session', 'need', 'press', '1', 'telephon', 'would', 'like', 'hand', 'confer', 'speaker', 'today', 'jim', 'friedland', 'director', 'investor', 'relat', 'pleas', 'go', 'ahead', 'good', 'afternoon', 'everyon', 'welcom', 'alphabet', 'third', 'quarter', '2023', 'earn', 'confer', 'call', 'us', 'today', 'sundar', 'pichai', 'philipp', 'schindler', 'ruth', 'porat', 'ill', 'quickli', 'cover', 'safe', 'harbor', 'statement', 'make', 'today', 'regard', 'busi', 'oper', 'financi', 'perform', 'may', 'consid', 'forwardlook', 'statement', 'base', 'current', 'expect', 'assumpt', 'subject', 'number', 'risk', 'uncertainti', 'actual', 'result', 'could', 'differ', 'materi', 'pleas', 'refer', 'form', '10k', 'includ', 'risk', 'factor', 'section', 'f

### Step 5: Reassembling Processed Tokens into Formatted Text
After preprocessing the text, including tokenization, stop word removal, and stemming or lemmatization, this step combines the processed tokens back into a single string. This helps in visualizing the result of preprocessing as a readable text.

In [332]:
# Reassembling Processed Tokens into Formatted Text

# Using lemmatized tokens for reassembly
formatted_text = ' '.join(lemmatized_tokens)

# Print the formatted text
formatted_text

'pdf version transcript please click thank standing alphabet third quarter 2023 earnings conference call time participant listenonly mode speaker presentation questionandanswer session ask question session need press 1 telephone would like hand conference speaker today jim friedland director investor relation please go ahead good afternoon everyone welcome alphabet third quarter 2023 earnings conference call u today sundar pichai philipp schindler ruth porat ill quickly cover safe harbor statement make today regarding business operation financial performance may considered forwardlooking statement based current expectation assumption subject number risk uncertainty actual result could differ materially please refer form 10k including risk factor section form 10qs undertake obligation update forwardlooking statement call present gaap nongaap financial measure reconciliation nongaap gaap measure included today earnings press release distributed available public investor relation website 

## Word frequency

Moving to the phase of analyzing word frequency, we'll focus on several key aspects. These steps will help us understand the distribution of words in the text, identify the most frequent words, and evaluate their relative importance based on frequency. Let's break down each step:

### Step 1: Frequency Distribution of the Words
Here, we'll calculate how often each word appears in the text.

In [333]:
# word_frequency = nltk.FreqDist(nltk.word_tokenize(formatted_text))
# word_frequency

In [334]:
# Frequency Distribution of the Words

# Assuming 'lemmatized_tokens' is the list of words after preprocessing
freq_dist = FreqDist(lemmatized_tokens)

# Create a dictionary from the frequency distribution
word_freq_dict = dict(freq_dist)

# (Optional) Display the frequency distribution
for word, frequency in freq_dist.items():
    print(word, ":", frequency)

pdf : 1
version : 1
transcript : 1
please : 5
click : 1
thank : 20
standing : 1
alphabet : 19
third : 19
quarter : 43
2023 : 6
earnings : 6
conference : 5
call : 10
time : 6
participant : 1
listenonly : 1
mode : 3
speaker : 2
presentation : 1
questionandanswer : 2
session : 3
ask : 4
question : 34
need : 5
press : 3
1 : 3
telephone : 2
would : 10
like : 16
hand : 1
today : 10
jim : 4
friedland : 3
director : 2
investor : 4
relation : 3
go : 3
ahead : 9
good : 8
afternoon : 1
everyone : 7
welcome : 1
u : 11
sundar : 18
pichai : 7
philipp : 9
schindler : 5
ruth : 12
porat : 6
ill : 6
quickly : 3
cover : 1
safe : 1
harbor : 1
statement : 3
make : 10
regarding : 2
business : 15
operation : 6
financial : 6
performance : 10
may : 3
considered : 1
forwardlooking : 2
based : 3
current : 2
expectation : 3
assumption : 1
subject : 1
number : 15
risk : 2
uncertainty : 1
actual : 1
result : 15
could : 8
differ : 1
materially : 1
refer : 1
form : 2
10k : 1
including : 11
factor : 3
section : 1
10qs

### Step 2: Find the Frequency of a Specific Word
This step allows you to input a word and get its frequency from the processed text.

In [335]:
# # Find the Frequency of a Specific Word

# # Ask the user to input a word
# word_to_check = input("Enter a word to find its frequency: ").lower()

# # Fetch the frequency of the word
# word_frequency = freq_dist.get(word_to_check, 0)

# # Display the frequency
# print(f"The word '{word_to_check}' appears {word_frequency} times in the text.")

### Step 3: Looking at the Keys
Here, we'll list the words (keys of the dictionary) that have been processed.

In [336]:
# Looking at the Keys
words = list(word_freq_dict.keys())
print(words)

['pdf', 'version', 'transcript', 'please', 'click', 'thank', 'standing', 'alphabet', 'third', 'quarter', '2023', 'earnings', 'conference', 'call', 'time', 'participant', 'listenonly', 'mode', 'speaker', 'presentation', 'questionandanswer', 'session', 'ask', 'question', 'need', 'press', '1', 'telephone', 'would', 'like', 'hand', 'today', 'jim', 'friedland', 'director', 'investor', 'relation', 'go', 'ahead', 'good', 'afternoon', 'everyone', 'welcome', 'u', 'sundar', 'pichai', 'philipp', 'schindler', 'ruth', 'porat', 'ill', 'quickly', 'cover', 'safe', 'harbor', 'statement', 'make', 'regarding', 'business', 'operation', 'financial', 'performance', 'may', 'considered', 'forwardlooking', 'based', 'current', 'expectation', 'assumption', 'subject', 'number', 'risk', 'uncertainty', 'actual', 'result', 'could', 'differ', 'materially', 'refer', 'form', '10k', 'including', 'factor', 'section', '10qs', 'undertake', 'obligation', 'update', 'present', 'gaap', 'nongaap', 'measure', 'reconciliation', '

### Step 4: Counting the Number of Keys
This code will tell us how many unique words (keys) are in our processed text.

In [337]:
# Counting the Number of Keys
num_keys = len(word_freq_dict)
print("Number of unique words:", num_keys)

Number of unique words: 1432


### Step 5: Retrieving the Word with the Highest Frequency and its Count
This step identifies the word with the highest frequency in the text and displays both the word and its frequency.

In [338]:
# Retrieving the Word with the Highest Frequency and its Count

# Find the word with the highest frequency
most_common_word, highest_freq = max(freq_dist.items(), key=lambda x: x[1])

# Display the most common word and its frequency
print(f"The word '{most_common_word}' has the highest frequency, appearing {highest_freq} times.")

The word 'google' has the highest frequency, appearing 63 times.


### Step 6: Calculating Word Weights Relative to the Highest Frequency
Finally, we first determine the highest frequency in the text.
Then, for each word, we calculate its weight by dividing its frequency by the highest frequency. This makes the weight of the most common word 1, and all other words will have a weight between 0 and 1, relative to the most common word.

In [339]:
# Calculating Word Weights Relative to the Highest Frequency

# Find the highest frequency
_, highest_freq = max(freq_dist.items(), key=lambda x: x[1])

# Calculate the weight of each word based on the highest frequency
word_weights = {word: freq / highest_freq for word, freq in freq_dist.items()}

# Display the word weights
print(word_weights)

{'pdf': 0.015873015873015872, 'version': 0.015873015873015872, 'transcript': 0.015873015873015872, 'please': 0.07936507936507936, 'click': 0.015873015873015872, 'thank': 0.31746031746031744, 'standing': 0.015873015873015872, 'alphabet': 0.30158730158730157, 'third': 0.30158730158730157, 'quarter': 0.6825396825396826, '2023': 0.09523809523809523, 'earnings': 0.09523809523809523, 'conference': 0.07936507936507936, 'call': 0.15873015873015872, 'time': 0.09523809523809523, 'participant': 0.015873015873015872, 'listenonly': 0.015873015873015872, 'mode': 0.047619047619047616, 'speaker': 0.031746031746031744, 'presentation': 0.015873015873015872, 'questionandanswer': 0.031746031746031744, 'session': 0.047619047619047616, 'ask': 0.06349206349206349, 'question': 0.5396825396825397, 'need': 0.07936507936507936, 'press': 0.047619047619047616, '1': 0.047619047619047616, 'telephone': 0.031746031746031744, 'would': 0.15873015873015872, 'like': 0.25396825396825395, 'hand': 0.015873015873015872, 'toda

## Sentence tokenization

Moving on to sentence tokenization, we'll work with the original data (for readibility) to split the text into individual sentences. This process is essential in many NLP tasks, particularly those involving analysis at the sentence level, such as summarization or sentiment analysis. Sentence tokenization involves breaking down a large body of text into its constituent sentences, which serves as a basis for further analysis.

Let's proceed with the code for sentence tokenization using NLTK, a popular NLP library in Python:

In [340]:
# Sentence Tokenization
# Assuming 'original_text' contains your initial, unprocessed text
# original_text = """Your original text here."""

# Perform sentence tokenization
sentences = sent_tokenize(original_text)

# Display the tokenized sentences
for idx, sentence in enumerate(sentences):
    print(f"Sentence {idx + 1}: {sentence}")

Sentence 1: For a PDF version of the transcript, please click here.
Sentence 2: Thank you for standing by for the Alphabet third quarter 2023 earnings conference call.
Sentence 3: At this time, all participants are in listen-only mode.
Sentence 4: After the speaker presentation, there will be a question-and-answer session.
Sentence 5: To ask a question during the session, you will need to press *1 on your telephone.
Sentence 6: I would now like to hand the conference over to your speaker today Jim Friedland, Director of Investor Relations.
Sentence 7: Please go ahead.
Sentence 8: Good afternoon, everyone, and welcome to Alphabet’s Third Quarter 2023 Earnings conference call.
Sentence 9: With us today are Sundar Pichai, Philipp Schindler and Ruth Porat.
Sentence 10: Now I’ll quickly cover the Safe Harbor.
Sentence 11: Some of the statements that we make today regarding our business, operations and financial performance may be considered forward-looking.
Sentence 12: Such statements are 

Now we have one line for each one of the sentences and we used the original data (`original_text`) here to show the results to the user because it makes no sense to show the results without stopwords or without punctuations. The user would have difficulty reading the results.

In short, the original version of the text is being used to show the results to the user, while the Pre-processed version will be used to perform the mathematical calculations in order to find the best sentences.

## Generate the summary (score for sentences)

To generate a score for each sentence in your text, we will follow a multi-step process. This scoring process will involve calculating a score for each sentence based on the words it contains, and then using these scores to identify the most significant sentences for the summary. Let's break down the process:

### Step 1: Calculate Scores for Each Sentence
First, we'll calculate a score for each sentence by summing the weights of the words it contains. Remember, these weights were determined based on word frequency in the previous steps.

In [341]:
# Calculate Scores for Each Sentence
# Assuming 'sentences' is your list of tokenized sentences
# And 'word_weights' is your dictionary of word weights

sentence_scores = {}

for sentence in sentences:
    words = word_tokenize(sentence.lower())
    sentence_score = sum(word_weights.get(word, 0) for word in words)
    sentence_scores[sentence] = sentence_score

In [342]:
sentence_scores

{'For a PDF version of the transcript, please click here.': 0.14285714285714285,
 'Thank you for standing by for the Alphabet third quarter 2023 earnings conference call.': 2.047619047619048,
 'At this time, all participants are in listen-only mode.': 0.14285714285714285,
 'After the speaker presentation, there will be a question-and-answer session.': 0.19047619047619047,
 'To ask a question during the session, you will need to press *1 on your telephone.': 0.8571428571428571,
 'I would now like to hand the conference over to your speaker today Jim Friedland, Director of Investor Relations.': 0.9047619047619048,
 'Please go ahead.': 0.2698412698412698,
 'Good afternoon, everyone, and welcome to Alphabet’s Third Quarter 2023 Earnings conference call.': 1.9841269841269842,
 'With us today are Sundar Pichai, Philipp Schindler and Ruth Porat.': 1.095238095238095,
 'Now I’ll quickly cover the Safe Harbor.': 0.09523809523809523,
 'Some of the statements that we make today regarding our busin

In [343]:
sentence_scores['And we’ll let you do the forecasting.']

0.07936507936507936

In [344]:
sentence_scores.keys()

dict_keys(['For a PDF version of the transcript, please click here.', 'Thank you for standing by for the Alphabet third quarter 2023 earnings conference call.', 'At this time, all participants are in listen-only mode.', 'After the speaker presentation, there will be a question-and-answer session.', 'To ask a question during the session, you will need to press *1 on your telephone.', 'I would now like to hand the conference over to your speaker today Jim Friedland, Director of Investor Relations.', 'Please go ahead.', 'Good afternoon, everyone, and welcome to Alphabet’s Third Quarter 2023 Earnings conference call.', 'With us today are Sundar Pichai, Philipp Schindler and Ruth Porat.', 'Now I’ll quickly cover the Safe Harbor.', 'Some of the statements that we make today regarding our business, operations and financial performance may be considered forward-looking.', 'Such statements are based on current expectations and assumptions that are subject to a number of risks and uncertainties.

### Step 2: Order Sentences by Score Value
Next, we will sort the sentences based on their calculated scores to identify which ones have the highest significance.

In [345]:
# Order Sentences by Score Value
sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)

In [346]:
sorted_sentences

[('It combines video and image ads in one campaign, with access to 3 billion users across YouTube and Google, and the ability to optimize and measure across the funnel using Google AI.',
  6.142857142857142),
 ('Sundar, you guys led over a year ago starting with Performance Max, and I wanted to know if we could get your updated thoughts on how AI might impact the broader advertising industry and how you are aligning Alphabet and Google’s goals with AI and where it might take the advertising industry in the years ahead.',
  5.650793650793651),
 ('We covered many innovations last quarter after GML, like our conversational experience in Google Ads, significant updates to Performance Max and new campaign types like Demand Gen.\nAnd as Sundar said, we’re continuing to experiment with new ad formats on SGE.',
  5.38095238095238),
 ('I’ll start with our performance for the quarter and then give color into the three key priority areas for Ads -- Google AI, Retail, and YouTube -- that we’ve ide

### Step 3: Select the Best Sentences for Summary
Finally, we will select a subset of the highest-scoring sentences to create the final summary. The number of sentences to include can be adjusted based on your requirements.

In [348]:
# Select the Best Sentences for Summary

# You can adjust the number of sentences to include in the summary
number_of_sentences_in_summary = 5

best_sentences = [sentence for sentence, score in sorted_sentences[:number_of_sentences_in_summary]]
final_summary = ' '.join(best_sentences)

# Display the final summary
final_summary

'It combines video and image ads in one campaign, with access to 3 billion users across YouTube and Google, and the ability to optimize and measure across the funnel using Google AI. Sundar, you guys led over a year ago starting with Performance Max, and I wanted to know if we could get your updated thoughts on how AI might impact the broader advertising industry and how you are aligning Alphabet and Google’s goals with AI and where it might take the advertising industry in the years ahead. We covered many innovations last quarter after GML, like our conversational experience in Google Ads, significant updates to Performance Max and new campaign types like Demand Gen.\nAnd as Sundar said, we’re continuing to experiment with new ad formats on SGE. I’ll start with our performance for the quarter and then give color into the three key priority areas for Ads -- Google AI, Retail, and YouTube -- that we’ve identified on past calls as opportunities for long-term growth in advertising. Sundar

## Visualizing the summary in HTML
Visualizing the summary in HTML is a great way to present the results of your text summarization in a more readable and visually appealing format. HTML allows for customization in the presentation, such as using different styles, colors, and formatting to enhance the readability of the summary. Here's a basic approach to visualizing your summary in HTML:

### Step 1: Sentence Tokenization
First, we tokenize the original text into individual sentences. This step assumes you have your original text in the original_text variable.

In [265]:
# Tokenize the text into sentences
sentence_list = sent_tokenize(original_text)

### Step 2: Identifying Best Sentences
Make sure you have identified the best sentences (for instance, the sentences with the highest scores from your previous analysis).

In [266]:
# Assuming 'best_sentences' is a list of the top sentences from your analysis
# Example:
# best_sentences = ['sentence1', 'sentence2', ...]

### Step 3: Visualizing the Summary in HTML
Now, use HTML formatting within the Jupyter Notebook to highlight these best sentences.

In [267]:
text = ''
display(HTML(f'<h2>Summary</h2>'))

for sentence in sentence_list:
    # Highlight the best sentences
    if sentence in best_sentences:
        text += ' ' + f"<mark>{sentence}</mark>"
    else:
        text += ' ' + sentence

# Display the text with highlighted best sentences
display(HTML(f"""{text}"""))

## Extracting texts from the Internet
To test our text summarization algorithm with real-world data, we can apply it to [Snapchat's Third Quarter 2023 Financial Results](https://investor.snap.com/news/news-details/2023/Snap-Inc.-Announces-Third-Quarter-2023-Financial-Results/default.aspx). This involves several steps, including extracting the text from the internet, preprocessing it, and then applying your summarization algorithm. Here's how you can approach this:

### Step 1: Extracting Text from the Internet
You'll need to locate the webpage containing Snapchat's Third Quarter 2023 Financial Results. Then, use a Python library like requests and BeautifulSoup (or goose3 if the page is complex) to extract the text.

In [268]:
# URL of the webpage containing the financial results
url = 'https://investor.snap.com/news/news-details/2023/Snap-Inc.-Announces-Third-Quarter-2023-Financial-Results/default.aspx'

# Initialize Goose
g = Goose()

# Fetch and extract content
article = g.extract(url=url)
extracted_text = article.cleaned_text

# Print the extracted text or proceed with processing
extracted_text

'Snap Inc. (NYSE: SNAP) today announced financial results for the quarter ended September 30, 2023.\n\n“Our revenue returned to positive growth in Q3, increasing 5% year-over-year and flowing through to positive adjusted EBITDA as our reprioritized cost structure demonstrated the leverage in our business model,” said Evan Spiegel, CEO. “We are focused on improving our advertising platform to drive higher return on investment for our advertising partners, and we have evolved our go-to-market efforts to better serve our partners and drive customer success.”\n\nJerry Hunter, Chief Operating Officer, has notified Snap that he will retire. Mr. Hunter joined Snap seven years ago and served an important role in building the company’s engineering and business structures. Mr. Hunter’s duties and responsibilities will be transitioned by the end of the month and he will continue to support Snap through July 1, 2024 to help ensure this transition is effective.\n\n“I am deeply grateful to Jerry for

In [269]:
article.title

'Snap Inc. Announces Third Quarter 2023 Financial Results'

In [270]:
len(extracted_text)

17046

### Step 2: Preprocessing the Extracted Text
Apply the same preprocessing steps to the extracted text, such as lowercasing, tokenization, removing stop words, and possibly stemming or lemmatization.

In [273]:
# Preprocessing
text = extracted_text.lower()
words = word_tokenize(text)

# Remove stop words
# stop_words = set(stopwords.words('english'))
stopwords = nltk.corpus.stopwords.words('english')
filtered_words = [word for word in words if word not in stopwords]

# Optional: Stemming or Lemmatization
stemmer = PorterStemmer()
lemmatized_words = [stemmer.stem(word) for word in filtered_words]

### Step 3: Applying the Summarization Algorithm
Tokenize the text into sentences, calculate scores based on word frequency, and select the best sentences.

In [280]:
# Tokenize into sentences
sentences = sent_tokenize(extracted_text)

# Calculate word frequencies
freq_dist = FreqDist(lemmatized_words)

# Score each sentence
sentence_scores = {}
for sentence in sentences:
    sentence_words = word_tokenize(sentence.lower())
    sentence_scores[sentence] = sum(freq_dist.get(word, 0) for word in sentence_words)

# Sort and select top sentences for summary
sorted_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)
number_of_sentences_in_summary = 8
best_sentences = [sentence for sentence, _ in sorted_sentences[:number_of_sentences_in_summary]]

### Step 4: Visualizing the Summary
Finally, display the summary using HTML in a Jupyter Notebook environment.

In [281]:
def visualize(title, sentence_list, best_sentences):
    from IPython.display import HTML
    text = ''

    display(HTML(f'<h1>Summary - {title}</h1>'))
    for sentence in sentence_list:
        if sentence in best_sentences:
            text += ' ' + str(sentence).replace(sentence, f"<mark>{sentence}</mark>")
        else:
            text += ' ' + sentence
    display(HTML(f"""{text}"""))

# Usage of the function
visualize("Snapchat's Third Quarter 2023 Financial Results", sentences, best_sentences)

## Conclusion and key takeways

Based on the provided key sentences from Google's earnings transcript, here are the top three conclusions or key takeaways:

1. **Integration and Optimization of Advertising Campaigns:**
  - Google is making significant strides in integrating different types of advertisements, such as video and image ads, into unified campaigns. This integration offers advertisers access to a massive user base across platforms like YouTube and Google. Furthermore, the use of Google AI for optimizing and measuring campaign performance across the sales funnel signifies a major advancement in targeted and efficient advertising.
2. **Impact of AI on Advertising and Alignment with Alphabet's Goals:**
  - AI is being viewed as a transformative force in the advertising industry. Sundar Pichai's emphasis on AI illustrates Alphabet and Google's commitment to leveraging this technology for innovation in advertising. The focus on how AI can reshape advertising strategies and its potential future impact underscores its importance as a central element in Alphabet's long-term vision and strategic alignment, particularly in enhancing advertising effectiveness and creating new opportunities.
3. **Ongoing Innovation and Focus on Key Priority Areas:**
  - Google is actively experimenting with new ad formats and continuously innovating in areas like conversational experiences in Google Ads and updates to Performance Max. The identification of key priority areas – Google AI, Retail, and YouTube – indicates a clear strategy for driving long-term growth in advertising. These areas represent the forefront of Google's focus, aiming to blend technological advancement with user engagement and advertiser value.

In summary, Google's approach is characterized by a deep integration of AI in advertising, continuous innovation in ad formats and campaign strategies, and a strategic focus on leveraging its platforms and AI capabilities to drive future growth in the advertising sector.

## Recommendations

Based on the conclusions drawn from Google's earnings transcript, here are the top three recommendations for someone considering investment:

1. **Invest in AI-Integrated Advertising Platforms:**
  - Given the significant role of AI in transforming the advertising landscape, investing in companies that are integrating AI into their advertising platforms can be a strategic move. Look for companies, like Google, that are at the forefront of AI development and are applying these advancements to optimize advertising campaigns. These companies are likely to stay competitive and offer innovative solutions in the rapidly evolving digital advertising market.
2. **Focus on Companies with Diverse Advertising Channels:**
  - Diversifying investments across companies that offer a range of advertising channels, including video, image, and conversational ad experiences, can be beneficial. This diversification aligns with the trend of integrated advertising campaigns. Companies that have a broad reach across platforms, like Google with its access to YouTube and the wider Google network, are positioned to capture a larger audience and deliver more effective advertising solutions.
3. **Prioritize Firms with a Strong Emphasis on Long-Term Growth Areas:**
  - Target companies that have identified and are investing in key areas for long-term growth, such as AI, retail, and new media platforms like YouTube. These areas are likely to drive future advancements and revenue growth in the advertising sector. Firms that are not only innovating but also strategically aligning their goals with future industry trends, like Alphabet and Google, may offer more sustainable investment opportunities.

In essence, an investment strategy that focuses on companies leveraging AI in advertising, offering diverse ad channels, and strategically investing in future growth areas could capitalize on the emerging trends and potential opportunities in the digital advertising space. As always, it's important to conduct thorough research and possibly consult with a financial advisor before making any investment decisions.