# Text preprocessing routines

In [13]:
text = """
Apple (AAPL) ended the recent trading session at $231.30, demonstrating a +1.65% swing from the preceding day's closing price. The stock outpaced the S&P 500's daily gain of 0.77%. On the other hand, the Dow registered a gain of 0.47%, and the technology-centric Nasdaq increased by 0.87%.
Coming into today, shares of the maker of iPhones, iPads and other products had gained 2.27% in the past month. In that same time, the Computer and Technology sector gained 6.36%, while the S&P 500 gained 4.87%.
Investors will be eagerly watching for the performance of Apple in its upcoming earnings disclosure. The company's earnings report is set to be unveiled on October 31, 2024. The company is forecasted to report an EPS of $1.54, showcasing a 5.48% upward movement from the corresponding quarter of the prior year. At the same time, our most recent consensus estimate is projecting a revenue of $94.48 billion, reflecting a 5.57% rise from the equivalent quarter last year.
Additionally, investors should keep an eye on any recent revisions to analyst forecasts for Apple. These recent revisions tend to reflect the evolving nature of short-term business trends. Therefore, positive revisions in estimates convey analysts' confidence in the company's business performance and profit potential.
Our research suggests that these changes in estimates have a direct relationship with upcoming stock price performance. To capitalize on this, we've crafted the Zacks Rank, a unique model that incorporates these estimate changes and offers a practical rating system.
Ranging from #1 (Strong Buy) to #5 (Strong Sell), the Zacks Rank system has a proven, outside-audited track record of outperformance, with #1 stocks returning an average of +25% annually since 1988. Over the last 30 days, the Zacks Consensus EPS estimate has witnessed a 0.12% decrease. Apple currently has a Zacks Rank of #3 (Hold).
Investors should also note Apple's current valuation metrics, including its Forward P/E ratio of 30.17. This valuation marks a premium compared to its industry's average Forward P/E of 15.36.
We can also see that AAPL currently has a PEG ratio of 2.38. The PEG ratio is similar to the widely-used P/E ratio, but this metric also takes the company's expected earnings growth rate into account. The Computer - Micro Computers was holding an average PEG ratio of 1.79 at yesterday's closing price.
"""


## Split longer documents into sentences

This is not a necessary step, however, for a better readabilty, we start by splitting the filing into sentences.

In [7]:
import nltk.data

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sent_detector.tokenize(text)

for sentence in sentences:
    print(sentence)


Apple (AAPL) ended the recent trading session at $231.30, demonstrating a +1.65% swing from the preceding day's closing price.
The stock outpaced the S&P 500's daily gain of 0.77%.
On the other hand, the Dow registered a gain of 0.47%, and the technology-centric Nasdaq increased by 0.87%.
Coming into today, shares of the maker of iPhones, iPads and other products had gained 2.27% in the past month.
In that same time, the Computer and Technology sector gained 6.36%, while the S&P 500 gained 4.87%.
Investors will be eagerly watching for the performance of Apple in its upcoming earnings disclosure.
The company's earnings report is set to be unveiled on October 31, 2024.
The company is forecasted to report an EPS of $1.54, showcasing a 5.48% upward movement from the corresponding quarter of the prior year.
At the same time, our most recent consensus estimate is projecting a revenue of $94.48 billion, reflecting a 5.57% rise from the equivalent quarter last year.
Additionally, investors sh

## Simple preprocessing routine

* Remove punctuation and numbers
* Lower casing 
* Split words by whitespaces

In [8]:
from gensim.utils import simple_preprocess

for sentence in sentences:
    preprocessed_sentenced = simple_preprocess(sentence)
    print(preprocessed_sentenced)

['apple', 'aapl', 'ended', 'the', 'recent', 'trading', 'session', 'at', 'demonstrating', 'swing', 'from', 'the', 'preceding', 'day', 'closing', 'price']
['the', 'stock', 'outpaced', 'the', 'daily', 'gain', 'of']
['on', 'the', 'other', 'hand', 'the', 'dow', 'registered', 'gain', 'of', 'and', 'the', 'technology', 'centric', 'nasdaq', 'increased', 'by']
['coming', 'into', 'today', 'shares', 'of', 'the', 'maker', 'of', 'iphones', 'ipads', 'and', 'other', 'products', 'had', 'gained', 'in', 'the', 'past', 'month']
['in', 'that', 'same', 'time', 'the', 'computer', 'and', 'technology', 'sector', 'gained', 'while', 'the', 'gained']
['investors', 'will', 'be', 'eagerly', 'watching', 'for', 'the', 'performance', 'of', 'apple', 'in', 'its', 'upcoming', 'earnings', 'disclosure']
['the', 'company', 'earnings', 'report', 'is', 'set', 'to', 'be', 'unveiled', 'on', 'october']
['the', 'company', 'is', 'forecasted', 'to', 'report', 'an', 'eps', 'of', 'showcasing', 'upward', 'movement', 'from', 'the', 'co

## Remove stopwords

* Stopwords are words which appear many times in almost all texts
* Usually their existence does not allow a user to retrieve content specific information

In [14]:
from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import simple_preprocess

for sentence in sentences:
    preprocessed_sentenced = simple_preprocess(sentence)
    preprocessed_sentenced = [word for word in preprocessed_sentenced if not(word in STOPWORDS)]
    print(preprocessed_sentenced)

['apple', 'aapl', 'ended', 'recent', 'trading', 'session', 'demonstrating', 'swing', 'preceding', 'day', 'closing', 'price']
['stock', 'outpaced', 'daily', 'gain']
['hand', 'dow', 'registered', 'gain', 'technology', 'centric', 'nasdaq', 'increased']
['coming', 'today', 'shares', 'maker', 'iphones', 'ipads', 'products', 'gained', 'past', 'month']
['time', 'technology', 'sector', 'gained', 'gained']
['investors', 'eagerly', 'watching', 'performance', 'apple', 'upcoming', 'earnings', 'disclosure']
['company', 'earnings', 'report', 'set', 'unveiled', 'october']
['company', 'forecasted', 'report', 'eps', 'showcasing', 'upward', 'movement', 'corresponding', 'quarter', 'prior', 'year']
['time', 'recent', 'consensus', 'estimate', 'projecting', 'revenue', 'billion', 'reflecting', 'rise', 'equivalent', 'quarter', 'year']
['additionally', 'investors', 'eye', 'recent', 'revisions', 'analyst', 'forecasts', 'apple']
['recent', 'revisions', 'tend', 'reflect', 'evolving', 'nature', 'short', 'term', 'b

## Stemming

* Especially frequency based algorithms may be improved if the number of different terms is low
* Options to achieve this are given by stemming or lemmatization

In [10]:
from nltk.stem.snowball import SnowballStemmer
from gensim.utils import simple_preprocess

stemmer = SnowballStemmer(language = "english")

for sentence in sentences:
    preprocessed_sentenced = simple_preprocess(sentence)
    preprocessed_sentenced = [stemmer.stem(word) for word in preprocessed_sentenced]
    print(preprocessed_sentenced)

['appl', 'aapl', 'end', 'the', 'recent', 'trade', 'session', 'at', 'demonstr', 'swing', 'from', 'the', 'preced', 'day', 'close', 'price']
['the', 'stock', 'outpac', 'the', 'daili', 'gain', 'of']
['on', 'the', 'other', 'hand', 'the', 'dow', 'regist', 'gain', 'of', 'and', 'the', 'technolog', 'centric', 'nasdaq', 'increas', 'by']
['come', 'into', 'today', 'share', 'of', 'the', 'maker', 'of', 'iphon', 'ipad', 'and', 'other', 'product', 'had', 'gain', 'in', 'the', 'past', 'month']
['in', 'that', 'same', 'time', 'the', 'comput', 'and', 'technolog', 'sector', 'gain', 'while', 'the', 'gain']
['investor', 'will', 'be', 'eager', 'watch', 'for', 'the', 'perform', 'of', 'appl', 'in', 'it', 'upcom', 'earn', 'disclosur']
['the', 'compani', 'earn', 'report', 'is', 'set', 'to', 'be', 'unveil', 'on', 'octob']
['the', 'compani', 'is', 'forecast', 'to', 'report', 'an', 'ep', 'of', 'showcas', 'upward', 'movement', 'from', 'the', 'correspond', 'quarter', 'of', 'the', 'prior', 'year']
['at', 'the', 'same', 

## Using trained algorithms to preprocess text

* Modern language models often include preprocessing routines which are combined with trained algorithms
* Various algorithms exists which learn how to identify tokens, i.e., *an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing*

In [11]:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
for sentence in sentences:
    preprocessed_sentenced = tokenizer.encode(sentence).tokens
    print(preprocessed_sentenced[1:-1])

['apple', '(', 'aa', '##pl', ')', 'ended', 'the', 'recent', 'trading', 'session', 'at', '$', '231', '.', '30', ',', 'demonstrating', 'a', '+', '1', '.', '65', '%', 'swing', 'from', 'the', 'preceding', 'day', "'", 's', 'closing', 'price', '.']
['the', 'stock', 'out', '##pace', '##d', 'the', 's', '&', 'p', '500', "'", 's', 'daily', 'gain', 'of', '0', '.', '77', '%', '.']
['on', 'the', 'other', 'hand', ',', 'the', 'dow', 'registered', 'a', 'gain', 'of', '0', '.', '47', '%', ',', 'and', 'the', 'technology', '-', 'cent', '##ric', 'nas', '##da', '##q', 'increased', 'by', '0', '.', '87', '%', '.']
['coming', 'into', 'today', ',', 'shares', 'of', 'the', 'maker', 'of', 'iphone', '##s', ',', 'ipad', '##s', 'and', 'other', 'products', 'had', 'gained', '2', '.', '27', '%', 'in', 'the', 'past', 'month', '.']
['in', 'that', 'same', 'time', ',', 'the', 'computer', 'and', 'technology', 'sector', 'gained', '6', '.', '36', '%', ',', 'while', 'the', 's', '&', 'p', '500', 'gained', '4', '.', '87', '%', '.