<a href="https://colab.research.google.com/github/RocioLiu/AI-for-Trading/blob/main/text_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Processing**


## **Capturing Text Data**

In [2]:
from google.colab import files
uploaded = files.upload()

Saving news.csv to news.csv


#### **Plain Text**

In [3]:
import os

# Read in a plain text file
with open("hieroglyph.txt", "r") as f:
    text = f.read()
    print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art. In day-to-day writing, scribes used a cursive form of writing, called hieratic, which was quicker and easier. While formal hieroglyphs may be read in rows or columns in either direction (though typically written from right to left), hieratic was always written from right to left, usually in horizontal rows. A new form of writing, Demotic, became the prevalent writing style, and it is this form of writing—along with formal hieroglyphs—that accompany the Greek text on the Rosetta Stone.[120]

Around the first century AD, the Coptic alphabet started to be used alongside the Demotic script. Coptic is a modified Greek alphabet with the addition

#### **Tabular Data**

In [4]:
import pandas as pd

# Extract text column from a dataframe
df = pd.read_csv("news.csv")
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Los Angeles Times,"Fed official says weak data caused by weather,..."
1,Livemint,Fed's Charles Plosser sees high bar for change...
2,IFA Magazine,US open: Stocks fall after Fed official hints ...
3,IFA Magazine,"Fed risks falling 'behind the curve', Charles ..."
4,Moneynews,Fed's Plosser: Nasty Weather Has Curbed Job Gr...


In [5]:
# Convert text column to lowercase
df['title'] = df['title'].str.lower()
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Los Angeles Times,"fed official says weak data caused by weather,..."
1,Livemint,fed's charles plosser sees high bar for change...
2,IFA Magazine,us open: stocks fall after fed official hints ...
3,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
4,Moneynews,fed's plosser: nasty weather has curbed job gr...


#### **Online Resource**

In [6]:
import requests
import json

# Fetch data from a REST API
r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(res)

{'success': {'total': 1}, 'contents': {'quotes': [{'quote': "Winning isn't everything.. It's the only thing.", 'length': '47', 'author': 'Vincent van Gogh', 'tags': {'0': 'inspire', '2': 'winning'}, 'category': 'inspire', 'language': 'en', 'date': '2021-06-21', 'permalink': 'https://theysaidso.com/quote/vincent-van-gogh-winning-isnt-everythingits-the-only-thing', 'id': 'B7OFzrXc4MXRTfb4Ga0fxQeF', 'background': 'https://theysaidso.com/img/qod/qod-inspire.jpg', 'title': 'Inspiring Quote of the day'}]}, 'baseurl': 'https://theysaidso.com', 'copyright': {'year': 2023, 'url': 'https://theysaidso.com'}}


In [7]:
print(json.dumps(res, indent=4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Winning isn't everything.. It's the only thing.",
                "length": "47",
                "author": "Vincent van Gogh",
                "tags": {
                    "0": "inspire",
                    "2": "winning"
                },
                "category": "inspire",
                "language": "en",
                "date": "2021-06-21",
                "permalink": "https://theysaidso.com/quote/vincent-van-gogh-winning-isnt-everythingits-the-only-thing",
                "id": "B7OFzrXc4MXRTfb4Ga0fxQeF",
                "background": "https://theysaidso.com/img/qod/qod-inspire.jpg",
                "title": "Inspiring Quote of the day"
            }
        ]
    },
    "baseurl": "https://theysaidso.com",
    "copyright": {
        "year": 2023,
        "url": "https://theysaidso.com"
    }
}


In [8]:
# Extract relevant object and field
q = res["contents"]["quotes"][0]
print(q["quote"], "\n--", q["author"])

Winning isn't everything.. It's the only thing. 
-- Vincent van Gogh


## **Cleaning**

In [9]:
import requests

# Fetch a web page
r = requests.get("https://news.ycombinator.com")
print(r.text)

# It looks like we got back an entire HTML source 

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?vk3lMsDnqyxpcJmXGtyi">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href=

In [10]:
import re

# Remove HTML tags using RegEx
pattern = re.compile(r'<.*?>') # tags look like <...>  (Define a pattern to match all HTML tags)
print(pattern.sub('', r.text)) # replace them with blank

# There is a lot of javascript and a number of items we don't need.
# In fact, this regular expression somehow didn't match some tags


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Yggdrasil – Early-stage implementation of an end-to-end encrypted IPv6 network (github.com/yggdrasil-network)
        269 points by dragonsh 6 hours ago  | hide | 55&nbsp;comments              
      
                
      2.      JPEG XL (jpegxl.info)
        199 points by tosh 6 hours ago  | hide | 107&nbsp;comments              
      
                
      3.      Fluid Paint (david.li)
        231 points by pjerem 5 hours ago  | hide | 27&nbsp;comments              
      
                
      4.      LibreCellular (librecellular.org)
        167 points by pabs3 5 hours ago  | hide | 17&nbsp;comments              
      
                
      5.      OrganicMaps is Android and iOS offline maps for travel without tra

In [11]:
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Yggdrasil – Early-stage implementation of an end-to-end encrypted IPv6 network (github.com/yggdrasil-network)
        269 points by dragonsh 6 hours ago  | hide | 55 comments              
      
                
      2.      JPEG XL (jpegxl.info)
        199 points by tosh 6 hours ago  | hide | 107 comments              
      
                
      3.      Fluid Paint (david.li)
        231 points by pjerem 5 hours ago  | hide | 27 comments              
      
                
      4.      LibreCellular (librecellular.org)
        167 points by pabs3 5 hours ago  | hide | 17 comments              
      
                
      5.      OrganicMaps is Android and iOS offline maps for travel without trackers or ads (organi

In [12]:
# Find all articles
summaries = soup.find_all("tr", class_="athing")
summaries[0]

<tr class="athing" id="27577201">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=27577201&amp;how=up&amp;goto=news" id="up_27577201"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://github.com/yggdrasil-network/yggdrasil-go">Yggdrasil – Early-stage implementation of an end-to-end encrypted IPv6 network</a><span class="sitebit comhead"> (<a href="from?site=github.com/yggdrasil-network"><span class="sitestr">github.com/yggdrasil-network</span></a>)</span></td></tr>

In [13]:
# Extract title
summaries[0].find("a", class_="storylink").get_text().strip()

'Yggdrasil – Early-stage implementation of an end-to-end encrypted IPv6 network'

In [14]:
# Find all articles, extract titles
articles = []
summaries = soup.find_all("tr", class_="athing")
for summary in summaries:
  title = summary.find("a", class_="storylink").get_text().strip()
  articles.append(title)

print(len(articles), "Article summaries found. Sample:")
print(articles[0])

30 Article summaries found. Sample:
Yggdrasil – Early-stage implementation of an end-to-end encrypted IPv6 network


## **Normalization**

#### **Case Normalization**

In [15]:
# Sample text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [16]:
# Convert to lowercase
text = text.lower() 
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


#### **Punctuation Removal**

In [17]:
import re

# Remove punctuation characters
# match everything that is not a lowercase a-z, uppercase A-Z, or digits 0-9, and replaces them with a space.
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


## **Tokenization**

In [18]:
# Split text into tokens (words)
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


#### **NLTK: Natural Language ToolKit**

In [19]:
import os
import nltk
# nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
# import all the resources for Natural Language Processing with Python
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/chat80.zip.
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2000.zip.
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/conll2002.zip.
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/dependency_treebank.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    

True

In [20]:
# Another sample text
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [21]:
from nltk.tokenize import word_tokenize

# Split text into words using NLTK
words = word_tokenize(text)
print(words)

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


In [22]:
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


## **Stop words removal**

In [23]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [24]:
# Reset text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize it
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [25]:
# Remove stop words
words = [w for w in words if w not in stopwords.words('english')]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


## **Part-of-Speech Tagging**

In [26]:
from nltk import pos_tag

# Tag parts of speech (PoS)
sentence = word_tokenize("I always lie down to tell a lie")
pos_tag(sentence)

[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN')]

#### **Sentence Parsing**

In [27]:
import nltk

# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


## **Named Entity Recognition**

In [28]:
from nltk import pos_tag, ne_chunk
from nltk import word_tokenize

# Recognize named entities in a tagged sentence
tagged = pos_tag(word_tokenize("Shohei Otani joined MLB Inc. in Los Angeles."))
entities = ne_chunk(tagged)
print(entities)

(S
  (PERSON Shohei/NNP)
  (PERSON Otani/NNP)
  joined/VBD
  (ORGANIZATION MLB/NNP Inc./NNP)
  in/IN
  (GPE Los/NNP Angeles/NNP)
  ./.)


## **Stemming & Lemmatization**

#### **Stemming**

In [29]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']


#### **Lemmatization**

In [30]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


The lemmatizer needs to know or make an assumption about the PoS of each word it's trying to transform.  
In this case, WordNetLemmatizer defaults to nouns, but we can override that by specifying the PoS parameter.

In [33]:
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing']


#### Stemming vs Lemmatization

* Stemming somtimes results in stems that are not complete words in English.  
* Lemmatization is similar to Stemming with one difference, the final form is also meaningful words.  

![](https://cdn-images-1.medium.com/max/800/1*SCIQUg-UcQCAe0k-KZBUsQ.png)