## TP-3: Processing and Document Representation

### requirements
- nltk: 3.9.2

```bash
pip install nltk==3.9.2
``` 

In [1]:
%pip install nltk==3.9.2

Note: you may need to restart the kernel to use updated packages.


### Load Text

In [1]:
# error different encoding
elon_musk_csv_link = "https://github.com/PLSeng/WR/raw/refs/heads/main/TP2/elon_musk.csv"

import urllib.request

with urllib.request.urlopen(elon_musk_csv_link) as response:
    text = response.read().decode("cp1252")

print(text)

Text
https://en.wikipedia.org/wiki/Elon_Musk
"Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list.[5][6]"
"Musk was born to a Canadian mother and White South African father, and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada at age 17. He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received a bachelor's degree in Economics and Physics. He moved to California in 1995 to attend Stanford University but decided instead to purs

### 2. Remove URLs from Text

In [3]:
import re

urls_pattern = r'https?://\S+'
cleaned_urls_txt = re.sub(urls_pattern, '', text)

print(cleaned_urls_txt)

Text

"Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list.[5][6]"
"Musk was born to a Canadian mother and White South African father, and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada at age 17. He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received a bachelor's degree in Economics and Physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding the w

### 1. Remove Punctuation from Text

In [4]:
import string

translator = str.maketrans("", "", string.punctuation)
cleaned_text = cleaned_urls_txt.translate(translator)

print(cleaned_text)

Text

Elon Reeve Musk FRS iln EElon born June 28 1971 is a business magnate investor and philanthropist He is the founder CEO and Chief Engineer at SpaceX angel investor CEO and Product Architect of Tesla Inc founder of The Boring Company and cofounder of Neuralink and OpenAI With an estimated net worth of around US265 billion as of May 20224 Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes realtime billionaires list56
Musk was born to a Canadian mother and White South African father and raised in Pretoria South Africa He briefly attended the University of Pretoria before moving to Canada at age 17 He matriculated at Queens University and transferred to the University of Pennsylvania two years later where he received a bachelors degree in Economics and Physics He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career cofounding the web software company Zip2 with his brother Ki

### 3. Lowercasing

In [5]:
lowercased_text = cleaned_text.lower()

print(lowercased_text)

text

elon reeve musk frs iln eelon born june 28 1971 is a business magnate investor and philanthropist he is the founder ceo and chief engineer at spacex angel investor ceo and product architect of tesla inc founder of the boring company and cofounder of neuralink and openai with an estimated net worth of around us265 billion as of may 20224 musk is the wealthiest person in the world according to both the bloomberg billionaires index and the forbes realtime billionaires list56
musk was born to a canadian mother and white south african father and raised in pretoria south africa he briefly attended the university of pretoria before moving to canada at age 17 he matriculated at queens university and transferred to the university of pennsylvania two years later where he received a bachelors degree in economics and physics he moved to california in 1995 to attend stanford university but decided instead to pursue a business career cofounding the web software company zip2 with his brother ki

### 4. Tokenization

In [6]:
import nltk
from nltk.tokenize import word_tokenize

tokens = word_tokenize(lowercased_text)

print(tokens)

['text', 'elon', 'reeve', 'musk', 'frs', 'iln', 'eelon', 'born', 'june', '28', '1971', 'is', 'a', 'business', 'magnate', 'investor', 'and', 'philanthropist', 'he', 'is', 'the', 'founder', 'ceo', 'and', 'chief', 'engineer', 'at', 'spacex', 'angel', 'investor', 'ceo', 'and', 'product', 'architect', 'of', 'tesla', 'inc', 'founder', 'of', 'the', 'boring', 'company', 'and', 'cofounder', 'of', 'neuralink', 'and', 'openai', 'with', 'an', 'estimated', 'net', 'worth', 'of', 'around', 'us265', 'billion', 'as', 'of', 'may', '20224', 'musk', 'is', 'the', 'wealthiest', 'person', 'in', 'the', 'world', 'according', 'to', 'both', 'the', 'bloomberg', 'billionaires', 'index', 'and', 'the', 'forbes', 'realtime', 'billionaires', 'list56', 'musk', 'was', 'born', 'to', 'a', 'canadian', 'mother', 'and', 'white', 'south', 'african', 'father', 'and', 'raised', 'in', 'pretoria', 'south', 'africa', 'he', 'briefly', 'attended', 'the', 'university', 'of', 'pretoria', 'before', 'moving', 'to', 'canada', 'at', 'age'

### 5. Stemming

In [7]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_tokens = [ps.stem(token) for token in tokens]

stemmed_tokens.pop(0) 
print(stemmed_tokens)

['elon', 'reev', 'musk', 'fr', 'iln', 'eelon', 'born', 'june', '28', '1971', 'is', 'a', 'busi', 'magnat', 'investor', 'and', 'philanthropist', 'he', 'is', 'the', 'founder', 'ceo', 'and', 'chief', 'engin', 'at', 'spacex', 'angel', 'investor', 'ceo', 'and', 'product', 'architect', 'of', 'tesla', 'inc', 'founder', 'of', 'the', 'bore', 'compani', 'and', 'cofound', 'of', 'neuralink', 'and', 'openai', 'with', 'an', 'estim', 'net', 'worth', 'of', 'around', 'us265', 'billion', 'as', 'of', 'may', '20224', 'musk', 'is', 'the', 'wealthiest', 'person', 'in', 'the', 'world', 'accord', 'to', 'both', 'the', 'bloomberg', 'billionair', 'index', 'and', 'the', 'forb', 'realtim', 'billionair', 'list56', 'musk', 'wa', 'born', 'to', 'a', 'canadian', 'mother', 'and', 'white', 'south', 'african', 'father', 'and', 'rais', 'in', 'pretoria', 'south', 'africa', 'he', 'briefli', 'attend', 'the', 'univers', 'of', 'pretoria', 'befor', 'move', 'to', 'canada', 'at', 'age', '17', 'he', 'matricul', 'at', 'queen', 'unive

### 6. Lemmatization

In [8]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /teamspace/studios/this_studio/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print(lemmatized_tokens)

['text', 'elon', 'reeve', 'musk', 'fr', 'iln', 'eelon', 'born', 'june', '28', '1971', 'is', 'a', 'business', 'magnate', 'investor', 'and', 'philanthropist', 'he', 'is', 'the', 'founder', 'ceo', 'and', 'chief', 'engineer', 'at', 'spacex', 'angel', 'investor', 'ceo', 'and', 'product', 'architect', 'of', 'tesla', 'inc', 'founder', 'of', 'the', 'boring', 'company', 'and', 'cofounder', 'of', 'neuralink', 'and', 'openai', 'with', 'an', 'estimated', 'net', 'worth', 'of', 'around', 'us265', 'billion', 'a', 'of', 'may', '20224', 'musk', 'is', 'the', 'wealthiest', 'person', 'in', 'the', 'world', 'according', 'to', 'both', 'the', 'bloomberg', 'billionaire', 'index', 'and', 'the', 'forbes', 'realtime', 'billionaire', 'list56', 'musk', 'wa', 'born', 'to', 'a', 'canadian', 'mother', 'and', 'white', 'south', 'african', 'father', 'and', 'raised', 'in', 'pretoria', 'south', 'africa', 'he', 'briefly', 'attended', 'the', 'university', 'of', 'pretoria', 'before', 'moving', 'to', 'canada', 'at', 'age', '17