# Query Data from Wikipedia as an example.
- I'll go ahead and just get the wikipedia page for Elon Musk since there is plenty of content here.
- Data in this context is the text.

## Note: Should these steps be automatically applied to your NLP problem? No!
- These are just tools at your disposal that will make your job so much easier.
- There are instances where stripping down information will actually be detrimental.

---
### The steps that I will be going through:
- Removing punctuations like . , ! $( ) * % @
- Removing URLs
- Removing Stop words
- Lower casing
- Tokenization, by Sentence, by Paragraph
- Stemming
- Lemmatization

### Note: I do not use any dedicated NLP package such as SpaCy or NTLK in this notebook.

# Getting the Data
- Extracted from wikipedia
- If you want, you can check it out [here](https://en.wikipedia.org/wiki/Elon_Musk) I just got the first section and pasted the content to a CSV

In [1]:
import pandas as pd

In [2]:
text = pd.read_csv('elon_musk.csv')
text.head()

Unnamed: 0,Text
0,https://en.wikipedia.org/wiki/Elon_Musk
1,Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born Jun...
2,Musk was born to a Canadian mother and White S...
3,"In 2002, Musk founded SpaceX, an aerospace man..."
4,Musk has been criticized for making unscientif...


In [3]:
text.info() # Objects in Python Dataframes are strings!

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Text    5 non-null      object
dtypes: object(1)
memory usage: 168.0+ bytes


In [4]:
'''Each record is separated by \n'''
# This is what the data looks like.
for record in text.Text:
    print(record + "\n")

https://en.wikipedia.org/wiki/Elon_Musk

Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list.[5][6]

Musk was born to a Canadian mother and White South African father, and raised in Pretoria, South Africa. He briefly attended the University of Pretoria before moving to Canada at age 17. He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received a bachelor's degree in Economics and Physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a b

# Removing Punctuation: function to remove all punctuation.
### Why remove punctuation?
- This can reduce complexity of total number of words considered.
- Most importantly, punctuation typically does NOT hold any vital information and thus should be removed.
    - However, do be careful if you are utilizing transfer learning on Large language models since the preprocess step can be dependent on punctuation.
- A very nice regex calculator can be found [here](https://regex101.com/)

Regular Expression to find __all__ punctuation: __[^\w\s]+__
- '+' matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
- \w matches any word character (equivalent to __[a-zA-Z0-9_]__)
- \s matches any whitespace character (equivalent to __[\r\n\t\f\v ]__)

In [5]:
def remove_punctuation(text: str) -> str:
    return text.str.replace(r'[^\w\s]+', '') # r'' indicates to treat as a raw string
    
no_punctuation = text.apply(remove_punctuation) # example how to run function on dataframe
for c, n in enumerate(no_punctuation.Text):
    print("PUNCTUATION REMOVED:", n + "\n")
    print("ORIGINAL:", text.Text[c] + "\n")
    print("-----")

PUNCTUATION REMOVED: httpsenwikipediaorgwikiElon_Musk

ORIGINAL: https://en.wikipedia.org/wiki/Elon_Musk

-----
PUNCTUATION REMOVED: Elon Reeve Musk FRS iln EElon born June 28 1971 is a business magnate investor and philanthropist He is the founder CEO and Chief Engineer at SpaceX angel investor CEO and Product Architect of Tesla Inc founder of The Boring Company and cofounder of Neuralink and OpenAI With an estimated net worth of around US265 billion as of May 20224 Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes realtime billionaires list56

ORIGINAL: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wea

  return text.str.replace(r'[^\w\s]+', '') # r'' indicates to treat as a raw string


# Removing URLs
- URLs do not give any information when we try to analyze text from words
- However, there are cases when maybe having URL's are still useful especially if you want to have a graph-oriented databse. (Neo4j for example)

In [6]:
def remove_url(text: str) -> str:
    return text.str.replace(r'http\S+|www.\S+', '')

no_url = text.apply(remove_url)
for c, n in enumerate(no_url.Text):
    print("URL REMOVED:", n + "\n") # Notice how the URL in the first record disappeared!
    print("ORIGINAL:", text.Text[c] + "\n")
    print("-----")

URL REMOVED: 

ORIGINAL: https://en.wikipedia.org/wiki/Elon_Musk

-----
URL REMOVED: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list.[5][6]

ORIGINAL: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in

  return text.str.replace(r'http\S+|www.\S+', '')


# Remove Stop Words!
- Why? Again, stopwords add little to no value to the meaning of the sentence!
- __stop words__ are commonly used words in language. A group of words that do not add any additional information such as articles, determiners and prepositions are associated with stopwords.
- The list of stop words vary from package to package, but as long as you are familiar with which type of words are used, feel free to use those packages!

In [7]:
# Copied and pasted from: https://www.ranks.nl/stopwords
stop_words = pd.read_csv("common_stopwords.txt")

for word in stop_words.STOP_WORDS:
    print(word)

a
about
above
after
again
against
all
am
an
and
any
are
aren't
as
at
be
because
been
before
being
below
between
both
but
by
can't
cannot
could
couldn't
did
didn't
do
does
doesn't
doing
don't
down
during
each
few
for
from
further
had
hadn't
has
hasn't
have
haven't
having
he
he'd
he'll
he's
her
here
here's
hers
herself
him
himself
his
how
how's
i
i'd
i'll
i'm
i've
if
in
into
is
isn't
it
it's
its
itself
let's
me
more
most
mustn't
my
myself
no
nor
not
of
off
on
once
only
or
other
ought
our
ours	ourselves
out
over
own
same
shan't
she
she'd
she'll
she's
should
shouldn't
so
some
such
than
that
that's
the
their
theirs
them
themselves
then
there
there's
these
they
they'd
they'll
they're
they've
this
those
through
to
too
under
until
up
very
was
wasn't
we
we'd
we'll
we're
we've
were
weren't
what
what's
when
when's
where
where's
which
while
who
who's
whom
why
why's
with
won't
would
wouldn't
you
you'd
you'll
you're
you've
your
yours
yourself
yourselves


In [8]:
# Let's remove these stop words from our dataframe.
text_with_no_stop_words = text.apply(lambda x: [item for item in x if item not in stop_words])
for c, twnsw in enumerate(text_with_no_stop_words.Text):
    print("NO STOP WORDS:", twnsw + "\n")
    print("ORIGINAL:", text.Text[c] + "\n")
    print("-----")

NO STOP WORDS: https://en.wikipedia.org/wiki/Elon_Musk

ORIGINAL: https://en.wikipedia.org/wiki/Elon_Musk

-----
NO STOP WORDS: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list.[5][6]

ORIGINAL: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of Ma

# Lowercase your text!
- We want to lowercase our text in order to improve our general accuracy. Many language models will ultimatiely treat capital letters differently. 
    - Different for very large language models
    - Language dependent (but for english, consider lowercasing your corpus)

In [10]:
for c, t in enumerate(text.Text):
    print("LOWERCASE:", t.lower() + "\n")
    print("ORIGINAL:", t + "\n")
    print("-----")

LOWERCASE: https://en.wikipedia.org/wiki/elon_musk

ORIGINAL: https://en.wikipedia.org/wiki/Elon_Musk

-----
LOWERCASE: elon reeve musk frs (/?i?l?n/ ee-lon; born june 28, 1971) is a business magnate, investor and philanthropist. he is the founder, ceo and chief engineer at spacex; angel investor, ceo and product architect of tesla, inc.; founder of the boring company; and co-founder of neuralink and openai. with an estimated net worth of around us$265 billion as of may 2022,[4] musk is the wealthiest person in the world according to both the bloomberg billionaires index and the forbes real-time billionaires list.[5][6]

ORIGINAL: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[

# Tokenization
- This is the building block for Natural Language Processing.
- Process of breaking your corpus into single words. (A token can be a word, character(s) or a part of a word (subword or n-gram characters))
    - What granularity should you act? It really depends.
- This is incredibly important! You can't work with text data if you don't perform a type of tokenization on your text.
- There are many forms of tokenization and more detail can be found [here](https://towardsdatascience.com/tokenization-for-natural-language-processing-a179a891bad4#:~:text=Tokenization%20is%20breaking%20the%20raw,the%20sequence%20of%20the%20words.)

# Difficulties in Tokenization
- The toughest part about tokenization is identifying at which boundary do you split your text.
- Many popular NLP models use different forms of tokenization, so it would be helpful to be aware of such discrepancies

## Example of Tokenization (whitespace tokenization):
Make sure to hit that like button and subscribe!

Make, sure, to, hit, that, like, button, and, subscribe!

In [16]:
# tokenization can be as simple as str.split()
def tokenize_by_word(text: str) -> str:
    return text.str.split(" ")

text_tokenize_words = text.apply(tokenize_by_word)
for ttw in text_tokenize_words.Text:
    print(ttw)
    print("\n")

['https://en.wikipedia.org/wiki/Elon_Musk']


['Elon', 'Reeve', 'Musk', 'FRS', '(/?i?l?n/', 'EE-lon;', 'born', 'June', '28,', '1971)', 'is', 'a', 'business', 'magnate,', 'investor', 'and', 'philanthropist.', 'He', 'is', 'the', 'founder,', 'CEO', 'and', 'Chief', 'Engineer', 'at', 'SpaceX;', 'angel', 'investor,', 'CEO', 'and', 'Product', 'Architect', 'of', 'Tesla,', 'Inc.;', 'founder', 'of', 'The', 'Boring', 'Company;', 'and', 'co-founder', 'of', 'Neuralink', 'and', 'OpenAI.', 'With', 'an', 'estimated', 'net', 'worth', 'of', 'around', 'US$265', 'billion', 'as', 'of', 'May', '2022,[4]', 'Musk', 'is', 'the', 'wealthiest', 'person', 'in', 'the', 'world', 'according', 'to', 'both', 'the', 'Bloomberg', 'Billionaires', 'Index', 'and', 'the', 'Forbes', 'real-time', 'billionaires', 'list.[5][6]']


['Musk', 'was', 'born', 'to', 'a', 'Canadian', 'mother', 'and', 'White', 'South', 'African', 'father,', 'and', 'raised', 'in', 'Pretoria,', 'South', 'Africa.', 'He', 'briefly', 'attended', 'the', 'Uni

In [26]:
text.Text[1]

'Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list.[5][6]'

# Tokenize by Sentence?

In [29]:
# tokenization can be as simple as str.split()
def tokenize_by_sentence(text: str) -> str:
    return text.str.split('.')

text_tokenize_sentence = text.apply(tokenize_by_sentence)
for c, tts in enumerate(text_tokenize_sentence.Text):
    print(tts)
    print("\n")
    print("ORIGINAL:", text.Text[c])
    print("----")

['https://en', 'wikipedia', 'org/wiki/Elon_Musk']


ORIGINAL: https://en.wikipedia.org/wiki/Elon_Musk
----
['Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist', ' He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc', '; founder of The Boring Company; and co-founder of Neuralink and OpenAI', ' With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list', '[5][6]']


ORIGINAL: Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist. He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$265 billion as of May 20

# Tokenize by Paragraph?
- Can be flexible with this approach. (What constitutes a paragraph?)

In [31]:
from typing import List

In [45]:
text_tokenize_sentence.Text[0]

['https://en', 'wikipedia', 'org/wiki/Elon_Musk']

In [155]:
def tokenize_by_paragraph(list_of_sentences: List[str], length_of_paragraph: int) -> List[str]:
    '''Assumes that the sentence-tokenization has already been executed. '''
    list_to_return = []
    to_join = ""
    for records in list_of_sentences.values:
        if len(records) < length_of_paragraph:
            list_to_return.append("".join(records))
        else:
            for cr, record in enumerate(records):
                to_join = to_join + "".join(record)
                if cr % length_of_paragraph == 0 and cr != 0:
                    list_to_return.append(to_join)
                    to_join = ""
            if to_join != "":
                list_to_return.append(to_join)
                to_join = ""
    return list_to_return

In [156]:
text_by_paragraph = tokenize_by_paragraph(text_tokenize_sentence.Text, 3)
for v in text_by_paragraph:
    print(v + "\n")

https://enwikipediaorg/wiki/Elon_Musk

Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc; founder of The Boring Company; and co-founder of Neuralink and OpenAI With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list

[5][6]

Musk was born to a Canadian mother and White South African father, and raised in Pretoria, South Africa He briefly attended the University of Pretoria before moving to Canada at age 17 He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received a bachelor's degree in Economics and Physics He moved to California in 1995 to attend Stanford University but decided instead to pursue a business

In [157]:
for s in text_tokenize_sentence.Text:
    for v in s:
        print(v + "\n")

https://en

wikipedia

org/wiki/Elon_Musk

Elon Reeve Musk FRS (/?i?l?n/ EE-lon; born June 28, 1971) is a business magnate, investor and philanthropist

 He is the founder, CEO and Chief Engineer at SpaceX; angel investor, CEO and Product Architect of Tesla, Inc

; founder of The Boring Company; and co-founder of Neuralink and OpenAI

 With an estimated net worth of around US$265 billion as of May 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and the Forbes real-time billionaires list

[5][6]

Musk was born to a Canadian mother and White South African father, and raised in Pretoria, South Africa

 He briefly attended the University of Pretoria before moving to Canada at age 17

 He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received a bachelor's degree in Economics and Physics

 He moved to California in 1995 to attend Stanford University but decided instead to p