<a href="https://colab.research.google.com/github/PTmytrial/Sentimental-Analytics/blob/main/Tokenization_Challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization Challenges

Tokenization is more challenging than you might think.

Do we just split our text on non-alphanumeric characters? What about:

* hyphens -- can't, won't
* mixed words -- Yahoo!
* emails and urls -- bob@website.com, www.senecacollege.ca
* phone numbers -- 999-999-9999
* abbreviations -- U.S.A.
* emoticons -- :)

What should be considered a token? Should these examples be multiple tokens or a single token?
* United States
* end of day
* is not

And should some words be treated as the same? What about:
* "Hello" and "hello" and "HELLO"
* "NY" and "New York"
* "ski" and "skiing"
* "good" and "great"

There is no right answer to these questions and no single best way to tokenize.

# Regular Expressions

A regular expression is a way of describing a pattern that you want to find in text. Here are some examples of regular expressions and their meaning:
* `[0-9]` -- any number
* `[A-Z]` -- any capital letter
* `[A-Z]+` -- any sequence of one or more capital letters
* `^A` -- any character that is not a capital A

We can use regular expressions to tokenize our text.

Regular expressions are almost like a language of themselves. They are implemented in many different programming languages, including Python. Although regular expressions are fairly consistent across languages, you should be aware that there are sometimes small differences.

A great resource for visualizing and testing-out regular expressions is [regex101](https://regex101.com/). We will be using this during class. Make sure to choose the Python flavour.

In [None]:
# import re
# re.findall()

In [None]:
import re
doc = '''Typhoon Yagi, known in the Philippines as Severe Tropical Storm Enteng, was a powerful and destructive tropical cyclone which impacted the Philippines, China, Vietnam and Laos in early September 2024. Yagi, which means goat or the constellation of Capricornus in Japanese, is the eleventh named storm, the first violent typhoon and the first Category 5 storm of the annual typhoon season. It was one of the most intense typhoons ever to strike Northern Vietnam, the strongest typhoon to strike Hainan during the meteorological autumn and the strongest since Rammasun in 2014. It is one of the only four Category 5 super typhoons recorded in the South China Sea, alongside Pamela in 1954, Rammasun in 2014 and Rai in 2021.'''

re.findall("[A-Za-z0-9]+", doc)

['Typhoon',
 'Yagi',
 'known',
 'in',
 'the',
 'Philippines',
 'as',
 'Severe',
 'Tropical',
 'Storm',
 'Enteng',
 'was',
 'a',
 'powerful',
 'and',
 'destructive',
 'tropical',
 'cyclone',
 'which',
 'impacted',
 'the',
 'Philippines',
 'China',
 'Vietnam',
 'and',
 'Laos',
 'in',
 'early',
 'September',
 '2024',
 'Yagi',
 'which',
 'means',
 'goat',
 'or',
 'the',
 'constellation',
 'of',
 'Capricornus',
 'in',
 'Japanese',
 'is',
 'the',
 'eleventh',
 'named',
 'storm',
 'the',
 'first',
 'violent',
 'typhoon',
 'and',
 'the',
 'first',
 'Category',
 '5',
 'storm',
 'of',
 'the',
 'annual',
 'typhoon',
 'season',
 'It',
 'was',
 'one',
 'of',
 'the',
 'most',
 'intense',
 'typhoons',
 'ever',
 'to',
 'strike',
 'Northern',
 'Vietnam',
 'the',
 'strongest',
 'typhoon',
 'to',
 'strike',
 'Hainan',
 'during',
 'the',
 'meteorological',
 'autumn',
 'and',
 'the',
 'strongest',
 'since',
 'Rammasun',
 'in',
 '2014',
 'It',
 'is',
 'one',
 'of',
 'the',
 'only',
 'four',
 'Category',
 '5

# Lemmatization and Stemming

Stemming is the process of reducing words to their base or root by removing suffixes and prefixes. For example:
* "skiing" and "skier" both become "ski"
* "walking" and "walks" both become "walk"

By doing this, we are treating words with the same "stem" as if they are the same words. This helps us train our models because it reduces the sparsity of our data. However, it also removes some meaning.

Lemmatization is a more general application of the same idea. Instead of removing prefixes and suffixes, lemmatization replaces a word with a word that has a similar meaning: its "lemma".

For example, "huge" might be replaced with "big". Again, this helps us address the sparsity of our data, but comes at the cost of removing meaning.

In [None]:
# lemmas as a dictionary
# replacing tokens with lemmas

In [None]:
lemmas = {
    "skiing": "ski",
    "skier": "ski",
    "walking": "walk",
    "walks": "walk",
    "huge": "big"
}

document = 'The skiier was skiing on a huge hill!'

tokens = re.findall("[A-Za-z0-9]+", document)
for i,v in enumerate(tokens):
  lowercase_v = v.lower()
  if lowercase_v in lemmas:
    tokens[i] = lemmas[lowercase_v]
  else:
    tokens[i] = lowercase_v

tokens

['the', 'skiier', 'was', 'ski', 'on', 'a', 'big', 'hill']

# Stop Words

Stop words are words that we remove from our analysis because we consider them insignificant. Generally, these are extremely common words like "a" and "the" that don't add much meaning to text. We remove these primarily for computational reasons: NLP models can be very computationally intensive and removing insignificant words can make our models run faster and use less compute power. However, with computers getting much better and cheaper, the use of stop words is becoming less common.

In [None]:
# storing stop words in a list
# removing stop words from a list of tokens

In [None]:
lemmas = {
    "skiing": "ski",
    "skier": "ski",
    "huge": "big"
}

document = 'The skiier was skiing on a huge hill!'

tokens = re.findall("[A-Za-z0-9]+", document)

for i,v in enumerate(tokens):
  lowercase_v = v.lower()
  if lowercase_v in lemmas:
    tokens[i] = lemmas[lowercase_v]
  else:
    tokens[i] = lowercase_v

stop_words = ['the', 'a', 'on', 'was']

print(tokens)
for i, v in enumerate(reversed(tokens)):
  if v in stop_words:
    tokens.remove(v)
    print(tokens)

['the', 'skiier', 'was', 'ski', 'on', 'a', 'big', 'hill']
['the', 'skiier', 'was', 'ski', 'on', 'big', 'hill']
['the', 'skiier', 'was', 'ski', 'big', 'hill']
['the', 'skiier', 'ski', 'big', 'hill']
['skiier', 'ski', 'big', 'hill']
