# Exercise 1: Simple Tokenization to Extract Words

## 1.1 What is Tokenization?

When we split text into smaller units, we are performing the process of *tokenization*. For now, we will consider these units as words in the text.
![tokenization-figure](../figures/class1/we-love-nlp-tokenization.png)


```{admonition} LLM FRAMING
:class: fuchsia
This is also the first step in how *Large Language Models* represent text, although, as you will see later, they do not necessarily use *words* as their basic units.  
```


## 1.2 String-splitting to get a word list

Below is an *excerpt* of an essay by ChatGPT-4 (`essay48.txt` in {cite:t}`herbold_large-scale_2023`). The topic is about "distance-learning vs. attending school in person":

In [None]:
chatgpt_text = """
Education, a cornerstone of human development, has evolved significantly over time. 
With the advent of technology and the challenges posed by the global pandemic, the focus has shifted towards distance learning, a mode that offers flexibility, accessibility, and convenience. 
However, traditional in-person learning remains indispensable for a holistic educational experience. 
This essay will compare and contrast distance learning and attending school in person, highlighting the advantages and disadvantages of each.
Distance learning, primarily facilitated through online platforms, offers a plethora of advantages. 
It provides students with access to a vast array of resources and the ability to learn at their own pace. 
Moreover, it transcends geographical boundaries, allowing students from remote locations to access quality education. 
Nevertheless, distance learning poses challenges related to student engagement, lack of social interaction, and the potential for digital inequity.
"""

We can split a string in python by using the `.split()` method on our string variable. 

In [70]:
chatgpt_words = chatgpt_text.split()

Let's look at the first 50 words of our printed text. (*you can side-scroll through the words*)


In [71]:
print(chatgpt_words[:50])

['Education,', 'a', 'cornerstone', 'of', 'human', 'development,', 'has', 'evolved', 'significantly', 'over', 'time.', 'With', 'the', 'advent', 'of', 'technology', 'and', 'the', 'challenges', 'posed', 'by', 'the', 'global', 'pandemic,', 'the', 'focus', 'has', 'shifted', 'towards', 'distance', 'learning,', 'a', 'mode', 'that', 'offers', 'flexibility,', 'accessibility,', 'and', 'convenience.', 'However,', 'traditional', 'in-person', 'learning', 'remains', 'indispensable', 'for', 'a', 'holistic', 'educational', 'experience.']


Some of the words are present more than one time. To give a better overview of the kinds of words used, we can search for all unique words:

In [72]:
# get the "SET"
chatgpt_set = set(chatgpt_words)

# convert back to list 
chatgpt_unique = list(chatgpt_set)

# let's sort it also
chatgpt_unique = sorted(chatgpt_unique)

print(chatgpt_unique)


['Distance', 'Education,', 'However,', 'It', 'Moreover,', 'Nevertheless,', 'This', 'With', 'a', 'ability', 'access', 'accessibility,', 'advantages', 'advantages.', 'advent', 'allowing', 'and', 'array', 'at', 'attending', 'boundaries,', 'by', 'challenges', 'compare', 'contrast', 'convenience.', 'cornerstone', 'development,', 'digital', 'disadvantages', 'distance', 'each.', 'education.', 'educational', 'engagement,', 'essay', 'evolved', 'experience.', 'facilitated', 'flexibility,', 'focus', 'for', 'from', 'geographical', 'global', 'has', 'highlighting', 'holistic', 'human', 'in', 'in-person', 'indispensable', 'inequity.', 'interaction,', 'it', 'lack', 'learn', 'learning', 'learning,', 'locations', 'mode', 'of', 'offers', 'online', 'over', 'own', 'pace.', 'pandemic,', 'person,', 'platforms,', 'plethora', 'posed', 'poses', 'potential', 'primarily', 'provides', 'quality', 'related', 'remains', 'remote', 'resources', 'school', 'shifted', 'significantly', 'social', 'student', 'students', 'tec

```{admonition} QUESTION
:class: red
Look at the unique words. Do you notice a problem with them? Discuss with your group.

<details>
<summary>Click to see ANSWER</summary>

Some words are duplicate such as: 
- "advantages" and "advantages." 
- "Distance" and "distance"

These things should have been fixed by lowercasing `chatgpt_text` and removing punctuation *before splitting*. We'll do this in the hands-on exercise. 
</details>
```

## 1.3 CODE TASKS: Fixing our word list!

If you have already discussed the question, start solving the first tasks:

```{admonition} REVEAL TASKS A TO C
:class: dropdown, teal

Do the following:
<ol type="A">
  <li>Paste the text <code>chatgpt_text</code> into your own code notebook</li>
  <li>Make all of <code>chatgpt_text</code> (before splitting) into lowercase</li>
  <li>Remove punctuation in the simplest way, only focusing on 
    <code>punctuations_to_remove = [".", ",", "!", "?"]</code>. You should <em>not need</em> to pip install any package to make this work.
  </li>
</ol>

I will provide a few hints and you can even click to reveal the code chunk if you feel completely stuck - but try to use Google before you do so!

### Task B

```{admonition} HINT (only if you are stuck)
:class: dropdown, tip
Python has a lot of built-in methods for string manipulations like `split()`. Might there be one for lowercasing?
```

If you are still stuck after the hint, check the code:

In [73]:
chatgpt_lowercased_text = chatgpt_text.lower()

print(chatgpt_lowercased_text)


education, a cornerstone of human development, has evolved significantly over time. 
with the advent of technology and the challenges posed by the global pandemic, the focus has shifted towards distance learning, a mode that offers flexibility, accessibility, and convenience. 
however, traditional in-person learning remains indispensable for a holistic educational experience. 
this essay will compare and contrast distance learning and attending school in person, highlighting the advantages and disadvantages of each.
distance learning, primarily facilitated through online platforms, offers a plethora of advantages. 
it provides students with access to a vast array of resources and the ability to learn at their own pace. 
moreover, it transcends geographical boundaries, allowing students from remote locations to access quality education. 
nevertheless, distance learning poses challenges related to student engagement, lack of social interaction, and the potential for digital inequity.



### Task C

In [None]:
# define the punctuations you want to remove
punctuations_to_remove = [".", ",", "!", "?"]

# very simple ("brute-force") solution:
for punctuation in punctuations_to_remove:
    chatgpt_lowercased_clean_text = text.replace(sp, '')