## Text Processing <img src="../Image/Animation/anime_gif01.gif" height="80px" style="vertical-align: middle;">
### What is Text Processing?
Text processing involves various techniques to manipulate and analyze text data. Here are some key concepts:

### Tokenization
Tokenization is the process of breaking down a text into its smallest elements, such as words, phrases, or symbols. These smallest elements are called tokens. Tokenization is often accompanied by other processes like removing punctuation symbols and converting all letters to lowercase.

### Stop Words
Stop words are a collection of words that frequently appear in tasks but are considered less informative for decision-making. These collections vary by language. For example, in English documents: the, will, a, can, etc. In Indonesian documents: dan, oleh, yang, etc. One way to remove stop words from data is by creating a list of these less informative words, then checking the data against this list. If a word is found in the stop words list, it will be removed.

### Stemming
Stemming is the process of transforming a word to its base form (root word). One example of a stemming algorithm is Porter. However, stemming has some weaknesses, such as producing words with different meanings (e.g., quite becomes quit) or generating words that do not exist in the dictionary (e.g., movies becomes movi).

### Lemmatization
Lemmatization is a process similar to stemming but ensures that the result is a grammatically correct root word. Lemmatization is more complex and thus requires relatively more computation time.

---

In [None]:
import string
import pandas as pd
import numpy as np

In [1]:
text = """Saya sangat suka menonton film action, apalagi jika film tersebut
dibintangi oleh saya sendiri yang dapat dipastikan menjadi terlihat keren dan
menampakkan keseruan"""

print(text)

Saya sangat suka menonton film action, apalagi jika film tersebut
dibintangi oleh saya sendiri yang dapat dipastikan menjadi terlihat keren dan
menampakkan keseruan


In [None]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

stemmer = StemmerFactory().create_stemmer()
text_stemmed = stemmer.steam(text)

In [None]:
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

remover = StopWordRemoverFactory().create_stop_word_remover()
text_clean = remover.remove(text_stemmed)

In [None]:
tokens = text_clean.split(' ')

print(f"Sebelum tokenization: {text_clean}")
print(f"Setelah tokenization: {tokens}")
# Created by Muhamad Fadhli Akbar-2200018197-B

## THANK YOU😸

<img src="../Image/Animation/slowlife.gif" alt="Footer Background">