<a href="https://colab.research.google.com/github/Grace0212/ATM/blob/main/text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="text-align: center;">Text Preprocessing 1 - Tutorial</h1>

Text Pre-processing common steps:

1. Text Cleaning: special characters, HTML tags, new lines
2. Tokenization: split text into sentences and words.
3. Stop Words Removal: remove words of little value like "the", "and", "a", "an".
4. Stemming: stripping the affixes from words.
5. Lemmatization: converting words to their base form.

## Install Dependencies

In [None]:
!pip3 install nltk

In [None]:
example_sentence = """A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. <br> Top speed attained, CPU rated speed,
add on cards & adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4m floppies are especially requested."""

## Cleaning

In [None]:
import re

def clean_text(text):
    text = re.sub(r"<.*?>", " ", text)  # Remove HTML tags
    text = re.sub(r"\\n", " ", text)  # Remove explicit new-line characters
    text = re.sub(r"[^\w\s.]", " ", text)  # Remove special characters except for decimal points
    text = re.sub(r"\s+", " ", text)  # Replace multiple spaces with a single space
    return text.strip().lower()

In [None]:
cleaned_text = clean_text(example_sentence)
print(cleaned_text)

## Tokenization

In [None]:
import nltk
nltk.download('punkt_tab') #model for sentence tokenizer

In [None]:
#Sentence Tokenizer
from nltk.tokenize import sent_tokenize
tokenized_sent = sent_tokenize(cleaned_text)
print('number of sentences: ', len(tokenized_sent))
print(tokenized_sent)

In [None]:
#Word Tokenizer
from nltk.tokenize import word_tokenize
tokenized = word_tokenize(cleaned_text)
print(tokenized)

In [None]:
#Tweet Tokenizer compared to word_tokenize
from nltk.tokenize import TweetTokenizer
tweet = "Dont take cryptocurrency advice from people on Twitter 😃👍 #crypto"
tokenizer = TweetTokenizer()
tokenized_tweet = tokenizer.tokenize(tweet)
print(tokenized_tweet)
print(word_tokenize(tweet))

## Stemming and Lemmatization

### 1- NLTK

In [None]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

Remember That:
* Porter Stemmer removes suffixes in a rule-based manner
* It does not always return valid English words
* Some words retain meaningful roots

In [None]:
# Standard cases
print(porter.stem('argue'))
print(porter.stem('argued'))
print(porter.stem('argues'))
print(porter.stem('arguing'))

In [None]:
# Plurals and derivational forms
print(porter.stem('running'))
print(porter.stem('runner'))
print(porter.stem('flies'))
print(porter.stem('fly'))
print(porter.stem('crying'))

In [None]:
# Complex endings
print(porter.stem('happiness'))
print(porter.stem('university'))
print(porter.stem('national'))
print(porter.stem('generalization'))

## When to Use PorterStemmer?
* For lightweight and rule-based stemming such as text classification and IR systems.
* For more linguistically accurate results, consider lemmatization instead (e.g., WordNetLemmatizer).

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("argue", 'v'))
print(lemmatizer.lemmatize("argued", 'v'))
print(lemmatizer.lemmatize("argues", 'v'))
print(lemmatizer.lemmatize("arguing", 'n'))

In [None]:
print(lemmatizer.lemmatize("better", 'a'))
print(lemmatizer.lemmatize("running", 'v'))
print(lemmatizer.lemmatize("running", 'n'))
print(lemmatizer.lemmatize("flies", 'n'))
print(lemmatizer.lemmatize("flies", 'v'))
print(lemmatizer.lemmatize("mice", 'n'))

In [None]:
#WordNetLemmatizer requires the correct POS tag to be accurate, default is noun
print(lemmatizer.lemmatize("went"))

### 2- spaCy

In [None]:
!pip3 install spacy
!python3 -m spacy download en_core_web_md

In [None]:
import spacy
nlp = spacy.load('en_core_web_md') #load the core English language model

In [None]:
doc=nlp('After the cats fell asleep, the mice went out to play.')
for token in doc:
    print(token,'-->',token.lemma_)

In [None]:
#lemmatize our original example sentence
doc = nlp(cleaned_text)

# Extract original words and their lemmatized forms
original = [token.text for token in doc]
lemmatized = [token.lemma_.lower() for token in doc]

# Display results in aligned format
print(f"{'Original':<15} {'Lemmatized':<15}")
print("=" * 30)
for orig, lem in zip(original, lemmatized):
    print(f"{orig:<15} {lem:<15}")

## Stop Word Removal


In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

# Get stop words list
stop = stopwords.words('english')

# Convert to a set to remove duplicates
unique_stopwords = set(stop)

# Convert back to a list
stop = list(unique_stopwords)

print("Total unique stop words:", len(stop))
print("Sample stop words:", stop[:100])  # Print the first 100 stop words

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
stop_words=list(STOP_WORDS)

print("Total unique stop words:", len(stop_words))
print("Sample stop words:", stop_words[:100])  # First 100 stop words

In [None]:
stop_words_removed = [word for word in lemmatized if word not in stop_words]
removed_arr = [word for word in lemmatized if word in stop_words]

In [None]:
print(stop_words_removed)

In [None]:
print(removed_arr)