# Basic Text Processing Techniques

In this notebook, we will explore some fundamental text processing techniques used in natural language processing (NLP). These techniques are essential for preparing text data for more complex tasks like question answering, sentiment analysis, and other machine learning models.

## 1. Tokenization

Tokenization is the process of breaking down a string or text into smaller components called tokens. Tokens are typically words or phrases, but they can also include punctuation and other symbols. The main purpose of tokenization is to simplify the text data by reducing it to manageable units for further processing.

There are several methods of tokenization:
- **Word tokenization**: Splits the text by words.
- **Sentence tokenization**: Splits the text into sentences.
- **Subword tokenization**: Splits words into smaller units (like syllables).

In [3]:
import nltk
nltk.download('punkt')

# Sample text
text = "Hello world! This is an example of word tokenization."

# Word tokenization
word_tokens = nltk.word_tokenize(text)
print("Word Tokens:", word_tokens)

# Sentence tokenization
sentence_tokens = nltk.sent_tokenize(text)
print("Sentence Tokens:", sentence_tokens)


Word Tokens: ['Hello', 'world', '!', 'This', 'is', 'an', 'example', 'of', 'word', 'tokenization', '.']
Sentence Tokens: ['Hello world!', 'This is an example of word tokenization.']


[nltk_data] Downloading package punkt to /Users/waseem/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 2. Stemming

Stemming is a process of reducing words to their word stem or root form. The base or root form of a word may not be a valid word itself. Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word.


In [1]:
from nltk.stem import PorterStemmer

# Initialize the PorterStemmer
porter = PorterStemmer()

# Example words
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

# Applying stemming
stemmed_words = [porter.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)


Stemmed Words: ['run', 'runner', 'run', 'ran', 'run', 'easili', 'fairli']


## 3. Lemmatization

Lemmatization is similar to stemming but it brings context to the words. It links words with similar meanings to one word. Text analysis is often improved by this method as it reduces words to their dictionary form (lemma).

Unlike stemming, lemmatization considers the morphological analysis of the words. To do this, it is necessary to have detailed dictionaries which the algorithm can look through to link the word to its lemma.

In [4]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Example words
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly']

# Applying lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)

[nltk_data] Downloading package wordnet to /Users/waseem/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized Words: ['run', 'runner', 'running', 'ran', 'run', 'easily', 'fairly']


## 4. Stop Word Removal

Stop words are common words that are usually filtered out before processing natural language data (text). Words like "and", "the", "a", and similar are considered stop words because they appear frequently and are unlikely to contribute to the meaning of text for many tasks. The removal of stop words can help in reducing the dataset size and improving the performance of the processing.


In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Set of English stop words
stop_words = set(stopwords.words('english'))

# Sample sentence
sentence = "This is an example of how to filter out stop words in a sentence."

# Removing stop words
filtered_sentence = ' '.join([word for word in nltk.word_tokenize(sentence) if word.lower() not in stop_words])
print("Filtered Sentence:", filtered_sentence)

Filtered Sentence: example filter stop words sentence .


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/waseem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 5. Word Embeddings

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a set of feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves the mathematical embedding from a space with many dimensions per word to a continuous vector space with much lower dimension.

Common models to generate word embeddings include:
- **Word2Vec**: Uses neural networks to learn word associations from a large corpus of text.
- **GloVe (Global Vectors for Word Representation)**: Uses matrix factorization based on global word-word co-occurrence to provide vector representations.


<img src="./imgs/embedding.png" alt="drawing" width="650"/>

```bash
pip install gensim wget

In [10]:
from gensim.models import Word2Vec

# Sample sentences
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
             ['this', 'is', 'the', 'second', 'sentence'],
             ['yet', 'another', 'sentence'],
             ['one', 'more', 'sentence'],
             ['and', 'the', 'final', 'sentence']]

# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Access vector for one word
word_vector = model.wv['sentence']
print("Word Vector for 'sentence':", word_vector)

Word Vector for 'sentence': [-5.3622725e-04  2.3643136e-04  5.1033497e-03  9.0092728e-03
 -9.3029495e-03 -7.1168090e-03  6.4588725e-03  8.9729885e-03
 -5.0154282e-03 -3.7633716e-03  7.3805046e-03 -1.5334714e-03
 -4.5366134e-03  6.5540518e-03 -4.8601604e-03 -1.8160177e-03
  2.8765798e-03  9.9187379e-04 -8.2852151e-03 -9.4488179e-03
  7.3117660e-03  5.0702621e-03  6.7576934e-03  7.6286553e-04
  6.3508903e-03 -3.4053659e-03 -9.4640139e-04  5.7685734e-03
 -7.5216377e-03 -3.9361035e-03 -7.5115822e-03 -9.3004224e-04
  9.5381187e-03 -7.3191668e-03 -2.3337686e-03 -1.9377411e-03
  8.0774371e-03 -5.9308959e-03  4.5162440e-05 -4.7537340e-03
 -9.6035507e-03  5.0072931e-03 -8.7595852e-03 -4.3918253e-03
 -3.5099984e-05 -2.9618145e-04 -7.6612402e-03  9.6147433e-03
  4.9820580e-03  9.2331432e-03 -8.1579173e-03  4.4957981e-03
 -4.1370760e-03  8.2453608e-04  8.4986202e-03 -4.4621765e-03
  4.5175003e-03 -6.7869602e-03 -3.5484887e-03  9.3985079e-03
 -1.5776526e-03  3.2137157e-04 -4.1406299e-03 -7.6826881e

## Word2Vec Model Training with Text8 Corpus

Next, we'll train a Word2Vec model using the `text8` dataset. The `text8` corpus is a compact version of Wikipedia text that has been cleaned and formatted specifically for use in natural language processing tasks. After training, we'll demonstrate how to use this model to find words that are most similar to a given input word.

In [6]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
corpus = api.load('text8')


Now that we have the dataset, let's load it and train a Word2Vec model using the Gensim library. Training a Word2Vec model involves specifying parameters like the size of the word vectors (`vector_size`), the size of the context window (`window`), and the minimum count of words (`min_count`).


In [11]:
model = Word2Vec(corpus,vector_size=100, window=5, min_count=150, workers=4)

In [19]:
# Find words similar to 'student'
similar_words = model.wv.most_similar('student', topn=10)
print("Words similar to 'student':", similar_words)

Words similar to 'king': [('graduate', 0.7455926537513733), ('teacher', 0.7073326706886292), ('students', 0.6708106398582458), ('undergraduate', 0.6443020701408386), ('faculty', 0.6319653987884521), ('professor', 0.622151255607605), ('school', 0.6202729940414429), ('harvard', 0.6047377586364746), ('bachelor', 0.6045958995819092), ('institution', 0.5951842665672302)]


## 6. Text Normalization

Text normalization is the process of transforming text into a more uniform format. This can include converting all characters to lowercase, removing punctuation, eliminating special characters, and correcting common misspellings. These steps can help in standardizing the text data and improving the performance of NLP models by reducing the number of unique tokens.

In [11]:
# Sample text
text = "This is an Example! It SHOULD be normalized, right? #NLP"

# Normalize text
normalized_text = text.lower().replace('!', '').replace(',', '').replace('?', '').replace('#', '')

print("Normalized Text:", normalized_text)


Normalized Text: this is an example it should be normalized right nlp
