In [28]:
'''Objective:
To implement the operation of extracting the words (features) used in a sentence.
Theory:
Text Pre-Processing
Before feature engineering, we need to pre-process, clean, and normalize the text like we mentioned before. There are multiple pre-processing techniques, some of which are quite elaborate. We will not be going into a lot of details in this section but we will be covering a lot of them in further detail in a future chapter when we work on text classification and sentiment analysis. Following is some of the popular pre-processing techniques.
• Text tokenization and lower casing
• Removing special characters
• Contraction expansion
• Removing stopwords
• Correcting spellings
• Stemming
• Lemmatization
Some important terms:
Tokenization: Splitting of string data into constituent units. For example, splitting sentences into words or words into characters.
Stemming and lemmatization: These are normalization methods to bring words into their root or canonical forms. While stemming is a heuristic process to achieve the root form, lemmatization utilizes rules of grammar and vocabulary to derive the root.
Stopword Removal: Text contains words that occur at high frequency yet do not convey much information (punctuations, conjunctions, and so on). These words/phrases are usually removed to reduce dimensionality and noise from data.
Corpora: The starting point of any text analytics process is the process of collecting the documents of interest in a single dataset. This dataset is central to the next steps of processing and analysis. This collection of documents is generally called a corpus. Multiple corpus datasets are called corpora. 
Tagging: The process of tagging will involve getting a text corpus, tokenizing the text and assigning metadata information like tags to each word in the corpora.
Chunking: Chunking is a process which is similar to parsing or tokenization but the major difference is that instead of trying to parse each word, we will target phrases present in the document. 

Prerequisites:
•	Basic knowledge of Python programming
•	Familiarity with string operations
Required Tools:
•	Python installed on computer

Method 1: Without using any dedicated library
Steps:
Steps 1: Define the Punctuation Characters
A set of punctuation characters that should be removed from the sentence with be define.
Step 2: Normalize the Sentence
The sentence will be converted to lowercase to standardize the words.
Step 3: Remove Punctuation
Any punctuation characters will be removed from the sentence.
Step 4: Tokenize the Sentence
The cleaned sentence will be split into individual words.
'''


'Objective:\nTo implement the operation of extracting the words (features) used in a sentence.\nTheory:\nText Pre-Processing\nBefore feature engineering, we need to pre-process, clean, and normalize the text like we mentioned before. There are multiple pre-processing techniques, some of which are quite elaborate. We will not be going into a lot of details in this section but we will be covering a lot of them in further detail in a future chapter when we work on text classification and sentiment analysis. Following is some of the popular pre-processing techniques.\n• Text tokenization and lower casing\n• Removing special characters\n• Contraction expansion\n• Removing stopwords\n• Correcting spellings\n• Stemming\n• Lemmatization\nSome important terms:\nTokenization: Splitting of string data into constituent units. For example, splitting sentences into words or words into characters.\nStemming and lemmatization: These are normalization methods to bring words into their root or canonical

In [29]:
sentence = 'The stating point of any text analytics process is the process of collecting the documents of interest in a single dataset. This dataset is central to the next steps of processing and analysis .This collection of documents is generally called a corpus. Multiple corpus darasets are called corpora.'

In [30]:
def extract_words(sentence):
    #Define a set of punctuation characters to remove.
    punctuation = '''~!`!@#$%^&*()_-+=[{]}|'"\;:,<.>/?'''
    #Normalize the sentence
    sentence = sentence.lower() #Convert to lowercase
    #Remove punctuation
    cleaned_sentence = ''
    for char in sentence:
        if char not in punctuation:
            cleaned_sentence += char
    words = cleaned_sentence.split()
    return words


  punctuation = '''~!`!@#$%^&*()_-+=[{]}|'"\;:,<.>/?'''


In [31]:
words = extract_words(sentence)
print('Extrected words:', words)

Extrected words: ['the', 'stating', 'point', 'of', 'any', 'text', 'analytics', 'process', 'is', 'the', 'process', 'of', 'collecting', 'the', 'documents', 'of', 'interest', 'in', 'a', 'single', 'dataset', 'this', 'dataset', 'is', 'central', 'to', 'the', 'next', 'steps', 'of', 'processing', 'and', 'analysis', 'this', 'collection', 'of', 'documents', 'is', 'generally', 'called', 'a', 'corpus', 'multiple', 'corpus', 'darasets', 'are', 'called', 'corpora']


In [32]:
#Method 2: Using re and nltk libraries
import pandas as pd 
import numpy as np
import re #regular expression
import nltk  #natural language toolkit

#Download necessary resources
nltk.download('stopwords')
nltk.download('punkt')

from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jalen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jalen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [33]:
#Step 1: Load Data
corpus = ['The sky is blue and beautiful.',
'Love this blue and beautiful sky!',
'The quick brown fox jumps over the lazy dog.',
'The brown fox is quick and the blue dog is lazy!',
'The sky is very blue and the sky is very beautiful today',
'The dog is lazy but the brown fox is quick!']

labels = ['weather','weather', 'animals', 'animals', 'weather','animals']

In [34]:
#Create a DataFrame
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
print('Original Corpus:')
print(corpus_df)

Original Corpus:
                                            Document Category
0                     The sky is blue and beautiful.  weather
1                  Love this blue and beautiful sky!  weather
2       The quick brown fox jumps over the lazy dog.  animals
3   The brown fox is quick and the blue dog is lazy!  animals
4  The sky is very blue and the sky is very beaut...  weather
5        The dog is lazy but the brown fox is quick!  animals


In [35]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

In [40]:
#Normalize the text
def normalize_document(doc):
    # Remove special characters and lowercase
    doc = re.sub(r'[^a-zA-Z0-9\s]','', doc)   #re.sub(pattern, replacement, string, flag)
    doc = doc.lower().strip()
    # Tokenize
    tokens = wpt.tokenize(doc)
    #Remove stopwords
    filtered_tokens = [token for token in tokens if token not in stop_words]
    #Reconstruct document
    return ' '.join(filtered_tokens)

In [41]:
# Apply normalization
normalize_corpus = np.vectorize(normalize_document)
normalize_corpus
corpus_df['Normalized_Document'] = normalize_corpus(corpus_df['Document'])

print("\nNormalized Corpus:")
print(corpus_df[['Normalized_Document', 'Category']])



Normalized Corpus:
              Normalized_Document Category
0              sky blue beautiful  weather
1         love blue beautiful sky  weather
2  quick brown fox jumps lazy dog  animals
3   brown fox quick blue dog lazy  animals
4    sky blue sky beautiful today  weather
5        dog lazy brown fox quick  animals
