We are using the Natural Language Toolkit NLTK library.  
Text preprocessing is tha analysis of text data.  

1. Import libraries  

2. Text Lowercase    

3. Remove numbers  

4. Remove punctuation  

5. Remove default stopwords  

6. Stemming

7. Lemmatization  

8. Part of Speech Tagging

9. Chunking

10. Named Entity Recognition  

##### IMPORT NECESSARY LIBRARIES
nltk for text processing  
string for string operations  
re for regular expressions  

In [9]:
# Import necessary libraries
import nltk  # Natural Language Toolkit for text processing
import string  # For string operations
import re  # For regular expressions

##### TEXT LOWERCASE  
Convert the whole text to lower case.  
This creates consistency in the data.  
It also simplifies tokenization and enhances model efficiency.  

In [10]:
def text_lowercase(text):
    return text.lower()

input_str = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
text_lowercase(input_str)


"oh my god she has 10 cubes left! i can't believe it. i thought that she had 15 ??"

##### REMOVE NUMBERS
+ Remove the numbers completely.  
This can be done using regular expressions.  

+ Convert the numbers to words.  
Eg 3 to three.  
This can be done using the inflect library.  

In [11]:
# Remove numbers
def remove_numbers(text):
    result = re.sub(r'\d+', '', text)   #replace digits with empty string ''
    return result

input_str = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
remove_numbers(input_str)

# \d indicates digits 0-9
# + indicates "one or more occurrences" eg '1', '3672'

"Oh my God she has  cubes left! I can't believe it. I thought that she had  ??"

In [12]:
# import the inflect library
import inflect

#create engine object p, which provides access to the library's functions
p = inflect.engine()

def convert_number(text):
    # split string into list of words/tokens
    temp_str = text.split()
    # initialise empty list
    new_string = []

    # loop through each word in temp_str
    for word in temp_str:

        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)

        else:
            new_string.append(word)

    # join all words of new_string to form one string separated by spaces
    temp_str = ' '.join(new_string)
    return temp_str

input_str = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
convert_number(input_str)

"Oh my God she has ten cubes left! I can't believe it. I thought that she had fifteen ??"

##### REMOVE PUNCTUATION
This helps normalize the text.  
Eg 'apple' and 'apple,' has different meaning.

In [13]:
# remove punctuation
def remove_punctuation(text):
    # str.maketrans() craetes a translation table
    # Two empty spaces means we are not replacing anything
    # string.punctuation provides list of characters to be removed
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

input_str = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
remove_punctuation(input_str)


'Oh my God she has 10 cubes left I cant believe it I thought that she had 15 '

In [14]:
# remove whitespace from text
def remove_whitespace(text):
    return  " ".join(text.split())

my_str = "we are going  tomorrow"
remove_whitespace(my_str)

'we are going tomorrow'

##### REMOVE DEFAULT STOPWORDS
stopwords from nltk.corpus contains ceratin common stopwords.  

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

nltk.download()

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))    #create set for english stop words
    word_tokens = word_tokenize(text, language="english")   #tokenize text to list of individual words
    filtered_text = [word for word in word_tokens if word.lower() not in stop_words] #use list comprehension
    return filtered_text

input_string = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
remove_stopwords(input_string)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


##### STEMMING 
This is getting the **root** from a word.  
We first convert the text into tokens, then convert tokens to their root form.  
Porter, Snowball or Lancaster Stemmer.   

In [17]:
from nltk.stem.porter import PorterStemmer

# initialize stemmer
stemmer = PorterStemmer()

# stem words in the list of tokenized words
def stem_words(text):
    word_tokens = text.split()
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems

text = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
stem_words(text)

['data',
 'scienc',
 'use',
 'scientif',
 'method',
 'algorithm',
 'and',
 'mani',
 'type',
 'of',
 'process']

##### LEMMATIZATION  
This is an NLP technique that reduces a word to its root form.  

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt') # Download the 'punkt' resource
nltk.download('wordnet') # Download the 'wordnet' resource 

# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def lemma_words(text):
    word_tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
    return lemmas
  
input_str = "Oh my God she has 10 cubes left! I can't believe it. I thought that she had 15 ??"
lemma_words(input_str)


##### PART OF SPEECH TAGGING
This is necessary to understand the relationship between words.  
It also helps in disambiguating words that have more than one meaning.  
Eg book - verb and book - noun.  

In [None]:
from nltk import pos_tag

# convert text into word_tokens with their tags
def pos_tagging(text):
  word_tokens = word_tokenize(text)
  return pos_tag(word_tokens)

pos_tagging('You can do anything if you put your mind to it')

In [None]:
# download the tagset 
nltk.download('tagsets')

# extract information about the tag
nltk.help.upenn_tagset('NN')

##### CHUNKING  
