# Streamlined NLP Preprocessing: An All-in-One Python Function
In the world of Natural Language Processing (NLP), preprocessing text data is a crucial step that can significantly affect the quality of your analysis and results. To simplify the often cumbersome and repetitive preprocessing tasks, I have developed a Python function that incorporates all the essential NLP preprocessing steps into one streamlined function. Best of all, it operates seamlessly on a Pandas Series object, meaning it can handle any amount of text data contained in the cells of the series.

## Preprocessing Steps
#### 1)Converting to Lowercase:
Standardizes the text by converting all characters to lowercase, isolating the raw content from casing variations.

#### 2) Removing HTML: 
Cleans any HTML tags that may be present in input text, perfect for web-scraped data.

#### 3) Removing Numbers:
Eliminates numerical values, enabling the focus on textual data instead.

#### 4) Fixing Contractions:
Converts contractions (e.g., "don't" to "do not") into their expanded forms, aiding machine learning models which often handle whole words better.

#### 5) Removing Punctuation: 
Strips out punctuation marks to clean up the text further for analysis.

#### 6) Tokenization: 
Splits the textual data into individual tokens (words), essential for vectorizing techniques required in machine learning.

#### 7) Removing Stop Words (SW): 
Discards common stop words (e.g., "and", "the", "is") which typically do not contribute meaningful information for analysis.

#### 8) Stemming: 
Reduces words to their root forms (e.g., “studying” to “study”), assisting models in recognizing variations of words.

#### 9) Lemmatization:
Takes it a step further by ensuring that transformed words meet the language's proper dictionary form (e.g., “better” to “good”).

In [1]:
import re
import nltk
import string
import pandas as pd
import contractions
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/melikamolaei/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/melikamolaei/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/melikamolaei/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
def final_preprocess(col, converting_to_lowercase= True,  removing_html= True, removing_number= True , fixing_contractions= True, removing_punc= True,  tokenization= True, removing_sw= True, stemming= True, lemmatization= True):
    
    def convert_to_lowercase(col):
        return col.apply(str.lower)
    
    def remove_html(col):
        
        def remove_tags(a):
            html_pattern = re.compile('<.*?>')
            return html_pattern.sub(r'', a)
            
        return col.apply(remove_tags)
    
    def remove_number(col):
        
        def nreplace(a):
            return re.sub(r'\d' , "" ,a)
            
        return col.apply(nreplace)
    
    def fix_contractions(col):
        
        def cont(a):
            return contractions.fix(a)
            
        return col.apply(cont)
    
    def remove_punc(col):
        
        def no_punc(a):     
            a= [i for i in a if i not in string.punctuation]
            return "".join([i for i in a])
            
        return col.apply(no_punc)
        
    def token(col):
        return col.apply(word_tokenize)
    
    def remove_sw(col):
        
        sw= set(stopwords.words('english'))
        def no_sw(a):
            
            a= [i for i in a if not i in sw]
            
            return a
            
        return col.apply(no_sw)
    
    def stem_list(col):
        
        def stem_words(a):
            b= []
            for i in a:
                ps= PorterStemmer()
                b.append(ps.stem(i))
            return b    
            
        return col.apply(stem_words)
    
    def lemmatization_list(col):
        
        def lemmatization_word(a):
            b= []
            for i in a:
                wnl= WordNetLemmatizer()
                b.append(wnl.lemmatize(i))
                
            return b 
            
        return col.apply(lemmatization_word)

    
    if converting_to_lowercase:
        col= convert_to_lowercase(col)
    if removing_html:
        col= remove_html(col)
    if removing_number:
        col= remove_number(col)
    if fixing_contractions:
        col= fix_contractions(col)
    if removing_punc:
        col= remove_punc(col)
    if tokenization:
        col= token(col)
    if removing_sw:
        col= remove_sw(col)
    if stemming:
        col= stem_list(col)
    if lemmatization:
        col= lemmatization_list(col)
        
    return col

The beauty of this function lies in its flexibility. You can cherry-pick the preprocessing steps that suit your specific needs. Want to skip converting to lowercase? No problem, just set `converting_to_lowercase` to `False`. Need to remove HTML tags but keep numbers intact? Easy, set `removing_html` to `True` and `removing_number` to `False`. You get the idea.

By default, all steps are set to `True`, so if you don't specify otherwise, the function will run all the preprocessing steps. But if you want to customize the process, just pass the relevant arguments with the desired values.

In [3]:
txt1 = "Absolutely wonderful - silky and sexy and comfortable"
txt2 = "Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8.  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite."
txt3 = "I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!"
test = pd.Series([txt1 , txt2 , txt3])
pp_test = final_preprocess(test , stemming = False)

In [4]:
for i in range(3):
    print("without pp: " , test[i])
    print("with pp: " , pp_test[i])
    print()

without pp:  Absolutely wonderful - silky and sexy and comfortable
with pp:  ['absolutely', 'wonderful', 'silky', 'sexy', 'comfortable']

without pp:  Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8.  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
with pp:  ['love', 'dress', 'sooo', 'pretty', 'happened', 'find', 'store', 'glad', 'never', 'would', 'ordered', 'online', 'petite', 'bought', 'petite', 'love', 'length', 'hit', 'little', 'knee', 'would', 'definitely', 'true', 'midi', 'someone', 'truly', 'petite']

without pp:  I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
with pp:  ['love', 'love', 'love', 'jumpsuit', 'fun', 'flirty', 'fabulous', 'every', 'time', 'wear', 'get', 'nothing', 'great', 'compliment']

