4.1 Introduction
Data parsing and text data cleaning are essential techniques when working with textual data. Text data often contains inconsistencies, unnecessary information, or requires extraction of specific parts to make it useful for analysis.

4.2 Techniques
In this chapter, we'll cover several techniques:

1. Extracting Meaningful Components: Extract specific information from text data, such as dates, names, or numbers.
2. Ckeaning Text Data: Removing special characters, Converting text to lowercase, Removing extra spaces
3. Removing Stop Words: Remove common words that do not add significant meaning to the text, such as "and," "the," or "is."
4. Lemmatization: Convert words to their base or dictionary form.
5. Handling Contractions: Expand contracted words (e.g., "don't" to "do not").
6. Removing HTML Tags: Clean text data from HTML tags if scraping from web pages.
7. Removing Numerical Data: Remove numbers from text when they are not needed for analysis.

4.2.1 Extracting Meaningful Components

Introduction:
Extracting meaningful components from text data is often the first step in text data processing. This process involves isolating specific information within text, such as dates, phone numbers, email addresses, or any other relevant patterns. These components can be crucial for analysis, reporting, or further data processing.

Task:
Let's start with extracting dates from a 'Description' column in a dataset. We'll use regular expressions to identify and extract any date formats within the text.

In [1]:
import pandas as pd
import re

# Load the data
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter04 Data Parsing and Text Data Cleaning/Products.csv')

# Example of the dataset with text data
print("Original DataFrame:")
print(df.to_string(index=False))

# Extracting Dates from 'Description' column
df['Dates'] = df['Description'].apply(lambda x: re.findall(r'\d{4}-\d{2}-\d{2}', x) if pd.notnull(x) else [])

# Display the DataFrame after extracting dates
print("\nDataFrame After Extracting Dates:")
print(df.to_string(index=False))


Original DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class

DataFrame After Extracting Dates:
 Product ID Product Name  Price    Category  Stock              Description Dates
          1     Widget A  1

Explanation:

Regular Expressions (re): We're using the re library to search for patterns in the text. The pattern \d{4}-\d{2}-\d{2} is looking for dates in the format YYYY-MM-DD.

Lambda Function: We apply a lambda function to the 'Description' column, searching for any text that matches the date pattern. If found, it adds it to a new column 'Dates'.

Handling Missing Values: If the 'Description' is missing, the function returns an empty list to handle the missing data gracefully.

4.2.2 Cleaning Text Data

Introduction:
Cleaning text data is a crucial step in preparing data for analysis or machine learning models. It involves removing or correcting unwanted characters, formatting inconsistencies, and noise from the text. This process helps ensure that the data is in a consistent and usable format.

Task:
We'll clean the 'Description' column in our dataset by performing the following tasks:

Removing special characters
Converting text to lowercase
Removing extra spaces

In [5]:
import pandas as pd
import re

# Load the data
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter04 Data Parsing and Text Data Cleaning/Products.csv')

# Example of the dataset with text data
print("Original DataFrame:")
print(df.to_string(index=False))

# Function to clean text data
def clean_text(text):
    if pd.isnull(text):
        return text
    # Remove special characters
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'Description' column
df['Cleaned_Description'] = df['Description'].apply(clean_text)

# Display the DataFrame after cleaning text data
print("\nDataFrame After Cleaning Text Data:")
print(df.to_string(index=False))


Original DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class

DataFrame After Cleaning Text Data:
 Product ID Product Name  Price    Category  Stock              Description      Cleaned_Description
      

Explanation:

Removing Special Characters: The regular expression r'[^A-Za-z0-9\s]' is used to remove any characters that are not letters, numbers, or spaces.

Converting to Lowercase: We convert the text to lowercase to ensure uniformity in the data.

Removing Extra Spaces: The regular expression r'\s+' is used to replace multiple spaces with a single space, and strip() removes any leading or trailing spaces.

This process cleans the text data, making it more consistent and ready for further processing or analysis.

4.2.3. Removing Stop Words

Introduction:
Stop words are common words that usually carry little meaningful information for text analysis, such as "and," "the," "is," etc. Removing stop words can help focus on the more important words in the text and improve the quality of text analysis and modeling.

Task:
We'll remove stop words from the 'Description' column in our dataset using the Natural Language Toolkit (nltk).

In [14]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load the data
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter04 Data Parsing and Text Data Cleaning/Products.csv')

# Download stopwords if not already present
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean text data
def clean_text(text):
    if pd.isnull(text):
        return text
    # Remove special characters
    text = re.sub(r'[^A-Za-z0-9\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'Description' column
df['Cleaned_Description'] = df['Description'].apply(clean_text)

# Verify the DataFrame with the new column
print("DataFrame After Cleaning Text Data:")
print(df[['Description', 'Cleaned_Description']].to_string(index=False))

# Get the set of English stop words
stop_words = set(stopwords.words('english'))

# Function to remove stop words
def remove_stop_words(text):
    if pd.isnull(text):
        return text
    # Tokenize the text
    words = word_tokenize(text)
    # Remove stop words
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Apply the remove_stop_words function to the 'Cleaned_Description' column
df['Description_No_Stop_Words'] = df['Cleaned_Description'].apply(remove_stop_words)

# Display the DataFrame after removing stop words
print("\nDataFrame After Removing Stop Words:")
print(df[['Cleaned_Description', 'Description_No_Stop_Words']].to_string(index=False))


DataFrame After Cleaning Text Data:
             Description      Cleaned_Description
   A high-quality widget     a highquality widget
                     NaN                      NaN
     Durable and stylish      durable and stylish
      A versatile widget       a versatile widget
   Compact and efficient    compact and efficient
Latest technology widget latest technology widget
    Multi-purpose widget      multipurpose widget
         Premium quality          premium quality
       Advanced features        advanced features
           Best in class            best in class

DataFrame After Removing Stop Words:
     Cleaned_Description Description_No_Stop_Words
    a highquality widget        highquality widget
                     NaN                       NaN
     durable and stylish           durable stylish
      a versatile widget          versatile widget
   compact and efficient         compact efficient
latest technology widget  latest technology widget
     multipurpose w

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rohit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rohit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Explanation:

Tokenization: word_tokenize is used to split the text into individual words.

Stop Words Removal: We filter out any words that are in the set of stop words.

Reconstruction: The remaining words are joined back into a single string.

This technique helps in focusing on the significant terms by removing common words that do not contribute much to the analysis.

4.2.4 Stemming and Lemmatization
Introduction:
Stemming and lemmatization are techniques used to reduce words to their base or root form. Stemming typically involves cutting off derivations, while lemmatization involves reducing words to their base form using linguistic rules. These techniques help in normalizing text and improving the quality of text analysis.

Task:
We'll apply stemming and lemmatization to the 'Description' column in our dataset.

In [17]:
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load the data
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter04 Data Parsing and Text Data Cleaning/Products.csv')

# Example of the dataset with text data
print("Original DataFrame:")
print(df.to_string(index=False))

# Download necessary nltk resources if not already present
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Create the 'Cleaned_Description' column: clean up the text
df['Cleaned_Description'] = df['Description'].fillna('').str.strip()

# Function to remove stop words
def remove_stop_words(text):
    if pd.isnull(text) or text == '':
        return text
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Apply the remove_stop_words function to the 'Cleaned_Description' column
df['Description_No_Stop_Words'] = df['Cleaned_Description'].apply(remove_stop_words)

# Function for stemming
def apply_stemming(text):
    if pd.isnull(text) or text == '':
        return text
    words = text.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

# Function for lemmatization
def apply_lemmatization(text):
    if pd.isnull(text) or text == '':
        return text
    words = text.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# Apply stemming and lemmatization to the 'Description_No_Stop_Words' column
df['Description_Stemmed'] = df['Description_No_Stop_Words'].apply(apply_stemming)
df['Description_Lemmatized'] = df['Description_No_Stop_Words'].apply(apply_lemmatization)

# Display the DataFrame after stemming and lemmatization
print("\nDataFrame After Stemming and Lemmatization:")
print(df.to_string(index=False))


Original DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rohit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\rohit\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rohit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rohit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



DataFrame After Stemming and Lemmatization:
 Product ID Product Name  Price    Category  Stock              Description      Cleaned_Description Description_No_Stop_Words     Description_Stemmed   Description_Lemmatized
          1     Widget A  19.99 Electronics  100.0    A high-quality widget    A high-quality widget       high-quality widget        high-qual widget      high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN                                                                                                    
          3          NaN  15.00  Home Goods   50.0      Durable and stylish      Durable and stylish           Durable stylish          durabl stylish          Durable stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget       A versatile widget          versatile widget         versatil widget         versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient

Explanation:

Creating Cleaned_Description: We fill any missing values in the Description column with empty strings and strip any leading or trailing whitespace.

Removing Stop Words: We apply the remove_stop_words function to the Cleaned_Description column.

Applying Stemming and Lemmatization: We process the text in Description_No_Stop_Words to generate stemmed and lemmatized versions.

4.2.5. Handling Dates and Times

Introduction:
Handling dates and times involves parsing, formatting, and extracting useful information from date and time fields. This is essential for analyzing time-based data, performing time series analysis, and ensuring consistency in date-time formats.

Task:
We'll parse dates from the 'Description' column, convert them to a standard datetime format, and extract features such as year, month, and day.

In [18]:
import pandas as pd
from datetime import datetime

# Load the data
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter04 Data Parsing and Text Data Cleaning/Products.csv')

# Example of the dataset with text data
print("Original DataFrame:")
print(df.to_string(index=False))

# Function to parse and extract date features
def parse_dates(text):
    if pd.isnull(text):
        return pd.Series([None, None, None])  # Year, Month, Day
    try:
        # Assuming the text might contain dates in 'YYYY-MM-DD' format
        date = datetime.strptime(text, '%Y-%m-%d')
        return pd.Series([date.year, date.month, date.day])
    except ValueError:
        return pd.Series([None, None, None])  # In case of parsing errors

# Apply the function to extract date features
df[['Year', 'Month', 'Day']] = df['Description'].apply(parse_dates)

# Display the DataFrame after extracting date features
print("\nDataFrame After Extracting Date Features:")
print(df.to_string(index=False))


Original DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class

DataFrame After Extracting Date Features:
 Product ID Product Name  Price    Category  Stock              Description Year Month  Day
         

Explanation:

Date Parsing: The datetime.strptime function converts text into a datetime object. We assume the date is in the 'YYYY-MM-DD' format.

Extract Features: We extract year, month, and day from the datetime object and create new columns for each.

Handling Parsing Errors: If the text cannot be parsed into a date, we return None values for the date features.

This technique helps standardize and extract useful information from date and time fields, making it easier to perform time-based analysis.