4.1 Introduction
Data parsing and text data cleaning are essential techniques when working with textual data. Text data often contains inconsistencies, unnecessary information, or requires extraction of specific parts to make it useful for analysis.

4.2 Techniques
In this chapter, we'll cover several techniques:

1. Extracting Meaningful Components: Extract specific information from text data, such as dates, names, or numbers.
2. Removing Special Characters: Clean the text by removing punctuation, symbols, and other non-alphabetic characters.
3. Lowercasing: Convert all text to lowercase to ensure uniformity.
4. Removing Stop Words: Remove common words that do not add significant meaning to the text, such as "and," "the," or "is."
5. Lemmatization: Convert words to their base or dictionary form.
6. Handling Contractions: Expand contracted words (e.g., "don't" to "do not").
7. Removing HTML Tags: Clean text data from HTML tags if scraping from web pages.
8. Removing Numerical Data: Remove numbers from text when they are not needed for analysis.

4.2.1 Extracting Meaningful Components
Introduction:
Extracting meaningful components from text data is often the first step in text data processing. This process involves isolating specific information within text, such as dates, phone numbers, email addresses, or any other relevant patterns. These components can be crucial for analysis, reporting, or further data processing.

Task:
Let's start with extracting dates from a 'Description' column in a dataset. We'll use regular expressions to identify and extract any date formats within the text.

In [1]:
import pandas as pd
import re

# Load the data
df = pd.read_csv('D:/Projects/Data-cleaning-series/Chapter04 Data Parsing and Text Data Cleaning/Products.csv')

# Example of the dataset with text data
print("Original DataFrame:")
print(df.to_string(index=False))

# Extracting Dates from 'Description' column
df['Dates'] = df['Description'].apply(lambda x: re.findall(r'\d{4}-\d{2}-\d{2}', x) if pd.notnull(x) else [])

# Display the DataFrame after extracting dates
print("\nDataFrame After Extracting Dates:")
print(df.to_string(index=False))


Original DataFrame:
 Product ID Product Name  Price    Category  Stock              Description
          1     Widget A  19.99 Electronics  100.0    A high-quality widget
          2     Widget B  29.99 Electronics    NaN                      NaN
          3          NaN  15.00  Home Goods   50.0      Durable and stylish
          4     Widget D    NaN  Home Goods  200.0       A versatile widget
          5     Widget E   9.99         NaN   10.0    Compact and efficient
          6     Widget F  25.00 Electronics    0.0 Latest technology widget
          7     Widget G    NaN     Kitchen  150.0     Multi-purpose widget
          8     Widget H  39.99     Kitchen   75.0          Premium quality
          9     Widget I    NaN Electronics    NaN        Advanced features
         10     Widget J  49.99 Electronics   60.0            Best in class

DataFrame After Extracting Dates:
 Product ID Product Name  Price    Category  Stock              Description Dates
          1     Widget A  1

Explanation:

Regular Expressions (re): We're using the re library to search for patterns in the text. The pattern \d{4}-\d{2}-\d{2} is looking for dates in the format YYYY-MM-DD.

Lambda Function: We apply a lambda function to the 'Description' column, searching for any text that matches the date pattern. If found, it adds it to a new column 'Dates'.

Handling Missing Values: If the 'Description' is missing, the function returns an empty list to handle the missing data gracefully.