<a href="https://colab.research.google.com/github/Krishna-Data-Business-Insights/NATURAL-LANGUAGE-PROCCESING-ASSIMENTS-_/blob/main/lab1_manipulate_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Manipulate text data in Python

**Duration:** 1 hour

**Objectives:**
- Master basic string manipulation methods in Python
- Learn to use regular expressions for pattern matching and text cleaning
- Apply these techniques to text preprocessing tasks

---

## Instructions

1. Complete all the exercises marked with `# TODO`
2. Run each cell to verify your answers
3. Save your completed notebook
4. **Push your work to a Git repository and send the link to: yoroba93@gmail.com**

---

# Part 1: String Manipulation

Python provides powerful built-in methods for string manipulation. These are the foundation of text preprocessing.

## 1.1 Case Conversion

In [None]:
# Example text
text = "Hello World! Welcome to NLP."

# lower() - Convert to lowercase
print("lower():", text.lower())

# upper() - Convert to uppercase
print("upper():", text.upper())

# title() - Capitalize first letter of each word
print("title():", text.title())

# capitalize() - Capitalize only the first character
print("capitalize():", text.capitalize())

lower(): hello world! welcome to nlp.
upper(): HELLO WORLD! WELCOME TO NLP.
title(): Hello World! Welcome To Nlp.
capitalize(): Hello world! welcome to nlp.


In [None]:
# TODO: Exercise 1.1
# Given the following sentence, convert it to lowercase and store in 'result'

sentence = "The Quick BROWN Fox Jumps Over The Lazy DOG"

result = sentence.lower()

print(result)
assert result == "the quick brown fox jumps over the lazy dog", "Check your answer!"

the quick brown fox jumps over the lazy dog


## 1.2 Length and Counting

In [None]:
text = "banana"

# len() - Get the length of a string
print("Length:", len(text))

# count() - Count occurrences of a substring
print("Count 'a':", text.count('a'))
print("Count 'an':", text.count('an'))

Length: 6
Count 'a': 3
Count 'an': 2


In [None]:
# TODO: Exercise 1.2
# Count how many times the word "the" appears in the following text (case-insensitive)
# Hint: convert to lowercase first

paragraph = "The cat sat on the mat. The mat was on the floor. THE floor was cold."

count_the = paragraph.lower().count("the")

print(f"'the' appears {count_the} times")
assert count_the == 5, "Hint: make sure to count case-insensitively!"

'the' appears 5 times


## 1.3 Whitespace Handling

In [None]:
text = "   Hello World!   "

# strip() - Remove leading and trailing whitespace
print(f"strip(): '{text.strip()}'")

# lstrip() - Remove leading (left) whitespace
print(f"lstrip(): '{text.lstrip()}'")

# rstrip() - Remove trailing (right) whitespace
print(f"rstrip(): '{text.rstrip()}'")

# strip with specific characters
text2 = "###Hello###"
print(f"strip('#'): '{text2.strip('#')}'")

strip(): 'Hello World!'
lstrip(): 'Hello World!   '
rstrip(): '   Hello World!'
strip('#'): 'Hello'


In [None]:
# split() - Split string into a list
text = "apple,banana,cherry"
print("Split by comma:", text.split(','))

text2 = "Hello   World   NLP"
print("Split by whitespace:", text2.split())  # Default splits on any whitespace

# join() - Join list elements into a string
words = ["Natural", "Language", "Processing"]
print("Join with space:", " ".join(words))
print("Join with hyphen:", "-".join(words))

Split by comma: ['apple', 'banana', 'cherry']
Split by whitespace: ['Hello', 'World', 'NLP']
Join with space: Natural Language Processing
Join with hyphen: Natural-Language-Processing


In [None]:
# TODO: Exercise 1.3
# Clean the following messy text:
# 1. Remove leading/trailing whitespace
# 2. Split into words
# 3. Join back with single spaces

messy_text = "   This    text   has    irregular     spacing   "

# We can use split() without arguments to handle irregular whitespace automatically
clean_text = " ".join(messy_text.split())

print(f"Clean: '{clean_text}'")
assert clean_text == "This text has irregular spacing", "Check your answer!"

Clean: 'This text has irregular spacing'


## 1.4 Find and Replace

In [None]:
text = "Hello World! Hello Python!"

# find() - Find the index of first occurrence (-1 if not found)
print("find('World'):", text.find('World'))
print("find('Java'):", text.find('Java'))

# replace() - Replace all occurrences
print("replace:", text.replace('Hello', 'Hi'))

# replace with count limit
print("replace (count=1):", text.replace('Hello', 'Hi', 1))

find('World'): 6
find('Java'): -1
replace: Hi World! Hi Python!
replace (count=1): Hi World! Hello Python!


In [None]:
# TODO: Exercise 1.4
# Replace all occurrences of "NLP" with "Natural Language Processing"

text = "NLP is fascinating. I love studying NLP. NLP has many applications."

expanded_text = text.replace("NLP", "Natural Language Processing")

print(expanded_text)
assert "NLP" not in expanded_text, "All 'NLP' should be replaced!"
assert expanded_text.count("Natural Language Processing") == 3, "Should have 3 replacements!"

Natural Language Processing is fascinating. I love studying Natural Language Processing. Natural Language Processing has many applications.


## 1.5 Checking String Content

In [None]:
# startswith() and endswith()
filename = "document.pdf"
print("Starts with 'doc':", filename.startswith('doc'))
print("Ends with '.pdf':", filename.endswith('.pdf'))
print("Ends with '.txt' or '.pdf':", filename.endswith(('.txt', '.pdf')))

# in operator - Check if substring exists
text = "Hello World"
print("'World' in text:", 'World' in text)
print("'Python' in text:", 'Python' in text)

Starts with 'doc': True
Ends with '.pdf': True
Ends with '.txt' or '.pdf': True
'World' in text: True
'Python' in text: False


In [None]:
# Character type checking
print("'Hello'.isalpha():", "Hello".isalpha())      # Only letters?
print("'Hello1'.isalpha():", "Hello1".isalpha())

print("'12345'.isdigit():", "12345".isdigit())      # Only digits?
print("'12.34'.isdigit():", "12.34".isdigit())

print("'Hello1'.isalnum():", "Hello1".isalnum())    # Letters or digits?
print("'   '.isspace():", "   ".isspace())          # Only whitespace?

'Hello'.isalpha(): True
'Hello1'.isalpha(): False
'12345'.isdigit(): True
'12.34'.isdigit(): False
'Hello1'.isalnum(): True
'   '.isspace(): True


In [None]:
# TODO: Exercise 1.5
# Filter the following list to keep only words that:
# 1. Contain only alphabetic characters
# 2. Have more than 3 characters

words = ["hello", "world123", "NLP", "AI", "machine", "42", "learning", "a1b2", "the"]

filtered_words = [word for word in words if word.isalpha() and len(word) > 3]

print(filtered_words)
assert filtered_words == ["hello", "machine", "learning"], "Check your filtering conditions!"

['hello', 'machine', 'learning']


## 1.6 Mini-Challenge: Text Cleaning Function

In [None]:
# TODO: Exercise 1.6
# Create a function that performs basic text cleaning:
# 1. Convert to lowercase
# 2. Remove leading/trailing whitespace
# 3. Replace multiple spaces with single space
# 4. Replace newlines with spaces

def basic_clean(text):
    """
    Perform basic text cleaning.

    Args:
        text (str): Input text
    Returns:
        str: Cleaned text
    """
    # 1. Convert to lowercase
    text = text.lower()

    # 4. Replace newlines with spaces (doing this before split ensures lines don't merge awkwardly)
    text = text.replace("\n", " ")

    # 2 & 3. Remove leading/trailing whitespace AND multiple spaces
    # Splitting by default handles all whitespace, joining puts single spaces back
    text = " ".join(text.split())

    return text

# Test your function
test_text = """   HELLO   World!
   This is    a TEST.   """

result = basic_clean(test_text)
print(f"Result: '{result}'")
assert result == "hello world! this is a test.", "Check your function!"

Result: 'hello world! this is a test.'


---

# Part 2: Regular Expressions

Regular expressions (regex) provide powerful pattern matching capabilities for text processing.

In [None]:
import re  # Import the regex module

## 2.1 Basic Regex Functions

In [None]:
text = "My phone number is 123-456-7890 and my zip code is 12345."

# re.search() - Find first match
match = re.search(r'\d+', text)  # \d+ means one or more digits
if match:
    print("search() found:", match.group())

# re.findall() - Find all matches (returns list)
all_numbers = re.findall(r'\d+', text)
print("findall() found:", all_numbers)

# re.sub() - Replace pattern with new string
censored = re.sub(r'\d', 'X', text)
print("sub() result:", censored)

search() found: 123
findall() found: ['123', '456', '7890', '12345']
sub() result: My phone number is XXX-XXX-XXXX and my zip code is XXXXX.


In [None]:
# re.split() - Split by pattern
text = "apple;banana,cherry orange"
# Split by semicolon, comma, or space
words = re.split(r'[;,\s]+', text)
print("split() result:", words)

split() result: ['apple', 'banana', 'cherry', 'orange']


In [None]:
# TODO: Exercise 2.1
# Extract all the years (4-digit numbers) from the following text

text = "Python was created in 1991. TensorFlow was released in 2015. GPT-3 came out in 2020."

years = re.findall(r'\d{4}', text)

print(years)
assert years == ['1991', '2015', '2020'], "Check your pattern!"

['1991', '2015', '2020']


## 2.2 Character Classes

In [None]:
# Common character classes
text = "Hello World! 123 test_var"

print("\\d (digits):", re.findall(r'\d', text))       # Digits
print("\\w (word chars):", re.findall(r'\w+', text))  # Word characters [a-zA-Z0-9_]
print("\\s (whitespace):", re.findall(r'\s', text))   # Whitespace
print("\\S (non-whitespace):", re.findall(r'\S+', text))  # Non-whitespace

\d (digits): ['1', '2', '3']
\w (word chars): ['Hello', 'World', '123', 'test_var']
\s (whitespace): [' ', ' ', ' ']
\S (non-whitespace): ['Hello', 'World!', '123', 'test_var']


In [None]:
# Custom character classes with []
text = "The price is $50.99 or â‚¬45.00"

# Match vowels
print("Vowels:", re.findall(r'[aeiouAEIOU]', text))

# Match currency symbols
print("Currency:", re.findall(r'[$â‚¬Â£]', text))

# Match anything except digits
print("Non-digits:", re.findall(r'[^0-9]+', text))

Vowels: ['e', 'i', 'e', 'i', 'o']
Currency: ['$', 'â‚¬']
Non-digits: ['The price is $', '.', ' or â‚¬', '.']


In [None]:
# TODO: Exercise 2.2
# Extract all words that contain only lowercase letters (no digits, no uppercase)

text = "Hello world NLP is GREAT for text processing 123"

lowercase_words = re.findall(r'[a-z]+', text)
# Hint: use [a-z]+ pattern

print(lowercase_words)
assert lowercase_words == ['ello','world', 'is', 'for', 'text', 'processing'], "Check your pattern!"

['ello', 'world', 'is', 'for', 'text', 'processing']


## 2.3 Quantifiers

In [None]:
# Quantifiers control how many times a pattern should match
text = "a aa aaa aaaa b bb bbb"

print("a+ (1 or more):", re.findall(r'a+', text))
print("a* (0 or more):", re.findall(r'ba*', text))  # b followed by 0+ a's
print("a? (0 or 1):", re.findall(r'ba?', text))     # b followed by 0 or 1 a
print("a{2} (exactly 2):", re.findall(r'a{2}', text))
print("a{2,3} (2 to 3):", re.findall(r'a{2,3}', text))

a+ (1 or more): ['a', 'aa', 'aaa', 'aaaa']
a* (0 or more): ['b', 'b', 'b', 'b', 'b', 'b']
a? (0 or 1): ['b', 'b', 'b', 'b', 'b', 'b']
a{2} (exactly 2): ['aa', 'aa', 'aa', 'aa']
a{2,3} (2 to 3): ['aa', 'aaa', 'aaa']


In [None]:
# Greedy vs Non-greedy
html = "<p>First</p><p>Second</p>"

# Greedy (default) - matches as much as possible
print("Greedy:", re.findall(r'<p>.*</p>', html))

# Non-greedy (add ?) - matches as little as possible
print("Non-greedy:", re.findall(r'<p>.*?</p>', html))

Greedy: ['<p>First</p><p>Second</p>']
Non-greedy: ['<p>First</p>', '<p>Second</p>']


In [None]:
# TODO: Exercise 2.3
# Extract all phone numbers in the format XXX-XXX-XXXX

text = "Call me at 123-456-7890 or 987-654-3210. My old number was 555-1234."

phone_numbers = re.findall(r'\d{3}-\d{3}-\d{4}', text)
# Hint: \d{3} matches exactly 3 digits

print(phone_numbers)
assert phone_numbers == ['123-456-7890', '987-654-3210'], "Check your pattern!"

['123-456-7890', '987-654-3210']


## 2.4 Anchors and Word Boundaries

In [None]:
# ^ matches start of string, $ matches end of string
text = "Hello World"

print("Starts with 'Hello':", bool(re.match(r'^Hello', text)))
print("Ends with 'World':", bool(re.search(r'World$', text)))

# \b matches word boundary
text2 = "cat catalog caterpillar"
print("All 'cat':", re.findall(r'cat', text2))
print("Only whole word 'cat':", re.findall(r'\bcat\b', text2))

Starts with 'Hello': True
Ends with 'World': True
All 'cat': ['cat', 'cat', 'cat']
Only whole word 'cat': ['cat']


In [None]:
# TODO: Exercise 2.4
# Find all words that START with 'pre' (as whole words, not substrings)

text = "I need to prepare a presentation. The prerequisites are comprehensive."

pre_words = re.findall(r'\bpre\w*', text)
# Hint: combine \b with \w+

print(pre_words)
assert pre_words == ['prepare', 'presentation', 'prerequisites'], "Check your pattern!"

['prepare', 'presentation', 'prerequisites']


## 2.5 Groups and Alternation

In [None]:
# Groups () capture parts of the match
text = "John Smith: john.smith@email.com, Jane Doe: jane.doe@company.org"

# Extract email parts
emails = re.findall(r'(\w+\.\w+)@(\w+\.\w+)', text)
print("Email parts:", emails)

# Alternation | for OR patterns
text2 = "I have a cat and a dog. My neighbor has a bird."
pets = re.findall(r'cat|dog|bird', text2)
print("Pets found:", pets)

Email parts: [('john.smith', 'email.com'), ('jane.doe', 'company.org')]
Pets found: ['cat', 'dog', 'bird']


In [None]:
# TODO: Exercise 2.5
# Extract all dates in format DD/MM/YYYY or DD-MM-YYYY
# Return them as tuples (day, month, year)

text = "Important dates: 25/12/2024, 01-01-2025, and 14/02/2025."

dates = re.findall(r'(\d{2})[/-](\d{2})[/-](\d{4})', text)
# Hint: use groups () and alternation for / or -

print(dates)
assert dates == [('25', '12', '2024'), ('01', '01', '2025'), ('14', '02', '2025')], "Check your pattern!"

[('25', '12', '2024'), ('01', '01', '2025'), ('14', '02', '2025')]


## 2.6 Practical NLP Cleaning Patterns

In [None]:
# Common text cleaning patterns
sample_text = """
Check out https://example.com for more info!
Contact us at support@company.com ðŸ“§
Follow @nlp_expert on Twitter! #MachineLearning #NLP
Price: $99.99 (50% off!!!)
"""

# Remove URLs
no_urls = re.sub(r'https?://\S+', '', sample_text)
print("No URLs:", no_urls)

# Remove emails
no_emails = re.sub(r'\S+@\S+', '[EMAIL]', sample_text)
print("No emails:", no_emails)

# Remove hashtags and mentions
no_social = re.sub(r'[@#]\w+', '', sample_text)
print("No social:", no_social)

No URLs: 
Check out  for more info!
Contact us at support@company.com ðŸ“§
Follow @nlp_expert on Twitter! #MachineLearning #NLP
Price: $99.99 (50% off!!!)

No emails: 
Check out https://example.com for more info!
Contact us at [EMAIL] ðŸ“§
Follow @nlp_expert on Twitter! #MachineLearning #NLP
Price: $99.99 (50% off!!!)

No social: 
Check out https://example.com for more info!
Contact us at support.com ðŸ“§
Follow  on Twitter!  
Price: $99.99 (50% off!!!)



In [None]:
# More cleaning patterns
text = "Hello!!!   What's up???  This is so cool..."

# Remove repeated punctuation
clean1 = re.sub(r'([!?.]){2,}', r'\1', text)
print("No repeated punct:", clean1)

# Remove extra whitespace
clean2 = re.sub(r'\s+', ' ', text)
print("No extra spaces:", clean2)

# Remove non-alphanumeric (keep spaces)
clean3 = re.sub(r'[^\w\s]', '', text)
print("Only alphanumeric:", clean3)

No repeated punct: Hello!   What's up?  This is so cool.
No extra spaces: Hello!!! What's up??? This is so cool...
Only alphanumeric: Hello   Whats up  This is so cool


## 2.7 Flags

In [None]:
# re.IGNORECASE (re.I) - Case insensitive matching
text = "Python PYTHON python PyThOn"

print("Without flag:", re.findall(r'python', text))
print("With IGNORECASE:", re.findall(r'python', text, re.IGNORECASE))

# re.MULTILINE (re.M) - ^ and $ match line boundaries
multiline_text = """First line
Second line
Third line"""

print("Lines starting with capital:", re.findall(r'^[A-Z]\w+', multiline_text, re.MULTILINE))

Without flag: ['python']
With IGNORECASE: ['Python', 'PYTHON', 'python', 'PyThOn']
Lines starting with capital: ['First', 'Second', 'Third']


## 2.8 Final Challenge: Complete Text Preprocessor

In [None]:
# TODO: Exercise 2.8 (Final Challenge)
# Create a comprehensive text preprocessing function that:
# 1. Converts to lowercase
# 2. Removes URLs (http/https)
# 3. Removes email addresses
# 4. Removes hashtags and mentions (@user, #topic)
# 5. Removes punctuation (keep only letters, numbers, spaces)
# 6. Removes extra whitespace
# 7. Strips leading/trailing whitespace

def preprocess_text(text):
    """
    Comprehensive text preprocessing for NLP tasks.

    Args:
        text (str): Raw input text
    Returns:
        str: Cleaned and preprocessed text
    """
    # Step 1: Lowercase
    text = text.lower()

    # Step 2: Remove URLs
    text = re.sub(r'https?://\S+', '', text)

    # Step 3: Remove emails
    text = re.sub(r'\S+@\S+', '', text)

    # Step 4: Remove hashtags and mentions
    text = re.sub(r'[@#]\w+', '', text)

    # Step 5: Remove punctuation (keep word characters and spaces)
    # [^\w\s] matches anything that is NOT a word char or space
    text = re.sub(r'[^\w\s]', '', text)

    # Step 6: Remove extra whitespace (replace multiple spaces with one)
    text = re.sub(r'\s+', ' ', text)

    # Step 7: Strip leading/trailing whitespace
    text = text.strip()

    return text

# Test your function
raw_text = """
    ðŸš€ Check out our NEW product at https://example.com!!!
    Contact: sales@company.com @CompanyName #Innovation #Tech
    Limited time offer: 50% OFF!!!
"""

cleaned = preprocess_text(raw_text)
print(f"Cleaned: '{cleaned}'")

# Expected output should be something like:
# "check out our new product at limited time offer 50 off"

Cleaned: 'check out our new product at contact limited time offer 50 off'


---

## Summary

### String Methods
- **Case**: `lower()`, `upper()`, `title()`, `capitalize()`
- **Whitespace**: `strip()`, `lstrip()`, `rstrip()`, `split()`, `join()`
- **Find/Replace**: `find()`, `replace()`, `count()`
- **Checking**: `startswith()`, `endswith()`, `isalpha()`, `isdigit()`, `isalnum()`

### Regex Patterns
- **Character classes**: `\d`, `\w`, `\s`, `[abc]`, `[^abc]`
- **Quantifiers**: `*`, `+`, `?`, `{n}`, `{n,m}`
- **Anchors**: `^`, `$`, `\b`
- **Groups**: `()`, alternation `|`

### Regex Functions
- `re.search()` - Find first match
- `re.findall()` - Find all matches
- `re.sub()` - Replace matches
- `re.split()` - Split by pattern

---

## Submission

1. Make sure all exercises are completed
2. Save this notebook
3. Create a Git repository and push your work
4. **Send the repository link to: yoroba93@gmail.com**
