# String Manipulation in Python for Data Science

Strings are widely used in data science for cleaning and processing textual data. In this notebook, you will learn to manipulate strings using common methods and regular expressions.

## Objectives
- Learn string methods such as `.strip()`, `.replace()`, `.split()`.
- Use basic regular expressions with the `re` module.
- Apply string manipulation to a real dataset.


## 1. Basic String Methods
Let's explore common methods for manipulating strings.

In [None]:
# Example: Cleaning and formatting strings
name = "  Mr. John Smith  "
print("Original:", name)

# Remove extra spaces
clean_name = name.strip()
print("Without spaces:", clean_name)

# Replace 'Mr.' with 'Mister'
formatted_name = clean_name.replace("Mr.", "Mister")
print("Formatted:", formatted_name)

# Split into words
words = formatted_name.split()
print("Words:", words)

## 2. Regular Expressions
The `re` module allows you to find patterns in strings, such as extracting titles or numbers.

In [None]:
import re

# Example: Extracting titles from names
name = "Mrs. Anna Maria Smith"
title = re.search(r'(Mr\.|Mrs\.|Miss\.|Master\.)', name)
if title:
    print("Title found:", title.group())

# Replacing numbers with 'X'
text = "Passenger 123, cabin A456"
masked_text = re.sub(r'\d+', 'X', text)
print("Masked text:", masked_text)

## 3. Example with Titanic Dataset
Let's clean and process the 'Name' column of the Titanic dataset.

In [None]:
import pandas as pd
import re

# Load dataset
df = pd.read_csv('data/titanic.csv')

# Function to extract titles
def extract_title(name):
    match = re.search(r'(Mr\.|Mrs\.|Miss\.|Master\.)', name)
    return match.group() if match else 'Other'

# Apply the function to the 'Name' column
df['Title'] = df['Name'].apply(extract_title)
print("Title counts:")
print(df['Title'].value_counts())

# Clean names (remove extra spaces)
df['Clean_Name'] = df['Name'].str.strip()
print("\nFirst 5 clean names:")
print(df['Clean_Name'].head())