# String Operations in Pandas

### What Are String Operations in Pandas?

Text-based data is everywhere — in names, emails, addresses, IDs, tickets, and more. In the Titanic dataset, key columns like `Name`, `Ticket`, and `Cabin` contain strings that can be cleaned, transformed, or mined for useful information.

Pandas provides a powerful `.str` accessor to manipulate strings across an entire column (Series) efficiently. These operations allow us to:

- Normalize text (e.g., convert everything to lowercase)
- Clean messy strings (e.g., remove punctuation)
- Extract structured patterns (e.g., titles from names)
- Filter or tag rows based on string content

These string operations are especially valuable in AI/ML workflows for **feature engineering**, such as extracting names, standardizing formats, or building binary indicators from patterns in text.

### Loading the Titanic Dataset

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")
print(df[['Name', 'Ticket', 'Cabin']].head())

                                                Name            Ticket Cabin
0                            Braund, Mr. Owen Harris         A/5 21171   NaN
1  Cumings, Mrs. John Bradley (Florence Briggs Th...          PC 17599   C85
2                             Heikkinen, Miss. Laina  STON/O2. 3101282   NaN
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)            113803  C123
4                           Allen, Mr. William Henry            373450   NaN


### Regular Expressions in Pandas (`regex=True`)

**Regex (Regular Expressions)** are patterns used to match strings. Pandas string functions like `.replace()`, `.contains()`, and `.extract()` can use regex to:

- Match specific characters or groups
- Remove unwanted patterns (e.g., punctuation)
- Extract substrings using complex patterns

Regex is especially useful when string values follow a consistent format (like names or codes).

### Why Use Regex in Pandas?

In AI/ML preprocessing, regex helps:

- Extract titles from names: `"Mr."`, `"Miss."`
- Remove unwanted characters: punctuation, brackets
- Match specific types of strings: digits, uppercase words, etc.

### Common Regex Symbols and Their Meaning

This cheat sheet will help us understand and write regular expressions when working with string data in Pandas:

| **Symbol** | **Meaning** | **Example** | **Matches** |
| --- | --- | --- | --- |
| `.` | Any character except newline | `"a.b"` | Matches `"aab"`, `"acb"` |
| `^` | Start of string (outside `[]`) | `"^Mr"` | Matches `"Mr. John"` at the start |
| `$` | End of string | `"end$"` | Matches `"This is the end"` |
| `*` | Zero or more of the preceding pattern | `"lo*"` | Matches `"l"`, `"lo"`, `"loo"` |
| `+` | One or more of the preceding pattern | `"lo+"` | Matches `"lo"`, `"loo"` |
| `?` | Zero or one of the preceding pattern | `"lo?"` | Matches `"l"` or `"lo"` |
| `{n}` | Exactly n repetitions | `"a{3}"` | Matches `"aaa"` |
| `{n,}` | n or more repetitions | `"a{2,}"` | Matches `"aa"`, `"aaa"`, etc. |
| `{n,m}` | Between n and m repetitions | `"a{2,4}"` | Matches `"aa"`, `"aaa"`, `"aaaa"` |
| `[]` | Character set (any one inside) | `"[abc]"` | Matches `"a"`, `"b"`, `"c"` |
| `[^...]` | Negated character set | `"[^0-9]"` | Matches anything except digits |
| `\w` | Word character = `[a-zA-Z0-9_]` | `"\w"` | Matches `"a"`, `"Z"`, `"9"`, `"_"` |
| `\W` | Non-word character | `"\W"` | Matches `"!"`, `"."`, `" "` |
| `\d` | Digit = `[0-9]` | `"\d"` | Matches `"3"`, `"9"` |
| `\D` | Non-digit | `"\D"` | Matches `"a"`, `"!"` |
| `\s` | Whitespace (space, tab, newline) | `"\s"` | Matches `" "`, `"\n"` |
| `\S` | Non-whitespace | `"\S"` | Matches `"a"`, `"1"`, `"@"` |
| `()` | Grouping and extraction | `"(Mr|Mrs)\\.?"` | Matches `"Mr."`, `"Mrs."`, `"Mr"`, `"Mrs"` |
| `\.` | Escaped dot (matches a literal `.`) | `"Mr\."` | Matches `"Mr."` |
| `\\` | Escapes a backslash | `"\\d"` | Interpreted as `\d` (digit) |
| `(?i)` | Case-insensitive mode | `"(?i)miss"` | Matches `"Miss"`, `"MISS"`, `"miss"` |

### Special Tips for Pandas

- Use `regex=True` in `.replace()` and `.str.contains()` to activate regex mode.
- Escape symbols like `.` or `+` with `\` if we want their literal meaning.
- Group patterns using `()` for extraction with `.str.extract()`

### Common String Methods

1. `.lower()` : Converts every character in a string to lowercase. Standardizes text so that comparisons are case-insensitive — very useful when matching keywords like "miss", "mr", etc.

In [2]:
df['Name_lower'] = df['Name'].str.lower()
print(df[['Name', 'Name_lower']].head())

                                                Name  \
0                            Braund, Mr. Owen Harris   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2                             Heikkinen, Miss. Laina   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4                           Allen, Mr. William Henry   

                                          Name_lower  
0                            braund, mr. owen harris  
1  cumings, mrs. john bradley (florence briggs th...  
2                             heikkinen, miss. laina  
3       futrelle, mrs. jacques heath (lily may peel)  
4                           allen, mr. william henry  


2. `.replace()` : Replaces specific characters or patterns with another string. Can use both simple string replacement and regex. Clean strings by removing punctuation or replacing unwanted patterns before analysis or feature extraction.

In [3]:
df['Name_clean'] = df['Name'].str.replace(r"[^\w\s]", "", regex=True)
print(df[['Name', 'Name_clean']].head())

                                                Name  \
0                            Braund, Mr. Owen Harris   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2                             Heikkinen, Miss. Laina   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4                           Allen, Mr. William Henry   

                                        Name_clean  
0                            Braund Mr Owen Harris  
1  Cumings Mrs John Bradley Florence Briggs Thayer  
2                             Heikkinen Miss Laina  
3         Futrelle Mrs Jacques Heath Lily May Peel  
4                           Allen Mr William Henry  


3. `.contains()` : Returns `True` if a substring or pattern is found in the string. Create binary flags for presence of certain titles or labels like "Miss", "Master", "PC", etc.

In [4]:
df['Has_Miss'] = df['Name'].str.contains("Miss")
print(df[['Name', 'Has_Miss']].head())

                                                Name  Has_Miss
0                            Braund, Mr. Owen Harris     False
1  Cumings, Mrs. John Bradley (Florence Briggs Th...     False
2                             Heikkinen, Miss. Laina      True
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)     False
4                           Allen, Mr. William Henry     False


4. `.startswith()` and `.endswith()` : Checks if a string starts or ends with a particular substring. Identify patterns like ticket codes or labels that follow a prefix/suffix.

In [5]:
df['Starts_with_Mr'] = df['Name'].str.startswith("Mr")
print(df[['Name', 'Starts_with_Mr']].head())

                                                Name  Starts_with_Mr
0                            Braund, Mr. Owen Harris           False
1  Cumings, Mrs. John Bradley (Florence Briggs Th...           False
2                             Heikkinen, Miss. Laina           False
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)           False
4                           Allen, Mr. William Henry           False


5. `.split()` with `.str.get()` : Splits a string into a list using a delimiter. Use `.str.get()` to access a specific part. Extract the last name from `Name` field (format is “Last, Title First”).

In [6]:
df['Last_Name'] = df['Name'].str.split(',').str.get(0)
print(df[['Name', 'Last_Name']].head())

                                                Name  Last_Name
0                            Braund, Mr. Owen Harris     Braund
1  Cumings, Mrs. John Bradley (Florence Briggs Th...    Cumings
2                             Heikkinen, Miss. Laina  Heikkinen
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   Futrelle
4                           Allen, Mr. William Henry      Allen


6. `.extract()` : Uses regex to extract parts of strings and place them into a new column. Extract titles such as “Mr”, “Miss”, “Dr” from names.

In [7]:
df['Title'] = df['Name'].str.extract(r'([A-Za-z]+)\.')
print(df[['Name', 'Title']].head())

                                                Name Title
0                            Braund, Mr. Owen Harris    Mr
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   Mrs
2                             Heikkinen, Miss. Laina  Miss
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   Mrs
4                           Allen, Mr. William Henry    Mr


**Explanation:**

- `([A-Za-z]+)` matches one or more letters.
- `\.` matches the literal dot character after the title.

### AI/ML Use Case: Feature Engineering from Strings

In machine learning, string columns are usually **not used directly**. Instead, we:

- Extract **features** (e.g., Title from Name)
- Convert **categories** into numbers (Label Encoding)
- Clean noise for consistency (e.g., lowercase, punctuation removal)

Examples from Titanic:

- `Title` can indicate gender and age group.
- Cleaned `Ticket` or `Cabin` values can help detect social class or group travel.
- Flags like `Has_Miss` or `Has_Master` can correlate with survival rate.

Clean, structured string features often **boost model performance**.

## When to Use `.str` vs `.apply()` for String Logic

- Use `.str` for **simple**, **vectorized**, and **efficient** string operations.
- Use `.apply()` when:
    - The logic involves conditionals
    - We need more control (e.g., combining columns, multiple operations)

Example with `.apply()`:

In [8]:
df['NameLengthLabel'] = df['Name'].apply(lambda x: 'Short' if len(x) < 30 else 'Long')
print(df[['Name', 'NameLengthLabel']].head())

                                                Name NameLengthLabel
0                            Braund, Mr. Owen Harris           Short
1  Cumings, Mrs. John Bradley (Florence Briggs Th...            Long
2                             Heikkinen, Miss. Laina           Short
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)            Long
4                           Allen, Mr. William Henry           Short


### Exercise

Q1. Extract the passenger Title from the Name column

In [9]:
df['Title'] = df['Name'].str.extract(r'([A-Za-z]+)\.')
print(df[['Name', 'Title']].head())

                                                Name Title
0                            Braund, Mr. Owen Harris    Mr
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   Mrs
2                             Heikkinen, Miss. Laina  Miss
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   Mrs
4                           Allen, Mr. William Henry    Mr


Q2. Clean the Name column by removing all punctuation and converting to lowercase

In [10]:
# Remove punctuation and convert to lowercase
df['Name_clean'] = df['Name'].str.replace(r"[^\w\s]", "", regex=True).str.lower()
print(df[['Name', 'Name_clean']].head())

                                                Name  \
0                            Braund, Mr. Owen Harris   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2                             Heikkinen, Miss. Laina   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4                           Allen, Mr. William Henry   

                                        Name_clean  
0                            braund mr owen harris  
1  cumings mrs john bradley florence briggs thayer  
2                             heikkinen miss laina  
3         futrelle mrs jacques heath lily may peel  
4                           allen mr william henry  


> `r"[^\w\s]"` means: "Match any character that is not a word (\w) or whitespace (\s)".
> 
> 
> It removes punctuation like commas, dots, brackets, etc.
> 

Q3. Create a `Has_Master` column that flags if "Master" is in the name

In [11]:
df['Has_Master'] = df['Name'].str.contains("Master", case=False)
print(df[['Name', 'Has_Master']].head())

                                                Name  Has_Master
0                            Braund, Mr. Owen Harris       False
1  Cumings, Mrs. John Bradley (Florence Briggs Th...       False
2                             Heikkinen, Miss. Laina       False
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)       False
4                           Allen, Mr. William Henry       False


Q4. Extract the last name from Name using `.split()` 

In [12]:
df['Last_Name'] = df['Name'].str.split(',').str.get(0)
print(df[['Name', 'Last_Name']].head())

                                                Name  Last_Name
0                            Braund, Mr. Owen Harris     Braund
1  Cumings, Mrs. John Bradley (Florence Briggs Th...    Cumings
2                             Heikkinen, Miss. Laina  Heikkinen
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)   Futrelle
4                           Allen, Mr. William Henry      Allen


Q5. Create a binary column `Has_Initial` if a title like “Dr.” or “Rev.” appears in the name

In [None]:
df['Has_Initial'] = df['Name'].str.contains(r'\b(Dr|Rev)\.', regex=True)
print(df[['Name', 'Has_Initial']].head())

Q6. Use `.replace()` with regex to remove all digits from the Ticket column

In [14]:
df['Ticket_no_digits'] = df['Ticket'].str.replace(r'\d+', '', regex=True)
print(df[['Ticket', 'Ticket_no_digits']].head())

             Ticket Ticket_no_digits
0         A/5 21171              A/ 
1          PC 17599              PC 
2  STON/O2. 3101282         STON/O. 
3            113803                 
4            373450                 


## Summary

String operations in Pandas are essential for cleaning, transforming, and extracting meaningful features from text data—often messy, inconsistent, or unstructured in real-world datasets. In AI and machine learning workflows, raw textual columns like names, tickets, or IDs usually cannot be fed directly into models. Instead, they require preprocessing to convert text into structured, numeric, or categorical features.

Pandas provides the powerful `.str` accessor that allows vectorized string manipulation across an entire Series or DataFrame column efficiently, even with large datasets. This makes string handling fast, concise, and scalable.

Key Pandas string methods include:

- **`.lower()`**: Normalizes text by converting all characters to lowercase, helping standardize data and avoid case sensitivity issues during matching or filtering.
- **`.replace()`**: Removes or substitutes unwanted characters or patterns using plain strings or regular expressions (regex). This cleans noisy data by eliminating punctuation, digits, or special symbols.
- **`.contains()`**: Returns Boolean values indicating if a substring or regex pattern exists in each string. This is useful to create binary flags (e.g., “Has_Master”) that signal presence of keywords or labels.
- **`.startswith()` and `.endswith()`**: Check if strings begin or end with a specific substring, helping identify patterns like ticket prefixes or name titles.
- **`.split()`**: Splits strings by delimiters into lists; combined with `.str.get()` to extract specific parts (e.g., last names from “Last, First” format).
- **`.extract()`**: Uses regex to capture and extract specific patterns into new columns, ideal for structured feature extraction like titles (“Mr.”, “Dr.”).

Regular expressions are integral to advanced string operations in Pandas, enabling complex pattern matching and extraction. Understanding common regex symbols and syntax enhances the ability to clean and parse data effectively.

While `.str` methods are ideal for simple, vectorized operations on single text columns, `.apply()` with custom functions is used when logic requires conditionals, multi-column interactions, or more complex processing—though this is less efficient.

In summary, mastering Pandas string operations and regex empowers data scientists and ML engineers to convert raw text into structured, informative features, significantly boosting model quality and analysis accuracy.