### Data Cleaning
Suppose we were doing some data cleaning and needed to apply a bunch of transformations to the following list of strings:

In [6]:
states = [" Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda","south carolina##", "West virginia?"]
states

[' Alabama ',
 'Georgia!',
 'Georgia',
 'georgia',
 'FlOrIda',
 'south carolina##',
 'West virginia?']

In [3]:
import re

## The `re` module
The re module in Python provides support for working with regular expressions. Regular expressions (regex) are a powerful tool for matching and manipulating text based on patterns. The re module allows you to perform various operations using regex patterns, including searching, replacing, and splitting strings.
### re.match() :
- The `re.match()` function is used to search for a pattern at the beginning of a string.
```python
result = re.match(r'\d+', '123abc')
print(result.group())  # Output: '123'
```
### re.search() :
- The `re.search()` function is used to search for a pattern anywhere in a string.
- returns the first occurrence.
```python
result = re.search(r'\d+', 'abc123def')
print(result.group())  # Output: '123'
```
### re.findall() :
- Returns all non-overlapping matches of the pattern in the string as a list.
```python
results = re.findall(r'\d+', 'abc123def456')
print(results)  # Output: ['123', '456']
```
### re.sub() :
- Replaces matches of the pattern with a replacement string.
```python
result = re.sub(r'\d+', 'NUM', 'abc123def456')
print(result)  # Output: 'abcNUMdefNUM'
```
### re.split() :
- Splits the string by occurrences of the pattern.
```python
results = re.split(r'\d+', 'abc123def456')
print(results)  # Output: ['abc', 'def', '']
```
## Regular Expression Syntax
- `.` : Matches any character except a newline
- `^` : Matches the start of a string
- `$` : Matches the end of a string
- `|` : Matches either the expression before or after the `|`
- `*` : Matches 0 or more occurrences of the preceding element
- `+` : Matches 1 or more occurrences of the preceding element
- `?` : Matches 0 or 1 occurrence of the preceding element
- `\d` : Matches any digit (equivalent to [0-9]).
- `\w` : Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]).
- `\s` : Matches any whitespace character (e.g., space, tab).
- `{n, m}` : Matches between n and m occurrences of the preceding element
- `[abc]` : Matches any character in the brackets
- `[^abc]` : Matches any character not in the brackets







In [4]:
def clean_states(states:list):
    cleaned_states = []
    for state in states:
        state = state.strip()
        state = re.sub("[!#?]", "", state)
        state = state.title()
        cleaned_states.append(state)
    return cleaned_states

In [5]:
clean_states(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

we can also create a list of functions that we want to be operated on the states :

In [12]:
def remove_punctuation(value):
    return re.sub("[!#?]", "", value)
clean_ops = [str.strip, remove_punctuation, str.title]
def clean_strings(strings, ops):
 result = []
 for value in strings:
    for func in ops:
        value = func(value)
    result.append(value)
 return result

In [13]:
clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']