# Strings Methods, String Manipulation and Regular Expressions

## Introduction
In this class, we will explore some of the most important string methods, string manipulation and regular expressions in Python. These tools will help you manipulate and extract information from text data in a variety of contexts, including data cleaning, natural language processing, and regular expression matching.

In [None]:
string = "As we know, this is a string. It can contain numbers (1), symbols (@), characters (a)"
# strings are immutable!

# String Methods

String methods allow you to perform common operations on strings, such as finding the length of a string, converting between uppercase and lowercase letters, and splitting strings into substrings.

These are some of the existing string methods:



1.   len(string)
2.   string.upper()
3. string.lower()
4. string.title()
4. string.split()
5. string.replace()
6. string.format()
7. "separator".join(string)
8. string.find(string2)
9. string.index(string2)



The **len()** method returns the length (number of characters, including whitespaces) of the given string.

In [None]:
string = "Hello, world!"
length = len(string)
print(length)

The **.upper()** method returns a new string in which all the characters are in uppercase. 

In [None]:
string = "I like screaming"
uppercase = string.upper()
print(uppercase) 


The **.lower()** method returns a new string in which all the characters are in lowercase.

In [None]:
string = "I am QUIET"
lowercase = string.lower()
print(lowercase)


The **title()** method returns a new string with the first letter of each word capitalized and the rest of the letters in lowercase.

In [None]:
text = "Hello there"
new_text = text.title()
print(new_text)

text2 = "hello there"
new_text2 = text2.title()
print(new_text2)

text3 = "HELLO THERE"
new_text3 = text3.title()
print(new_text3)


The **.replace()** method returns a new string in which all occurrences of a substring in the original string are replaced with another substring.

In [None]:
string = "I do not know how to program in python"
new_string = string.replace("do not", "do") # you can also replace the "not"
# with "" <- nothing like new_string = string.replace("not" , "")
print(new_string)

If you only want to replace the first occurrence of the substring with the other substring you can specify it in the method like this

In [None]:
string = "I do not know how to program and I do not know how to cook"

new_string = string.replace("do not", "do", 1) # or any number of maximum
#replacements

print(new_string)

We know the **.format()** method already, it is used to format a string with specified values.

In [None]:
name = "Bruno"
age = 27
subject = "Introduction to Programming"
uni = "LMU"
message = "My name is {} and I am {} years old, I teach {} at {}".format(name, age, subject, uni)
print(message)


The **.join()** method joins a list of strings into a single string, with the specified separator between each string.

In [None]:
words = ["Make", "me", "a", "single", "string"]
sentence = " ".join(words)
sentencewithhyphen = "-".join(words)
print(sentence)
print(sentencewithhyphen)

The **.find()** method finds the first occurrence of a substring in a string and returns the index (original string) of the first character of the substring.

In [None]:
string = "Hello there"
index = string.find("there")
index2 = string.find("no") # returns -1 if it cannot find it
print(index)
print(index2)

The **.index()**method is similar to **.find()**, but raises a ValueError if the substring is not found.

In [None]:
string = "Hello there"
index = string.index("there")
print(index)

# Other String manipulation things you can do

## String Slicing

This refers to the ability to extract a portion of a string by specifying a range of indices. Slicing can be done using the square brackets notation, where the start index is included and the end index is excluded.

In [None]:
text = "Hello there"
print(text[0:5])
print(text[6:])   
print(text[:5])    
print(text[-5:])   


## String Addition

We know this method already (concatenation). This involves joining two or more strings together using the **+** operator. The result is a new string that is the concatenation of the original strings.

In [None]:
string1 = "Hello"
string2 = " there"
concatenated_string = string1+string2
print(concatenated_string) 


## String Multiplication/Repetition

This involves repeating a string a given number of times using the **\*** operator. The result is a new string that consists of the original string repeated the specified number of times.

In [None]:
string = "ha"
print(string * 5)

## String Indexing

This refers to the ability to access individual characters in a string by specifying their position (index) within the string. Indexing starts at 0 for the first character, and negative indices can be used to access characters from the end of the string.

In [None]:
text = "Carbonara"
print(text[0])   # prints "C"
print(text[7])   # prints "r"
print(text[-1])  # prints "a"


# Regular Expressions

Regular expressions (regex) are a powerful tool for searching for patterns in text data. They allow you to match specific characters, words, or patterns within a string.

Examples of regular expressions:
* \[ \] to match any character within a set of characters


* . to match any character except a newline character
* \* to match zero or more occurrences of the preceding character or group

You can try regex here:

https://regex101.com/

## Applying Regex to Python

**re** is a built-in Python package through which you can use regular expressions.

These are some of its methods:


* **re.compile()** to compile a regular expression
* **re.search()** and re.match() to search for patterns in strings
* **re.findall()** to extract all matches from a string
* **re.sub()** to replace the find pattern in the string with another substring



**re.search()** is used to search for a pattern within a string. It returns the first occurrence of the pattern if found, or None if the pattern is not found. 

In [None]:
import re

# Search for the pattern 'fox' in the string
result = re.search(r'fox', 'The quick brown fox jumps over the lazy dog')

# Check if the pattern was found
if result:
    print('Pattern found')
    print(result.group()) #if we do not specify group, it will print
    #the entire regex result operator
else:
    print('Pattern not found')


**re.match()** is used to match a pattern at the beginning of a string. It returns the match object if the pattern is found at the beginning of the string, or None if the pattern is not found.

In [None]:
import re

# Match the pattern 'The' at the beginning of the string
result = re.match(r'The', 'The quick brown fox jumps over the lazy dog')

# Check if the pattern was found
if result:
    print('Pattern found')
else:
    print('Pattern not found')

**re.findall()** is used to find all occurrences of a pattern within a string and return them as a list.

In [None]:
import re

# Find all occurrences of the pattern 'cat' in the string
result = re.findall(r'cat', 'I have a cat and a kitten, but my neighbor has two cats')

# Print the list of occurrences
print(result)


**re.sub()** is used to search for a regular expression pattern in a given string and replace all occurrences of the pattern with a specified string. It takes three arguments:

* The regular expression pattern to search for.
* The replacement string to replace any matches found.
* The string to search and replace in.

In [None]:
import re

text = "The quick brown fox jumps over the lazy dog"
regex = r"\b\w{4}\b"  # matches 4-letter words

new_text = re.sub(regex, "****", text)

print(new_text)


**re.compile()** is used to pre-compile a regular expression pattern into a pattern object, which can then be used for various operations.

In [None]:
import re

# Define the regular expression pattern
pattern = re.compile(r'\d+')

# Use the pattern object to perform operations
result = pattern.findall('I have 2 cats and 3 dogs')
print(result)

# The re.sub() method returns a new string with all matches replaced.
#  Note that the original string text is not modified.

## Other Regular expression examples:

1. `^[A-Za-z]+$`
- This regular expression matches any string that contains only alphabetic characters. The `^` and `$` anchors ensure that the entire string is made up of letters, without any extra characters at the beginning or end. The character class `[A-Za-z]` matches any uppercase or lowercase letter in the English alphabet, and the `+` quantifier means that the character class must appear at least one or more times in the string. 
- Limits: This regular expression will not match strings that contain spaces, punctuation marks, or any characters outside of the English alphabet.

2. `\d{3}-\d{2}-\d{4}`
- This regular expression matches strings that match the format of a standard US social security number. The `\d` shorthand character class matches any digit character, and the `{3}`, `{2}`, and `{4}` quantifiers specify the exact number of digits that must appear in each section of the social security number. The `-` character matches the literal hyphens that separate the three sections of the social security number.
- Limits: This regular expression will not match any social security numbers that deviate from the standard format, such as those with spaces or without hyphens, or those with numbers outside of the allowed ranges for each section.

3. `[a-z]+[A-Z]+[0-9]+`
- This regular expression matches any string that contains at least one lowercase letter, one uppercase letter, and one numeric digit. The three character classes `[a-z]`, `[A-Z]`, and `[0-9]` match the corresponding characters, and the `+` quantifier means that each character class must appear at least one or more times in the string. The order of the character classes specifies that the lowercase letter must appear before the uppercase letter, which must appear before the numeric digit.
- Limits: This regular expression will not match strings that contain spaces, punctuation marks, or any characters outside of the English alphabet and numeric digits.

4. `\b\w{4,}\b`
- This regular expression matches any word in a string that is at least four characters long. The `\b` word boundary matches the beginning or end of a word, and the `\w` shorthand character class matches any word character, including letters, digits, and underscores. The `{4,}` quantifier specifies that the word must be at least four characters long.
- Limits: This regular expression will not match any words that contain non-word characters, such as punctuation marks or spaces.

## Even more with code

Matching email address: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$


- `^` - start of the string
- `[a-zA-Z0-9._%+-]+` - matches one or more occurrences of any character in the set a-z, A-Z, 0-9, period (.), underscore (_), percent (%), plus (+), or hyphen (-)
- `@` - matches the at symbol (@) that separates the username from the domain name
- `[a-zA-Z0-9.-]+` - matches one or more occurrences of any character in the set a-z, A-Z, 0-9, period (.), hyphen (-), or underscore (_), which make up the domain name
- `\.` - escapes the period character to match the dot (.) that separates the domain name from the top-level domain (TLD)
- `[a-zA-Z]{2,}` - matches two or more occurrences of any character in the set a-z or A-Z, which make up the TLD
- `$` - end of the string

So, this regular expression will match any email address that consists of one or more occurrences of alphanumeric characters, period, underscore, percent, plus, or hyphen, followed by the at symbol (@), followed by one or more occurrences of alphanumeric characters, period, hyphen, or underscore, followed by a dot (.) and then two or more occurrences of letters from a-z or A-Z. 

However, it's important to note that this regular expression has some limitations. For example, it doesn't account for all the possible valid characters in an email address, such as some non-ASCII characters. Additionally, it doesn't check if the email address actually exists or is deliverable, just that it follows a general format.

In [None]:
import re

email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

email1 = 'john.doe@example.com'
email2 = 'jane@example'

if re.match(email_regex, email1):
    print(f"{email1} is a valid email address")
else:
    print(f"{email1} is not a valid email address")

if re.match(email_regex, email2):
    print(f"{email2} is a valid email address")
else:
    print(f"{email2} is not a valid email address")


Match a phone number in a specific format: \b\d{3}[-.]?\d{3}[-.]?\d{4}\b

- `\b` is a word boundary anchor that matches the empty string at the beginning or end of a word. In this case, it is used to make sure the phone number is not part of a larger word.
- `\d` is a shorthand character class that matches any digit (equivalent to `[0-9]`).
- `{3}` is a quantifier that matches exactly three occurrences of the preceding pattern. In this case, it matches three digits.
- `[-.]?` is an optional character class that matches either a hyphen or a dot. The `?` makes the preceding character or character class optional, so the phone number can have either hyphens or dots, or neither.
- The pattern `[-.]?` is repeated twice after the three digits, and once after the next three digits, to allow for the optional hyphens or dots between each group of three digits.
- `\b` is used again to match the end of the phone number.

So, this regular expression will match phone numbers that have either no separator between the groups of three digits, or have hyphens or dots as separators. The phone number must also be surrounded by word boundaries to ensure it is not part of a larger word. 

Here are some examples of phone numbers that would match this regular expression:
- 555-555-5555
- 555.555.5555
- 5555555555
- 555-5555555
- 555.5555555

And here are some examples of strings that would not match:
- 555555-5555 (missing separator after first group of digits)
- 555.555.555 (missing last group of digits)
- 555-5555-5555 (too many digits in last group)
- (555) 555-5555 (contains non-digit characters)

In [None]:
import re

phone_regex = r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'


text = 'Call me at 555-123-0000 or 555.567.8901 or 044-0000000'

phone_numbers = re.findall(phone_regex, text)

print(f"Phone numbers found: {phone_numbers}")


Split a string in a list of words through regex: r'\W+'

The regular expression `\W+` matches one or more non-word characters. In regex, a "word" character is usually defined as any alphanumeric character and underscores (`a-z`, `A-Z`, `0-9`, and `_`), while a non-word character is anything that is not a word character. 

So, the expression `\W+` will match any sequence of one or more consecutive non-word characters. Examples of non-word characters are spaces, punctuation marks, special characters, etc.

For example, if we apply this regular expression to the string "Hello! How are you?", it will match the exclamation mark and the space before the word "How", resulting in two matches: `['!', ' ']`.

In [None]:
import re

text = 'The quick brown fox jumps over the lazy dog'

words = re.split(r'\W+', text)

print(words)


Extracting all the URLS from a string r'https?://\S+'

This regular expression matches URLs starting with either "http://" or "https://", followed by one or more non-space characters (denoted by \S+) until the end of the URL is reached. The ? after s makes the s optional, so the regular expression will match both "http://" and "https://".

For example, it would match "https://www.google.com", "http://example.com/page.html", but would not match "ftp://example.com" because it does not start with "http://" or "https://".

Overall, this regular expression is useful for extracting URLs from text data, such as web pages or social media posts.

In [None]:
import re

text = 'Visit my website at https://www.example.com or http://example.com'

url_regex = r'https?://\S+'

urls = re.findall(url_regex, text)

print(urls)


# Exercises


For each of the exercise you need to create a different ".py" file. You can choose the name for each function!

1. Write a function that takes a string as input and returns the number of words in the string. A word is defined as a sequence of characters separated by whitespace.

2. Write a function that takes a string as input and returns the number of sentences in the string. A sentence is defined as a sequence of words ending with a punctuation mark (. or ! or ?).

3. Write a function that takes a string as input and returns the string with all the vowels removed.

4. Write a function that takes a list of phone numbers and returns a dictionary where the keys are the area codes and the values are lists of phone numbers that belong to each area code. Assume all phone numbers have the format "XXX-XXX-XXXX" or "(XXX) XXX-XXXX", where X is a digit. Use regular expressions to extract the area code from each phone number, and use a dictionary to store the results.


5. Write a function that takes a string as input and returns a list of all the unique words in the string, sorted alphabetically. Use regular expressions to split the string into words, and use a set to store the unique words.

6. Write a function that takes a dictionary structured like this as input:

In [None]:
input_dict = {"Bruno":["hello@lmu.de", "hi@gmail.com", "yes@dkes.fak12.lmu.de"], "John":["john@live.it", "johnlive.it", "john@lmu.de"]}

And return a list of valid lmu emails (either ending with "@lmu.de" or "@dkes.fak12.lmu.de". The result for that dictionary would be: ["hello@lmu.de","yes@dkes.fak12.lmu.de","john@lmu.de"]

7. Create a Jupyter notebook and document all the previous exercises