# Strings in Python

##  Introduction to Strings

### A string is a sequence

In Python, a string is a sequence of characters enclosed within single quotes `'` or double quotes `"`, representing textual data. As a built-in data type, strings support indexing, slicing, iteration, and various sequence operations. This enables character access via index, substring extraction via slicing, iteration through loops, and operations like concatenation and repetition.

Examples of string operations showcasing their sequence-like behavior [Downey, 2015, Python Software Foundation, 2023]:

### Strings are immutable

In Python, strings are immutable objects. This means that once a string is created, its contents cannot be changed or modified.

```python
text = "Hello, World!"

# Attempting to change a character at a specific index (this will raise an error)
text[0] = 'h'  # Raises "TypeError: 'str' object does not support item assignment"

# Slicing to create a new string with changes
modified_text = text[:6] + 'Python!'
print(modified_text)  # Output: "Hello, Python!"
```

## String Operations

### Indexing

In Python, string indexing allows you to access individual characters within a string. The indexing is zero-based, which means the first character of the string has an index of 0, the second character has an index of 1, and so on. You can also use negative indexing, where -1 represents the last character, -2 represents the second-to-last character, and so on.

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/String_Indexing.png" alt="picture" height="100">
</center>

In [None]:
my_string = "Hello, World!"

# Accessing individual characters using positive indexing
print(my_string[0])
print(my_string[7])

# Accessing individual characters using negative indexing
print(my_string[-1])
print(my_string[-6])

### Slicing

In Python, string slicing extracts substrings using:

```python
string[start:stop]
```

Here:
- `start`: Inclusive index of the first character.
- `stop`: Exclusive index of the substring's end.

This creates a new string containing characters from `start` to just before `stop`.

Let's see some examples of string slicing:

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/String_Indexing.png" alt="picture" height="100">
</center>

In [None]:
text = "Hello, World!"

# Slice from index 7 to the end of the string
substring2 = text[7:]
print(substring2)

# Slice from index 2 to 8 (exclusive)
substring3 = text[2:8]
print(substring3)

# Negative indices can also be used for slicing (counting from the end of the string)
substring6 = text[-6:-1]
print(substring6)

## Concatenation

Concatenation is the process of merging two or more strings to create a new, unified string. In Python, strings are sequences of characters, and concatenation enables the blending of these character sequences, allowing the creation of longer and more comprehensive strings.

There are several ways to concatenate strings in Python [Martelli et al., 2023, Python Software Foundation, 2023]:

### Using the + Operator

The `+` operator is used to concatenate strings by placing them next to each other.

In [None]:
string1 = "Hello, "
string2 = "Calgary!"
result = string1 + string2
print(result)

### Using f-strings

In [None]:
name = "John"
city = "Calgary"
description = f"My name is {name} and I live in {city}."
print(description)

Formatting numbers using f-strings in Python is a straightforward process. F-strings offer a convenient approach to precisely manage the appearance of numeric values directly within the string they are applied to. A range of formatting choices are available, including controlling decimal places, incorporating leading zeros, and more. Let's explore how f-strings can be used to format numbers:

In [None]:
# Formatting integers and floating-point numbers
integer_value = 45
float_value = 3.141592653589793

formatted_integer = f"The formatted integer: {integer_value:04d}"
formatted_float = f"The formatted float: {float_value:.2f}"

print(formatted_integer)
print(formatted_float)

## Searching

### Finding a character/word in a string
In Python, you can search for substrings within a string using various methods and operations. Here are some common approaches for searching in a string:

**1. Using the `in` keyword:**
The `in` keyword is used to check if a substring exists within a given string. It returns a Boolean value `True` if the substring is found and `False` otherwise.

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/hatefdastour.github.io/master/_notes/Introduction_to_Digital_Engineering/_images/String_Indexing.png" alt="picture" height="100">
</center>


In [None]:
text = "Hello, Calgary!"

if "Hello" in text:
    print("Substring found.")
else:
    print("Substring not found.")

**2. Using the `find()` method:**

The `find()` method returns the index of the first occurrence of the substring within the string. If the substring is not found, it returns -1.

In [None]:
text = "Hello, Calgary!"

index = text.find("World")
if index != -1:
    print("Substring found at index:", index)
else:
    print("Substring not found.")

## String methods

In Python, strings are objects with built-in methods for operations and manipulations. These methods transform, search, split, and more. Here are some common string methods [Downey, 2015, Python Software Foundation, 2023]:


| Command           | Description                                                                                                         |
|-------------------|---------------------------------------------------------------------------------------------------------------------|
| `upper()`         | Converts all characters in the string to uppercase.                                                                |
| `lower()`         | Converts all characters in the string to lowercase.                                                                |
| `capitalize()`    | Capitalizes the first character of the string and makes the rest lowercase.                                        |
| `strip()`         | Removes leading and trailing whitespace characters (spaces, tabs, newlines) from the string.                      |
| `split()`         | Splits the string into a list of substrings based on a given delimiter.                                            |
| `join()`          | Joins a list of strings into a single string, using the calling string as the separator.                          |
| `replace()`       | Replaces occurrences of a substring with another substring.                                                        |
| `find()`          | Finds the index of the first occurrence of a substring in the string. Returns -1 if not found.                    |

In [None]:
### `capitalize()`
text = "hello, world!"
print(text.capitalize())
### `strip()`
text = "   hello, world!   "
print(text.strip())
### `split()`
text = "apple,banana,orange"
fruits = text.split(",")
print(fruits)
### `join()`
fruits = ['apple', 'banana', 'orange']
text = ",".join(fruits)
print(text)
### `replace()`
text = "Hello, World!"
modified_text = text.replace("Hello", "Hi")
print(modified_text)

## The `in` operator
In Python, the `in` operator is used to check if a value exists within a sequence or a collection, such as strings, lists, tuples, and dictionaries. The `in` operator returns a Boolean value `True` if the value is found in the sequence and `False` if it is not found.
Here are some examples of using the `in` operator:

### Using `in` with a string

In [None]:
text = "Hello, World!"
print('o' in text)
print('z' in text)

## Introduction to Regular Expressions

Regular expressions (regex) in Python are essential for pattern matching and text manipulation [Python Software Foundation, 2023]. They allow concise pattern definitions within strings, crucial for tasks like data validation and text processing.

Regex is used to:

* **Define Patterns**: Craft intricate patterns for specific character sequences, perfect for searching or extracting data from formatted strings.

* **Search and Replace**: Find pattern occurrences within text and replace them, streamlining bulk text processing.

* **Validate Input**: Regex validates input data like emails, numbers, or dates, ensuring proper formats.

* **Data Extraction**: Efficiently extract string portions aligned with patterns, streamlining data extraction without complex manipulations.

### Building Regular Expression Patterns

In regular expressions, patterns form by blending ordinary characters with special metacharacters. These metacharacters set rules for pattern matching, crucial for crafting effective regular expressions. Key metacharacters and their meanings include:

- `.`: Matches any character except a newline, acting as a versatile wildcard for a single character.

- `*`: Matches zero or more instances of the previous element, indicating it can appear any number of times or not at all.

- `+`: Matches one or more instances of the preceding element, requiring at least one occurrence.

- `?`: Matches zero or one instance of the preceding element, denoting optional presence.

- `^`: Matches the start of a string when at the pattern's beginning.

- `$`: Matches the end of a string when at the pattern's end.

- `[...]`: Defines a character class, matching any character inside the brackets.

- `(...|...)`: Groups and alternates elements, allowing alternative patterns.

- `\`: Escapes special characters, matching them literally.

**Remark:**

Creating effective regular expression (regex) patterns involves these key strategies:

1. **Problem Understanding:** Before you start writing a regular expression, thoroughly grasp your problem and identify target patterns in the text.

2. **Test Data:** Develop representative test data to validate your regex on real scenarios.

3. **Start Simple:** Begin with basic patterns and gradually add complexity.

4. **Online Testers:** Utilize online regex testers like Regex101.com, RegExr.com, or Pythex.com for experimentation.

5. **Escape Characters:** Be mindful of escaping special characters like ".", "*", "?" for literal matching (e.g., `\.`).

6. **Quantifiers and Groups:** Employ quantifiers and parentheses for specifying repetition and grouping.

7. **Greedy Matching:** Understand and control greedy matching with non-greedy quantifiers (e.g., `*?`, `+?`).

8. **Character Classes:** Use character classes (e.g., `[0-9]`) to match sets of characters.

9. **Practice and Refinement:** Regular expression skills improve with practice and iterative refinement.

10. **Learn from Examples:** Analyze existing regex patterns for similar tasks from online resources and communities.

Balancing simplicity, accuracy, and efficiency is vital when crafting regex patterns for specific needs.

### Regular Expression Modifiers

Modifiers alter regular expression behavior. Common modifiers:

- `re.IGNORECASE` or `re.I`: Enable case-insensitive matching.
- `re.MULTILINE` or `re.M`: Match across multiple lines.
- `re.DOTALL` or `re.S`: Let `.` match any character, including newlines.

### Examples and Use Cases

Here are some practical examples of using the `re` module in Python to perform common text manipulation tasks:

<font color='Blue'><b>Examples - Extracting Email Addresses:</b></font>

In [None]:
import re

text = "Please contact me at john@example.com or jane@example.com"
pattern = r'\S+@\S+'
emails = re.findall(pattern, text)
print(emails)

`pattern = r'\S+@\S+'`: This line sets the regular expression pattern used for finding email addresses. Let's break down the pattern:
   - `\S+`: Matches one or more non-space characters. This will match the username part of the email address.
   - `@`: Matches the "@" symbol literally.
   - `\S+`: Matches one or more non-space characters. This will match the domain part of the email address.

<font color='Blue'><b>Examples - Validating Phone Numbers:</b></font>

In [None]:
import re

def is_valid_phone_number(phone_number):
    pattern = r'^\d{3}-\d{3}-\d{4}$'  # This pattern matches phone numbers in the format XXX-XXX-XXXX
    return re.match(pattern, phone_number) is not None

phone1 = "123-456-7890"
phone2 = "555-1234"  # Invalid format

print(is_valid_phone_number(phone1))
print(is_valid_phone_number(phone2))

Here's how the pattern works:

- `^`: This symbol indicates the start of the string.

- `\d{3}`: This part matches exactly three digits. The `\d` represents any digit, and the `{3}` specifies that there should be exactly three occurrences.

- `-`: This matches the hyphen character literally.

- `\d{3}`: Similar to the previous part, this matches another three digits.

- `-`: Another hyphen.

- `\d{4}`: This matches exactly four digits.

- `$`: This symbol indicates the end of the string.