# String Data Type in Python

In Python, a string is a sequence of characters. Strings in Python are immutable, which means once a string is created, it cannot be changed. However, you can create a new string based on the original string with modifications. Strings can be defined using either single quotes (' ') or double quotes (" "), and they can contain letters, numbers, special characters, spaces, and even escape sequences (like \n for a new line).

## String Operations

Concatenation: Joining two or more strings together.
Indexing: Accessing a specific character in a string using its position.
Slicing: Extracting a portion of a string.

In [2]:
# String concatenation
string1 = "Hello"
string2 = "World"
concatenated_string = string1 + " " + string2
print("Concatenated String:", concatenated_string)

# String indexing
index = 4
character_at_index = string1[index]
print(f"Character at index {index} in '{string1}':", character_at_index)
print(' - Note: the first character in string has index 0.')

# String slicing
start_index = 1
end_index = 4
sliced_string = string1[start_index:end_index]
print(f"Sliced string from index {start_index} to {end_index} in '{string1}':", sliced_string)


Concatenated String: Hello World
Character at index 4 in 'Hello': o
 - Note: the first character in string has index 0.
Sliced string from index 1 to 4 in 'Hello': ell


In [3]:
# Repeat with implicit printing

# Defining the strings
string1 = "Hello"
string2 = "World"

# String concatenation
concatenated_string = string1 + " " + string2

# String indexing
index = 4
character_at_index = string1[index]

# String slicing
start_index = 1
end_index = 4
sliced_string = string1[start_index:end_index]

(concatenated_string, character_at_index, sliced_string)


('Hello World', 'o', 'ell')

## String Character Encoding

Character encoding is a system that pairs a sequence of characters from a given set with something else (e.g. ordinal numbers) to enable the transmission and storage of data. In computing, we deal with different character encodings due to historical and practical reasons.

1. ASCII (American Standard Code for Information Interchange):

- It was one of the first character encodings and is based on the English alphabet.
- ASCII uses 7 bits to represent each character, allowing for 128 different symbols.
- It covers English letters, digits, and a few punctuation marks, but lacks characters from non-English languages.

2. Unicode:

- Unicode is a computing industry standard designed to consistently represent text expressed in most of the world's writing systems.
- It can represent over a million characters, covering a wide array of languages and symbols.
- Unicode is an abstract representation. To store Unicode characters in memory or on disk, we need a specific encoding, such as UTF-8 or UTF-16.

3. UTF-8 (Unicode Transformation Format - 8 bits):

- It is a variable-width character encoding that can represent any character in the Unicode standard.
- UTF-8 is backward-compatible with ASCII.
- It's the dominant character encoding for the web.

4. Code Page (Windows Encoding):

- Before Unicode became widespread, many different encodings were created to handle different languages and character sets. Windows had its own set of encodings known as "code pages".
- Each code page supports different character sets. For instance, CP1250 is for Eastern European languages, while CP1251 is for Cyrillic script.
- With the advent of Unicode, code pages have become less common, but you might still encounter them in legacy systems or older data.

Converting Between Various String Encodings in Python:

In Python, you can convert a string from one encoding to another using the *encode()* and *decode()* methods.

In [9]:
# Original string (contains a special character for demonstration purposes)
original_string = "Hello, World! – Special Character and Local Characters čšž"

# Encoding the string to bytes in UTF-8
utf8_encoded = original_string.encode('utf-8')

# Decoding the bytes from UTF-8 to string
utf8_decoded = utf8_encoded.decode('utf-8')

# Encoding the string to bytes in CP1252 (Windows Encoding)
cp1252_encoded = original_string.encode('cp1250', errors='ignore')  # ignoring characters not supported by CP1250

# Decoding the bytes from CP1252 to string
cp1252_decoded = cp1252_encoded.decode('cp1250')

utf8_encoded, utf8_decoded, cp1252_encoded, cp1252_decoded


(b'Hello, World! \xe2\x80\x93 Special Character and Local Characters \xc4\x8d\xc5\xa1\xc5\xbe',
 'Hello, World! – Special Character and Local Characters čšž',
 b'Hello, World! \x96 Special Character and Local Characters \xe8\x9a\x9e',
 'Hello, World! – Special Character and Local Characters čšž')

The built-in *sorted()* function in Python can be used to return a new list containing all items from the original list (or any iterable), sorted in ascending order by default. When applied to strings, sorted() treats the string as a sequence of characters and returns a list of characters sorted in ascending order.

In [17]:
# Original string
string = "programming"
# string = "češnja"

# Sorting the string
sorted_characters = sorted(string)

# Joining the sorted characters to form a sorted string
sorted_string = ''.join(sorted_characters)

sorted_string


'aggimmnoprr'

In [18]:
# List of strings
string_list = ["banana", "želod", "apple", "cherry", "date", "češnja"]

# Sorting the list of strings
sorted_list = sorted(string_list)

sorted_list


['apple', 'banana', 'cherry', 'date', 'češnja', 'želod']