# Strings

Strings are one of the most common types of objects used in Python, containing text data. A string is any set of characters that is longer than one character- ranging from data values like names or locations to entire paragraphs of text.

Later, in machine learning, processing text is one of the (in my opinion) most interesting applications of analytics. We can use large amounts of text to create models that can identify emotions in text, translate languages, or even generate text like ChatGPT. We need to process and organize all that text data to make it useful for our modelling projects. 

## String Basics

### Defining a String

Strings are defined either with a single quote or a double quotes. Which one is used doesn't matter, as long as the start and end quotes match. 

### Printing a String

Using Jupyter notebook with just a string in a cell will automatically output strings, but the correct way to display strings in your output is by using a print function. In notebook files we get the output of the final line of a cell printed by default, but we can use the print function to call on anything else to print. 

In [111]:
# We can simply declare a string
'Hello World'
# note that we can't output multiple strings this way
'Hello World 1'
'Hello World 2'
# We can print multiple strings this way
print('Hello World 1')
print('Hello World 2')
print('Use \n to print a new line')
print('\n')
print('See what I mean?')

Hello World 1
Hello World 2
Use 
 to print a new line


See what I mean?


### String Concatenation

Concatenation is the process of combining two strings together. This is done with the + operator. This is another example of an operator overloading - the '+' operator is overloaded, or redefined, in the string class to do something logical for strings. The operator will combine the two strings together, in the order they are written, returning one string. 

We can also put multiple items in one print statement, separated by commas. This will print each item, separated by a space. This is an easy way to have multiple items in one print statement without having to worry about having to concatenate them together. This is useful for debugging, where we might want to print multiple variables at once and don't want to waste time making them look nice. 

In [112]:
string1 = "I'm the start"
string2 = "and I'm the end"
puntuation = "!"
aNumber = 59

print(string1 + " " + string2 + puntuation)
print(string1, " ", aNumber, string2, puntuation)

I'm the start and I'm the end!
I'm the start   59 and I'm the end !


### String Conversions

Strings can be converted to other data types using methods like int(), float(). This is useful if we have extracted a numerical value from somewhere and we want to use it in a calculation. Going in the other direction, we can use the str() method to convert many objects to strings. This is very common, we often need to do it when we want to do some sting operation, like concatenation, on a non-string object. In general, if something non-string is asked to print, it will just work; if something non-string is asked to do something "beyond" just printing, like concatenation, it will need to be converted to a string first.

<b>Note:</b> this string conversion calls the __str__() method of an object, which provides the string representation of that object. When creating our own classes we need to write this to make sense for whatever we are creating. 

In [113]:
stringNum = "47"
print(stringNum + 50) # This will throw an error

TypeError: can only concatenate str (not "int") to str

In [None]:
numFromString = int(stringNum)
print(numFromString + 50) # This will work

97


In [None]:
print("My number:" + numFromString) # This will throw an error

NameError: name 'numFromString' is not defined

In [None]:
print("My number:" + str(numFromString)) # This will work

My number:47


### Strings as Containers

Strings can also be thought of as containers, where each element in the string is a single character. In older, lower-level languages, strings were actually stored as arrays of characters. In Python we can treat a string mostly like a list of characters. 

#### String Indexing

We can use square bracket notation to index individual characters in a string, or a "slice" to extract a sub-section of characters in a string. Just like a list we can use negative indices to count from the end of the string. The indexing of a string is zero-based, so the first character is at index 0.

![String Indexing](../../images/string_index.png)
![String Indexing](../images/string_index.png)

#### String Methods

Strings have a set of built-in methods that can be used to manipulate their contents.
<ul>
<li>upper() - returns a string with all characters upper case</li>
<li>lower() - returns a string with all characters lower case</li>
<li>capitalize() - returns a string with the first character upper case</li>
<li>count() - returns the number of times a character or substring occurs in a string</li>
<li>find() - returns the index of the first occurrence of a character or substring in a string</li>
<li>replace() - returns a string with all occurrences of a character or substring replaced with another character or substring</li>
<li>split() - returns a list of substrings separated by a delimiter</li>
<li>join() - returns a string that is a concatenation of the strings in a list</li>
</ul>

The documentation for strings is located here: https://docs.python.org/3/library/stdtypes.html#string-methods 

In [None]:
longString = 'I am a very long string with lots of characters in me'


### String Characteristics

As a data structure, strings are immutable - we cannot change individual elements of a string. If we wish to, we need to create a new string with the changes. This also means that when we use a function like upper() or lower() on a string, we are not changing the original string, but creating a new string with the changes.

In [None]:
# Attempt to change a character directly
# We want to make it start with "I-am..."
longString[1] = "-"
longString

TypeError: 'str' object does not support item assignment

In [None]:
# Doing the same thing as above
longString2 = longString[0]+'-'+longString[2:]
longString2

'I-am a very long string with lots of characters in me'

We can also use one of the String class methods that seems to change the original string, like replace. Here, we have two lines that will "modify" and print our long string. Going line by line:
<ul>
<li> The first statement does the replacement, but that string with replacements only exists as the return value of the replace method. It is not stored anywhere here, it is only the input to the print function.</li>
<li> The second statement prints the longString after the replace() method runs, we can see it is unchanged.</li>
</ul>

In [None]:
print(longString.replace(' ','-'))
longString

I-am-a-very-long-string-with-lots-of-characters-in-me


'I am a very long string with lots of characters in me'

If we wanted to capture the result of the replace() method, we would need to assign it to a variable.

In [None]:
newString = longString.replace(' ','-')
print(longString)
print(newString)

I am a very long string with lots of characters in me
I-am-a-very-long-string-with-lots-of-characters-in-me


## String Escape Sequences

There are a number of special characters that can be used in strings. These are called escape sequences. They are used to represent characters that are not printable, or that otherwise cannot be typed directly into a string. In Python, we need to preface some characters with a backslash to indicate that they are escape sequences.

<ul>
<li>\n - newline</li>
<li>\t - tab</li>
<li>\' - single quote</li>
<li>\" - double quote</li>
<li>\\ - backslash</li>
</ul>

In [None]:
print("some text and then a tab \t and then some more text and then a new line \n and then some more text")

some text and then a tab 	 and then some more text and then a new line 
 and then some more text


### Raw Strings

If we want to use a string that contains a lot of backslashes, we can use a raw string. Raw strings are defined by putting an r before the opening quote of a string. This tells Python to ignore all escape sequences in the string.

In [None]:
# Raw
print(r"some text and then a tab \t and then some more text and then a new line \n and then some more text")

some text and then a tab \t and then some more text and then a new line \n and then some more text


### Triple Quotes

Triple quotes are used to define multi-line strings. They can be used with either single or double quotes. Triple quotes are also used to define docstrings, which are special strings used to document functions and classes that we'll look at soon.

In [None]:
# Triple quotes
trippleQuoteString = """This is a string that spans multiple lines
                    and has a tab \t and a new line \n and a backslash \\
                    and a double quote \" and a single quote \' and a carriage return \r
                    and a form feed \f and a vertical tab \v
                    and a bell \a and a unicode character \u0394"""
print(trippleQuoteString)

This is a string that spans multiple lines
                    and has a tab 	 and a new line 
 and a backslash \
                    and a double quote " and a single quote ' and a carriage return 
                    and a form feed  and a vertical tab 
                    and a bell  and a unicode character Δ


### Unicode Strings

In programming languages there are several different ways to represent characters. The most common is ASCII, which is a standard that uses 8 bits to represent each character. This allows for 256 different characters to be represented. This is enough for the English alphabet, numbers, and some special characters, but not enough for other languages that use different alphabets. Unicode is a standard that uses 16 bits to represent each character, allowing for 65,536 different characters to be represented. Python 3 uses Unicode by default, so we can use any character from any language in our strings.

This means that every character we can think of typing is represented by some ID number. We can use the built-in ord() function to find the ID number of a character, and the chr() function to find the character associated with an ID number. We can use these IDs to insert characters into our strings that we don't have a key for on our keyboard. For things like Greek Letters, foreign punctuation, or even emojis we can look up the code for that character and insert it into our string.

The Wikipedia article here has a list, but there are many sites listing and explaining the codes. https://en.wikipedia.org/wiki/List_of_Unicode_characters 

In [None]:
ummlaut = "\u00E4"
letterQ = "\u0051"
emoji = "\U0001F617"

print(ummlaut)
print(letterQ)
print(emoji)

ä
Q
😗


In [None]:
# Look up the unicode value for a character
print(ord(ummlaut))
print(chr(228))

228
ä


## String Formatting

When printing strings, we often want to include variables in the output. We can do this by concatenating strings and variables together, but this can get messy. Python provides three main ways to format strings to include variables: the % operator, the format() method, and f-strings. This formatting is primarily used in data science to allow us to print output in our notebooks in a readable way - we are generally exploring and revising inside a notebook as we work, so having readable output is important, since we'll be reading it. 

### String Formatting with the % Operator

The % operator can be used to format strings. The % operator is used to format a set of variables enclosed in a "tuple" (a fixed size list), together with a format string, which contains normal text together with "argument specifiers", special symbols like %s and %d. The symbols used are (you can probably stop after the first three):
<ul>
<li>%s - String (or any object with a string representation, like numbers)</li>
<li>%d - Integers</li>
<li>%f - Floating point numbers</li>
<li>%.<number of digits>f - Floating point numbers with a fixed amount of digits to the right of the dot.</li>
<li>%x/%X - Integers in hex representation (lowercase/uppercase)</li>
</ul>

The structure of the string is:

"actual string with the % operators in it" % (tuple of variables)

In [None]:
# Using the % operator
print('The %s %s fox jumped over the %s dog' %('quick', 'brown', 'lazy'))

The quick brown fox jumped over the lazy dog
I am some text that is so long that I need to split it up over
 multiple lines for readability


We can also split up long declarations for a function over multiple lines by using parentheses to group the parts of the function call together. This is a common practice in Python, and is used to make code more readable. We'll use this pretty regularly later in machine learning when we need to make function calls that have a lot of parameters. Note the parentheses around the function call below, that allow us to split the function call over multiple lines.

In [None]:
reallyLongString = ("I am some text that is so %s "
                    "that I need to split it up over "
                    "%s lines for readability" %
                    ('long', 'multiple')
                    )
print(reallyLongString)

### String Formatting with the format() Method

The format() method that we used above is another way to format strings. It is more powerful than the % operator, and can be used to format strings in a variety of ways. The basic idea is to include placeholders in the string, which are then replaced by the arguments of the format() method. The placeholders are defined by curly braces: {}. Anything that is not contained in braces is considered literal text, which is copied unchanged to the output. If you need to include a brace character in the literal text, it can be escaped by doubling: {{ and }}.

The placeholders can include a number, which is used to specify the position of the argument to the format() method. This allows for re-arranging the order of display without changing the arguments. They can also include a variable name, which is used to specify the keyword argument to the format() method. This allows for re-arranging the order of arguments, if they are all included as keyword arguments.

#### Print Formatting

There are many modifiers that we can use with the format() method to control the formatting of the output. For example, below we've forced the age to print as a float with a forced positive sign and a different number of decimal places. These modifiers are inserted after a colon in the curly brace placeholder. The general syntax is:

{<argument_index_or_keyword>:<format_spec>}

There are lots and lots of options here:

![Formatting Options](../../images/formatting.png "Formatting Options")
![Formatting Options](../images/formatting.png "Formatting Options")

In general, being formatting experts isn't critically important for us, but we should be able to print things out in a nice format if we need to. 

In [185]:
# Format with theformat operator
print('Insert another string with curly brackets: {}'.format('The inserted string'))
print('My name is {:} and I am {:+f} years old'.format('John', 30))
print('My name is {one} and I am {two:+.1f} years old'.format(one='John', two=30))
print('My name is {1} and I am {0} years old'.format('John', 30))


Insert another string with curly brackets: The inserted string
My name is John and I am +30.000000 years old
My name is John and I am +30.0 years old
My name is 30 and I am John years old


### F-Strings

F-strings are a new way to format strings in Python 3.6 and above. They are similar to the format() method, but are simpler to use and faster to execute. They are indicated by an f before the opening quotation mark of the string. Similar to the format() method, they use curly braces to indicate where variables will be substituted into the string. The variables are simply named inside the braces.

In [None]:
# F-Srings
name = 'John'
age = 30
print(f'My name is {name} and I am {age} years old')

My name is John and I am 30 years old


One odd note about f-strings is that we can't directly embed a newline character in them. We can get around this by declaring a variable with the newline character in it, and then using that variable in the f-string.

In [169]:
newline = '\n'
print(f'Hello, {name}!{newline}You are {age:+f} years old')

Hello, John!
You are +30.000000 years old


## Exercises

### Exercise 1

Write a function that takes a string as input and returns the string reversed.

### Exercise 2

Write a function that takes a string as input and returns the number of times each word is repeated. The function should return a dictionary with the words as keys and the number of times each word is repeated as values.

### Exercise 3

Write a function that takes a string as input and returns the number of times each letter is repeated. The function should return a dictionary with the letters as keys and the number of times each letter is repeated as values.

In [186]:
# Exercises
    

## String Logic

Strings can be used in logical expressions and we frequently use logic based on strings in our programs.

<b>Note:</b> Empty strings are considered False, and non-empty strings are considered True. 

### String Comparison

Strings can be compared using the standard comparison operators. The comparison is done character by character, starting with the first character in each string. If the first character in each string is the same, the next character is compared, and so on. If all the characters in the first string are the same as the corresponding characters in the second string, the strings are equal. If the characters are not the same, the string with the smaller character is considered smaller. If one string is a prefix of the other, the shorter string is considered smaller. If the strings are the same length and have the same characters up to a point, the string with the character that would be smaller in ASCII is considered smaller.

In [188]:
text1 = 'Hello'
text2 = 'Yello'
text3 = 'Hell'
text4 = 'Hello'

print(text1 == text2)
print(text1 == text3)
print(text1 == text4)
print(text1 != text2)

False
False
True
True


#### String Logic

Strings will also allow for comparisons like less than or greater than, but some care needs to be taken. These will return a test for things that are "lexicographically" greater or lesser, which means that it'll do a comparison for which comes first in the dictionary. This is useful for sorting strings, but not so much for other things. One concern is that uppercase and lower case characters are not treated the same here, all uppercase characters come before all lowercase characters in the ordering. 

If we want more robust order comparisons, we should probably write something ourselves or 

In [189]:
text1 < text2

True

In [190]:
"hello" < "Hello"

False

### String Comparison Methods

Strings have a set of built-in methods that can be used to compare strings. These methods return a boolean value, either True or False. Some methods are:
<ul>
<li>isalpha() - returns True if all characters in the string are alphabetic</li>
<li>isalnum() - returns True if all characters in the string are alphanumeric</li>
<li>isdecimal() - returns True if all characters in the string are decimal</li>
<li>isspace() - returns True if all characters in the string are whitespace</li>
<li>islower() - returns True if all characters in the string are lower case</li>
<li>isupper() - returns True if all characters in the string are upper case</li>
<li>istitle() - returns True if the string is title cased</li>
</ul>

These are useful for checking the contents of a string if we are cleaning up some data. We can test to see if a string matches whatever we require from it, such as only having letters in it, then go and do whatever cleaning we need to do to fix it. We can also use them for easy conditions - if a value is a number, go and do something; if it is not, go clean it up. 

In [191]:
print(text1.isalnum())
print(text1.islower())

True
False


#### Character Comparisons

Note that when comparing things, sometimes there are characters that look the same but are different characters in the embedding. If comparing these, the comparison will not see them as the same, because they are not the same object. This is something to be aware of when using large text datasets, particularly if that text was gathered from multiple sources; some clean up is needed to ensure that we aren't hurt by odd issues like this.

One notable example of this are the two different types of dashes, emdash and endash. If we print them, they look the same, but they are not. In Python, these are represented by the strings "\u2014" and "\u2013". If we compare these strings, they are not equal, because they are different objects. 

<b>Mostly Irrelevant Note:</b> These are used in typography to indicate different things, and are different characters. The emdash is used to indicate a break in a sentence, and is the width of a capital M. The endash is used to indicate a range of numbers, and is the width of a capital N. 

In [192]:
emdash = '\u2014'
endash = '\u2013'

print(emdash)
print(endash)
print(emdash == endash)

—
–
False


### String Loops

Since strings are containers, we can loop over them. A for loop over a string will return each character in the string in order. We can use this to do things like count the number of times a character appears in a string.

In [193]:
dictChars = {}
for char in longString:
    dictChars[char] = dictChars.get(char, 0) + 1

dictChars

{'I': 1,
 ' ': 11,
 'a': 4,
 'm': 2,
 'v': 1,
 'e': 3,
 'r': 4,
 'y': 1,
 'l': 2,
 'o': 3,
 'n': 3,
 'g': 2,
 's': 3,
 't': 4,
 'i': 3,
 'w': 1,
 'h': 2,
 'f': 1,
 'c': 2}

### String Membership

We can use the "in" operator to check if a string is a substring of another string. We can also use the startswith() and endswith() methods to check if a string starts or ends with a particular substring. All of these methods return a boolean value, so they can be used in logical expression conditions easily. 

### String Searching

We can use the find() method to search for a substring in a string. This method returns the index of the first occurrence of the substring, or -1 if the substring is not found. We can also use the rfind() method to search for a substring starting from the end of the string. This method returns the index of the last occurrence of the substring, or -1 if the substring is not found.

In [194]:
# in Strings
isIn1 = "very" in longString
isIn2 = "Cowabunga" in longString

print(isIn1)
print(isIn2)
print(longString.find("very"))

True
False
7


### __repr__() vs __str__()

There is another method, __repr__(), which is similar to __str__(). This method returns a string representation of the object as well, very similar to the str() function. The difference is intent - __str__() is intended to return a human-readable string representation of the object, while __repr__() is intended to return a string representation of the object that can be used to recreate the object. For strings, and most other simple data types, these two are the same. For more complex objects, the representation returned by __repr__() may be more elaborate or detailed. Notice that if you put just one of your new objects on a line in a notebook, the __repr__() method is called to display it. If you put the object in a print, the __str__() method is called to display it.

## Exercise

Create a class called Review, that represents one review of the Threads app, taken from the data below. The class should have the following attributes (at least):
<ul>
<li> review_id - the id of the review. This should be unique. </li>
<li> description - the text of the actual review. </li>
<li> rating - the rating given to the app by the reviewer on a 1 to 5 scale. </li>
<li> date - the date the review was posted. </li>
<li> words - a list of the unique alphanumeric words in the review text.
</ul>

As well, the class should have the following methods (at least):
<ul>
<li> __init__ - the constructor, which should take the review_id, description, rating, and date as parameters. </li>
<li> __str__ - the string representation of the object. This should return a string only the rating and the date (and free text you add to make a statement) </li>
<li> __eq__ - this overloads the == operator. It should take another Review object as a parameter, and return True if the RATING of the two are the same. </li>
<li> __repr__ - this overloads the repr() function. It should return a string containing ALL OF the review_id, description, rating, and date. Try to make this one different from the other one, at a minimum format this string differently. </li>
<li><br></li>
<li> word_in_review - this should return True if the supplied arguent is in the words of the review, and False otherwise. </li>
<li> get_rating - a method that returns the rating of the review. </li>
<li> get_date - a method that returns the date of the review. </li>
<li> get_description - a method that returns the description of the review. </li>
<li> set_rating - a method that takes a new rating as a parameter and sets the rating of the review to the new value. </li>
</ul>

Once your class is created, create a list of Review objects, one for each of the first 100 reviews. Then, write a function that takes a list of Review objects as a parameter and returns the average rating of the reviews in the list.

<b>Note:</b> for row in iterrows() will loop through rows in a dataframe, returning the index and the row as a tuple. We can use this to create our Review objects.

In [195]:
import pandas as pd
df = pd.read_csv("../../data/threads_reviews.csv")
df.head()

Unnamed: 0,source,review_description,rating,review_date
0,Google Play,Very good app for Android phone and me,5,27-08-2023 10:31
1,Google Play,Sl👍👍👍👍,5,27-08-2023 10:28
2,Google Play,Best app,5,27-08-2023 9:47
3,Google Play,Gatiya app,1,27-08-2023 9:13
4,Google Play,Lit bruv,5,27-08-2023 9:00


In [196]:
# Code ME!!!!!


These are some tests that I did. You don't need to do these exact things, but likely something similar. 

In [197]:
rev = Review(df.iloc[0]['review_description'], df.iloc[0]['rating'], df.iloc[0]['review_date'])
print(rev)

Rating of 5 stars.
 Very good app for Android phone and me


In [198]:
rev


Review ID-0
 5 stars 
on 27-08-2023 10:31
 Very good app for Android phone and me

In [199]:
rev2 = Review(df.iloc[1]['review_description'], df.iloc[1]['rating'], df.iloc[1]['review_date'])
print(rev2)

Rating of 5 stars.
 Sl👍👍👍👍


In [200]:
rev3 = Review(df.iloc[2]['review_description'], df.iloc[2]['rating'], df.iloc[2]['review_date'])
rev3

Review ID-2
 5 stars 
on 27-08-2023 9:47
 Best app

In [201]:
rev2 == rev

True

In [202]:
rev.get_words()

{'and', 'android', 'app', 'for', 'good', 'me', 'phone', 'very'}

In [203]:
rev2.get_words()

#### Use the Class to Create Objects

Use the dataframe to create the list of reviews.

In [204]:
# Codes