## Strings

### Assign a string to a variable:
When assigning a string to a variable, you do it by using the variable's name followed by an equal sign and the string. You can also think of it as storing a value in the variable name.

For example, url = "'https://gutenberg.org/cache/epub/1597/pg1597.txt'"

The url variable contains the value to the right of the equal sign.

Now, we are retrieving a text from Gutenberg.org. When we do that, we are using a variety of different functions and methods. Later in the same notebook, we will revisit all of them again.

In [48]:
# We use the urlopen() function from the urllib.request library when retrieving resources from the Internet.
from urllib.request import urlopen 

# Link to HC Andersen fairy tale
url = 'https://gutenberg.org/cache/epub/1597/pg1597.txt'
with urlopen(url) as response:
    text_string = response.read().decode('utf-8')


# find the start
start = text_string.find("CONTENTS") + len("CONTENTS")

# find the end
end = text_string.find("End of Project Gutenberg's Andersen's Fairy Tales, by Hans Christian Andersen")

# find what is between
content = text_string[start:end]

### A text string is a sequence
A text string is a sequence of elements representing Unicode characters.

Square brackets provide access to elements in the string.

For example, raw_text[0] returns the character at index position 1 (remember that the first character has position 0).

In [49]:
content[0]

'\r'

### To slice
You can retrieve a range of characters by using a slicing technique, which involves specifying a start index and an end index separated by a colon.

For example, raw_text[0:100] retrieves the elements from the start index 0 to the end index 100, which is not included.

In [50]:
content[0:100]

"\r\n\r\n\r\n     The Emperor's New Clothes\r\n     The Swineherd\r\n     The Real Princess\r\n     The Shoes of "

By omitting the end index, the range will go to the end:

In [51]:
content[305000:]

'shine to God, and there\r\nno one asked after the RED SHOES.\r\n\r\n\r\n\r\n\r\n\r\n'

You can also use negative indices to slice to the end of the string:

In [52]:
content[-100:]

'roke. Her soul flew on the sunshine to God, and there\r\nno one asked after the RED SHOES.\r\n\r\n\r\n\r\n\r\n\r\n'

### The len() Function and String Methods
#### len()
You can find the length of a string using the len() function.

In [53]:
length = len(content)
print (f'The e-book is {length} signs long ( with whitespace ) ')

The e-book is 305070 signs long ( with whitespace ) 


### F-strings
Above, we use F-strings to insert the content of variables into a string.

In [54]:
book = "ANDERSEN'S FAIRY TALES"
lenght = len(content)
print(f"The e-book {book} is {lenght} signs long ( with whitespace )")

The e-book ANDERSEN'S FAIRY TALES is 305070 signs long ( with whitespace )


### .lower() method
.lower() returns a string of lowercase letters.

.upper() does the opposite.

In [55]:
lower_text = content.lower()
print(lower_text[0:100])




     the emperor's new clothes
     the swineherd
     the real princess
     the shoes of 


### The .replace() method
The .replace() method returns a string where a specified character or phrase is replaced with another specified character or phrase.

For example, lower_text.replace('.', ' ') replaces all full stops with a space.

The replace method can be used, for instance, to clean text of punctuation marks.

In [56]:
clean_string = lower_text.replace('.', ' ').replace(',', ' ').replace(';', ' ')\
                .replace('!', ' ').replace('‘', ' ').replace('_', ' ')\
                .replace('--', '').replace('\r','').replace('\n','')
clean_string[0:100]

"     the emperor's new clothes     the swineherd     the real princess     the shoes of fortune     "

### .count metoden()

In [57]:
word = 'bird'
word_count = clean_string.count(word) 
word_count

28

### The .split() method
The .split() method returns a list of words, where each element is a word (we'll get back to lists). By default, the method splits on spaces, but you can also split on other characters, such as a period, like this: .split(".")

You can use the split method to build a word list, and then you can count how many words are in the text using the word list and the len() function.

In [58]:
word_list = clean_string.split()
word_list[0:10]

['the',
 "emperor's",
 'new',
 'clothes',
 'the',
 'swineherd',
 'the',
 'real',
 'princess',
 'the']

# Summary of string methods
[Click on this link and get an overview of the string methods built into Python](https://www.w3schools.com/python/python_strings_methods.asp) 

### Lists
In lists, you can store a collection of elements in a single variable. It could be, for example, a list of words stored in the variable "word_list."

Lists are characterized by square brackets. If it's a list of words, it might look like this:

word_list = ['and', 'soon', 'a', 'little', 'bird', 'came']

In the list, each element consists of a text string, indicated by quotation marks. Each element is separated by a comma. The list begins and ends with square brackets.

Elements are indexed; the first element has index [0], the second element has index [1], and so on.

You can also use negative indexing with lists, meaning counting starts from the end.

[-1] is the last element, [-2] is the second-to-last element, and so forth.

You can also select an index range and specify where the interval starts and ends.

This way, you can return a specific portion of the list.

For example, if you want to retrieve the first 10 words from the list, it would look like this.

In [59]:
word_list[0:10]

['the',
 "emperor's",
 'new',
 'clothes',
 'the',
 'swineherd',
 'the',
 'real',
 'princess',
 'the']

## The len() function and lists
The len() function also works with lists and can return the number of elements in the list.

In this case we use it to count the length of the word list to get the total amount of words in the text.

We can use the total amount of words in the text to calculate how much space a single word takes up in the whole text.

We can find how much space a single word takes up by counting it and dividing it with the total amount of words.

In [60]:
total_words = len(word_list)
print (f'The word list consists of {total_words} words.' )

word = 'bird'
word_count = clean_string.count(word)


res = round((word_count / total_words) * 100, 2)

print (f'The word {word} takes up {res} % of the whole text.')

The word list consists of 52702 words.
The word bird takes up 0.05 % of the whole text.


## Appending (adding) elements to a list
You can add new elements to a list using append(). The new elements are added to the end of the list.

In [61]:
the_end = word_list[-10:]
the_end.append('The End')
the_end

['to',
 'god',
 'and',
 'thereno',
 'one',
 'asked',
 'after',
 'the',
 'red',
 'shoes',
 'The End']

## Oversigt over list metoder
Klik på linket og få overblik over de list-metoder, der er indbygget i Python:  https://www.w3schools.com/python/python_lists_methods.asp

## Overview of list methods
[Click the link to get an overview of the list methods built into Python](https://www.w3schools.com/python/python_lists_methods.asp)

Using a Dictionary to Store Word Frequencies
We create a dictionary that we use to store the frequencies of words.

Think of a dictionary as a kind of list that can store pairs of things. In this case, we will build a dictionary to keep track of how many times each word appears in a piece of text.

A dictionary is characterized by curly braces and consists of keys and values.

In [62]:
# Make an empty dictionary
word_dict = {}

Using a for loop, we create keys of all the words in the dictionary. Initially, we add the value 0 to all. 

In [76]:
for word in word_list:
    word_dict[word] = 0
    
# word_dict

The .keys() method returns a list with all the keys, and the .values() method returns a list with all the values.

In [77]:
# word_dict.keys()

In [78]:
# word_dict.values()

### The .get() method
To be able to count the number of times each word appears, we need to use the dictionary method .get(), and we should take a closer look at its parameters.

Parameter: Key. Description: Required. The key name of the item from which you want to retrieve the value.

Parameter: Value. Description: ***Optional. A value to be returned if the specified key does not exist.***

Source: https://www.w3schools.com/python/ref_dictionary_get.asp

In [79]:
word_dict = {}  # Dictionary to store word counts
    
for word in word_list:
    word_dict[word] = word_dict.get(word, 0) + 1
    
word_dict 

{'the': 3537,
 "emperor's": 11,
 'new': 40,
 'clothes': 22,
 'swineherd': 13,
 'real': 21,
 'princess': 69,
 'shoes': 67,
 'of': 952,
 'fortune': 11,
 'fir': 24,
 'tree': 84,
 'snow': 46,
 'queen': 25,
 'leap-frog': 7,
 'elderbush': 4,
 'bell': 19,
 'old': 221,
 'house': 45,
 'happy': 25,
 'family': 13,
 'story': 38,
 'a': 1047,
 'mother': 46,
 'false': 3,
 'collar': 17,
 'shadow': 69,
 'little': 306,
 'match': 9,
 'girl': 29,
 'dream': 15,
 'tuk': 9,
 'naughty': 7,
 'boy': 69,
 'red': 48,
 'shoesthe': 1,
 'clothesmany': 1,
 'years': 17,
 'ago': 5,
 'there': 274,
 'was': 884,
 'an': 94,
 'emperor': 26,
 'who': 162,
 'so': 358,
 'excessively': 3,
 'fond': 5,
 'ofnew': 1,
 'that': 536,
 'he': 727,
 'spent': 1,
 'all': 295,
 'his': 365,
 'money': 3,
 'in': 827,
 'dress': 10,
 'did': 90,
 'not': 345,
 'troublehimself': 1,
 'least': 15,
 'about': 119,
 'soldiers': 5,
 'nor': 17,
 'care': 19,
 'to': 979,
 'go': 76,
 'either': 12,
 'tothe': 17,
 'theatre': 3,
 'or': 92,
 'chase': 2,
 'except'

In the first line, we create an empty dictionary called word_dict.

Then, there's a loop that goes through each word in the word list, which contains all the words we want to count in our text. For each word in the list, we do three things:

- We check in word_dict to see if we already have that word in there.
- If we have the word, we add 1 to the count of that word in the dictionary.
- If we don't have the word yet, we start the count at 0 and add 1 to it.

After we have gone through all the words, word_dict will contain counts of how often different words were used.

So, word_dict is a way to keep track of how many times each word occurs in the text. It's like a list that shows the frequency of words.