<div class="frontmatter text-center">
<h1>Introduction to Data Science and Programming</h1>
<h2>Exercise 6: Python Crash Course - Strings, text, and IO</h2>
<h3>IT University of Copenhagen, Fall 2023</h3>
</div>

* Task 1: Working with built-in modules
* Task 2: Simulating a dice game
* Tasks 3-5: String formatting & text file processing. (**Note** Task 5 is an extra challenge!)

# Task 1: Exploring built-in Python modules

In each of the code snippets below, we import a built-in module (or a single function from a module), and then call one of the module's functions.

Explore the code output, as well as the [documentation](https://docs.python.org/3/py-modindex.html#) of the corresponding Python module, to answer the questions about the code snippets.

In [None]:
# the operator module
from operator import add
result_sub = add(4,3)

# what does the variable "result" contain?
# >>> the result from the addition 4+3

# how can you access the sub() function in the operator module,
# for the following line of code to work?
# result_sub = sub(4,3)
# >>> you need to import the function sub from the module, too:
from operator import sub
result_sub = sub(4,3)

In [None]:
# >>> OR:
import operator
operator.sub(4,3)

In [None]:
# the time module:
import time
time_in_secs = time.time()
print(time_in_secs)
time_as_str = time.ctime(time_in_secs)
print(time_as_str)

# What does the number saved in the variable time_in_secs represent?
# >>> the number of seconds that have passed since Jan 1, 1970,
# >>> until the moment that you run your code

# What does the function ctime() from the time module do?
# >>> it converts this number of seconds into a human-readable string format

In [None]:
# the Counter function from the collections module:
mysnacks = ["cookies", "apple", "orange", "sandwich", "apple", "apple"]
from collections import Counter
myCounter = Counter(mysnacks)
# explain what the object myCounter now contains!
myCounter
# >>> myCounter is a dictionary, where keys are UNIQUE items from the list,
# >>> and values are the number of times each of the items appears on the list

# Task 2: Simulating a dice game

Now that we know how to simulate dice with the `random` module,  let's play a game! The rules are simple: you have `N` dice;  you throw them `M` times; if the sum of all your dice in all throws is bigger than `S`, you win. For example, if `N=3`, `M=2`, and `S=20`, it means that you have N=3 dice that you are allowed to throw M=2 times, and you win if all your points add up to at least S=20. 

**Write a function `win_dice_game` that:**
* takes as input 
    * `n_dice` (the number of dice in the game), 
    * `n_throws` (the number of times the dice are thrown), and 
    * `points_win` (the number of points you need to have to win) 
* uses a function of your choice from the [`random` module](https://docs.python.org/3/library/random.html) to simulate the dice throw
* computes the sum of all points from the dice throw
* compares the sum to `points_win` and returns either `True` (if you win, i.e. if the sum is equal to or greater than `points_win`) or `False` otherwise

**Use your function `win_dice_game()` to simulate 100 games**, with different settings. What percentage of games did you win if:
* you have 2 dice, are allowed 3 throws, and need at least 20 points to win?
* you have 5 dice, are allowed 5 throws, and need at least 120 points to win?

How many games do you need to play with the second configuration (5 dice, 5 throws, and at least 120 points to win) if you want to win at least once? (Just experiment with your code by increasing/decreasing the number of games played, no mathematical formula needed - but you can **try** to come up with one!)

In [None]:
# import random module, which we will need
import random

# define your function
def win_dice_game(n_dice, n_throws, points_win):
    '''
    function that takes 3 integer values and returns True/False
    '''
    # initiate variable where we will sum points from all dice and all throws: 
    total_points = 0
    # throw exactly n_throws times:
    for _ in range(n_throws):
        # each iteration step is one thrown of n_dice dices
        throw_points = random.choices(
            population=range(1,7),
            k = n_dice
            )
        # convert to list so that we don't have to differentiate
        # between throwing 1 dice and throwing n>1 dice
        # and always use the sum() function:
        throw_points = list(throw_points)
        # print("dice show: ", throw_points)
        # add the points from this throw to the "points" variable
        # print("points at current throw:", sum(throw_points))
        total_points += sum(throw_points)
    # after the for loop, all our throws are done,
    # now we compare our total number of points to the number of points to win:
    # print("total points:", points)
    if total_points >= points_win:
        #print("you win!")
        return True
    else:
        #print("you lose!")
        return False


In [None]:
# check if it works:
win_dice_game(2,1,10)

In [None]:
# simulate 100 games with:
# 2 dice, 3 throws, 20 points to win
games_won = 0
for _ in range(100):
    current_game = win_dice_game(n_dice=2, n_throws=3, points_win=20)
    games_won += current_game
    # since current_game is either True of False (1 or 0 in int interpretation),
    # we can simply add it to the games_won variable
print(games_won, "out of 100 games won")

In [None]:
# simulate 100 games with:
# 5 dice, 5 throws, 120 points to win
games_won = 0
for _ in range(100):
    current_game = win_dice_game(n_dice=5, n_throws=5, points_win=120)
    games_won += current_game
    # since current_game is either True of False (1 or 0 in int interpretation),
    # we can simply add it to the games_won variable
print(games_won, "out of 100 games won")

In [None]:
# simulate N games with:
# 5 dice, 5 throws, 120 points to win
# experimenting with N (number of games) from the list:
games_played = [100, 500, 1000, 5000, 10000, 20000, 50000]
for N in games_played:
    games_won = 0
    for _ in range(N): # current number of games
        current_game = win_dice_game(n_dice=5, n_throws=5, points_win=120)
        games_won += current_game
        # since current_game is either True of False (1 or 0 in int interpretation),
        # we can simply add it to the games_won variable
    print(games_won, "out of", N, "games won")

# Task 3: String formatting - Capital cities

Below, we provide you with a dictionary `capitals`, that contains key-value pairs with countries as keys, and their capital cities as values. Let's do some data cleaning first:
* Some cities' names contain numbers; these need to be deleted
* Some cities' names consist of several words, but lack a white space; insert a white space where appropriate (for example, "AddisAbaba" needs to be formatted into "Addis Ababa").

Now, use the `f'{}'` syntax to generate a file where in each line contains one sentence: `The capital of <country> is <city>.`, inserting countries and capitals from the dictionary. Save the file to `capitals.txt`. 

In [None]:
capitals = {
    "Nigeria" : "Abuja",
    "Colombia" : "0Bo0gotá",
    "Gibraltar": "Gibr2altar",
    "Ethiopia": "AddisAb3aba",
    "United Arab Emirates": "AbuDhab7i"
}

In [None]:
# import re module
import re

# remove numbers from dictionary values with the help of the "\d" or "\d+" regex 

# loop through dictionary items
for key, value in capitals.items():
    # loop through all numbers that were found in the value string
    for item in re.findall("\d", value):
        # reassign the new value, where item (the number) is replaced by "" (an empty string)
        capitals[key] = capitals[key].replace(item, "")        
capitals

In [None]:
# add white spaces before capital letters 
# (you can use the regex "[A-Z]" to find capital letters)

# the way we know that a white space is missing:
# if there is a capital letter "in the middle" of the word,
# i.e. at position >0.

# as above, loop through dictionary items,
# this time inserting a white space BEFORE a capital letter,
# if it is at a position >0.
# loop through dictionary items
for key, value in capitals.items():
    # loop through all numbers that were found in the value string (excluding the first letter)
    for item in re.findall("[A-Z]", value[1:]):
        # replace the capital letter by (whitespace + the capital letter)
        capitals[key] = capitals[key].replace(item, " " + item)
capitals

In [None]:
# open up the file; with the opened file,
    # loop through the dictionary items;
    # create a sentence from keys and values (with string formatting) at each iteration step;
    # wrte the sentence + a linebreak (expressed as "\n") string to the file


# create the file
with open('capitals.txt', 'w') as opened_file:
    # loop through the dictionary items
    for key, value in capitals.items():
        # create the sentence for this key-value pair
        sentence = f"The capital of {key} is {value}."
        # write the sentence to the file
        opened_file.write(sentence)
        # add a line break
        opened_file.write("\n")

# Task 4: Text processing - Numbers in an article

In the file `article.txt`, we provide the text of this [Guardian article](https://www.theguardian.com/commentisfree/2023/jul/12/progress-climate-european-greenlash-populist-right) by Nathalie Tucci. Let's say we are **VERY** interested in all the **numbers** that she used in the article. Your tasks:

* `.read()` in the text file 
* find all the numbers (of one or more characters, with the regex `"\d+"`) mentioned in the text, and print them out
* `.split()` the text into separate sentences
* loop through the sentences, `.append()`ing only the ones that contain numbers to a list
* Additional challenge: try to write this list to a text file, so that every line in the text file is a sentence (with a number) from the article

In [None]:
# if not done yet, import the re module
import re

In [None]:
# .read() in the text file
with open('files/article.txt', 'r') as opened_file:
    my_text = opened_file.read()

In [None]:
# print out all numbers (regex "\d+") that you can find
re.findall("\d+", my_text)

In [None]:
# split into sentences 

# with the "." as separator
sentences = my_text.split(". ")
sentences # my sentences is now a list of strings, every string is a sentence (now without the ".")

In [None]:
# find all sentences that contain numbers, and append them to a list

sentences_with_numbers = []

for sentence in sentences:
    if re.search("\d+", sentence):
        sentences_with_numbers.append(sentence)

print(sentences_with_numbers)

In [None]:
# Challenge: write the sentences_with_numbers to a file

with open("sentences_with_numbers.txt", "w") as opened_file:
    for sentence in sentences_with_numbers:
        opened_file.write(sentence + "\n") # write each sentence plus a line break


## Task 5: Text processing - programmers' feelings

Remember the Menti survey from lecture 01? We copied (most of) your replies to the question "How do you feel about programming?" into a text file, `feelings.txt` (provided together with this notebook in a zipped folder). **You want to find out what the 5 most common feelings were.** However, there is some data cleaning to do!

You can go about it your own way; **or** take some inspiration from the instructions below:

* Read in the file, with the method of your choice (`.read()` or `.readlines()`)
* Remove all line breaks and tabulators (`"\n", "\t"`)
* You will see that some lines contain only one word, while other lines contain several words; make sure that you can access each word separately (this will depend on what method you choce to read in the file)
* since we don't care about upper/lowercase spelling, convert everything to lower case with the `.lower()` method 
* Create a dictionary where `keys` are words (feelings) and `values` are the number of times they appear. (`Counter` from the `collections` method might help!)
* Print out all keys to look through them, and remove the keys that you think don't belong in the list (that don't contain feelings) from the dictionary
* Now you can count the number of times each feeling was mentioned (again, with the method of your choice). What were the top 5 most common feelings?

In [None]:
# open the file and read it in

with open("files/feelings.txt", "r") as opened_file:
    my_lines = opened_file.readlines()

In [None]:
# remove line breaks "\n" and tabulators "\t"
# white spaces should be treated as separators between words (use them to .split() the string(s))
# convert all words into lower case

wordlist = [] # initiate an empty list of words
for line in my_lines:
    while "\n" in line:
        line = line.replace("\n", "") # removing "new line" signs
    while "\t" in line:
        line = line.replace("\t", " ") # removing "tabulator" signs
    words = line.split(" ") # split words (white space as separator)
    for word in words:
        if word: # i.e. if the word is not empty
            wordlist.append(word.lower()) # append its lower-case version to the list

In [None]:
# use the Counter function from the collections module
# to get a dictionary where keys are unique words from the list
# and values are number of times they appear on the list
from collections import Counter
wordCounter = Counter(wordlist) 

In [None]:
# print out the keys of the Counter dictionary (i.e. unique words)
# make a list of words you want to remove from the dictionary
print(wordCounter.keys())
words_to_remove = ["klingon", "computer", "nerdge", "potato", "stonks", "jens", "aug", "klingoin", "kligon"]

In [None]:
# remove the words
for word in words_to_remove:
    del(wordCounter[word])

In [None]:
# show the highest 10 counts
sorted(wordCounter.values(), reverse = True)[0:10]

In [None]:
# we are interested in the first 5 positions - 
# all words that have been mentioned 7 or more times

# initiate a new dictionary, we will copy only the top 5 here
frequent_feelings = {}

for key, value in wordCounter.items():
    if value >= 7:
        frequent_feelings[key] = value       

In [None]:
print(frequent_feelings)