In [None]:
from IPython.display import Image
from IPython.display import clear_output
from IPython.display import FileLink, FileLinks

## Introduction to

![title](img/python-logo-master-flat.png)

### with Application to Bioinformatics

#### - Day 3

## Day 3
- __Session 1__
    - Quiz: Review of Day 2
    - Lecture: Go through questions, data type `set`
    - Ex1: IMDb exercise - Find the number of unique genres
- __Session 2__    
    - Lecture: Data type `dict`
    - Ex2: IMDb exercise - Find the number of movies per genre
    - PyQuiz 3.1
- __Session 3__  
    - Lecture: Write you own functions
    - Ex3: Day 3, Exercise 3, Functions 
- __Session 4__
    - Lecture: Pass arguments from command line using `sys.argv` and string formatting
    - Ex4: IMDb exercise - functions and `sys.argv`
    - PyQuiz 3.2
- __Project time__

## Quiz: Review Day 2

Go to Canvas, `Modules -> Day 3 -> Review Day 2`
 
~20 minutes


## Tuples (Q 1&2)

__1. Which of the following variables are of the type tuple?__  
`a = (1, 2, 3, 4)`  
`a = ([1, 2], 'a', 'b')`

Note: tuple is a data structure in Python that can store multiple items in ordered and immutable items. 

__2. What is the difference between a tuple and a list?__  
A tuple is immutable while a list is mutable

In [None]:
myTuple    = (1, 2, 3)
myList     = [1, 2 ,3]

In [None]:
myList[2]  = 4
myList

In [None]:
myTuple[2] = 4

### Is it true that we can never modify the content of a tuple?

In [None]:
myTuple = (1, 2, [1,2,3])
print(myTuple)
myTuple[2][2] = 4
print(myTuple)

- The immutability of tuples in Python means that __the structure of the tuple itself cannot be changed__, you cannot add, remove, or replace elements in the tuple. 
- However, if a tuple contains mutable objects like lists, dictionaries, or other objects, the contents of those mutable objects can still be changed.

## How to structure the code (Q 3)

__3. What does pseudocode mean?__  
Writing down the steps you intend to include in your code in more general language

### Things to Consider When Writing Pseudocode
- Decide on the desired output.
- Identify the input files you have.
- Examine the structure of the input – can it be iterated over?
- Determine where the necessary information is located.
- Assess if you need to store information while iterating:
    - Use lists for ordered data.
    - Use sets for unique, non-duplicate entries.
    - Use dictionaries for structured, key-value information.
- After collecting the required data, decide how to process it.
- Determine if you’ll need to write your results to a file.

__Writing pseudocode before actual coding is a good habit.__

## Functions and methods (Q 4&5)

__4. What are the following examples of?__  
`len([1, 2, 3, 4])`  
`print("my text")`

Functions

__5. What are the following examples of?__  
`"my\ttext".split("\t")`  
`[1, 2, 3].pop()`

Methods

### What are the differences between a `function` and a `method`?
| __Function__                      | __Method__                                                |
|-------------------------------|-------------------------------------------------------|
| Standalone block of code      | Function associated with an object                    |
| Called independently, e.g. `functionName()`         | Called on an instance of a class, e.g. `obj.methodName()`   |
| Not tied to any object or class | Tied to the objects they are called on                |
| Defined outside of a class    | Defined within a class                                |
|


__6. Calculate the average of the list `[1,2,3.5,5,6.2]` to one decimal, using Python__

In [None]:
myList = [1, 2, 3.5, 5 ,6.2]
round(sum(myList)/len(myList),1)

__7. Take the list `['I','know','Python']` as input and output the string 'I KNOW PYTHON'__

In [None]:
my_list   = ['I','know','Python']
my_string =' '.join(my_list).upper()
print(my_string)

## Exerciese from yesterday

### Find the movie with the highest rating in the file `250.imdb`

<img src="img/header_imdb.png" alt="Drawing" style="width: 1000px;"/> 

In [None]:
IMDb: internet movie database

<img src="img/header_imdb.png" alt="Drawing" style="width: 1000px;"/> 

In [None]:
# Code Snippet for Finding the Movie with the Highest Rating
# Note that this is just one of the solutions
with open('../downloads/250.imdb', 'r') as fh:  
    movieList = [] 
    highestRating = -1.0  
                         
    for line in fh:     
        if not line.startswith('#'):    
            cols = line.strip().split('|')
            rating = float(cols[1].strip())
            title = cols[6].strip()
            movieList.append((rating, title))
            if rating > highestRating:
                highestRating = rating
    print("Movie(s) with highest rating " + str(highestRating) + ":" )
    for i in range(len(movieList)):
        if movieList[i][0] == highestRating:
            print(movieList[i][1])

### The `with` key word 
- Use the `with` keyword to ensure the file handle be closed automatically
```python
file = open("filename.txt", "r")
content = file.read()
file.close()
```
---
```python
with open("filename.txt", "r") as file:
    content = file.read()
```
---
```python
with open ("filename.txt", "r", encoding='utf-8') as file
    content = file.read()
```

#### However, for Python 3, the default encoding is usually 'utf-8', so it's not needed.

There are other encodings such as `latin-1` and `ascii`

In [None]:
# Alternative solution, using sorting
# Note that this is just one of the solutions
with open('../downloads/250.imdb', 'r') as fh:  # open the file   
    movieList = [] # create an empty list to start with
    for line in fh:     # iterate over the file
        if not line.startswith('#'):
            # split the line separated by '|' into a list 
            cols = line.strip().split('|')
            # extract rating and movie title from cols
            rating = float(cols[1].strip())
            title = cols[6].strip()
            # votes = int(cols[0].strip())
            # year = int(cols[2])
            movieList.append((rating, title))
    sortedMovieList = sorted(movieList, key = lambda x:x[0], reverse=True)
    highestRating = sortedMovieList[0][0]
    print("Movie(s) with highest rating " + str(highestRating) + ":" )
    for i in range(len(sortedMovieList)):
        if sortedMovieList[i][0] == highestRating:
            print(sortedMovieList[i][1])
        else:
            break

## New data type: `set`
- A set contains an unordered collection of unique and hashable objects
    - __Unordered__: Items have no defined order
    - __Unique__: Duplicate items are not allowed
    - __Hashable__: Each item must be hashable

### Syntax:  

```python
setName = set() # Create an empty set
```  
___
```python
setName = {1,2,3,4,5} # Create a populated set
setName = set([1,2,3,4,5]) # Alternative way
```

### Set is unordered

In [None]:
mySet = {"1", "2", "3", "4", "5"}
for e in mySet:
    print(e)

### Set has unique elements

In [None]:
mySet = {"1", "1", "2", "2", "3"}
print(mySet)

### Set can only have hashable elements

In [None]:
mySet = {1, "tga", (3, 4), 5.6, False}
print(mySet)

In [None]:
mySet = {1, "tga", [3, 4], 5.6, False}

In [None]:
mySet = {1, "tga", (3, 4, [1, 2]), 5.6, False}

#### Although tuples are immutable, but when it contains mutable items, it becomes non hashable. Be careful!

### Basic operations on `set`

In [None]:
# Add elements to a set
myset = set()
myset.add(1)
myset.add(100)
myset.add(100)
print(myset)

In [None]:
# get the number of elements of a set
len(myset)

In [None]:
# membership checking
1 in myset

#### Learn more on https://www.w3schools.com/python/python_sets.asp

#### When the size of list is large, membership checking with `set` tends to be much faster than with `list`

In [None]:
import time, random

# Create a large list and set
large_list = list(range(10000000))
large_set = set(large_list)
elements_to_find = random.sample(range(10000001), 10)

# Measure time for list membership check
list_time = time.time()
for e in elements_to_find:
    e in large_list
list_time = time.time() - list_time

# Measure time for set membership check
set_time = time.time()
for e in elements_to_find:
    e in large_set
set_time = time.time() - set_time

print(f"List check: {list_time:.6f} seconds")
print(f"Set check: {set_time:.6f} seconds")
print(f"Set is approximately {list_time / set_time:.2f} times faster.")

## Day 3, Exercise 1 (~20 min)
### Find the number of unique genres in the file `250.imdb`

<img src="img/header_imdb.png" alt="Drawing" style="width: 1000px;"/> 

- Canvas -> Modules -> Day 3 -> IMDb exercise -> 1 



- Take a break after the exercise (~10 min)

In [None]:
# Find the number of unique genres
# open the file
fh = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
# empty list to start with
genres_res = []
# iterate over the file
for line in fh:
    if not line.startswith('#'):
        # split the line into a list, del |
        cols = line.strip().split('|')
        # extract genres from list, split genres into list
        genres = cols[5].strip().split(',')
        # loop over genre list and add to empty start list if genre not already in list
        for genre in genres:
            if genre.lower() not in genres_res:
                genres_res.append(genre.lower())
fh.close()
print(genres_res)
print(len(genres_res))

In [None]:
# Find the number of unique genres using set
with open('../downloads/250.imdb', 'r', encoding = 'utf-8') as fh:
    uniqueGenres = set() # create an empty set

    for line in fh:
        if not line.startswith('#'):
            cols  = line.strip().split('|')
            genre = cols[5].strip()
            glist = genre.split(',')
            for entry in glist:
                uniqueGenres.add(entry.strip().lower())
    print("Number of unique generes:", len(uniqueGenres))
    print(sorted(list(uniqueGenres)))

## Session 2     
   - Lecture: Data type `dictionary`
   - Ex2: IMDb exercise - Find the number of movies per genre + Extra
   - PyQuiz 3.1

## New data type: `dictionary`

- A dictionary is an unordered, mutable collection of key-value pairs. 
- Dictionaries are mutable
- Each key in a dictionary must be unique and immutable, while the values associated with keys can be of any data type and can be duplicated

<br>
<img src="img/key_values.png" alt="Drawing" style="width: 1000px;"/>  

## Syntax:  
```python
d = {} # Create an empty dictionary
```  
---
```python
d = {'key1':1, 'key2':2, 'key3':3} # create a populated dictionary
```

In [None]:
myDict = {'drama': 4,
          'thriller': 2,
          'romance': 5}
myDict

### Basic operations on Dictionaries
<img src="img/dictionary.png" alt="Drawing" style="width: 600px;"/>  

In [None]:
myDict = {'drama': 4, 
          'thriller': 2, 
          'romance': 5}
myDict

In [None]:
len(myDict)

In [None]:
# Get the value of a certain key
myDict['drama']
# Get the length
len(myDict)
myDict['horror'] = 2
myDict
del myDict['horror']
myDict
'drama' in myDict
myDict.keys()
list(myDict.items())
list(myDict.values())

### Live Exercise

In [None]:
myDict = {'drama': 182, 
          'war': 30, 
          'adventure': 55, 
          'comedy': 46, 
          'family': 24, 
          'animation': 17, 
          'biography': 25}

- How many genres are in this dictionary?
- How many movies are in the `comedy` genre?
- You're not interested in biographies, delete this entry
- You're interested in fantasy; add that we have `29` movies in the `fantasy` genre to this dictionary.
- Which genres are listed in this dictionary after the change?
- You remembered another comedy movie; increase the number of movies in the `comedy` genre by one.

In [None]:
myDict = {'drama': 182, 
          'war': 30, 
          'adventure': 55, 
          'comedy': 46, 
          'family': 24, 
          'animation': 17, 
          'biography': 25}
print(len(myDict))
print(myDict['comedy'])
del myDict['biography']
print(myDict)
myDict['fantasy'] = 29
print(myDict)
print(myDict.keys())
myDict['comedy'] += 1
print(myDict)

### Day 3, Exercise 2 (~50 min)
- #### Find the number of movies per genre
- #### (Extra) What is the average length of the movies (in hours and minutes) in each genre?


- Canvas -> Modules -> Day 3 -> IMDb exercise -> 2&3 
___
#### Take a break after the exercise (~10 min)

#### PyQuiz 3.1 - set, list and dictionary (before lunch)
___
### Lunch

### Find the number of movies per genre

<img src="img/header_imdb.png" alt="Drawing" style="width: 1000px;"/>  


Hint! If the genre is not already in the dictionary, you have to add it first

### (Extra) What is the average length of the movies (hours and minutes) in each genre?

<img src="img/header_imdb.png" alt="Drawing" style="width: 1000px;"/>  

### Answer

<img src="img/movie_dict.png" alt="Drawing" style="width: 500px;"/>  

In [None]:
# Find the number of movies per genre
fh        = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genreDict = {}     # create empty dictionary

for line in fh:
    if not line.startswith('#'):
        cols  = line.strip().split('|')
        genre = cols[5].strip()
        glist = genre.split(',')
        for entry in glist:
            if not entry.lower() in genreDict: # check if genre is not in dictionary, add 1
                genreDict[entry.lower()] = 1
            else:
                genreDict[entry.lower()] += 1   # if genre is in dictionary, increase count with 1
fh.close()
print(genreDict)


### (Extra) What is the average length of the movies (in hours and minutes) fore each genre?

<img src="img/header_imdb.png" alt="Drawing" style="width: 1000px;"/>  

### Answer

<img src="img/average_length.png" alt="Drawing" style="width: 500px;"/>  


__Tip!__  
Here you have to loop twice

In [None]:
# Calculate the average length of the movies (in hours and minutes) for each genre
fh        = open('../downloads/250.imdb', 'r', encoding = 'utf-8')
genreDict = {}

for line in fh:
    if not line.startswith('#'):
        cols    = line.strip().split('|')
        genre   = cols[5].strip()
        glist   = genre.split(',')
        runtime = cols[3]      # length of movie in seconds
        for entry in glist:
            if not entry.lower() in genreDict:
                genreDict[entry.lower()] = []   # add a list with the runtime
            genreDict[entry.lower()].append(int(runtime))   # append runtime to existing list
fh.close()
                
for genre in genreDict:      # loop over the genres in the dictionaries
    average = sum(genreDict[genre])/len(genreDict[genre])  # calculate average length per genre
    hours   = int(average/3600)                                 # format seconds to hours
    minutes = (average - (3600*hours))/60             # format seconds to minutes
    print('The average length for movies in genre '+genre\
          +' is '+str(hours)+'h'+str(round(minutes))+'min')

## Session 3    
   - Lecture: Write you own functions
   - Exercise 3: Functions 

Note: we have used a lot of functions in the past two days, and these functions are written by others so that we can use, how do we write our own functions

### We have used many built-in functions

In [None]:
print("Hello Python")

In [None]:
len("ACCCCTTGAACCCC")

In [None]:
max([87, 131, 69, 112, 147, 55, 68, 130, 119, 50])

### How to write your own functions?

### Syntax of function
```python
def function_name(arg1, arg2, ...):
    # Block of code
    return result
```

In [None]:
def SayHi(name):
    print("Hi", name)

SayHi('Mike')
SayHi('Anna')

Note: let take a look at the following code, it calculate the average duration of movies in the genre 'drama',
How if we also want to calculate the average duration of the genre 'horror', we would need to either copy the code or using a loop.

and add the complicated code under the loop.

Is there any better solutions? This is how functions are introduced

In [None]:
# Calculate the average duration of movies in the genre 'drama'
genre = "drama"
average = sum(genreDict[genre])/len(genreDict[genre])  # calculate average length per genre
hours   = int(average/3600)                                 # format seconds to hours
minutes = (average - (3600*hours))/60             # format seconds to minutes
reformattedTime = str(hours)+'h'+str(round(minutes))+'min'
print('The average length for movies in genre '+ genre +\
      ' is '+ reformattedTime)

Live coding, write a function based on the previous code

In [None]:
def formatSec(genre, genreDict):
    average   = sum(genreDict[genre])/len(genreDict[genre])
    hours     = int(average/3600)
    minutes   = (average - (3600*hours))/60   
    reformattedTime = str(hours)+'h'+str(round(minutes))+'min'
    return reformattedTime
genre = 'drama'
print('The average length for movies in genre '+ genre +\
        ' is '+ formatSec(genre, genreDict))

In [None]:
for genre in ['drama', 'horror', 'comedy']:
    average = sum(genreDict[genre])/len(genreDict[genre])  # calculate average length per genre
    hours   = int(average/3600)                                 # format seconds to hours
    minutes = (average - (3600*hours))/60             # format seconds to minutes
    reformattedTime = str(hours)+'h'+str(round(minutes))+'min'
    print('The average length for movies in genre '+ genre +\
          ' is '+ reformattedTime)

In [None]:
for genre in ['drama', 'horror', 'comedy']:
    print('The average length for movies in genre '+ genre +\
          ' is '+ formatSec(genre, genreDict))

## Why use functions?
```python
for genre in ['drama', 'horror', 'comedy']:
    print('The average length for movies in genre '+ genre +\
          ' is '+ formatSec(genre, genreDict))
```

- Cleaner code
- Better defined tasks in code
- Re-usability
- Better structure

### Scope 

- Local variables - Variables within functions
- Global variables - Variables outside of functions

In [None]:
WEIGHT = 5
def addWeight(value):
    return value * WEIGHT
print(addWeight(4))

In [None]:
WEIGHT = 5
def changeWeight():
    WEIGHT = 10
    return None
print(WEIGHT)

#### We will talk more about the scope of variables tomorrow

## Use external libraries in Python

In [None]:
math.sqrt(5)

In [None]:
import math
math.sqrt(5)

In [None]:
sqrt(5)

In [None]:
from math import sqrt
sqrt(5)

In [None]:
del sys.modules['math'] 

## Why use libraries
   - Cleaner code
   - Better defined tasks in code
   - Re-usability
   - Better structure


### How to define your own libraries
####  A simple library is just file with some python functions

In [None]:
def formatSec(seconds):
    hours     = seconds/3600
    minutes   = (seconds - (3600*int(hours)))/60   
    return str(int(hours))+'h'+str(round(minutes))+'min'


def toSec(days, hours, minutes, seconds):
    total = 0
    total += days*60*60*24
    total += hours*60*60
    total += minutes*60
    total += seconds
 
    return str(total)+'s'

Example:
1. Create a file called myFunctions.py, located in the same folder as your script
2. Put a function called `formatSec()` in the file
3. Start writing your code in a separate file and `import` the function

In [None]:
from myutils import formatSec, toSec

formatSec(3601)

In [None]:
toSec(days=0, hours=1, minutes=0, seconds=1)

In [None]:
from myFunctions import  formatSec, toSec

seconds = 21154
print(formatSec(seconds))

days    = 0
hours   = 21
minutes = 56
seconds = 45

print(toSec(days, hours, minutes, seconds))

### myFunctions.py

<img src="img/myFunctions.png" alt="Drawing" style="width: 600px;"/>  

## Summary

- A function is a block of organized, reusable code that is used to perform a single, related action
- Variables within a function are local variables
- Variables outside of functions are global variables
- Functions can be organized in separate files as libraries and be imported to the main code

## Day 3, Exercise 3 (~30 min)


- Canvas -> Modules -> Exercise 3 - functions 
___

#### Take a break after the exercise (~10 min)

## Session 4    
   - Lecture: Pass arguments from command line using `sys.argv` and string formatting
   - Ex4: IMDb exercise - functions and `sys.argv`
   - PyQuiz 3.2

### How to pass arguments to Python script from the command line?

#### Not just 
```bash
    python myscript.py
```

#### But also

```bash
    python myscript.py arg1 arg2
```

## `sys.argv`

- Avoid hardcoding the filename in the code
- Easier to re-use code for different input files
- Uses command-line arguments
- Input is list of strings:
    - Position 0: the program name
    - Position 1: the first argument
    - Position 2: the second argument
    - etc

### How to use it
```python
import sys

program_name = sys.argv[0]
arg1 = sys.argv[1] # index error if the first argument is not provided in the command
arg2 = sys.argv[2] # index error if the second argument is not provided in the command
```



Note: we will talk more about modules and how to use them tomorrow. Now we will switch to the terminal again

### Try out `sys.argv`

Python script is called `print_argv.py` and can be found in the downloads folder

#### Run the following commands in the terminal
```bash
python print_argv.py
python print_argv.py 1 
python print_argv.py arg1 arg2 arg3
```

### Naive code to copy a text file

In [None]:
input_file = "../downloads/250.imdb"
output_file = "newfile.imdb"

with open(input_file, "r") as fi:
    with open(output_file, "w") as fo:
        for line in fi:
            fo.write(line)

copy the code to a file copy_file_naive.py, run it. However, if I want to copy it to a new file, or if I want to copy another file, I need to change the code. The resuability of the python script is bad

In [None]:
# Code that can deal with command line arguments
import sys

usage = f"{sys.argv[0]} inputFile outputFile"

if len(sys.argv) < 3:
    print(usage)
    sys.exit(1)

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(input_file, "r") as fi:
    with open(output_file, "w") as fo:
        for line in fi:
            fo.write(line)

## String formatting

Format text for printing or for writing to file.

What we have been doing so far:

In [None]:
title  = 'Toy Story'
rating = 10
print('The result is: ' + title + ' with rating: ' + str(rating))

Other (better) ways of formatting strings:

<br>

__f-strings (since python 3.6)__

In [None]:
title  = 'Toy Story'
rating = 10
print(f'The result is: {title} with rating: {rating}')

__format method__

In [None]:
title  = 'Toy Story'
rating = 10
print('The result is: {} with rating: {}'.format(title, rating))

__The ancient way (python 2)__

In [None]:
title  = 'Toy Story'
rating = 10
print('The result is: %s with rating: %s' % (title, rating))

### Day 3, Exercise 4 (~30 min)
- #### Restructure and write the output to a new file

- Canvas -> Modules -> Day 3 -> IMDb exercise -> 4
- Work in pairs

#### PyQuiz 3.2 
___
### Project time

In [None]:
# 
!cat ../scripts/reformat_imdb.py