<div style="width: 100%; clear: both;">
    <div style="float: left; width: 50%;">
       <img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg", align="left">
    </div>
</div>

<div style="float: right; width: 50%;">
    <p style="margin: 0; padding-top: 22px; text-align:right;">22.403 路 Programming Fundamentals</p>
    <p style="margin: 0; text-align:right;">Data Science Master's Degree</p>
</div>

</div>
<div style="width: 100%; clear: both;">
<div style="width:100%;">&nbsp;</div>

Programming Fundamentals
============================


This notebook deals with two topics: 
- Creation of functions which have parameters with different characteristics
- Interaction of the code with files and the Operative System

In [1]:
# Checks a cell for pep8 compliance
%load_ext pycodestyle_magic
%pycodestyle_on

# Exercice 1

Answer the following questions with True / False and briefly reason your answer.

**(a) When calling a function, we must pass exactly the same number of arguments as parameters the function has defined.**

**Answer:** 
False. When functions have parameters with default arguments, the parameters become optional. This means that depending on the desired result when executing a function, it may not be necessary to specify the value of a parameter, as there could already be one assigned to it. The syntax of a function with a default value assigned to a parameter would be as follows:

    def foo(x,y=1):
        # To-Do

In addition, functions with an indeterminate number of arguments can be defined. These type of functions have the following syntax:
    
    def foo(x,y,{*extra_arguments|**extra_arguments}):
        # To-Do
        
In this type of function it would be mandatory to pass arguments to the "x" and "y" parameters, however, the "extra_arguments" parameter would refer to the rest of the function arguments that are neither "x" nor "y". Therefore, this function would range from 2 arguments to (theoretically) infinite arguments even though the function has 3 parameters.        

**(b) In the next fragment of code we are passing a parameter by reference. Therefore, the variable `*args` could be affected if we try to assign inside the function a new value to the variable.**

In [2]:
def foo(*args):
    acc = 0
    acc_sum = [acc + arg for arg in args]
    return sum(acc_sum)

**Answer:** 
An `*args` parameter means that the function can receive an indeterminate number of arguments (that will be grouped as a list). Furthermore, in this case, `args` would be a tuple-type object and would therefore be immutable. If we try to modify its value or assign it a new value within the body of the function, an exception would be thrown.

# Exercice 2


We want to create a function that checks if a year is a [leap year](https://ca.wikipedia.org/wiki/Any_de_trasp%C3%A0s) or not.

The function must receive two parameters:
- the year we want to check,
- an **optional** parameter to decide if we want the function to show the reason why the year is not considered a leap year (in case the year is a leap year the function should not show any message).

The return type of the function must be *Boolean (True / False)*.

According to the Gregorian calendar, a year is considered a leap year if it meets the following conditions:

- If 4 is a divisor of the year, it is a leap year, except:
- If 100 is also a divisor of the year, it is **NOT** considered a leap year, except:
- If 400 is also a divisor of the year, it **IS** considered a leap year.
   
You can see some examples of the expected return:

2000 -> True

1800 -> False

2100 -> False

**Note**. **You cannot use the `isleap` function we saw in Unit 1**. Apply the implemented function to the next cell years, and check the result with the `isleap` function.


In [2]:
# Resposta
import calendar

years = [1800, 1996, 2000, 2101, 2400]


def find_leap(year, reason=False):
    """ Evaluates if a year is a leap-year

    Parameters:
        year(int): the year that will be evaluated

        reason(boolean): by default is False. If True will show why year
            IS NOT a leap-year.

    Returns:
        conclusion(boolean): True if the year is a leap-year. False if not.
    """
    if year % 4 == 0:
        if year % 100 == 0:
            if year % 400 == 0:
                conclusion = True
            else:
                conclusion = False
                if reason:
                    print("Reason: The year is divisible by 4 and 100,",
                          "but not 400")
        else:
            conclusion = True
    else:
        conclusion = False
        if reason:
            print("Reason: The year is not divisible by 4.")
    return conclusion

In [3]:
for year in years:
    print("Calendar ISLEAP() function on year {} -> {}".format(
        year, calendar.isleap(year)))
    print("New function FIND_LEAP() on year {} -> {}\n".format(
        year, find_leap(year, reason=True)))

Calendar ISLEAP() function on year 1800 -> False
Reason: The year is divisible by 4 and 100, but not 400
New function FIND_LEAP() on year 1800 -> False

Calendar ISLEAP() function on year 1996 -> True
New function FIND_LEAP() on year 1996 -> True

Calendar ISLEAP() function on year 2000 -> True
New function FIND_LEAP() on year 2000 -> True

Calendar ISLEAP() function on year 2101 -> False
Reason: The year is not divisible by 4.
New function FIND_LEAP() on year 2101 -> False

Calendar ISLEAP() function on year 2400 -> True
New function FIND_LEAP() on year 2400 -> True



# Exercice 3

We want to create a function that validates a list of email addresses in a file.

The function will receive the path of a file containing the email addresses as *input* and its *output* should be a tuple with the format `(address_number, address_list)` where:

- `address_number` represents the number of **not** valid e-mail addresses.
- `address_list` represents the list of invalid email addresses sorted alphabetically. We should be able to hide this second element of the tuple using the parameters.

Below you can see some examples of output formats of the function.

```
(3,)
(3, ['hello@icloud', 'g@sbcglobal.global', 'foo@*bar*.com'])
```

An email address is considered valid if it meets the following conditions:
- Follow the format `<username> @ <domain>. <extension>`
- Username (`<username>`) contains only alphanumeric characters or the characters `_` (underscore) or` -` (hyphen)
- The domain (`<domain>`) contains only alphanumeric characters
- The extension (`<extension>`) contains only characters from the Latin alphabet
- The maximum length for the extension (`<extension>`) is 3 characters

The format of the input file will be:
```
Direcci贸 1
Direcci贸 2
...
Direcci贸 N
```

Each line of the file represents an email address to validate. You can find examples of input files in the `data/ex3` folder.

In [6]:
# Resposta
import glob
import re


def email_finder(path, show_list="list"):
    """ Counts and store invalid e-mail addresses in different lines of a file.

    Parameters:
        path(str): the path where the file is.

        show_list(str): by default show_list="list", it shows all the invalid
            e-mail addresses as a list. If parameter show_list="nolist" the
            list will not be showed.

    Returns:
        (regex_count, email_list)(tuple): if show_list="list" returns the count
            of invalid e-mail addresses and a list with the invalid addresses.
        (regex_count,)(tuple): if show_list="nolist" returns only the count of
            invalid addresses.
    """
    email_list = []

    # Extract the lines and store them on a list
    with open(path, 'r') as file:
        for i in file:
            email_list.append(i.strip().replace("\n", ""))
        email_list = list(filter(
            lambda i:
                # Only filtered addresses with lowercase letters.
                # https://docs.python.org/3/howto/regex.html
                False if re.search(r'^[a-z0-9_-]*@[a-z0-9]*\.[a-z]{1,3}$', i)
                else True, email_list))

        # Count number of incorrect emails.
        regex_count = len(email_list)

    if show_list == "list":
        return(regex_count, email_list)

    if show_list == "nolist":
        return(regex_count,)

In [7]:
path = 'data/ex3/*'
files = glob.glob(path, recursive=True)

for i in files:
    result = email_finder(i, "nolist")
    print(result)
    result = email_finder(i)
    print(result, "\n")

(3,)
(3, ['wonderkid@*1956*.', 'bastian@icloud', 'ghaviv@sbcglobal.global']) 

(0,)
(0, []) 

(4,)
(4, ['drjlaw$@m-ac.france', '_mleary!@sbcglobal', '__muzzy__@@optonline.net', 'smpeters2$mac$com']) 



# Exercice 4

We want to create a function that creates an ordered structure of files from an input file with statistics of the points obtained by various users in bowling games.

The input file will have an indeterminate number of lines in the following format:

```
<user_name> <match_date> <points>
```

where:

- `<user_name>`: contains alphanumeric characters
- `<match_date>`: follows a `YYYY-MM-DD` format
- `<points>`: integer that represents the number of points obtained in the game

The expected output is a structure of directories and files as described below:
```
<root_folder>

    <user_name>

        <match_date>
            scores.txt

        <match_date>
            scores.txt
            
        ...
        
    <user_name>
    
    ...
```

You will have to group the data in the original file by username and game date (the same user can play more than one game on the same day). The `scores.txt` file must be in the following format:

```
Total: <sum_of_points>
Game #1: <points_obtained>
Game #2: <points_obtained>
...
Game #N: <points_obtained>
```

`<sum_of_points>` must be replaced by the sum of points for all the games of that day and `<points_obtained>` by the points obtained in each individual game. 

The directory where this structure is to be generated must be a parameter of the function. You can find an example of the input files in the `data/ex4` folder.

In [8]:
import glob
import pandas as pd
import os


def create_sort_df(path):
    """ Creates a DataFrame from the scores_unfiltered.txt. Then it sorts the
    DataFrame using the columns 'names' and 'dates'.

    Parameters:
        path(str): the path where the file scores_unfiltered.txt.

    Returns:
        df(DataFrame): returns de sorted DataFrame created from the
            scores_unfiltered.txt.
    """
    # Get the path of the scores_unfiltered.txt
    df_path = os.path.join(path, '*')
    files = glob.glob(df_path, recursive=True)
    # Create and sort a dataframe from scores_unfiltered.txt
    df = pd.read_csv(files[0], sep=" ", header=None,
                     names=['names', 'dates', 'points'])
    df = df.sort_values(by=['names', 'dates'])
    return df

In [9]:
def create_append_file(path, count, points):
    """ Creates (or append to, if it is already created) a file for every date
    with the number of match and the points gained.

    Parameters:
        path(str): the path where the file is going to be created.

        count(int): number of match.

        points(int): number of points per match.

    Returns: nothing.
    """
    # Append a line with: match number and the number of points
    with open(path, 'a') as file:
        file.write("Partida #" + str(count) + ":" + str(points) + "\n")

In [10]:
def append_total(path, count):
    """ Appends in the first line of the file the total of points gained in the
    date of this file. To do it, reads the file with the matches and points,
    copies de content, and creates a new file with the first the first line,
    and pastes the rest.

    Parameters:
        path(str): the path of the file.

        count(int): total of points gained in the file.

    Returns: nothing.
    """
    # Read content of an existing file
    with open(path, 'r') as file:
        content = file.read()
    # Creates a new file and write the first line
    with open(path, 'w') as file:
        file.write("Total:" + str(count)+"\n")
    # Appends the content of the first existing file
    with open(path, 'a') as file:
        file.write(content)

In [11]:
def main_function(path):
    """ Reads the scores_unfiltered.txt and creates in the same directory a new
    folder for every user. Once done, creates in every folder new folders for
    every user match's date, and inside them a file called scores.txt where
    will be appended all the information of the matches filtered by user and
    date.

    Parameters:
        path(str): the path of the scores_unfiltered.txt file.

    Returns: nothing.
    """
    # Create the sorted dataframe
    df = create_sort_df(path)

    name_var, date_var, points_var = "Nothing", None, None
    points_count = 0

    for column, row in df.iterrows():
        if name_var != row.names:
            # Append total points when changes the user and variable name_var
            # is not "Nothing"
            if name_var != "Nothing":
                append_total(new_score_file_path, points_count)

            # Create a new folder for every new user
            date_var = None
            name_var = row.names
            new_user_folder_path = os.path.join(path, name_var)
            os.mkdir(new_user_folder_path)

            if date_var != row.dates:
                # Create a new folder for every date inside every user's folder
                date_var = row.dates
                new_date_folder_path = os.path.join(new_user_folder_path,
                                                    date_var)
                os.mkdir(new_date_folder_path)

                # New file for every user's date
                count = 1
                new_score_file_path = os.path.join(new_date_folder_path,
                                                   'scores.txt')
                create_append_file(new_score_file_path, count, row.points)
                points_count = row.points
                points_var = row.points

        else:
            if date_var != row.dates:
                # Append total points when changes the date
                append_total(new_score_file_path, points_count)
                points_count = row.points
                date_var = row.dates
                new_date_folder_path = os.path.join(new_user_folder_path,
                                                    date_var)
                os.mkdir(new_date_folder_path)

                # New file for each date if the user has different dates
                new_score_file_path = os.path.join(new_date_folder_path,
                                                   'scores.txt')
                create_append_file(new_score_file_path, count, row.points)

            else:
                # Append to a file the different matches of the same user&date
                if points_var != row.points:
                    points_var = row.points
                    points_count += row.points
                    count += 1
                    create_append_file(new_score_file_path, count, row.points)
    # Append to the last file the different matches of the same user and date
    append_total(new_score_file_path, points_count)

In [12]:
path = 'data/ex4'
main_function(path)

# Exercice 5

Given the file structure of the previous exercise, we want to create a function that compresses the statistics older than a **certain number of days**. For example, if we want to archive all game statistics that were played more than 7 days ago, we must compress the directories with the format `YYYY-MM-DD` that correspond to a date earlier than 7 days ago.

The user should be able to choose the compression format (ZIP or [TAR](https://en.wikipedia.org/wiki/Tar) format) along with the compression method [gzip ](https://ca.wikipedia.org/wiki/Gzip). By default, the function should compress directories using ZIP.

Therefore, the above directory structure should look similar to:

```
<root_directory>

    <user_name>

        <match_date>
            scores.txt

        <match_date>.<compression_format>
            
        ...
        
    <user_name>
    
        <match_date>.<compression_format>
        
        ...
```

where `<compression_format>` is `zip` o `tar.gz`.

**Note 1** In Python we have the `tarfile` module that allows us to work with files packaged with the format *tar* ([link](https://docs.python.org/3/library/tarfile.html#module-tarfile)).

**Note 2** Once the folders are compressed, they must be removed.

**Note 3** We need to create a compressed file for each `<match_date>` folder. 

In [14]:
import zipfile as zf
import tarfile
from os.path import basename
import shutil as su
import datetime


def remove(path):
    """ Remove all the folders and their contents of a directory. Can be
    used to undone main_function effects (function from exercice 4).

    Parameters:
        path(str): path where all its folders and their content will be removed

    Returns: nothing
    """
    # Get all the folders paths inside parameter path
    inside_folder = os.path.join(path, "*")
    inside_folder = glob.glob(inside_folder, recursive=True)
    # Remove every folder and its content
    for i in inside_folder:
        if os.path.isdir(i):
            su.rmtree(i)

In [15]:
def find_minim_day(number):
    """ It rests the parameter number from today's date.

    Parameters:
        number(int): number to rest to today's date.

    Prints the date returned.

    Returns:
        returns minim_day(datetime): date since the folders will be compressed.
    """
    # Rests to the current date the number parameter
    days_number = datetime.timedelta(days=number+1)
    minim_day = datetime.datetime.today() - days_number
    str_minim_day = minim_day.strftime("%Y-%m-%d")
    print("Date selected: {}".
          format(str_minim_day))
    return minim_day

In [19]:
def create_zip_delete_folder(name_zip, path_folder, compress_type="ZIP"):
    """ Creates a new compressed file in zip format by default with the content
    from the selected folder. The format can also be tar.gz. Finally it removes
    the original folder with its content.

    Parameters:
        name_zip(str): the name of the new compressed file.

        path_folder(str): the path of the original folders.

        compress_type(str): by default the function compress in zip format,
            as compress_type='ZIP' by default. If compress_type='TAR' it will
            be compressed tar.gz

    Returns: nothing
    """
    inside_folder = os.path.join(path_folder, "*")
    inside_folder = glob.glob(inside_folder, recursive=True)

    # Creates a new zip format file and compress a file
    if compress_type == "ZIP":
        name_zip = os.path.join(name_zip + ".zip")
        with zf.ZipFile(name_zip, 'w', compression=zf.ZIP_DEFLATED) as zip_fdr:
            zip_fdr.write(inside_folder[0], basename(inside_folder[0]))

        if os.path.isdir(path_folder):
            su.rmtree(path_folder)

    # Creates a new tar.gz format file and compress a file
    # https://docs.python.org/3/library/tarfile.html#module-tarfile
    elif compress_type == "TAR":
        name_tar = os.path.join(name_zip + ".tar.gz")
        with tarfile.open(name_tar, "w:gz") as tar:
            tar.add(inside_folder[0], basename(inside_folder[0]))

        if os.path.isdir(path_folder):
            su.rmtree(path_folder)

In [17]:
def main_zip_file(number, zip_path, compression_type):
    """Compresses the content of the users' folders selected since a certain date
    and removes the original folders and their content. Uses find_minim_day
    function to calculate the date since the content of the folders selected,
    and with the function create_zip_delete_folder they will be compressed
    in zip or tar.gz format depending of the compression_type parameter.

    Parameters:
        number(int): days to rest to the current date. Will be used to decide
            since which date the folders will be selected.

        zip_path(str): path where the users' folders are

        compress_type(str): by default the function compress in zip format,
            as compress_type='ZIP' by default. If compress_type='TAR' it will
            be compressed tar.gz

    Returns: nothing
    """
    # Finds the day since the folders will be selected
    minim_day = find_minim_day(number)

    user_folders_path = os.path.join(zip_path, '*/')
    user_folders = glob.glob(user_folders_path, recursive=True)

    # Get the folders in all the users' folders
    for i in list(range(len(user_folders))):
        if os.path.isdir(user_folders[i]):
            dates_folders_path = os.path.join(user_folders[i], '*/')
            dates_folders = glob.glob(dates_folders_path, recursive=True)

            # strip the date folder's name as a date
            for date_folder in dates_folders:
                date = date_folder[-11:-1]
                strp_folder_date = datetime.datetime.strptime(date, "%Y-%m-%d")
                str_folder_date = strp_folder_date.strftime("%Y-%m-%d")

                # Compares the stripped folder name with the date selected
                if minim_day > strp_folder_date:
                    name_zip_folder = os.path.join(user_folders[i],
                                                   str_folder_date)
                    # Create the zip files and remove the original folder
                    create_zip_delete_folder(name_zip_folder,
                                             date_folder,
                                             compression_type)

In [18]:
path = 'data/ex4'

# Removes main_function effect to avoid problems if there has been any change.
remove(path)
main_function(path)

# Create the zip files
main_zip_file(60, path, "ZIP")

Date selected: 2020-09-17
