Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart Kernel) and then **run all cells** (in the menubar, select Run$\rightarrow$Run All Cells). Alternatively, you can use the **validate** button in the assignment list panel.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". When you insert your Code you can remove the line `raise NotImplementedError()`. Also put your name, matriculationnumber, and collaborators below:

In [None]:
NAME = "Li xinghan"
MATRICULATIONNUMBER = "5130032"
COLLABORATORS = ""

---

<img src="images/logo_ifn.svg" alt="Drawing" style="width: 256px;" align="right"/>

# Exercise 1.2: Data Handling

Often times Python is used as a more advanced tool that, e.g., Excel, to analyze large amounts of data. To this end, it is essential to know how to load different types of data, how to store different types of data and to know the implications of different file types. In the following, we will introduce various methods and tools to handle data and to interact with the system's storage. First, let's load some libraries, which we will subsequently need.

In [1]:
import os
import PIL
import cv2
import json
import pickle
import pandas as pd
import numpy as np
import random
import params
from ipylab import JupyterFrontEnd
PARAM_1, PARAM_2, PARAM_3 = params.gen_params(os.getcwd())
PARAM_4 = int(params.gen_params(os.getcwd(), mode='float', num=1)[0] *100000)
app = JupyterFrontEnd()
app.commands.execute('notebook:render-all-markdown')

### Task 1.2-A: Loading Simple Text Files (5P)

First, we want to read simple text files. You will find a simple file in "./data/simple_text.txt". Please load the file line-wise and solve the following tasks:
- The first row contains characters (in this case only letters). Please delete all consonants and return the resulting string as `string`.
- The second row contains only numbers. Please delete all odd numbers and return the remaining numbers as a string in the variable `numbers`.
- The third row contains several names. Convert them into a list and sort them alphabetically. Return the list as `names`.

In [22]:
def read_txt():
    # YOUR CODE HERE
    with open ('./data/simple_text.txt','r') as file:
        lines = file.readlines()
    string = ''.join(char for char in lines[0]if char.lower() in 'aeiou')
    numbers = ' '.join(num for num in lines[1].split()if int(num)%2 ==0)
    names = sorted(lines[2].split())
    return string, numbers, names

output = read_txt()

In [None]:
assert type(output[0]) == str
assert type(output[1]) == str
assert type(output[2]) == list


### Task 1.2-B: Writing Simple Text Files (5P)

After we learned how to read files, we will now learn how to write files to disk. For this purpose please create the directory specified by the variable SAVE_DIR (see comment below). Save all files for this unit in this folder. You can verify if the directory has been created succesfully either by using the os library functionalities to check if directories/files exist or by going into a terminal (blue plus sign in the top right and then opening a terminal) and using the linux commands cd and ls to move through directories and inspect their content. With this in mind solve the following task:
- Create a file './results/text_file.txt'.
- In the first row of this file save a space-separated number series from 1 to (including) {{PARAM_2}}.
- In the second row save the first {{PARAM_3}} prime numbers (also space-separated). Note that 1 is not a prime number.

In [26]:
# IMPORTANT: Define the variable SAVE_DIR as:
SAVE_DIR = './results'
def is_prime(num):
    if num<=1:
        return False
    for i in range(2,num):
        if num % i== 0:
            return False
    return True    
def write_txt():
    # YOUR CODE HERE
    os.makedirs(SAVE_DIR,exist_ok=True)
    with open(os.path.join(SAVE_DIR,'text_file.txt'),'w')as file:
        numbers_sequence = ' '.join(map(str,range(1,68)))
        file.write(numbers_sequence+'\n')
        prime_list = []
        count = 0
        num = 2
        while count<46:
            if is_prime(num):
                prime_list.append(str(num))
                count += 1
            num += 1
        prime_number = ' '.join(prime_list)
        file.write(prime_number)
                
write_txt()

In [None]:
assert os.path.isdir(SAVE_DIR)
assert os.path.isfile(os.path.join(SAVE_DIR, "text_file.txt"))


### Task 1.2-C: Pandas Dataframes (5P)

A very useful tool to process and convert different data format is the library pandas. In the following, we will exemplarily explore the pandas functionalities using csv files. Remeber to store all files in the `SAVE_DIR` directory.
- Below you can see a dictionary. Convert the dictionary to a pandas dataframe and save the output to a file "complete_dataframe.csv". Use the argument `index=False` to not store the column index.
- Save only the first two columns (name and birth_date) to a file "first_col.csv" without saving the indices.
- Load the data from "data/health_data.csv" into a pandas dataframe. Drop all rows which don't contain a value for 'resting_pulse' and replace all missing values for 'age' with the average value from all other patients. Return the resulting dataframe as `health_data`


In [31]:
SAVE_DIR = './results' 
def process_dataframe(studs):
    health_data = pd.DataFrame(studs)
    health_data.to_csv(os.path.join(SAVE_DIR,"complete_dataframe.csv"),index=False)
    health_data[["name","birth_date"]].to_csv(os.path.join(SAVE_DIR,"first_col.csv"),index = False)
    
    return health_data

    
studs_dict = {"name": ["Max", "Albert", "Marie", "Niels"],
         "birth_date": ["01-01-2001", "07-12-1998", "14-05-1999", "12-12-2000"],
         "id_number": [1234567, 2345678, 7654321, 9876543],
         "grade": [1.7, 1.0, 1.0, 4.0]}
health_data = process_dataframe(studs_dict)
health_data = pd.read_csv("data/health_data.csv")
health_data = health_data.dropna(subset = ['resting_pulse'])
mean_age = health_data['age'].mean()
health_data['age'].fillna(mean_age,inplace = True)


In [None]:
assert os.path.isdir(SAVE_DIR)
assert os.path.isfile(os.path.join(SAVE_DIR, "complete_dataframe.csv"))
assert os.path.isfile(os.path.join(SAVE_DIR, "first_col.csv"))
assert type(health_data) == pd.core.frame.DataFrame


### Task 1.2-D: Handling Tables (3P)

In the following we will explore a few other useful file handling operations that can be helpful in your future data science life. First, we will start with handling tables generated from somebody still using Excel instead of Python. Complete the function below and return the table corresponding to the filename using pandas.

In [34]:
def load_excel(filename):
    # YOUR CODE HERE
    table = pd.read_excel(filename)
    return table

table = load_excel('./data/health_data.xlsx')

In [None]:
assert type(table) == pd.core.frame.DataFrame


### Task 1.2-E: Handling Images (3P)

Images can be found in many different formats. Load the image './data/street_image.png' using the opencv library (cv2). Afterwards, save the image in jpeg format as "SAVE_DIR/test_image.jpeg".

In [32]:
# YOUR CODE HERE
image = cv2.imread('./data/street_image.png')
cv2.imwrite(os.path.join(SAVE_DIR,'test_image.jpeg'),image)

True

In [None]:
assert os.path.isfile(os.path.join(SAVE_DIR, "test_image.jpeg"))


### Task 1.2-F: Handling Dictionaries (3P)

For dictionaries we can use the json format for saving. It naturally preserves the dictionary structure. Create a dictionary that for all months has the three initials (in English, all lower case) as key and the number of the month as value. Save the dictionary in json format under "SAVE_DIR/months.json".

In [35]:
# YOUR CODE HERE
month_dict = {
    "january":1,"february":2,"march":3,"april":4,"may":5,"june":6,
    "july":7,"august":8,"september":9,"october":10,"november":11,"december":12
}
with open(os.path.join(SAVE_DIR,'months.json'),'w')as json_file:
    json.dump(month_dict,json_file)
    

In [None]:
assert os.path.isfile(os.path.join(SAVE_DIR, "months.json"))


### Task 1.2-G: Handling Numpy Files (3P)

While arrays can often be saved in image format to save disk space, it comes at the cost of loosing precision (e.g., most image formats convert the numbers to unsigned integers). Therefore, when working with numpy you can use the inbuilt saving and loading functions from numpy to avoid any loss in precision. Set the random seed from numpy to {{PARAM_4}} before creating `in_array` and afterwards save `in_array` under "SAVE_DIR/random.npy"

In [38]:
# set random seed here
# YOUR CODE HERE
np.random.seed(13832)

in_array = np.random.uniform(size=(34,56))


# YOUR CODE HERE
np.save(os.path.join(SAVE_DIR,'random.npy'),in_array)

In [None]:
assert os.path.isfile(os.path.join(SAVE_DIR, "random.npy"))


### Task 1.2-H: Handling General Python Objects (3P)

General Python objects can also be saved without any loss in precision using pickle. The library converts python objects into a binary format and back. Thereby, using the binary format a lot of space is saved while still preserving all information. Use the months dictionary and extend it by a key 'days', which points to a list containing the numbers 28, 30, and 31 (in that order). Save the resulting dictionary under "SAVE_DIR/ext_months.pickle".

In [39]:
# YOUR CODE HERE
month_dict = {
    "january":1,"february":2,"march":3,"april":4,"may":5,"june":6,
    "july":7,"august":8,"september":9,"october":10,"november":11,"december":12
}
month_dict['days']=[28,30,31]
with open(os.path.join(SAVE_DIR,'ext_months.pickle'),'wb')as pickle_file:
    pickle.dump(month_dict,pickle_file)

In [None]:
assert os.path.isfile(os.path.join(SAVE_DIR, "ext_months.pickle"))
