# Assignment 1
The goal of this assignment is to make you familiar with Python's syntax, data types, and file I/O. This assignment is divided into twelve exercises.

We assume that the folder that you work in has the following structure.
<code>
assignment01.ipynb
hue_upload.csv
hue_upload2.csv
data/dist_matrix.txt
data/draughts.txt
data/draughts2.txt
data/textfile1.txt
data/textfile2.txt
</code>

The files hue `upload.csv` and `hue upload2.csv` are two files in the same format, that have been collected over different time periods. Each line contains a line number, a user id, an event string, and a data field, separated by semicolons (;) and enclosed by double quotes (").

In [2]:
import pandas as pd
import scipy
import re

pd.options.display.max_rows = 20

## Exercise 1 (5 points)
Write a Python function takes combines several files into a single file. The function should take as input a list of filenames and the name of the output file.

In [2]:
# Reads in column names by specifying separator - thus outputs multiple columns

def separator_concatenate_files(input_files, output_file):
# YOUR CODE HERE
    dataframes = []
    colnames = ['line_number', 'user_id', 'event_string', 'data_field']
    for filename in input_files:
        dataframes.append(pd.read_csv(filename, sep = ';', names = colnames, header = None, encoding ='utf_8'))
        combined_df = pd.concat(dataframes)
    return combined_df.to_csv(output_file, index = False)

# YOUR CODE ENDS HERE

In [3]:
# Reads in data as is, without separator - thus outputs 1 column

def concatenate_files(input_files, output_file):
# YOUR CODE HERE
    dataframes = []
    for filename in input_files:
        dataframes.append(pd.read_csv(filename, header = None, encoding ='utf_8'))
        combined_df = pd.concat(dataframes)
    return combined_df.to_csv(output_file, index = False)

# YOUR CODE ENDS HERE

In [5]:
filenames = ['hue_upload.csv', 'hue_upload2.csv']
concatenate_files(filenames, 'hue_combined.csv')


In [6]:
# TEST OUTPUT
new_output = pd.read_csv('hue_combined.csv')

In [7]:
# READ HEAD OF TEST OUTPUT
print(new_output.head())

                                                   0
0  1;"10";"lamp_change_29_mei_2015_19_08_33_984";...
1   2;"10";"0010_31_mei_2015_bedtime_tonight";"2300"
2             3;"10";"0010_31_mei_2015_fitness";"52"
3                 4;"10";"morning_backup_minute";"0"
4  5;"10";"lamp_change_29_mei_2015_19_08_33_942";...


## Exercise 2 (5 points) 
Write a Python function reads a file and removes all double quote (") characters and the first field of each row, and stores the result into a new file. The function should take as input the name of the input file and the name of the output file.

In [8]:
def clean_file(input_file, output_file):
# YOUR CODE HERE   
    df = pd.read_csv(input_file)
    df = df['0'].str.replace('"', "").str.split(";", expand = True)
    df.drop(df.columns[0],axis = 1, inplace = True)
    return df.to_csv(output_file, index = False)

# YOUR CODE ENDS HERE

In [9]:
clean_file('hue_combined.csv', 'hue_combined_cleaned.csv')


In [10]:
# TEST OUTPUT
new_output = pd.read_csv('hue_combined_cleaned.csv')

In [11]:
# READ HEAD OF TEST OUTPUT
print(new_output.head())

    1                                     2     3
0  10  lamp_change_29_mei_2015_19_08_33_984   OFF
1  10      0010_31_mei_2015_bedtime_tonight  2300
2  10              0010_31_mei_2015_fitness    52
3  10                 morning_backup_minute     0
4  10  lamp_change_29_mei_2015_19_08_33_942   OFF


## Exercise 3 (5 points) 
Write a Python function that removes the duplicate lines for a file, and stores the output in a new file. The function should take as input the name of the input file and the name of the output file.

In [12]:
def drop_duplicates_in_file(input_file, output_file):
# YOUR CODE HERE
    df = pd.read_csv(input_file)
    dropped = df.drop_duplicates()
    df = dropped["1"].astype(str) +";"+ dropped["2"] +";"+ dropped["3"]
    return df.to_csv(output_file, header = None, index = False)
# YOUR CODE ENDS HERE


In [13]:
drop_duplicates_in_file('hue_combined_cleaned.csv', 'hue.csv')


In [14]:
# TEST OUTPUT
new_output = pd.read_csv('hue.csv')
new_output.shape

(8332, 1)

In [15]:
# READ HEAD OF TEST OUTPUT
print(new_output.head())

   10;lamp_change_29_mei_2015_19_08_33_984;OFF
0     10;0010_31_mei_2015_bedtime_tonight;2300
1               10;0010_31_mei_2015_fitness;52
2                   10;morning_backup_minute;0
3  10;lamp_change_29_mei_2015_19_08_33_942;OFF
4                 10;is_wifi_switched_on;false


### Python Exercises 4-8
Exercises 4-8 take `hue.csv` as input. To the three columns of `hue.csv`, we assign the following labels for the  index: `user_id`, `event_string`, and `data_field`, respectively. We read in the data, and assign it to the variable `hue_data`. 

In [16]:
hue_data = pd.read_csv('hue.csv', header=None, sep=';')
hue_data.columns = ['user_id', 'event_string', 'data_field']
hue_data.head()

Unnamed: 0,user_id,event_string,data_field
0,10.0,lamp_change_29_mei_2015_19_08_33_984,OFF
1,10.0,0010_31_mei_2015_bedtime_tonight,2300
2,10.0,0010_31_mei_2015_fitness,52
3,10.0,morning_backup_minute,0
4,10.0,lamp_change_29_mei_2015_19_08_33_942,OFF


## Exercise 4 (5 points) 
Calculate the number of lamp change events, and assign it to the variable `num_lamp_change`.

In [17]:
# YOUR CODE HERE
bool_series = hue_data["event_string"].str.startswith("lamp_change", na = False)
num_lamp_change = hue_data[bool_series]["event_string"].count()
# YOUR CODE ENDS HERE

In [18]:
print(num_lamp_change)


3406


## Exercise 5 (5 points) 
Return a single-column data frame with the unique values of adherence importance, and assign the result to the variable `df_adherence_importance`.

In [19]:
# YOUR CODE HERE
bool_series = hue_data["event_string"].str.contains("adherence_importance", na = False)
unique = hue_data[bool_series]['data_field'].unique()
df_adherence_importance = pd.DataFrame(unique, columns = ['adherence_importance'])
# YOUR CODE ENDS HERE

In [20]:
display(df_adherence_importance)


Unnamed: 0,adherence_importance
0,100
1,11
2,78
3,75
4,34
...,...
87,90
88,96
89,86
90,19


## Exercise 6 (5 points) 
Return a single-column data frame with the number of data points (lines) for each user id. Assign the result to the variable `num_per_user`, and make sure that the `user_id` is used as index. 

In [21]:
# YOUR CODE HERE
grouped = hue_data['user_id'].groupby(hue_data['user_id']).count()
type(grouped)
num_per_user = pd.DataFrame(grouped)
# YOUR CODE ENDS HERE

In [22]:
display(num_per_user)


Unnamed: 0_level_0,user_id
user_id,Unnamed: 1_level_1
1.0,196
9.0,143
10.0,325
12.0,141
18.0,241
...,...
70.0,317
6789.0,40
9996.0,56
9998.0,17


## Exercise 7 (10 points) 
Return a single-column data frame with all unique strings that come up in the column `event_string` of `data_hue`. Assign the result to the variable `unique_events`.

Note that a string consists of one or more words joined by an underscore, a word being one or more alphabetic characters: lamp change, rise reason, mei, but not 15 mei or 2015 risetime. 

In [155]:
# NOT QUITE RIGHT - NEED TO KEEP STRINGS TOGETHER WHERE THEY ARE ORIGINALLY TOGETHER, I.E.: lamp change, rise reason etc.

# YOUR CODE HERE
event_string = hue_data["event_string"].astype(str)
event_list = []
for line in event_string:
    for word in line.split("_"): 
        if (word.isnumeric()) == False:
            new_string = word
            if new_string not in event_list:
                event_list.append(new_string)
print(event_list)
unique_events = pd.DataFrame(event_list,columns=['event_string'])
# YOUR CODE ENDS HERE

['lamp', 'change', 'mei', 'bedtime', 'tonight', 'fitness', 'morning', 'backup', 'minute', 'is', 'wifi', 'switched', 'on', '', 'adherence', 'importance', 'evening', 'hour', 'first', 'run', 'type', 'rise', 'reason', 'risetime', 'start', 'experiment', 'yesterday', 'target', 'url', 'nudge', 'time', 'juni', 'nan', 'augustus', 'subject', 'key', 'info', 'event', 'error', 'september', 'oktober']


In [156]:
display(unique_events)


Unnamed: 0,event_string
0,lamp
1,change
2,mei
3,bedtime
4,tonight
...,...
36,info
37,event
38,error
39,september


## Exercise 8 (10 points) 
Return a single-column dataframe with the number of lamp change events for each relevant day. Use as index the substring of the `event_string` value, e.g., '01_augustus_2015'. Assign the result to the variable `num_lamp_changes_per_day`.

In [157]:
# YOUR CODE HERE
bool_series = hue_data["event_string"].str.startswith("lamp_change", na = False)
lamp_changes = hue_data[bool_series]["event_string"].str.findall(r"\d+_\w+_\d{4}")
lamp_changes_flatten = lamp_changes.apply(pd.Series).stack().reset_index(drop=True)
df = pd.DataFrame(lamp_changes_flatten, columns = ['event_string'])
grouped = df['event_string'].groupby(df['event_string']).count()
num_lamp_changes_per_day = pd.DataFrame(grouped, columns = ['event_string'])
# YOUR CODE ENDS HERE

In [158]:
display(num_lamp_changes_per_day)


Unnamed: 0_level_0,event_string
event_string,Unnamed: 1_level_1
01_augustus_2015,8
01_juni_2015,291
02_augustus_2015,11
02_juni_2015,122
02_september_2015,22
...,...
29_augustus_2015,3
29_mei_2015,120
30_augustus_2015,3
30_mei_2015,116


### Python exercises 9-12
To get hands-on familiarity with Python, you are strongly advised to work through a Python tutorial (see Canvas module: Introduction to Python). Python has an excellent reference web site, with the documentation for all its language constructs and library APIs. Use this for all your Python work!

## Exercise 9 (10 points) 
A number is composed of digits. For example, 512 is composed of a 5, a 1, and a 2. Write a Python function that accepts keyboard input (stdin). If the user does not enter a number or if the number is smaller than 10, the script has to keep asking for input. When a number is detected, the script should compute and output the number of digits, the number of distinct digits, the largest sum of two consecutive digits (Dutch: opeenvolgende cijfers), and the sum of its distinct prime factors. For the number 5112, the program should give the following output:
<code>
4
3
6
76
</code>
Your implementation should print the solution to the screen. The function does not have to return anything.

In [None]:
# YOUR CODE HERE
def exercise09():

# YOUR CODE ENDS HERE

In [None]:
exercise09()

## Exercise 10 (10 points) 
Write a function that reads two text files, each of which has one word per line in lowercase, checks which words occur in the first le but not in the second file, and writes those words to a third file in alphabetical order separated by a newline character. You may assume that both text files comfortably fit in memory, and that the files do not contain duplicates.

Your implementation should output the solution to a file. The function does not have to return anything or print anything to the screen.

In [None]:
# YOUR CODE HERE
def exercise10(filename1,filename2,outfilename):

# YOUR CODE ENDS HERE

In [None]:
exercise10('data/textfile1.txt', 'data/textfile2.txt', 'data/textfile3.txt')


## Exercise 11 (15 points) 
Write a function that reads an integer distance matrix from a text file. An example is the following input file:
<code>
0 1 \-
2 0 4
\- 4 0
</code>
which contains, e.g., the information that the distance from point 0 to point 1 is 1, while the distance from point 1 to point 0 is 2. A dash means that no direct connection exists (which can be implemented as float('inf'), so that the connection will not get used). Implement Dijkstra's Algorithm to find the shortest path from point 0 to the last point (2 in this case). 

You may not use a package that features Dijkstra's Algorithm. For the matrix shown above, the shortest distance is 5. Your implementation should return the length of the shortest path, and assign it to the variable `minimum_distance`.

In [None]:
# YOUR CODE HERE
def exercise11(filename):

# YOUR CODE ENDS HERE

In [None]:
minimum_distance = exercise11('data/dist_matrix.txt')
print(minimum_distance)


## Exercise 12 (15 points) 
International draughts (Dutch: dammen) is a board game played on a 10x10 field with a chessboard pattern. For this assignment, you have to read a file that contains the state of a game as it is being played, and provide a graphical representation. For this, only a small subset of the rules is relevant:
<ul>
    <li>There are two players, white and black.</li>
    <li>A 10x10 coordinate system is imposed on the board such that (1,1) is the left lower square for the white player and (10,1) is the right lower square for the white player.</li>
    <li>The board is oriented such that (1,1) is a dark square.</li>
    <li>Pieces can only be placed on dark squares, i.e., on (x,y) for which |x-y| = 0 (mod 2).</li>
    <li>There are two types of pieces: regular pieces and promoted pieces.</li>
</ul>

The starting position will be supplied in an ASCII text file. Its lines contain the positions of (crowned) pieces. A valid line contains a coordinate, followed by a tab, followed by the type (w, W, b or B for a white piece, white promoted piece, black piece, or black promoted piece, respectively). Invalid lines should be ignored, as should
any characters after the type. An example is shown below:
<code>
w
(5,5) w the predator
(6,6) b the prey
(this is an example)
</code>

Write a function that takes the name and path of a text le as its argument, reads the text le, and prints a graphical representation of the board on the screen. The representation should use underscores and vertical bars as follows (notice that this example is a 3x3 board instead of a full board):

<code>
 ___ ___ ___
|   |   |   |
|   |   |   |
|___|___|___|
|   |   |   |
|   |   |   |
|___|___|___|
|   |   |   |
|   |   |   |
|___|___|___|
</code>

Note that each side consists of either three underscores or three vertical bars. The center of each square should contain the character w, W, b or B to indicate the piece (if any).

Your implementation should print the output to the screen. The function does not need to return anything. For this assignment, you need to verify whether |x-y| = 0 (mod 2), in other words, |x-y| is an even number. The template provides some invalid lines as examples. The line with coordinate (3,1) is invalid since it starts with a space, and the line with coordinate (3,5) is invalid since no tab separates the coordinate
from the color.


In [None]:
# YOUR CODE HERE
def exercise12(filename):

# YOUR CODE ENDS HERE

In [None]:
exercise12('data/draughts.txt')