# Assignment 2
You will continue to work with the hue data files supplied for Assignment 1. We assume that the folder that you work in has the following structure.
<code>
assignment02.ipynb
hue_upload.csv
hue_upload2.csv
</code>

The first four columns represent the `row id`, `user id`, `event id`, and `value`. Any extra columns are irrelevant. For example, the first row of one file reads:

`"1";"10";"lamp change 29 mei 2015 19 08 33 984";"OFF"`

As you can see, the `event id` encompasses both a description of the event (`lamp_change`) and the date/time
(May 29, 7:08:33 pm). The following events are considered informative:

| String               | Description                                               |
-----------------------|------------------------------------------------------------
| `lamp_change`          | Light control via app                                     |
| `nudge_time`           | Automatic light dim time for people in experimental group |
| `bedtime_tonight`      | Intended bedtime (self-reported)                          |
| `risetime`             | Rise time (self-reported)                                 |
| `rise_reason`          | Reason for rising (self-reported)                         |
| `adherence importance` | Adherence (self-reported)                                 |
| `fitness`              | Fitness (self-reported)                                   |

All self-reported values are entered around noon. Records with other events may be ignored.

In [None]:
import datetime as datetime
import pandas as pd
import numpy as np
import pymongo
import re 

pd.options.display.max_rows = 20

In [2]:
# My locale was set to English so did not understand the dates in Dutch
import locale
locale.setlocale(locale.LC_TIME, 'nl_NL.utf8')

'nl_NL.utf8'

## Exercise 1 (70 points)
The first part of this assignment is to write a Python function `read_csv_data` that reads the data into
a Pandas DataFrame. The index should be a (date, user) tuple, where date is stored in datetime.datetime format
(see the document with the "Tips"). The columns of your Pandas DataFrame should be `bedtime`, `intended_bedtime`,
`rise_time`, `rise_reason`, `fitness`, `adherence_importance` and `in_experimental_group`. Note: it is important to stick to this nomenclature. The way to do this is by going through the CSV data line by line, and parsing each line
individually, following these requirements:

<ul>
<li>
`bedtime` should be inferred from the `lamp_change` event. There are multiple reasonable ways to accomplish this. In this assigment, we define the  bedtime as the last OFF state of the `lamp_change` in the interval between 7 pm of the current day and 6 am of the next day. For example, from the row printed above you may
infer that the person did not sleep before 7:08:33 pm. As you go through the lines in the csv file, whenever
you discover new relevant information, you either update an existing record in the dataframe (if a record
for that day and user exists), or you create a new record (see the document with the "Tips").
<br><br>
For example, if you encounter a line where user 10 turns the light on at 9 pm and another line where he
turns it off at 10 pm (still on May 29), you update the record above to change the bedtime to 10 pm. If
someone falls asleep past midnight, the bedtime should be stored in the record corresponding to the day before. Again, dates and times should be stored as datetime.datetime.
<br><br></li>
<li>`intended_bedtime` should be filled in based on the `bedtime_tonight` event. Note that 1030 probably means 10:30 in the evening. Again, dates and times should be stored as datetime.datetime.
<br><br></li>
<li>
`rise_time`. The value for the column in your solution should be obtained from the `risetime` event in the CSV file.
<br><br></li>
<li>
`rise_reason`, `fitness` and `adherence_importance` values should be copied from the CSV file. Note that if
multiple distinct values are entered, the last should be assumed to be correct.
<br><br></li>
<li>
`in_experimental_group` should be boolean (True/False). The default value is False, but should be changed
to True if a `nudge_time` event is encountered. If a user is in the experimental group on one day, he is on
all days.
</li>
</ul>    

In [None]:
def read_csv_data(filenames):
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [3]:
def read_csv_data(filenames):
# YOUR CODE HERE
    dataframes = []
    colnames = ['row_id', 'user_id', 'event_id', 'value']
    for filename in filenames:
        dataframes.append(pd.read_csv(filename, sep = ';', names = colnames, header = None, encoding ='utf_8'))
        combined_df = pd.concat(dataframes)
    return combined_df
# YOUR CODE ENDS HERE

In [4]:
df = read_csv_data(['hue_upload.csv', 'hue_upload2.csv'])
df

Unnamed: 0,row_id,user_id,event_id,value
0,1,10,lamp_change_29_mei_2015_19_08_33_984,OFF
1,2,10,0010_31_mei_2015_bedtime_tonight,2300
2,3,10,0010_31_mei_2015_fitness,52
3,4,10,morning_backup_minute,0
4,5,10,lamp_change_29_mei_2015_19_08_33_942,OFF
...,...,...,...,...
1134006,1134007,63,error_event_15_september_2015_20_33_31_287,1_lamps_found
1134007,1134008,63,error_event_16_september_2015_21_15_35_568,0_lamps_found
1134008,1134009,63,start_experiment,2015-09-06T00:00:00.000+02:00
1134009,1134010,63,lamp_change_09_september_2015_23_39_52_045,OFF


In [5]:
# Pulling out 'Date' from the Event_ID column
df["date_event_id"] = df["event_id"].str.findall(r"_\d+_\w+_\d{4}").astype(str).str.replace('\[|\]|\'', '')
df["date_event_id"] = df["date_event_id"].str.replace('_'," ").str.strip()
df.replace(r'^\s*$', np.nan, regex=True, inplace = True) #replacing whitespace with NaN
df

Unnamed: 0,row_id,user_id,event_id,value,date_event_id
0,1,10,lamp_change_29_mei_2015_19_08_33_984,OFF,29 mei 2015
1,2,10,0010_31_mei_2015_bedtime_tonight,2300,31 mei 2015
2,3,10,0010_31_mei_2015_fitness,52,31 mei 2015
3,4,10,morning_backup_minute,0,
4,5,10,lamp_change_29_mei_2015_19_08_33_942,OFF,29 mei 2015
...,...,...,...,...,...
1134006,1134007,63,error_event_15_september_2015_20_33_31_287,1_lamps_found,15 september 2015
1134007,1134008,63,error_event_16_september_2015_21_15_35_568,0_lamps_found,16 september 2015
1134008,1134009,63,start_experiment,2015-09-06T00:00:00.000+02:00,
1134009,1134010,63,lamp_change_09_september_2015_23_39_52_045,OFF,09 september 2015


In [6]:
# Pulling out 'Time' from the Event_ID column

df["time_event_id"] = df["event_id"].str.findall(r"_\d{2}_\d{2}_\d{2}").astype(str).str.replace('\[|\]|\'', '')
df["time_event_id"] = df["time_event_id"].str.replace('_'," ").str.strip()
df["time_event_id"] = df["time_event_id"].str.replace(" ",":")
df.replace(r'^\s*$', np.nan, regex=True, inplace = True)
df["time_event_id"].fillna(value = "00:00:00", inplace = True) 
#where there is no time, replacing with 00:00:00 --- NOT SURE IF THIS CREATES ANY PROBLEMS
df

Unnamed: 0,row_id,user_id,event_id,value,date_event_id,time_event_id
0,1,10,lamp_change_29_mei_2015_19_08_33_984,OFF,29 mei 2015,19:08:33
1,2,10,0010_31_mei_2015_bedtime_tonight,2300,31 mei 2015,00:00:00
2,3,10,0010_31_mei_2015_fitness,52,31 mei 2015,00:00:00
3,4,10,morning_backup_minute,0,,00:00:00
4,5,10,lamp_change_29_mei_2015_19_08_33_942,OFF,29 mei 2015,19:08:33
...,...,...,...,...,...,...
1134006,1134007,63,error_event_15_september_2015_20_33_31_287,1_lamps_found,15 september 2015,20:33:31
1134007,1134008,63,error_event_16_september_2015_21_15_35_568,0_lamps_found,16 september 2015,21:15:35
1134008,1134009,63,start_experiment,2015-09-06T00:00:00.000+02:00,,00:00:00
1134009,1134010,63,lamp_change_09_september_2015_23_39_52_045,OFF,09 september 2015,23:39:52


In [7]:
# Combining date and time into one column

df['date_time'] = (df['date_event_id'] +" "+ df['time_event_id']).astype(str)
df.replace('nan', np.nan, regex=True, inplace = True)
df.dtypes


row_id            int64
user_id           int64
event_id         object
value            object
date_event_id    object
time_event_id    object
date_time        object
dtype: object

In [8]:
# Creating the date column using datetime.datetime as instructed in the Hints

def date_convert(date_to_convert):
    return datetime.datetime.strptime(date_to_convert, '%d %B %Y %H:%M:%S')

df['date'] = df.apply(lambda x: date_convert(x['date_time']) if x['date_time'] is not np.nan else np.nan, axis=1)

# SHOULD I BE EXPECTING A DATETIME.DATETIME TUPLE OUTPUT HERE? 
# NOT GETTING THAT... 


In [None]:
df.dtypes


In [None]:
# check the index
print(df.index)


In [None]:
# check the bedtime column
display(df[['bedtime']])


In [None]:
# check the intended_bedtime column
display(df[['intended_bedtime']])


In [None]:
# check the rise_time
display(df[['rise_time']])


In [None]:
# check the rise_reason, fitness, adherence_importance column
display(df[['rise_reason', 'fitness', 'adherence_importance']])


In [None]:
# check the in_experimental_group column
display(df[['in_experimental_group']])


## Exercise 2 (10 + 20 points)
The second part of this assignment is to store the contents of the DataFrame into MongoDB, and to write a function that retrieves data from MongoDB and outputs it in a user-friendly format.

<ol>
<li>
The data should be stored in the collection "sleepdata" in the database "BigData". Make sure to use
the same column names as specified for the DataFrame, and to define the correct primary key. See
the document with the "Tips" for some comments about the primary key. Add the extra columns "date",
"user", "sleep duration" to facilitate sorting the data if necessary. Here, "sleep duration" is the difference between the risetime and the bedtime.
<br><br></li>
<li>The following is an example of how the output must be presented.
    
| date | user | bedtime | intended | risetime | reason | fitness | adh | in_exp | sleep_duration |
-------|------|---------|----------|----------|--------|---------|-----|--------|----------------|    
| 11-06-2015 | 2  | 00:51:28 | 22:30:00 | 07:00:00 | ja  | -    | 47.0 | no  | 22351 |
| 11-06-2015 | 20 | 00:28:10 | 23:00:00 | 07:10:00 | nee | 55.0 | 88.0 | yes | 33510 |
| 11-06-2015 | 34 | 19:54:10 | -        | -        | -   | -    | -    | yes | -     |

Here sleep duration is in number of seconds. Note that, in order to determine the sleep duration of day X, it
is necessary to know the risetime of day X, but the bedtime of day X - 1.
<br><br></li>
</ol>

In [None]:
def to_mongodb(df):
# YOUR CODE HERE

    # connect to MongoDB database
    client = pymongo.MongoClient("localhost", 27017)
    db = client.BigData
    sleepdata = db.sleepdata
    
    sleepdata.delete_many({})
    
# YOUR CODE ENDS HERE

In [None]:
to_mongodb(df)


In [None]:
def read_mongodb(filter,sort):
# YOUR CODE HERE

    # connect to MongoDB database
    connection = pymongo.MongoClient("localhost", 27017)
    db = connection.BigData
    sleepdata = db.sleepdata

# YOUR CODE ENDS HERE

In [None]:
query = read_mongodb({'sleep_duration': {'$gt': 40000}}, '_id')
print(query)
