# Assignment 3

## Part 1

The goal of this part of the assignment is to provide you with practice and experience in some basic data exploration and hypothesis testing with Python. You will work with data from the “HUE bedtime procrastination study”. A cleaned version of the data is available on Canvas (`hue_week_3.csv`), as well as another file that contains data from the post-study questionnaire that participants filled out at the end of the study (`hue_questionnaire.csv`). This file contains the following information:

| Column | Description |
-----------------------|--------------------------------------------|
| `gender`          | 1 = male, 2 = female |
| `age`           | Numeric age value | 
| `chronotype`      |    Single item (7-point scale), do you consider yourself more of a <br> morning (1) or an evening person? (7) |
| `bp_scale` | Dutch version of the Bedtime Procrastination Scale |
| `motivation` | Questions pertaining to personality traits related to procrastination. <br> Single item (7-point scale), how motivated were you to go to bed on <br> time each night? (1 = not motivated, 7 = very motivated) |
| `daytime_sleepiness` | Dutch translation of the Epworth Sleepiness Scale <br> (4-point scale from 0-3; 8 questions, values summed) |
| `self_reported_effectiveness` | Single item (7-point scale), <br> do you feel more rested since the intervention |

In this part of the assignment, you will use Python to examine the post-questionnaire data in relation to the HUE data file, visualize trends and relationships, look for correlations between factors, test for significant differences between groups and build a regression model to predict bedtime delay. In order to perform the analyses, a number of transformations on the data still need to be done.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
import datetime
import sys
from io import StringIO


pd.options.display.max_rows = 20

## Exercise 1 (20 points)
Implement the following steps in Python:

<ul>
<li>
    Read the HUE data file and the questionnaire data file into two separate pandas DataFrames.
<br></li>
<li>
    Create a new DataFrame that contains the following Series:
    
| Column | Description |
-----------------------|--------------------------------------------|
| `ID` | Participant ID |
| `group` | Participant group (1 for experimental, 0 for control) |
| `delay_nights` | The number of nights the participant delayed their bedtime (range: 0-12) |
| `delay_time` | The mean time in seconds a participant delayed their bedtime <br> (total delay in seconds, divided by the number of observations <br> measured for each individual, rounded to nearest second) |
| `sleep_time` | The mean time in seconds of a participant (rounded to nearest second) |
    
    
<br></li>
<li>
    Set the index of this new DataFrame to `ID`. Note that there should only be a single row per participant ID.    
<br></li>
<li>
    Fill this new DataFrame by transforming data from the DataFrame about participants' bedtimes (from the HUE data file).
<br></li>
<li>
    Merge this new DataFrame with the post-questionnaire data and store the resulting DataFrame in a new variable. Perform this merging operation of the two DataFrames in such a way that the resulting Data Frame only contains IDs that were present in both datasets.
<br></li>
<li>
    Remove the rows that have NaN values in this merged DataFrame.
</li>
</ul> 

In [None]:
sleepdatafile   = 'hue_week_3.csv'
surveydatafile  = 'hue_questionnaire.csv'

In [None]:
def read_data(sleepdatafile, surveydatafile):
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [None]:
mergedDfNoNan = read_data(sleepdatafile, surveydatafile)
mergedDfNoNan


## Exercise 2 (5 points)
Use the `scipy.stats` package and, respectively, the Pearson correlation test and the Kendall rank correlation test, to calculate the following correlation coefficients:
<br></li>
<li>
    the Pearson correlation coefficient between bedtime procrastination scale (`bp_scale`, a personality trait) and mean time spent delaying bedtime,    
<br></li>
<li>
    the Kendall rank correlation coefficient between age and mean time spent delaying bedtime,
<br></li>
<li>
    the Pearson correlation coefficient between mean time spent delaying bedtime and daytime sleepiness.
</li>
</ul> 

Save them into the variables `r1`, `tau`, `r2`.

In [None]:
def calculate_correlations(mergedDfNoNan):
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [None]:
r1, pvalue1, tau, pvalue2, r2, pvalue3 = calculate_correlations(mergedDfNoNan)

statistics = [r1,tau,r2]
pvalues = [pvalue1, pvalue2, pvalue3]

print("Correlation tests:")
for (statistic, pvalue) in zip(statistics, pvalues):
    print('The value of the test statistic is:',statistic)
    print('The p-value is:', pvalue,'\n')


## Exercise 3 (15 points)
Use the `scipy.stats` package to determine whether there are significant differences (at 5\% significance level) between the experimental group and the control group in terms of:
<br></li>
<li>
    the number of nights participants delayed their bedtime,    
<br></li>
<li>
    the time participants spent in bed each night,
<br></li>
<li>
    the mean time participants spent delaying their bedtime.
</li>
</ul> 

Use the t-test or the Wilcoxon rank-sum test to reach a conclusion and use knowledge gained in the courses Statistics and Statistical Data Analysis to determine which statistical test is appropriate. Save the conclusions - either the string 'significant difference' or 'no significant difference' - into the variables `dif1`, `dif2`, `dif3`.

\* Note that in the final assignment you are expected to explicitly motivate the choice of an appropriate test, in this exercise you do not have to.

In [None]:
def perform_tests(mergedDf):
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [None]:
dif1, dif2, dif3 = perform_tests(mergedDfNoNan) 

print('The number of nights participants delayed their bedtime:', dif1)
print('The time participants spent in bed each night:', dif2)
print('The mean time participants spent delaying their bedtime:', dif3)


## Exercise 4 (15 points)
Use `statsmodels.api` to build a regression model for `delay_time` on the predictors `age`, `chronotype` and `bp_scale`. Return the coefficients of the model, and the conclusion whether the model is significant by using the string 'significant' or 'not significant'.

\* Convince yourself that the basic diagnostics for this model are ok. Here not, but in the final assignment you are expected to explicitly check the diagnostics.

In [None]:
def regression_analysis(mergedDfNoNan):
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [None]:
parameters, significant = regression_analysis(mergedDfNoNan)

print('The parameters of the model are:')
print(parameters)
print('\nThe model is', significant)


## Exercise 5 (15 points)
Create three distinct, meaningful, well-crafted visualizations that either provide insight into the data, or help support your conclusions. This means creating three different kinds of plots (not three boxplots, or three scatterplots for example). Interpret and discuss your findings.

In [None]:
# Plot 1
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [None]:
# Plot 2
# YOUR CODE HERE

# YOUR CODE ENDS HERE

In [None]:
# Plot 3
# YOUR CODE HERE

# YOUR CODE ENDS HERE

# Part 2
The goal of this part of the assignment is to provide you with practice in implementing MapReduce in Python. Using the `map_reduce_hue.csv` dataset, you will implement two simple MapReduce algorithms.

<a href="https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac">First read this webpage!</a>

In the ideal situation, we would have access to multiple nodes in the cloud to test our MapReduce functions. Instead, we are going to simulate such an environment in this notebook. We are going to feed the Map function a line of the file in each call (as if this is running on a node in the cloud). The Map function will print the result of the computation to the standard output. When all Map function have processed all lines of the file, the Reduce function is going to collect the output of the Map functions (all the intermediate results that were printed). We do this line by line as well as if the output of a Map function is sent to a Reduce function directly. The Reduce function will then transform the intermediate results to obtain the final answer that one wants to compute. 

Since the Map function is using the `print()` function to communicate to the Reduce function, we use a smart trick! We store the standard output in a variable, and replace it by a variable. Whenever the Map function using the `print()` function, it is added in a string to the variable. After all Map functions are finished, we have collected all the output and change back to the original standard output. Now, the reduce function can use the variable to process the output of the Map functions. Note that since the Reduce function is fed line by line, the function might need to use global variables instead of local variable to store the information.

## Exercise 6 (15 points)
Write a MapReduce algorithm that counts and outputs the total number of times the fitness value is strictly higher than 50. The expected output is a single integer. In this case, the Map function should print relevant information related to the line that be used by the Reduce function. The Reduce function should read all these values, and print the total count. In this case, it might be necessary to have a global variable `totalCount`, which indicates the current count of the number of relevant lines.

In [70]:
def mapper1(line):
# YOUR CODE HERE
    values = line.split(',')
    for index,value in enumerate(values):
        if index == 5:
            print ('\t{}\t{}'.format(value, 1))  #WHY 1??
# YOUR CODE ENDS HERE


# Pseudocode logic:
# The mapreduce1 function reads in the file and strips the lines
# So we split the lines and only select those lines at index == 5 
# because that is where the fitness value is in the files
# So this outputs tuples in the form (52.0, 1)

In [76]:

def reducer1(line):
# YOUR CODE HERE
    global totalCount
    if not line:
        print(totalCount)
    else:
        line = line.strip()
        listOfElements = line.split('\t')
        listOfElements = [int(element) if element.isdigit() else float(element) for element in listOfElements if element]
        totalCount += len([x for x in listOfElements if x > 50])        

# YOUR CODE ENDS HERE




In [77]:
# Pseudocode logic:
# So we then read in the lines from the mapper function
# We strip any whitespace
# And split again on \t

# I had trouble filtering on elements that are greater than 50,
# so I apply it all the lists and then get a sublist with only those lines where the first value is greater than 50

# ## THE PROBLEM I RUN INTO:
# I have tried to count the elements in index position[1] in every which way
# I have tried to convert the elements first to float & int to be able to do a > comparison
# Then I try to change them back to strings, so that I don't get keeping 'int' or 'float' is not subscriptable
# But nothing is working

# I realise that the code above is probably overcomplicated for this problem
# But nothing I was trying worked
# I think I am missing something which would greatly simplify this problem

# I WILL COME BACK TO THIS NUMBRE 6 & 7 LATER TODAY 
# I JUST NEED TO STEP AWAY FROM IT FOR A FEW HOURS TO CLEAR MY MIND!

In [78]:
def mapreduce1(data):    
    old_stdout = sys.stdout
    mystdout = StringIO()
    sys.stdout = mystdout

    with open(data) as file:
        for index, line in enumerate(file):
            if index == 0:
                continue
            line = line.strip()
            mapper1(line)
        mapper1(',,,,,,,')
        
        sys.stdout = old_stdout
        mapper_lines = mystdout.getvalue().split("\n")
        mystdout.close()

        for index, line in enumerate(sorted(mapper_lines)):
            if index == 0:
                continue
            reducer1(line)
        reducer1('')

In [79]:
totalCount = 0
mapreduce1('map_reduce_hue.csv')


In [80]:
totalCount

225

## Exercise 7 (15 points)
Write a MapReduce algorithm that calculates the mean fitness per participant. Do not use any statistical packages to calculate the mean. The expected output is one line per participant, containing the participants ID and the mean of his or her fitness, separated by a tab ("\t"). The outputted lines do not have to be sorted. 

The Map function in this case is more complicated than in the previous case. Think about what information the Map function should give the Reduce function. In this case, it is necessary to have at least the variable `currentID` (indicating which ID you are processing) as global variable

In [258]:
def mapper2(line):
# YOUR CODE HERE
    global currentID
    user_index = 1
    finitness_index = 5
    values = line.split(',')
    if ",,,," not in line:
        print ('{{}}\t{{}}'.format(values[user_index].strip("\""), values[finitness_index]))
    else:
        currentID = ''
        
# YOUR CODE ENDS HERE

In [259]:
def reducer2(line):
# YOUR CODE HERE
    global currentID, values
    
    def mean(l):
        return sum(l)/len(l)
    
    if not line:
        print(mean(values))
    else:
        line = line.strip("")
        elements = line.split('\t')
        nextID, value = elements[0], elements[1]
        if currentID == nextID:
            values.append(float(value))
        else:
            if currentID:
                print("{}\t{}".format(currentID, mean(values)))
                res.append("{}\t{}".format(currentID, mean(values)))
            values = [float(value)]
            currentID = nextID.strip()
            
    
# YOUR CODE ENDS HERE

In [260]:
def mapreduce2(data):    
    old_stdout = sys.stdout
    mystdout = StringIO()
    sys.stdout = mystdout

    with open(data) as file:
        for index, line in enumerate(file):
            if index == 0:
                continue
            line = line.strip()
            mapper2(line)
        mapper2(',,,,,,,')
        
        sys.stdout = old_stdout
        mapper_lines = mystdout.getvalue().split("\n")
        mystdout.close()

        for index, line in enumerate(sorted(mapper_lines)):
            if index == 0:
                continue
            reducer2(line)
        reducer2('')

In [261]:
res = []

In [262]:
currentID = ''
mapreduce2('map_reduce_hue.csv')


In [263]:
res

['10',
 '37',
 '12',
 '24',
 '1',
 '20',
 '9',
 '22',
 '31',
 '34',
 '36',
 '19',
 '26',
 '29',
 '20',
 '29',
 '37',
 '31',
 '1',
 '18',
 '22',
 '12',
 '9',
 '36',
 '34',
 '24',
 '10',
 '32',
 '19',
 '26',
 '29',
 '1',
 '20',
 '10',
 '22',
 '34',
 '31',
 '12',
 '9',
 '36',
 '24',
 '18',
 '30',
 '1',
 '22',
 '10',
 '12',
 '24',
 '20',
 '34',
 '18',
 '9',
 '31',
 '36',
 '30',
 '37',
 '29',
 '26',
 '32',
 '19',
 '30',
 '24',
 '31',
 '22',
 '1',
 '12',
 '37',
 '20',
 '29',
 '9',
 '18',
 '36',
 '34',
 '10',
 '32',
 '19',
 '26',
 '1',
 '31',
 '30',
 '37',
 '29',
 '34',
 '10',
 '22',
 '20',
 '18',
 '12',
 '9',
 '36',
 '26',
 '19',
 '31',
 '36',
 '30',
 '10',
 '22',
 '18',
 '34',
 '20',
 '26',
 '9',
 '1',
 '37',
 '29',
 '30',
 '10',
 '18',
 '1',
 '34',
 '22',
 '9',
 '12',
 '20',
 '36',
 '19',
 '31',
 '26',
 '1',
 '30',
 '29',
 '37',
 '10',
 '18',
 '22',
 '20',
 '12',
 '31',
 '9',
 '36',
 '32',
 '34',
 '19',
 '26',
 '20',
 '29',
 '26',
 '1',
 '10',
 '37',
 '34',
 '9',
 '30',
 '31',
 '22',
 '36'