# Introduction

* The domain I selected for this project is my sleep data collected by an application on my AppleWatch called AutoSleep. This domain is important to me because unlike most of my peers, I am an early morning person. My body naturally wakes up around 6:30am without fail, unless I go to bed later than usual. The purpose of this research was to try to identify a correlation between when I go to sleep/wake up, and the efficiency rating of my sleep. 
* All of the datasets I used were in the format of a .csv file. 
* There are four tables in the dataset, one table contains all of the AutoSleep data from the first history export I made, beginning on January 27th and ending on April 12th. A second table contains data from April 13th to April 29th. Then there are two tables containing corresponding days of the week for each of the instances in the AutoSleep tables. 
* The data in the AutoSleep tables is collected through my AppleWatch, by sensing when I have put my phone down for the night, as well as monitoring my heart rate to determine when I have fallen asleep and what stage of the sleep cycle I am in. Meanwhile, I created the .csv files for the days of the week myself in excel using the 'fromDate' column from the respective AutoSleep files and a calendar to correctly label the days. 
* There are 48 instances in the first sleep table and its corresponding days of the week table, and 13 instances in the second table as well as its corresponding days of the week table. This is a combined total of 61 instances of sleep data and days of the week.  
* In the Autosleep tables, the following attributes were provided for each night's sleep:
    * From Date/To Date - the date the sleep session was recorded in
    * Bedtime / Waketime - the time you went to bed and awoke. Using Light Off feature gives a more accurate bedtime
    * InBed - how long you were in bed for, shown in hours, minutes and seconds
    * Awake - how long you were awake for, shown in hours, minutes and seconds
    * Fell Asleep - if you use the Lights Off feature, this will populate with the amount time it took you to fall asleep
    * Sessions - the number of sleep sessions recorded during the from and to date
    * Asleep / AsleepAvg7 - the sleep duration recorded along with the 7 day sleep duration average on that date
    * Efficiency / EfficiencyAvg7 - the ratio of time asleep versus time spent in bed along with the 7 day efficiency average
    * Quality / QualityAvg7 - quality considers how long you have slept, how restless you've been and your sleeping heart rate. It is shown as hours, minutes and seconds along with a 7 day sleep quality average
    * Deep / DeepAvg7 - where your heart rate slows and your muscles relax to a point where you barely move.  It is shown as hours, minutes and seconds along with a 7 day deep sleep average
    * SleepBPM / SleepBPMAvg7 - your average heart rate shown in beats per minute for the sleep, along with a 7 day average of your sleeping heart rate
    * DayBPM / DayBPMAvg7 - your average heart rate outside of your sleep, generally during the day for most users, shown in beats per minute, along with a 7 day average of your daily heart rate
    * WakingBPM / WakingBPMAvg7 - your waking pulse shown in beats per minute which is automatically captured by AutoSleep , along with a 7 day waking pulse average
    * HRV / HRVAvg7- your Heart Rate Variability that was automatically captured by the Apple Watch or using the breathe app. AutoSleep will use the maximum value where multiple values exist for the same date. This will also show a 7 day average of your HRV
    * SpO2Avg/ SpO2Min / SpO2Max - your Blood Oxygen that was automatically captured by the Apple Watch while you sleep. Remember that you need background measures enabled for automatic measurements while you sleep
    * Tags / Notes - any emoji tags or notes recorded will appear in the export
* The days of the week tables simply had two attributes: the day of the week, and the 'fromDate'. 
* In the dataset, I am trying to classify the efficiency of my nightly sleep sessions. 
* These results have the potential of showing me what length of sleep and heart rate(s) may or may not correspond to a more efficient night's rest. 
* The only stakeholder insterested in these results is myself, however if I were to find soemthing significant, it may be grounds for further research, which is when others might become interested, such as doctors and psychologists (although I highly doubt that 60 days worth of sleep tracking is going to produce anything)


****DELETE THIS ONE LATER****  
Data Analysis: Provide details about the dataset, data preparation, exploratory data analysis, and statistical analysis. More specifically:
What cleaning of the dataset did you need to perform (e.g.. are there missing values and how did you handle the missing values)
How are you merging the tables
What are challenges with data preparation
What data aggregation techniques are you applying
What visualizations informatively present the attributes and relationships
What statistical hypothesis tests are you computing
Make sure you set your null and alternative hypotheses up correctly. Please come see me if you have questions about how to do this


# Data Analysis 

* The dataset didn't need much cleaning. There were a few columns that were completely empty or because data was unique to each column (dates and times) or was irrelevant to the classification goal and as such, the following columns were removed from the dataframe:  
 
|Empty|Unique/Irrelevant|
|---|---|
|deep|ISO8601|
|deepAvg7|toDate|
|dayBPM|inBed|
|dayBPMAvg7|awake| 
|SpO2Avg|fellAsleepIn| 
|SpO2Min|sessions|
|SpO2Max|asleepAvg7|
|respAvg|efficiencyAvg7| 
|respMin|qualityAvg7| 
|respMax|sleepBPMAvg7| 
|tags|wakingBPMAvg7|
|notes|hrvAvg7|

* I merged the tables on the 'fromDate' column to accurately line up all of the sleep data with the day of the week I fell asleep on. 
* The only challenge with preparing the data was finding ways to compute statistics with the time stamps. I found a way to use functions
that could convert an entire column of data into seconds and then if I wanted to, I could put it into a second function to get a string
with how many hours, minutes, and seconds it broke down to (this was particularly helpful when looking at the 'asleep' column).
* I used the split-apply-combine technique to aggregate my data by day. It was split into 7 different groups, then statistical analyses
were performed on each group, before combining these results into a pandas Series.
* 


Classification Results: Describe the classification approach you developed and its performance. More specifically:
What attribute are you using as class information (i.e., what attribute or attributes are you predicting)
What is the distribution of the class labels? (e.g. 50% yes, 50% no; or 70% weekday, 30% weekend, etc.)
What are your hypotheses about the predictions
How are you evaluating performance of your kNN and decision tree classifier? How do their results compare?
What are challenges with classification


# Classification Results

* For this classification, I wanted to predict efficiency scores for my sleep using the 'efficiency' attribute from my AutoSleep dataset.
* The distribution of the class labels is between 55 and 100. The 25th percentile of the data is at 92.675%, while the 50th percentile is 98.55%, and the 100th percentile maps to 100%. In other words, almost all of the efficiency scores lie between 90 and 100. 
* My hypothesis is that with the optimal classification algorithm, there will be more success in predicting the upper 95-100% efficiency scores than the lower values. 
* I evaluated my kNN and decision tree classifier using the corresponding libraries in sklearn that was able to compute the accuracies. 
* Challenges with classification can arise when dealing with a smaller dataset because there are often less training instances to run through the different ML models. I found it particularly difficult to get a good prediciton from a kNN classifier that used the holdout method where 21% of the data was withheld for testing purposes (only 3 of the 13 instances were predicted correctly). 