# SI 618: Debugging exercise


*The notebook for this assignment is a little different. There are bugs and problems in the existing code that you need to fix in order to get the assert statements to pass. If you run the first cell under each question title, you’ll either get errors or it will produce the wrong answer.*

Think of this assignment in the following way:  it's your first day on the job and you've been given a notebook that was authored by someone who is no longer with your company.  You've been asked to fix it.  There are errors in it, and some of it was not completed by the original author.  You're lucky, though, as there are assertions
sprinkled throughout the notebook to help guide you along the way.


Top-level goal of notebook:
Read a CSV file into a pandas DataFrame and add specific columns to it.
These columns are added by applying functions to specific columns.
The columns to add include:
1. A datetime column that converts "Garmin time" to standard (unix epoch) time.  Note that Garmin doesn't use standard epoch offsets for their timestamps.  Rather than using the number of seconds that have elapsed since midnight on January 1, 1970, they use the number of seconds from midnight on December 31, 1989.

2. A conversion of "semicircles" of latitude and longitude to two different formats: degrees, minutes, seconds 3-tuples and fractional degrees.  For example, a latitude of 504719750 semicircles corresponds to a 3-tuple of degrees, minutes and seconds of (42, 18, 18.43) and 42.305121 degrees.

3. A "normalized speed" column that consists of the values for speed modified to remove outliers by replacing them with upper and lower bounds as well as normalized to z-values (i.e. by subtracting the mean from each value and dividing the result by the standard deviation).

In addition, you will need to complete a function that looks at the difference between sequential rows to determine whether the cyclist is slowing down or not.

Your task for this assignment is to debug this notebook to produce the desired results as shown in the assertions below.

---

## Question 1

In [None]:
import pandas
import nunpy as mp
ride = pd.read_csv('ride_final2.csv')
ride.head()

def garmin_time_to_datetime(series):
    """Convert Garmin FIT time by adding the number of 
    seconds from January 1, 1970 to December 31, 1989.
    """
    
    return pd.to_datetime(series + 64000000, unit='s', utc=True) # second here is wrong

In [5]:
import pandas as pd
import numpy as np
ride = pd.read_csv('../data/ride_final2.csv')
ride.head()

def garmin_time_to_datetime(series):
    """Convert Garmin FIT time by adding the number of 
    seconds from January 1, 1970 to December 31, 1989.
    """
    
    return pd.to_datetime(series + 631065600, unit='s', utc=True)

In [9]:
ride.head(5)

Unnamed: 0,Timestamp,Latitude,Longitude,Distance,Altitude,Speed,Timestamp_datetime
0,896018545,504719750,-998493490,10.87,285.8,1.773,2018-05-23 14:02:25+00:00
1,896018560,504717676,-998501870,71.85,285.0,5.533,2018-05-23 14:02:40+00:00
2,896018566,504716354,-998506792,108.02,284.0,6.485,2018-05-23 14:02:46+00:00
3,896018575,504714055,-998515244,170.23,284.0,6.951,2018-05-23 14:02:55+00:00
4,896018584,504711900,-998523278,229.27,285.0,6.224,2018-05-23 14:03:04+00:00


In [8]:
ride['Timestamp_datetime'] = ride.Timestamp.map(garmin_time_to_datetime)

assert ride.Timestamp_datetime[0] == pd.to_datetime('2018-05-23T14:02:25', utc=True), \
    "First datetime is not correct"

## Question 2

In [10]:
def semicircles_to_degrees(semicircles):
    '''
    Convert semicircles to degrees
    '''
    max_32_bit_int = 2**32
    return semicircles * (180/max_32_bit_int)


def degrees_to_dms(degrees_fraction):
    ''' Convert degrees to degree, minute, second 3-tuples'''
    degrees = int(degrees_fraction)
    minutes = (degrees_fraction - degrees) * 60
    seconds = round((degrees_fraction - degrees - minutes/60) * 3600, 6)
    return (degrees, abs(minutes), abs(seconds))


def dms_to_degrees(d,m,s):
    ''' Convert degrees, minutes, seconds to fractional degrees'''
    return d+m/60+s/3600

In [16]:
def semicircles_to_degrees(semicircles):
    '''
    Convert semicircles to degrees
    '''
    max_32_bit_int = 2**31 ## 32 bit integer ends at 2^31
    return semicircles * (180/max_32_bit_int)


def degrees_to_dms(degrees_fraction):
    ''' Convert degrees to degree, minute, second 3-tuples'''
    degrees = int(degrees_fraction)
    minutes = int((degrees_fraction - degrees) * 60) # Here we need to get the integer part of minute
    seconds = round((degrees_fraction - degrees - minutes/60) * 3600, 6)
    return (degrees, abs(minutes), abs(seconds))


def dms_to_degrees(d,m,s):
    ''' Convert degrees, minutes, seconds to fractional degrees'''
    return d+m/60+s/3600

In [17]:
dms = degrees_to_dms(42.2833333)
dms

(42, 16, 59.99988)

In [18]:
# This is a read-only grader cell

dms = degrees_to_dms(42.2833333)
assert dms[0] >= -180, "dms[0] must be greater than or equal to -180"
assert dms[0] <= 180, "dms[0] must be less than or equal to 180"
assert dms[1] >= 0, "dms[1] must be greater than or equal to 0"
assert dms[1] < 60, "dms[1] must be less than 60"
assert dms[2] >= 0, "dms[2] must be greater than or equal to 0"
assert dms[2] < 60, "dms[2] must be less than 60"
assert dms == (42, 16, 59.99988), "dms value is not correct"
assert dms_to_degrees(dms[0], dms[1], dms[2]) == 42.2833333, "dms_to_degrees() conversion is not correct"

ride['Latitude_degrees'] = ride['Latitude'].map(semicircles_to_degrees)
ride['Longitude_degrees'] = ride['Longitude'].map(semicircles_to_degrees)
ride['Latitude_dms'] = ride['Latitude_degrees'].map(degrees_to_dms)
ride['Longitude_dms'] = ride['Longitude_degrees'].map(degrees_to_dms)

last_row = ride.iloc[213]
assert round(last_row.Latitude_degrees,6) == 42.280569, \
    "Last row of ride does not have the correct Latitude_degrees value"
assert round(last_row.Longitude_degrees,6) == -83.739442, \
    "Last row of ride does not have the correct Longitude_degrees value"

## Question 3

In [None]:
def normalize(df, pd_series_name, nsd=2)
    '''
    Take all values that are outside some bound (mean +- 2 sd by default)
    and convert them to the appropriate bound.
    Scale the results to z-scores before returning them
    '''
    df = df.copy()
    pd_series = df[pd_series_name].astype(float)

    # Find upper and lower bound for outliers
    avg = np.mean(pd_series)
    sd  = np.std(pd_series)

    # Calculate the bounds
    lower_bound = avg - nsd*sd
    upper_bound = avg + nsd*sd

    # Collapse in the outliers: replace them with appropriate bound
    df.loc[pd_series < lower_bound , pd_series_name ] = upper_bound
    df.loc[pd_series > upper_bound , pd_series_name ] = upper_bound
    
    return (df[pd_series_name] - avg) / sd

In [24]:
def normalize(df, pd_series_name, nsd=2):
    '''
    Take all values that are outside some bound (mean +- 2 sd by default)
    and convert them to the appropriate bound.
    Scale the results to z-scores before returning them
    '''
    df = df.copy()
    pd_series = df[pd_series_name].astype(float)

    # Find upper and lower bound for outliers
    avg = np.mean(pd_series)
    sd  = np.std(pd_series)

    # Calculate the bounds
    lower_bound = avg - nsd*sd
    upper_bound = avg + nsd*sd

    # Collapse in the outliers: replace them with appropriate bound
    df.loc[pd_series < lower_bound , pd_series_name ] = lower_bound # replace here with lower_bound
    df.loc[pd_series > upper_bound , pd_series_name ] = upper_bound
    
    return (df[pd_series_name] - avg) / sd

In [25]:
# This is a read-only grader cell

ride['Speed_normalized'] = normalize(ride,'Speed')

assert round(ride.iloc[0].Speed_normalized,4) == -1.7737, \
    "First row of ride does not have the correct value for Speed_normalized"
assert ride.iloc[213].Speed_normalized == -2.0, \
    "Last row of ride does not have the correct value for Speed_normalized"

## Question 4

In [None]:
def proportion_slowing(df,series_name):
    ''' Calculate the proportion of rows that represent a slower speed than the previous row'''
    return 0 # whoops -- ran out of time to do this before I got fired!

In [27]:
ride.describe()

Unnamed: 0,Timestamp,Latitude,Longitude,Distance,Altitude,Speed,Latitude_degrees,Longitude_degrees,Speed_normalized
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,896019200.0,504555400.0,-998795900.0,3229.65472,268.001869,5.791822,42.291348,-83.718108,-0.003329
std,317.981,81011.15,153225.7,1644.337023,13.052397,2.271123,0.00679,0.012843,0.968194
min,896018500.0,504424400.0,-999051200.0,10.87,242.4,0.0,42.280363,-83.739507,-2.0
25%,896019000.0,504504100.0,-998941000.0,1807.68,256.3,4.306,42.287043,-83.73027,-0.655758
50%,896019200.0,504544200.0,-998800800.0,3374.795,272.3,5.444,42.290403,-83.71852,-0.153509
75%,896019400.0,504614500.0,-998650100.0,4642.45,278.4,7.24075,42.296299,-83.705885,0.639474
max,896019800.0,504719800.0,-998493500.0,5981.77,288.0,11.365,42.305121,-83.692758,2.0


In [33]:
def proportion_slowing(df,series_name):
    ''' Calculate the proportion of rows that represent a slower speed than the previous row'''
    df = df.copy()
    df['shift'] = df[series_name].shift(1)
    df['slowing'] = df[series_name] < df['shift']
    return df['slowing'].sum() / len(df)

In [34]:
# This is a read-only grader cell

assert round(proportion_slowing(ride,'Speed_normalized'),6) == 0.514019, \
    "proportion_slowing() does not return the correct value for the full ride dataset"
assert round(proportion_slowing(ride[:10],'Speed_normalized'),6) == 0.4, \
    "proportion_slowing() does not return the correct value for the first 10 rows of the ride dataset"
assert round(proportion_slowing(ride[10:],'Speed_normalized'),6) == 0.519608, \
    "proportion_slowing() does not return the correct value for the last rows of the ride dataset"