# Machine Learning Project, Task 1: Feature Engineering - Data Visualization

Before you start, make sure that you are familiar with the basic usage of Jupyter Notebook. 

If not, please finish the Jupyter Notebook primer first. Additionally, visit the Azure Notebook library [Cloud Computing Course](https://notebooks.azure.com/CloudComputingCourse/projects/cloud-computing-course) and read the tutorials with **worked examples** and practice on Linux, Bash and Pandas.

In this task, you are visualizing spatial and temporal data which would influence cab fare prices in New York City. You need to implement the following methods based on your observations:
```
q1()
q2()
q3()
q4()
```

Please do not change any utility method. More cells may be added to the notebook. If you don't want to include the cell in the converted script, please tag the cell with `excluded_from_script`. You can display the tags for each cell as such: `View > Cell Toolbar > Tags`.

Execute `./runner.sh` in the console to check the result. Please make sure that the virtualenv is activated when executing `runner.sh`.


In [None]:
# Import libraries that q1 - q4 depend on.
# Please DO NOT change this cell. 
# The cell will be included in the converted Python script.
import pandas as pd
import math
import matplotlib.pyplot as plt
import datetime
import scipy.signal
import sys
import argparse
import os

In [None]:
# Import packages that are used in data visualization but not in q1 - q4.
# This cell will be excluded in the converted Python script.
import seaborn as sns
from mapboxgl.utils import *
from mapboxgl.viz import *

In [None]:
def visualize_map(data, center, zoom):
    """
    This is a sample method for you to get used to spatial data visualization using the Mapboxgl-jupyter library.
    
    Mapboxgl-jupyter is a location based data visualization library for Jupyter Notebooks.
    To better understand this, you may want to read the documentation: 
    https://mapbox-mapboxgl-jupyter.readthedocs-hosted.com/en/latest/
    
    To use the library, you need to register for a token by accessing: 
    https://account.mapbox.com/access-tokens/
    You need to create an account and login. Then you can see your access token by revisiting the above URL.
    
    You can check the output of this method by exporting your token as an environment variable `MAPBOX_ACCESS_TOKEN`.
    and by executing the cell below. It may take several minutes to show a complete map.
    
    Hint:
    You can set the environment variable in Jupyter Notebook by creating a cell to
    execute `os.environ['MAPBOX_ACCESS_TOKEN'] = "pk.your_specific_token"`.
    Be sure to exclude the plaintext token from your submission 
    by deleting the cell that includes 
    `os.environ['MAPBOX_ACCESS_TOKEN'] = "pk.your_specific_token"`.
    """
    # Create the viz from the dataframe
    viz = CircleViz(data,
                    access_token = os.environ['MAPBOX_ACCESS_TOKEN'],
                    center = center,
                    zoom = zoom,
                  )
    # It could take several minutes to show the map
    print("showing map...")
    viz.show();
    

In [None]:
# set the center of the map
center_of_nyc = (-74, 40.73)

# visualize an empty map of New York City
visualize_map(data=None, center=center_of_nyc, zoom=10)

# Q1: Spatial Data Visualization
>In q1, you will explore the geographical distribution of the data.

>You should carry out the data visualization and explore the dataset using `cc_nyc_fare_train_small.csv`, the same dataset will be used in the following feature engineering task. However, please use `NA_boundary_box.csv` when submitting q1.

>Steps:
>1. Find the proper inputs to feed into the `visualize_map` method and visualize the spatial data.
>2. Explore the data points on the map. Does every point make sense? Should some data be in the Atlantic Ocean? 
>3. Implement a data filter to exclude rows with pickup location outside the region of the United States.

>Hint: 

>You may want to figure out latitude and longitude boundaries for the United States excluding the bordering countries. A good place to find a bounding box is: http://boundingbox.klokantech.com/. Please explore the usage of this bounding box tool and find the required bounding box. You may drag and drop the box to include the region you want.

In [None]:
# Create a geojson file export from a Pandas dataframe
df = pd.read_csv('data/cc_nyc_fare_train_small.csv', parse_dates=['pickup_datetime'])
data = df_to_geojson(df, lat='pickup_latitude', lon='pickup_longitude')

# TODO: visualize spatial data on a map
raise NotImplementedError("To be implemented")
visualize_map(data=data, center=TODO, zoom=TODO)

In [None]:
def q1():
    """
    ML Objective: When exploring raw datasets you will often come across data points which do not fit the business 
    case and are called outliers. In this case, the outliers might denote data points outside of the specific area
    since our goal is to develop a model to predict fares in NYC. 
    
    You might want to exclude such data points to make your model perform better in the Feature Engineering Task.
    
    TODO: Exclude rows with pickup location outside the region of the United States excluding the bordering countries.
    
    output format:
    <row number>, <pickup_longitude>, <pickup_latitude>
    <row number>, <pickup_longitude>, <pickup_latitude>
    <row number>, <pickup_longitude>, <pickup_latitude>
    ...
    """
    
    df = pd.read_csv('data/NA_boundary_box.csv').loc[:,['pickup_latitude', 'pickup_longitude']]
    
    # TODO: Implement a data filter to exclude the data outside the region of the United States
    #       (replace "None" with your implementation)
    raise NotImplementedError("To be implemented")
    res = None
    
    # print the result to standard output in the CSV format
    res.to_csv(sys.stdout, encoding='utf-8', header=None)

In [None]:
q1()

# Q2: Hotspots and Distance
>In this task, you are supposed to calculate the distance between Madison Square Garden (MSG) and the most popular pickup location in the southeast of NYC based on your observation of the heatmap.

>You are still using `cc_nyc_fare_train_small` for data visualization to get a better idea of the dataset for the feature engineering task.

>Hints: 
>1. To find the southeastern hotspot, you need to call the draw_heatmap method to draw a heatmap. 
>2. You need to figure out the latitude and longitude of the hotspot you observed and Madison Square Garden. 
>3. You should use the provided haversine_distance function to calculate the distance.

In [None]:
# Utility methods, please do not change.

def haversine_distance(origin, destination):
    """
    Formula to calculate the spherical distance between 2 coordinates, with each specified as a (lat, lng) tuple

    :param origin: (lat, lng)
    :type origin: tuple
    :param destination: (lat, lng)
    :type destination: tuple
    :return: haversine distance
    :rtype: float
    """
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371  # km

    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(math.radians(lat1)) * math.cos(
        math.radians(lat2)) * math.sin(dlon / 2) * math.sin(dlon / 2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    d = radius * c

    return d
  

def draw_heatmap(data, center, zoom):
    """
    Method to draw a heat map. You should use this method to identify the most popular pickup location in the southeast of NYC.

    :param geodata: name of GeoJSON file or object or JSON join-data weight_property
    :type geodata: string
    :param center: map center point
    :type center: tuple
    :param zoom: starting zoom level for map
    :type zoom: float
    """
    # set features for the heatmap
    heatmap_color_stops = create_color_stops([0.01,0.25,0.5,0.75,1], colors='RdPu')
    heatmap_radius_stops = [[10,1],[20,2]] #increase radius with zoom

    # create a heatmap
    viz = HeatmapViz(data,
                     access_token=os.environ['MAPBOX_ACCESS_TOKEN'],
                     color_stops=heatmap_color_stops,
                     radius_stops=heatmap_radius_stops,
                     height='500px',
                     opacity=0.9,
                     center=center,
                     zoom=zoom)
    print("drawing map...")
    viz.show()


In [None]:
def q2():
    """
    ML Objective: When exploring raw datasets, you will come often across a small set of data points which might 
    exhibit a unique or different behavior as compared to the rest of the data points. 
    
    In this case, the fare between two hotspots in NYC might be much higher irrespective of the distance between them. 
    You might want to identify such data points to make your model perform better during the Feature Engineering Task.
    
    TODO: calculate the distance between MSG and the most popular pickup location in the southeast of NYC.
    
    output format:
    <distance>
    """
   
    MSG_coor = (40.750298, -73.993324) # lat, lng
    
    # TODO: replace "None" with your implementation
    raise NotImplementedError("To be implemented")
    hot_spot_coor = None
    res = None
    
    print(round(res, 2))

In [None]:
q2()

# Time-related Features

>Before conducting feature engineering, you may want to think about time-based features, which could be correlated with traffic conditions that may influence the fare.

>In q3, you need to implement the method to extract year, month, hour and weekday from the pickup_datetime feature.

In [None]:
def q3():
    """
    ML Objective: As described above, time based features are crucial for better performance of an ML model since 
    the input data points often change with respect to time.  
    
    In this case, the traffic conditions might be higher during office hours or during weekends which may result 
    in higher fares. You might want to develop such time-based features to make your model perform better during the 
    Feature Engineering Task.
    
    TODO: You need to implement the method to extract year, month, hour and weekday from the pickup_datetime feature
    
    output format:
    <row number>, <pickup_datetime>, <fare_amount>, <year>, <month>, <hour>, <weekday>
    """
    # read the CSV file into a DataFrame
    df = pd.read_csv('data/cc_nyc_fare_train_tiny.csv', parse_dates=['pickup_datetime'])
    time_features = df.loc[:, ['pickup_datetime', 'fare_amount']]

    # TODO: extract time-related features from the `pickup_datetime` column.
    #       (replace "None" with your implementation)
    raise NotImplementedError("To be implemented")
    time_features['year'] = None
    time_features['month'] = None
    time_features['hour'] = None
    time_features['weekday'] = None
    
    # print the result to standard output in the CSV format
    time_features.to_csv(sys.stdout, encoding='utf-8', header=None)


In [None]:
q3()

# Q4: Investigate time pattern

In feature engineering, you must write your transformation code to be stateless, which means the transformed features of test set should match the transformed features of the training set. After you have trained your model using transformed features from the training set, you will need to perform the same transformation on the records from the test set with the same code.

An example is filtering outliers according to the percentile values of the feature using [pandas.DataFrame.quantile](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html). **Calling these functions directly on the training set and test set may probably perform different transformations and yield different results, since each has a different set of records.** It performs the same transformation only if the training set and test set have the same quantile distribution of the numeric column.

Instead, you should **store** the unique value of a numeric column from the **training set**, to apply them consistently to the **test set**.

>In q4, you are expected to:

>1. Use the DataFrame you generated in q3 to visualize the relationship between year, hour, weekday and fare_amount. 
>2. Explore the plot of year 2013 and fix the abnormal distribution by removing 0.1% of raw data.

>You are using `cc_nyc_fare_train_small` as your training set. However, please use `cc_nyc_fare_train_tiny` as your test set in the implementation of q4.

>Hint:

>1. You may want to draw a histogram to find what is happening here.
>2. The method needs to be stateless, i.e., the passed input DataFrame should not be modifed. To achieve this, you may want to find the 99.9% quantile and hardcode the specific number in the filter so that it won't change when you're using test data).

In [None]:
# Utility method, please do not change.

def draw_fare_time_plot(df):
    """
    Utility function to draw a heatmap on fare_amount, weekday, hour and year.
    
    input format:
    <pickup_datetime>, <fare_amount>, <year>, <month>, <hour>, <weekday>
    <pickup_datetime>, <fare_amount>, <year>, <month>, <hour>, <weekday>
    <pickup_datetime>, <fare_amount>, <year>, <month>, <hour>, <weekday>
    ...
    """
    # group data by time-related figures and calculate mean fare amount
    df = df.groupby(['year','weekday','hour']).mean().reset_index()
    
    # plot
    plt.figure(1, figsize=(18,10))
    for (i, year) in enumerate(range(2009, 2016)):
        ax = plt.subplot(2, 4, i + 1)
        df_plot = df.query('year == @year') 
        sns.heatmap(df_plot.pivot(index='hour', columns='weekday', values='fare_amount'), annot=False, cmap="Blues")
        plt.title("year " + str(year)) 
        ax.set(xlim=(-1, 7))
        ax.set(ylim=(0, 25))


In [None]:
def q4():
    """
    ML Objective: While relying on time based features might be beneficial, it is a good practice to remove the 
    abnormalities in the data. 
    
    In this case, the time of the day might not be an explicable factor for the resulting fare. When developing 
    time-based features you might want to exclude a few abnormal data points which might lead to overfitting.
    
    Fix the abnormal distribution in `fare_amount` by removing 0.1% of raw data.
    
    output format:
    <row number>, <pickup_datetime>, <fare_amount>
    <row number>, <pickup_datetime>, <fare_amount>
    <row number>, <pickup_datetime>, <fare_amount>
    ...
    """
    # read the CSV file into a DataFrame
    df = pd.read_csv('data/cc_nyc_fare_train_tiny.csv', parse_dates=['pickup_datetime']).loc[:, ['pickup_datetime', 'fare_amount']]

    # TODO: replace "None" with the 99.9% quantile
    raise NotImplementedError("To be implemented")
    df = df[df['fare_amount'] < None]

    # print the result to standard output in the CSV format
    df.to_csv(sys.stdout, encoding='utf-8', header=None)

In [None]:
q4()

# DO NOT MODIFY ANYTHING BELOW

In [None]:
def main():
    parser = argparse.ArgumentParser(
        description="Project Machine Learning on Cloud")
    parser.add_argument("-r",
                        metavar='<question_id>',
                        required=False)
    args = parser.parse_args()
    question = args.r

    if question == "q1":
        q1()
    elif question == "q2":
        q2()
    elif question == "q3":
        q3()
    elif question == "q4":
        q4()
    else:
        print("Invalid question")

if __name__ == "__main__":
    main()