# Lab 1 - Feature Engineering using Dask Dataframes

## Setting up a Dask client

In [11]:
from dask.distributed import Client
import dask.dataframe as dd
import dask
import pandas as pd

In [12]:
client = Client(n_workers=1, processes=False)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 35305 instead
  http_address["port"], self.http_server.port


0,1
Client  Scheduler: inproc://192.168.0.26/77934/61  Dashboard: http://192.168.0.26:35305/status,Cluster  Workers: 1  Cores: 8  Memory: 16.68 GB


## **Exercise 1:** Reading the data

Dask Dataframes coordinate many Pandas dataframes, partitioned along an index.
You can read a Dask dataframe from multiple parquet files using a glob expression like below:
```python

    df = dd.read_parquet("my-parquet-data-*.parquet")

```

In the cell below, read data from our `nyc_taxi_data_2014` CSV files just like in Pandas, but using a glob expression to read the multiple files at once.

In [3]:
# YOUR CODE GOES HERE!

**solution:**

In [58]:
df = dd.read_csv("data/nyc-taxi-trip-duration/train-*.csv")

In [51]:
df.head()

Unnamed: 0,id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,2016-03-14 17:32:30,1,-73.982155,40.767937,-73.96463,40.765602,N,455
1,id2377394,1,2016-06-12 00:43:35,2016-06-12 00:54:38,1,-73.980415,40.738564,-73.999481,40.731152,N,663
2,id3858529,2,2016-01-19 11:35:24,2016-01-19 12:10:48,1,-73.979027,40.763939,-74.005333,40.710087,N,2124
3,id3504673,2,2016-04-06 19:32:31,2016-04-06 19:39:40,1,-74.01004,40.719971,-74.012268,40.706718,N,429
4,id2181028,2,2016-03-26 13:30:55,2016-03-26 13:38:10,1,-73.973053,40.793209,-73.972923,40.78252,N,435


**Unlike Pandas, Dask DataFrames are lazy and so no data is printed here. But the column names and dtypes are known.** (Explicar como dask sabe os tipos)

**Some operations are computed right away, like head, but most of them are computed after the explicit compute command.**

## **Exercise 2:** What as the biggest trip duration in hours?

In [52]:
# YOUR CODE GOES HERE

In [53]:
(df.trip_duration.max()/3600).compute()

979.5227777777778

Hmmmm... This doesn't seem right. It would mean that someone did a trip during 40 days! Let's remove every trip that took more than a day.

## **Exercise 3:** Outlier Removal

In [54]:
# YOUR CODE GOES HERE

In [55]:
df = df[df.trip_duration < (df.trip_duration.mean()+2*df.trip_duration.std())]

## **Exercise 4:** What was the mean and standard deviation of number of passengers?

In [None]:
# YOUR CODE GOES HERE

In [12]:
dask.compute(df.passenger_count.mean(), df.passenger_count.std())

  result = function(*args, **kwargs)


(1.7018262467884164, 1.390735923905905)

## **Exercise 5:** What was the biggest trip distance based on the haversine distance?

In [46]:
from math import radians, cos, sin, asin, sqrt

def haversine_distance(row):
#     lon1, lat1, lon2, lat2):
    """
    Calculate the circle distance between two points in lat and lon
    on the earth (specified in decimal degrees)
    returning distance in miles
    """
    # need to convert decimal degrees to radians 
    # a unit of angle, equal to an angle at the center of a circle whose arc is equal in length to the radius.
    lon1, lat1, lon2, lat2 = row['pickup_longitude'], row['pickup_latitude'], row['dropoff_longitude'], row['dropoff_latitude']
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

In [47]:
# YOUR CODE GOES HERE

In [48]:
df["haversine_distance"] = df.apply(haversine_distance, meta=(None, "float64"), axis=1)
haversine_distance = df["haversine_distance"].compute() # The graph until here takes some time to execute.
                                                        # To avoid running this twice,
                                                        # I saved the result in a variable instead.

In [49]:
haversine_distance.max()

1240.9086766508526

Let's use IQR to remove some outlier distances too:

***Remember: Avoid recalculations!**

In [None]:
# YOUR CODE GOES HERE

In [None]:
df = df[haversine_distance < haversine_distance.mean()+2*haversine_distance.std()]

## **Exercice 6**: Generating calendar features

* Explicar map_partitions
* Explicar que API parece a de pandas, dar essa dica quando forem gerar as features de calendário

In [7]:
df['pickup_datetime'] = df.map_partitions(lambda part: pd.to_datetime(part['pickup_datetime'], format="%Y-%m-%d %H:%M:%S"),
                                          meta=("pickup_datetime", "datetime64[ns]"))
df["pickup_hour"] = df["pickup_datetime"].dt.hour
df["pickup_day"] = df["pickup_datetime"].dt.day
df["pickup_week"] = df["pickup_datetime"].dt.week

weekday_dict = {0: "Mon",
                1: "Tues",
                2: "Wed",
                3: "Thurs",
                4: "Fri",
                5: "Sat",
                6: "Sun"}
df["pickup_weekday"] = df["pickup_datetime"].dt.weekday
df["pickup_weekday"] = df["pickup_weekday"].map(weekday_dict)

month_dict = {1: "Jan",
              2: "Feb",
              3: "March",
              4: "April",
              5: "May",
              6: "June",
              7:"July",
              8:"Aug",
              9:"Sep",
              10:"Oct",
              11:"Nov",
              12:"Dec"}
df["pickup_month"] = df["pickup_datetime"].dt.month
df['month'] = df['pickup_month'].map(month_dict)

  if callable(getattr(self._meta, key)):
  out = getattr(getattr(obj, accessor, obj), attr)


## **Exercise 7:** Generating the target

In [60]:
df["target"] = df["trip_duration"] > df.trip_duration.mean()

In [None]:
df.compute()

  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
  out = getattr(getattr(obj, accessor, obj), attr)
