# Introduction to Feature Engineering <font color='blue'> (10 min) </font>

# Google doc with code corrections is accessible at:
### https://docs.google.com/document/d/19Um3u0V7dekptT0NBE-ArPf0MHe74xQA_BMueBD0Vpw/edit?usp=sharing

# 0) Importing the right tools <font color='blue'> (5 min) </font>

### <font color='red'>0.1) Import the necessary packages: </font>

- pandas (aliased as pd)
- numpy (aliased as np)
- seaborn (aliased as sns)
- matplotlib.pyplot (aliased as plt)

In [37]:
from __future__ import division

#### IMPORT THE NECESSARY PACKAGES WITH THEIR ALIASES ####


%pylab inline 

Populating the interactive namespace from numpy and matplotlib


### <font color='red'>0.2) Import the dataset from <i>'../data/data_after_collection_cleaning.csv'</i></font>

In [None]:
raw_data = #### IMPORT THE DATASET HERE USING pd.read_csv(...)####
data = #### COPY THE RAW DATA ####

### <font color='red'>0.3) Print samples of data so you are familiar with the data </font>

## 1) Week-end, Weekday <font color='blue'> (20 min) </font>

### Weekday

### <font color='red'>1.1) Import the <i>calendar</i> package and get help on it </font>

In [None]:
#### IMPORT THE PACKAGE ####

In [None]:
#### CALL HELP ON IT ####

### <font color='red'>1.2) Run the following block, so it changes starttime and stoptime columns of <i>data</i> to the right <i>datetime</i> format </font>

In [None]:
data['starttime'] = pd.to_datetime(data['starttime'], format="%Y-%m-%d %H:%M")
data['stoptime'] = pd.to_datetime(data['stoptime'], format="%Y-%m-%d %H:%M")

### <font color='red'>1.3) Run the following block with different day indexes, so you can see what <i>calendar.day_name[index_day]</i> returns </font>

In [None]:
index_day = 0
print calendar.day_name[index_day]

### <font color='red'>1.4) Run the following block so you get the day index of a given observation, as well as its week-day using <i>calendar</i></font>

In [None]:
example_observation = data.iloc[0] 
print example_observation.starttime.weekday()
print calendar.day_name[example_observation.starttime.weekday()]

### <font color='red'>1.5) Create a column in the dataframe, containing the day of the week of a given trip. You can use the following functions:</font>
- pd.column_name.apply()
- calendar.day_name[day_number_here] (notice the brackets)
- datetime.weekday()

<font color='green'> <b>Any function can be passed to pd.column_name.apply(name_of_your_function), for instance you could use the <i>lambda functions</i></b>:
- new_column = pd.column_name.apply(lambda x: x.attribute_of_x) will create a new column, based on the attributes of elements of the old column.<font color='green'>

In [43]:
data['start_day'] = #### ADD THE DAY OF THE WEEK IN THIS NEW DATAFRAME COLUMN ####

### <font color='red'>1.6) Print samples of data to make sure the column has been created successfully </font>

### <font color='red'>1.7) Use the <i>seaborn</i> package, and more specifically <i>sns.countplot</i>, to plot the number of trips per week day</font>

### <font color='red'>1.8) Define a new binary column, with True if day is on the week-end, False otherwise. You can use the <i>pandas.column_name.apply(your_function_here)</i> function.</font>

In [46]:
data['is_weekend'] = ##### NEW BINARY COLUMN, 1 IF OBSERVATION IS A WEEKEND TRIP, 0 OTHERWISE ####
                    #### YOU CAN USE A LAMBDA FUNCTION HERE like this : .apply(lambda x: x in set_to_check) ####

### <font color='red'>1.9) Plot samples of data to check success of operation</font>

## 2) Morning/afternoon/evening/night <font color='blue'> (15 min) </font>

### <font color='red'>2.1) Define a function that returns a string depending on its parameter x. If:</font>
- x.hour is strictly under 6 or strictly over 22, return 'night'
- x.hour is strictly over 18 and under 22, return 'evening'
- x.hour is strictly over 12 and under 18, return 'afternoon'
- otherwise return 'morning'

In [None]:
def time_of_day(x):
    if x.hour < 6 or x.hour > 22:    #### COMPLETE THE FUNCTION BELOW ####
        return 'night'
    elif  ..... :
        return ...
    elif ....:
        return ...
    else:
    return .... 


### <font color='red'>2.2) Apply the function to the <i>starttime</i> column of <i>data</i> to create a new feature , using <i>pd.apply(your_function)</i> function</font>

In [None]:
data['start_moment'] = #### APPLY THE FUNCTION TO data.starttime IN ORDER TO CREATE A NEW FEATURE ####

### <font color='red'>2.3) Use <i>seaborn.countplot()</i> to plot the number of trips per moment of the day, with axes labels, and a title</font>

### <font color='red'>2.4) Use <i>seaborn.countplot()</i> to plot the number of trips per weekday and per time of day</font>

## 3) Is rainy <font color='blue'> (5 min) </font>

### <font color='red'>3.1) Define a new binary column, with True if day is rainy, False otherwise. You can use the <i>pandas.column_name.apply()</i> function on the <i>Conditions</i> column of the dataframe</font>
- Hint : you can use a <b>lambda function</b> such as : <i>lambda condition: 'Rain' in condition</i>

In [18]:
data['is_rainy'] = #### CREATE A NEW FEATURE : TRIP HAPPENED ON A RAINY DAY OR NOT ####

### <font color='red'>3.2) Print <i>samples</i> of data to check if the operation was successful</font>

## 4) Is circle trip <font color='blue'> (5 min) </font>

### <font color='red'>4.1) Define a new binary column, with 1 if trip was loopy, 0 otherwise. You can use a boolean comparison of two columns of the dataframe:</font>

- example of boolean condition : <b>data.column_1 == data.column_2</b>

In [20]:
data['is_circle_trip'] = #### USE A COMPARISON OF TWO COLUMNS OF THE DATAFRAME TO CHECK FOR A LOOP ####

### <font color='red'>4.2) Print the proportion of circle trips in the dataset</font>

## 5) Understanding trip evolution in June <font color='blue'> (15 min) </font>

### <font color='red'>5.1) Add a new column with the trip day number. Since all trips in the dataset occured in June, we will name this column "june_day". You can use the <i>datetime.day</i> attribute of the starttime</font>

In [51]:
data['june_day'] = #### ADD A NEW FEATURE: THE DAY IN JUNE ####
                    #### HINT : use data.starttime.apply(your_lambda_function) #####

### <font color='red'>5.2) Group by the columns with respect to day in June, using <i>data.groupby()</i></font>

In [None]:
grouped_by_data = #### GROUP THE DATA BY JUNE DAY ####

### <font color='red'>5.3) Aggregate the grouped by data with respect to mean temperature, and number of trips, using the <i>.count()</i> and <i>.mean()</i> methods of <i>data.groupby()</i></font>

In [None]:
aggregate_count_trips = grouped_by_data.count()
aggregate_temperature = grouped_by_data.mean()

### <font color='red'>5.4) Understand what the code below does. How do you understand the plots ? You can look up the following functions:</font>
- fig, ax1 = plt.subplots()
- ax2=ax1.twinx()
- ax1.plot(), ax2.plot()
- any other options to set the ticks labels, colors, titles ...

In [None]:
sns.set_style('white')

fig, ax1 = plt.subplots(figsize=(15,10))

june_day = aggregate_count_trips.index

number_trips = aggregate_count_trips.tripduration
ax1.plot(june_day, number_trips, 'b')
ax1.set_xlabel('Day in June')
for ticklabel in ax1.get_yticklabels():
    ticklabel.set_color('b')
ax1.set_ylabel('Number of trips',color='b')

ax2 = ax1.twinx()
temperature = aggregate_temperature.TemperatureC
ax2.plot(june_day, temperature, 'k--')
for ticklabel in ax2.get_yticklabels():
    ticklabel.set_color('k')
ax2.set_ylabel('Temperature',color='k')
plt.title('Number of trips and temperature per day in June', fontsize=17)
plt.show()

## 6) Trip distances and speeds <font color='blue'> (15 min) </font>

### <font color='red'>Import the haversine package, that computes the haversine distance from one coordinate to another</font>

In [None]:
from haversine import haversine
help(haversine)

### <font color='red'>Understand how the <i>haversine</i> function from the package can be used to compute a distance by running the following blocks</font>

In [None]:
x = data.iloc[0]

In [None]:
print 'Traveled haversine distance for trip 0: {:.1f} kilometers'.format(
    haversine((x['start station latitude'], x['start station longitude']),
              (x['end station latitude'], x['end station longitude'])))

### <font color='red'>6.1) Complete the following function so it returns, for a given observation, the haversine distance from a start station to the end station</font>

In [None]:
def distance_stations(x):
    start_lat = x['start station latitude']
    start_long = x['start station longitude']
    end_lat = x['end station latitude']
    end_long = x['end station longitude']
    return #### COMPLETE THE FUNCTION SO IT RETURNS THE DISTANCE FROM START TO END ####

### <font color='red'>6.2) What does the following block do ? It may take a minute or two to run</font>

In [None]:
data['traveled_distance'] = data.apply(distance_stations, axis=1)

### <font color='red'>6.3) Show <i>samples</i> of data, and see how the traveled distance has been added as a new feature in the dataset</font>

### <font color='red'>6.4) Compute the average speed (in km/h) by dividing two dataframe columns</font>

In [29]:
data['average_speed'] = #### COMPUTE THE AVERAGE SPEED HERE ####

### <font color='red'>Run the following block to compute the mean speed on non-loopy trips </font>

In [30]:
mean_speed = mean(data['average_speed'][data['average_speed'] != 0])

### <font color='red'>6.5) Keep only trips for which the speed is under 50 km/h </font>

In [31]:
data = data[#### ENTER YOUR FILTERING CONDITION HERE ####
            #### WARNING : THIS WILL ERASE THE RAW DATA SO MAKE SURE IT IS CORRECT! ####
            #### MAKE TESTS ON THE SLICING CONDITION BEFORE ERASING THE DATA ####]

### <font color='red'>What does the following block do ?</font>

In [32]:
data.loc[data['average_speed']==0,'average_speed'] = mean_speed

### <font color='red'>6.7) Using <i>seaborn.distplot</i>, plot the distribution of speeds</font>

In [None]:
plt.figure(figsize=(20,10))
#### PLOT THE DISTRIBUTION OF SPEEDS ####

### <font color='red'>6.8) Subsidiary question : plot the average speed vs. the total distance. What do you observe ? You can use the following function: </font>
- seaborn.regplot()
- data_sample = data.sample(1000) so as not to overload the graph

# Save dataset to csv file

In [218]:
data.to_csv('my_data_after_feature_engineering.csv', index=False)

# 7) Imagine and build your own features !