# Feature Selection and Data Integration

In this report we do feature selection and will integrate the selected features from different data files. 

In [1]:
# load necessary packages
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from IPython.display import display

# load custom modules
from integrate_data import add_new_station_id, add_bikes_available_future, integrate_data

In [2]:
# show the figures within the notebook
%matplotlib inline

# Select ggplot as style
plt.style.use("ggplot")

## Feature Selection

### Features from *weather* Data

From [another report](http://localhost:8891/notebooks/data_exploration/explore_weather_data.ipynb) we found that the features in the *weather* data can be reduced to the following:
1. mean_temperature_f
2. mean_humidity
3. mean_sea_level_pressure_inches
4. mean_visibility_miles
5. max_wind_Speed_mph
6. mean_wind_speed_mph
7. max_gust_speed_mph
8. precipitation_inches
9. cloud_cover
10. wind_dir_degrees
11. date
12. events

Although not well correlated, max_wind_Speed_mph, mean_wind_speed_mph and max_gust_speed_mph all describe the strengh of wind, and therefore, we only choose mean_wind_speed_mph among them. Based on feature selection results from [Ashqar et al. (2017)](http://ieeexplore.ieee.org/abstract/document/8005700/), we also discard *mean_sea_level_pressure_inches*, "clound_cover" and *wind_dir_degrees* because they are not important for predicting future bike availability.

In [3]:
selected_columns = ['date', 'mean_temperature_f', 'mean_humidity', 'mean_visibility_miles', 
                    'mean_wind_speed_mph','precipitation_inches', 'events']

# read the selected data
df_weather = pd.read_csv("../data/weather_fixed.csv", parse_dates=["date"])
df_weather_selected = df_weather.loc[:, selected_columns]

# display the selected data
display(df_weather_selected.head())

Unnamed: 0,date,mean_temperature_f,mean_humidity,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,events
0,2013-08-29,68.0,75.0,10.0,11.0,0.0,No-Event
1,2013-08-30,69.0,70.0,10.0,13.0,0.0,No-Event
2,2013-08-31,64.0,75.0,10.0,15.0,0.0,No-Event
3,2013-09-01,66.0,68.0,10.0,13.0,0.0,No-Event
4,2013-09-02,69.0,77.0,10.0,12.0,0.0,No-Event


### Features from *status* Data

In [4]:
time_res = "15"
# read the data
df_status_res15 = pd.read_csv("../data/status_time_res_" + time_res + "min.csv", parse_dates=["time"])

# display the data
display(df_status_res15.head(5))

Unnamed: 0,station_id,bikes_available,docks_available,time
0,2,2,25,2013-08-29 12:15:00
1,2,2,25,2013-08-29 12:30:00
2,2,2,25,2013-08-29 12:45:00
3,2,2,25,2013-08-29 13:00:00
4,2,3,24,2013-08-29 13:15:00



From the *status* we use current *bikes_available*

In [5]:
time_res = 15
new_id = 56

# add new_id column which maps station_id into a sequential ingeters
df_status = add_new_station_id(time_res=15)

# add bikes_available_future column for status data that has a given new_id
df_status = add_bikes_available_future(df_status, new_id=new_id,
                                           horizon_time=time_res)

# integrate the data
df_weather = pd.read_csv("../data/weather_fixed.csv", parse_dates=["date"])
df = integrate_data(df_status, df_weather,new_id)

# display the data
display(df.head())

Unnamed: 0,station_id,new_id,time_of_day,day_of_week,month_of_year,mean_temperature_f,mean_humidity,mean_visibility_miles,mean_wind_speed_mph,precipitation_inches,events,bikes_available,bikes_available_future
0,67,56,1215,3,8,68.0,75.0,10.0,11.0,0.0,No-Event,1,0.0
1,67,56,1230,3,8,68.0,75.0,10.0,11.0,0.0,No-Event,0,1.0
2,67,56,1245,3,8,68.0,75.0,10.0,11.0,0.0,No-Event,1,2.0
3,67,56,1300,3,8,68.0,75.0,10.0,11.0,0.0,No-Event,2,0.0
4,67,56,1315,3,8,68.0,75.0,10.0,11.0,0.0,No-Event,0,1.0


In [6]:
# store the output into a .csv file
df.to_csv("../data/integrated_data_station_" + str(new_id) + ".csv")