# Feature Engineering Exercise
Your task is to engineer some new features to try to improve a model's ability to predict the total number of bike share rentals during a given hour of the day.

## 1. Import the data the drop the 'casual' and 'registered' columns.
These are redundant with your target, 'count'.


In [3]:
# import packages
import pandas as pd

In [5]:
# import data
df = pd.read_csv('Data/bikeshare_train - bikeshare_train.csv')

In [8]:
# preview data
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [9]:
# drop casual and registered columns
df = df.drop(columns=['casual','registered'])

In [10]:
# verify changes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   count       10886 non-null  int64  
dtypes: float64(3), int64(6), object(1)
memory usage: 850.6+ KB


## 2. Transform the 'datetime' column into a datetime type and use it to create 3 new columns in the data frame containing the:

In [12]:
# convert datetime column to datetime datatype
df['datetime'] = pd.to_datetime(df['datetime'])

In [13]:
# verify change
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 850.6 KB


### Name of the Month

In [14]:
df['Month Name'] = df['datetime'].dt.month_name()

### Name of the Day of the Week

In [17]:
df['DOTW'] = df['datetime'].dt.day_name()

### Hour of the Day

In [25]:
df['Hour'] = df['datetime'].dt.hour.astype('object')

#### # 1. Make sure all 3 new columns are 'object' datatype so they can be one-hot encoded later.

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
 10  Month Name  10886 non-null  object        
 11  DOTW        10886 non-null  object        
 12  Hour        10886 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(6), object(3)
memory usage: 1.1+ MB


#### 2. Drop the 'datetime' and 'season' columns. These are now redundant.

In [27]:
df = df.drop(columns=['datetime','season'])

In [28]:
# verify changes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   holiday     10886 non-null  int64  
 1   workingday  10886 non-null  int64  
 2   weather     10886 non-null  int64  
 3   temp        10886 non-null  float64
 4   atemp       10886 non-null  float64
 5   humidity    10886 non-null  int64  
 6   windspeed   10886 non-null  float64
 7   count       10886 non-null  int64  
 8   Month Name  10886 non-null  object 
 9   DOTW        10886 non-null  object 
 10  Hour        10886 non-null  object 
dtypes: float64(3), int64(5), object(3)
memory usage: 935.6+ KB


## 3. The temperatures in the 'temp' and 'atemp' columns are in Celsius. Use `.apply()` and a Lambda function to convert them to Fahrenheit.

In [32]:
# convert temp from c to f
df['temp'] = df['temp'].apply(lambda x: x*9/5+32)

In [38]:
# verify changes
df['temp'].head()

0    49.712
1    48.236
2    48.236
3    49.712
4    49.712
Name: temp, dtype: float64

In [35]:
# convert atemp from c to f
df['atemp'] = df['atemp'].apply(lambda x: x*9/5+32)

In [39]:
# verify changes
df['atemp'].head()

0    57.911
1    56.543
2    56.543
3    57.911
4    57.911
Name: atemp, dtype: float64

## 4. Create a new column, 'temp_variance,' which shows how much warmer or colder the current temperature ('temp') is than the average temperature for that day of the year ('atemp').
### If the current temperature is warmer than average ('atemp'), the value in 'temp_variance' should be positive.

In [62]:
# create function for temp_variance column
def current_temp(curr):
    avg_temp = df['atemp']
    curr = df['temp']
    return curr - avg_temp

In [67]:
# create new column by running function
df['temp_variance'] = current_temp(df['atemp'])

In [73]:
# preview changes
df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Month Name,DOTW,Hour,temp_variance
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0,-8.199
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1,-8.307
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2,-8.307
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3,-8.199
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4,-8.199


#### 1. Drop the 'atemp' column.

In [74]:
df = df.drop(columns=['atemp'])

In [75]:
# verify changes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   holiday        10886 non-null  int64  
 1   workingday     10886 non-null  int64  
 2   weather        10886 non-null  int64  
 3   temp           10886 non-null  float64
 4   humidity       10886 non-null  int64  
 5   windspeed      10886 non-null  float64
 6   count          10886 non-null  int64  
 7   Month Name     10886 non-null  object 
 8   DOTW           10886 non-null  object 
 9   Hour           10886 non-null  object 
 10  temp_variance  10886 non-null  float64
dtypes: float64(3), int64(5), object(3)
memory usage: 935.6+ KB
