# One Hot encoding
A key difference between the scikit-learn library and many other machine learning libraries that users may come across (such as the R studio machine learning), is the use of factor or categorical variables. 

The scikit learn library does not handle categroical variables as others might, instead data needs to be converted first and represented in numerical form. This is done through one hot encoding.

The process of one hot encdoing is to take unordered categorical variables and assign an arbitrary value. For example days of the week don't represent a linear trend, and if we inputed them as 1,2,3,4,5,6,7 the computer would treat them as linear. Instead we need to assign an unorded value to each varaible. We do this by creating a new column for each day of the week, and assigning a 0 or 1 if the value is true. 

## Example
To illustrate this we will load in some weather data from the NOAA website and perform a one hot encoding on the categorical variables.

In [3]:
# Pandas is used for data manipulation
import pandas as pd

# Read in data as pandas dataframe and display first 5 rows
features = pd.read_csv('temps.csv')
features.head(5)

Unnamed: 0,year,month,day,week,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend
0,2016,1,1,Fri,45,45,45.6,45,43,50,44,29
1,2016,1,2,Sat,44,45,45.7,44,41,50,44,61
2,2016,1,3,Sun,45,44,45.8,41,43,46,47,56
3,2016,1,4,Mon,44,41,45.9,40,44,48,46,53
4,2016,1,5,Tues,41,40,46.0,44,46,46,46,41


In [4]:
# Use datetime for dealing with dates
import datetime

# Get years, months, and days
years = features['year']
months = features['month']
days = features['day']

# List and then convert to datetime object
dates = [str(int(year)) + '-' + str(int(month)) + '-' + str(int(day)) for year, month, day in zip(years, months, days)]
dates = [datetime.datetime.strptime(date, '%Y-%m-%d') for date in dates]

In [5]:
# One-hot encode categorical features
features = pd.get_dummies(features)
features.head(5)

Unnamed: 0,year,month,day,temp_2,temp_1,average,actual,forecast_noaa,forecast_acc,forecast_under,friend,week_Fri,week_Mon,week_Sat,week_Sun,week_Thurs,week_Tues,week_Wed
0,2016,1,1,45,45,45.6,45,43,50,44,29,1,0,0,0,0,0,0
1,2016,1,2,44,45,45.7,44,41,50,44,61,0,0,1,0,0,0,0
2,2016,1,3,45,44,45.8,41,43,46,47,56,0,0,0,1,0,0,0
3,2016,1,4,44,41,45.9,40,44,48,46,53,0,1,0,0,0,0,0
4,2016,1,5,41,40,46.0,44,46,46,46,41,0,0,0,0,0,1,0


## Try it out
We see seven new dummy variables have been created for the days of the week

Try out some data of your own and perform a one hot encoding