# Feature Engineering
* Creation of new features based on existing features
* Insight into relationships between features
* Extract and expand data

## 1. Encoding categorical variables
We need to encode categorical variables for machine learning models

### Encoding binary variables - Pandas
In pandas, we can use the apply function to encode 1s and 0s in a dataframe column.

### Endocing binary variables - scikit-learn
In scikit-learn, we can use LabelEncoder 

In [3]:
import pandas as pd
hiking = pd.read_json('hiking.json')
hiking.columns

Index(['Accessible', 'Difficulty', 'Length', 'Limited_Access', 'Location',
       'Name', 'Other_Details', 'Park_Name', 'Prop_ID', 'lat', 'lon'],
      dtype='object')

In [4]:
hiking.head()

Unnamed: 0,Accessible,Difficulty,Length,Limited_Access,Location,Name,Other_Details,Park_Name,Prop_ID,lat,lon
0,Y,,0.8 miles,N,"Enter behind the Salt Marsh Nature Center, loc...",Salt Marsh Nature Trail,<p>The first half of this mile-long trail foll...,Marine Park,B057,,
1,N,Easy,1.0 mile,N,Enter Park at Lincoln Road and Ocean Avenue en...,Lullwater,Explore the Lullwater to see how nature thrive...,Prospect Park,B073,,
2,N,Easy,0.75 miles,N,Enter Park at Lincoln Road and Ocean Avenue en...,Midwood,Step back in time with a walk through Brooklyn...,Prospect Park,B073,,
3,N,Easy,0.5 miles,N,Enter Park at Lincoln Road and Ocean Avenue en...,Peninsula,Discover how the Peninsula has changed over th...,Prospect Park,B073,,
4,N,Easy,0.5 miles,N,Enter Park at Lincoln Road and Ocean Avenue en...,Waterfall,Trace the source of the Lake on the Waterfall ...,Prospect Park,B073,,


Take a look at the hiking dataset. There are several columns here that need encoding, 
one of which is the Accessible column, which needs to be encoded in order to be modeled. 
Accessible is a binary feature, so it has two values - either Y or N - so it needs to 
be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation.

In [5]:
# Set up the LabelEncoder object
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

# Compare the two columns
print(hiking[["Accessible_enc", "Accessible"]].head())

   Accessible_enc Accessible
0               1          Y
1               0          N
2               0          N
3               0          N
4               0          N


### Encoding categorical variables - One-hot encoding
One-hot encoding encodes categorical variables into 1s and 0s when you have more than two 
variables to encode.

<img src = "one-hot.png">

One of the columns in the volunteer dataset, category_desc, gives category descriptions 
for the volunteer opportunities listed. Because it is a categorical variable with more than 
two categories, we need to use one-hot encoding to transform this column numerically. 
Use Pandas' get_dummies() function to do so.

In [12]:
# making data frame from csv file 
volunteer = pd.read_csv("volunteer_opportunities.csv") 
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])
print(type(category_enc))
print(category_enc.shape)
print(volunteer["category_desc"].head())
print(category_enc.head())

<class 'pandas.core.frame.DataFrame'>
(665, 6)
0                          NaN
1    Strengthening Communities
2    Strengthening Communities
3    Strengthening Communities
4                  Environment
Name: category_desc, dtype: object
   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0  


In [None]:
# Concatenate this set back to the volunteer DataFrame
volunteer_new = pd.concat([volunteer, category_enc], axis=1)

# Take a look at the encoded columns
print(volunteer.shape)
print(volunteer_new.shape)
print(category_enc.shape)
print(category_enc.head())

### 2. Engineering numerical features - datetime
There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

In [13]:
print(volunteer["start_date_date"].head())

0        July 30 2011
1    February 01 2011
2     January 29 2011
3    February 14 2011
4    February 05 2011
Name: start_date_date, dtype: object


In [19]:

# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Method 1:
#for index, row in volunteer.iterrows():
#    volunteer["start_date_month"][index] = row["start_date_converted"].month

# Method 2:
#def return_month(row):
#     return row.month
#volunteer["start_date_month"] = volunteer["start_date_converted"].apply(return_month)

# Method 3:
# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

# Take a look at the original and new columns
print(volunteer[["start_date_date","start_date_converted", "start_date_month"]].head())

    start_date_date start_date_converted  start_date_month
0      July 30 2011           2011-07-30                 7
1  February 01 2011           2011-02-01                 2
2   January 29 2011           2011-01-29                 1
3  February 14 2011           2011-02-14                 2
4  February 05 2011           2011-02-05                 2


### 3. Engineering features from strings - extraction
The Length column in the hiking dataset is a column of strings, but contained in the column 
is the mileage for the hike. We're going to extract this mileage using regular expressions, 
and then use a lambda in Pandas to apply the extraction to the DataFrame.

In [21]:
print(hiking["Length"].head())

0     0.8 miles
1      1.0 mile
2    0.75 miles
3     0.5 miles
4     0.5 miles
Name: Length, dtype: object


### Regular Expressions (Regex)
http://www.ntu.edu.sg/home/ehchua/programming/howto/regexe.html

In [23]:
import re
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern,str(length))
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
    
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(return_mileage)
print(hiking[["Length", "Length_num"]])


        Length  Length_num
0    0.8 miles        0.80
1     1.0 mile        1.00
2   0.75 miles        0.75
3    0.5 miles        0.50
4    0.5 miles        0.50
5      Various         NaN
6    1.7 miles        1.70
7    2.4 miles        2.40
8     1.0 mile        1.00
9    3.0 miles        3.00
10  12.3 miles       12.30
11  0.85 miles        0.85
12   4.0 miles        4.00
13   7.6 miles        7.60
14   8.0 miles        8.00
15   0.5 miles        0.50
16   7.6 miles        7.60
17  0.75 miles        0.75
18  0.25 miles        0.25
19  12.3 miles       12.30
20   1.4 miles        1.40
21  1.25 miles        1.25
22   1.5 miles        1.50
23    1.1 mile        1.10
24   1.5 miles        1.50
25   1.2 miles        1.20
26  0.75 miles        0.75
27   1.5 miles        1.50
28   3.0 miles        3.00
29        None         NaN
30        None         NaN
31        None         NaN
32        None         NaN


### 4. Engineering features from strings - tf/idf
Let's transform the volunteer dataset's title column into a text vector, 
to use in a prediction task in the next exercise.

### Bag of Words Reference:

http://datameetsmedia.com/bag-of-words-tf-idf-explained/

https://medium.freecodecamp.org/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Take the title text
title_text = volunteer["title"]
#print(title_text.head())
# Create the vectorizer method
tfidf_vec = TfidfVectorizer()
# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)
print(type(text_tfidf))
print(text_tfidf.toarray().shape)
#print(pd.DataFrame(text_tfidf.toarray()).head())
print(text_tfidf.toarray()[0][1086])
print(pd.DataFrame(text_tfidf.toarray()).columns)


<class 'scipy.sparse.csr.csr_matrix'>
(665, 1136)
0.2304728774077965
RangeIndex(start=0, stop=1136, step=1)


In [33]:
from sklearn.model_selection import train_test_split  
from sklearn.naive_bayes import GaussianNB
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"].astype(str)
train_X, test_X, train_y, test_y = train_test_split(text_tfidf.toarray(), y, stratify = y)

nb = GaussianNB()
# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

0.49101796407185627


In [27]:
print(tfidf_vec.vocabulary_)

{'volunteers': 1086, 'needed': 690, 'for': 404, 'rise': 869, 'up': 1061, 'stay': 959, 'put': 822, 'home': 493, 'rescue': 855, 'fair': 375, 'web': 1095, 'designer': 297, 'urban': 1063, 'adventures': 43, 'ice': 515, 'skating': 930, 'at': 98, 'lasker': 587, 'rink': 868, 'fight': 392, 'global': 447, 'hunger': 512, 'and': 75, 'support': 986, 'women': 1108, 'farmers': 380, 'join': 562, 'the': 1012, 'oxfam': 739, 'action': 31, 'corps': 255, 'in': 523, 'nyc': 710, 'stop': 962, 'swap': 989, 'queens': 825, 'staff': 951, 'development': 300, 'trainer': 1037, 'claro': 213, 'brooklyn': 155, 'volunteer': 1084, 'attorney': 101, 'cents': 188, 'ability': 23, 'community': 235, 'health': 480, 'advocates': 48, 'supervise': 984, 'children': 202, 'highland': 491, 'park': 748, 'garden': 433, 'worldofmoney': 1118, 'org': 727, 'youth': 1132, 'amazing': 67, 'race': 826, 'qualified': 824, 'board': 142, 'member': 649, 'seats': 899, 'available': 106, 'young': 1130, 'adult': 38, 'tutor': 1052, 'updated': 1062, '30':

In [132]:
print(tfidf_vec.idf_.shape)
print(tfidf_vec.idf_[1086])
print(tfidf_vec.get_feature_names())

(1136,)
3.863703510814003
['11', '125th', '14th', '17', '175th', '20', '2011', '2012', '21th', '22nd', '23', '24', '2nd', '30', '3rd', '54st', '55', '5k', '5th', '7th', '8th', '9th', 'abe', 'ability', 'aboard', 'about', 'academic', 'accountant', 'accounting', 'aces', 'achievement', 'action', 'active', 'activism', 'activities', 'activity', 'administrative', 'administrator', 'adult', 'adults', 'adv', 'advanced', 'adventure', 'adventures', 'advertising', 'advetures', 'advice', 'advisor', 'advocates', 'adwords', 'aerobics', 'affected', 'affiliate', 'african', 'after', 'afterschool', 'against', 'age', 'aid', 'aide', 'air', 'al', 'all', 'alliance', 'alongside', 'alternatives', 'alzheimer', 'amazing', 'ambassador', 'america', 'american', 'americorps', 'an', 'analysis', 'analyst', 'and', 'animal', 'annual', 'anyone', 'apartment', 'april', 'archivist', 'area', 'around', 'art', 'arthritis', 'artist', 'arts', 'as', 'asbury', 'assault', 'asser', 'assist', 'assistance', 'assistant', 'assistants', '

In [57]:
#Regular expression
#@[A-Za-z0-9_]+
#starts with @
#followed by any alphabet(upper or lower case), digit, or underscore
#that repeats at least once, but any number of times
#.: wildcard, matches a single character
#^: start of a string
#$: end of a string
#[]: matches one of the set of characters within []
#[a-z]: matches one of the range of characters a,b,...,z
#[^abc]: matches a character that is not a,b, or,c
#a|b: matches either a or b, where a and b are strings
#(): Scoping for operators
#\: Escape character for special characters (\t,\n,\b)
#\b: Mathces word boundary
#\d: Any digit, equivalent to [0-9]
#\D: Any non-digit, equivalent to [^0-9]
#\s: Any whitespace, equivalent to [ \t\n\r\f\v]
#\w: Alphanumeric character, equivalent to [a-zA-Z0-9_]
#\W: Non-Alphanumeric, equivalent to [^a-zA-Z0-9_]
#*: matches zero or more occurrences
#+: matches one or more occurrences
#?: matches zero or one occurrences
#{n}: exactly n repetitions, n >= 0
#{n,}: at least n repetitions
#{,n}: at most n repetitions
#{m,n}: at least m and at most n repetitions