In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [2]:
volunteer = pd.read_csv("../volunteer_opportunities.csv")

In [3]:
hiking = pd.read_json("../hiking.json")
hiking.head()

Unnamed: 0,Prop_ID,Name,Location,Park_Name,Length,Difficulty,Other_Details,Accessible,Limited_Access,lat,lon
0,B057,Salt Marsh Nature Trail,"Enter behind the Salt Marsh Nature Center, loc...",Marine Park,0.8 miles,,<p>The first half of this mile-long trail foll...,Y,N,,
1,B073,Lullwater,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,1.0 mile,Easy,Explore the Lullwater to see how nature thrive...,N,N,,
2,B073,Midwood,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.75 miles,Easy,Step back in time with a walk through Brooklyn...,N,N,,
3,B073,Peninsula,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Discover how the Peninsula has changed over th...,N,N,,
4,B073,Waterfall,Enter Park at Lincoln Road and Ocean Avenue en...,Prospect Park,0.5 miles,Easy,Trace the source of the Lake on the Waterfall ...,N,N,,


### Encoding categorical variables - binary

Take a look at the hiking dataset. There are several columns here that need encoding, one of which is the Accessible column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y or N - so it needs to be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation. 

* Instructions

    * Store LabelEncoder() in a variable named enc
    * Using the encoder's fit_transform() function, encode the hiking dataset's "Accessible" column. Call the new column Accessible_enc.
    * Compare the two columns side-by-side to see the encoding.


In [4]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

# Compare the two columns
print(hiking[["Accessible_enc", "Accessible"]].head())

   Accessible_enc Accessible
0               1          Y
1               0          N
2               0          N
3               0          N
4               0          N


### Encoding categorical variables - one-hot

One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' `get_dummies()` function to do so.

In [5]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0  


### Engineering numerical features - taking an average

A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

* Instructions

    * Create a list of the columns you want to take the average of and store it in a variable named run_columns.
    * Use apply to take the mean() of the list of columns and remember to set axis=1. Use lambda row: in the apply.
    * Print out the DataFrame to see the mean column.


In [6]:
data_dict = {
    'name':['Sue', 'Mark', 'Sean', 'Erin', 'Jenny', 'Russell'], 
    'run1':[20.1, 16.5, 23.5, 21.7, 25.8, 30.9], 
    'run2':[18.5, 17.1, 25.1, 21.1, 27.1, 29.6], 
    'run3':[19.6, 16.9, 25.2, 20.9, 26.1, 31.4], 
    'run4':[20.3, 17.6, 24.6, 22.1, 26.7, 30.4], 
    'run5':[18.3, 17.3, 23.9, 22.2, 26.9, 29.9]
}

In [7]:
running_times_5k = pd.DataFrame(data_dict)

In [8]:
# Create a list of the columns to average
run_columns = ["run1", "run2", "run3", "run4", "run5"]

# Use apply to create a mean column
running_times_5k["mean"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
print(running_times_5k)

      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44


### Engineering numerical features - datetime

There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

In [9]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].apply(lambda row: row.month)

# Take a look at the converted and new month columns
print(volunteer[["start_date_converted", "start_date_month"]].head())

  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2


### Engineering features from strings - extraction

The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.
* Instructions

    * Create a pattern that will extract numbers and decimals from text, using `\d+ to get numbers` and `\. to get decimals`, and pass it into re's compile function.
    * Use re's match function to search the text, passing in the pattern and the length text.
    * Use the matched mile's group() attribute to extract the matched pattern, making sure to match group 0, and pass it into float.
    * Apply the `return_mileage()` function to the hiking["Length"] column.


In [10]:
import re

In [11]:
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r"\d+\.\d+")
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].dropna(axis=0).apply(lambda row: return_mileage(row))
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


## Engineering features from strings - tf/idf

Let's transform the volunteer dataset's title column into a text vector, to use in a prediction task in the next exercise.
* Instructions

    * Store the volunteer["title"] column in a variable named title_text.
    * Use the tfidf_vec vectorizer's fit_transform() function on title_text to transform the text into a tf-idf vector.


In [12]:
volunteer['title'].shape

(665,)

In [13]:
volunteer['title'].head()

0    Volunteers Needed For Rise Up & Stay Put! Home...
1                                         Web designer
2        Urban Adventures - Ice Skating at Lasker Rink
3    Fight global hunger and support women farmers ...
4                                        Stop 'N' Swap
Name: title, dtype: object

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [29]:
temp_df = volunteer[['title', 'category_desc']]
temp_df = temp_df.dropna(axis=0)

In [30]:
# Take the title text
title_text = temp_df["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

In [31]:
text_tfidf.shape

(617, 1089)

## Text classification using tf/idf vectors

Now that we've encoded the volunteer dataset's title column into tf/idf vectors, let's use those vectors to try to predict the category_desc column.
* Instructions

    * Using train_test_split, split the text_tfidf vector, along with your y variable, into training and test sets. Set the stratify parameter equal to y, since the class distribution is uneven. Notice that we have to run the toarray() method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
    * Use Naive Bayes' fit() method on the X_train and y_train variables.
    * Print out the score() of the X_test and y_test variables.


In [32]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

nb = GaussianNB(priors=None)

In [39]:
# Split the dataset according to the class distribution of category_desc
y = temp_df["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5935483870967742
