# Encoding categorical variables - binary

Take a look at the `hiking` dataset. There are several columns here that need encoding before they can be modeled, one of which is the `Accessible` column. `Accessible` is a binary feature, so it has two values, `Y` or `N` , which need to be encoded into 1's and 0's. Use scikit-learn's `LabelEncoder` method to perform this transformation.

## Instructions

- Store `LabelEncoder()` in a variable named `enc` .
- Using the encoder's `.fit_transform()` method, encode the `hiking` dataset's `"Accessible"` column. Call the new column `Accessible_enc` .
- Compare the two columns side-by-side to see the encoding.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

hiking = pd.read_json("hiking.json")

In [2]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
print(hiking[['Accessible', 'Accessible_enc']].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


# Encoding categorical variables - one-hot

One of the columns in the `volunteer` dataset, `category_desc` , gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use pandas' `pd.get_dummies()` function to do so.

## Instructions

- Call `get_dummies()` on the `volunteer["category_desc"]` column to create the encoded columns and assign it to `category_enc` .
- Print out the `.head()` of the `category_enc` variable to take a look at the encoded columns.

In [3]:
volunteer = pd.read_csv("volunteer_opportunities.csv")

# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0      False                   False        False   False   
1      False                   False        False   False   
2      False                   False        False   False   
3      False                   False        False   False   
4      False                   False         True   False   

   Helping Neighbors in Need  Strengthening Communities  
0                      False                      False  
1                      False                       True  
2                      False                       True  
3                      False                       True  
4                      False                      False  


# Aggregating numerical features

A good use case for taking an aggregate statistic to create a new feature is when you have many features with similar, related values. Here, you have a DataFrame of running times named `running_times_5k` . For each `name` in the dataset, take the mean of their 5 run times.

## Instructions

- Use the `.loc[]` method to select all rows and columns to find the `.mean()` of the each columns.
- Print the `.head()` of the DataFrame to see the `mean` column.

In [4]:
running_times_5k = pd.DataFrame({
    'name': ['Sue', 'Mark', 'Sean', 'Erin', 'Jenny', 'Russell'],
    'run1': [20.1, 16.5, 23.5, 21.7, 25.8, 30.9],
    'run2': [18.5, 17.1, 25.1, 21.1, 27.1, 29.6],
    'run3': [19.6, 16.9, 25.2, 20.9, 26.1, 31.4],
    'run4': [20.3, 17.6, 24.6, 22.1, 26.7, 30.4],
    'run5': [18.3, 17.3, 23.9, 22.2, 26.9, 29.9],
    'mean': [19.36, 17.08, 24.46, 21.60, 26.52, 30.44]
})

In [5]:
# Use .loc to create a mean column
running_times_5k["mean"] = running_times_5k.loc[:, "run1":"run5"].mean(axis=1)

# Take a look at the results
print(running_times_5k.head())

    name  run1  run2  run3  run4  run5   mean
0    Sue  20.1  18.5  19.6  20.3  18.3  19.36
1   Mark  16.5  17.1  16.9  17.6  17.3  17.08
2   Sean  23.5  25.1  25.2  24.6  23.9  24.46
3   Erin  21.7  21.1  20.9  22.1  22.2  21.60
4  Jenny  25.8  27.1  26.1  26.7  26.9  26.52


# Extracting datetime components

There are several columns in the `volunteer` dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.

## Instructions

- Convert the `start_date_date` column into a `pandas` datetime column and store it in a new column called `start_date_converted` .
- Retrieve the month component of `start_date_converted` and store it in a new column called `start_date_month` .
- Print the `.head()` of just the `start_date_converted` and `start_date_month` columns.

In [6]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].dt.month

# Take a look at the converted and new month columns
print(volunteer[["start_date_converted", "start_date_month"]].head())

  start_date_converted  start_date_month
0           2011-07-30                 7
1           2011-02-01                 2
2           2011-01-29                 1
3           2011-02-14                 2
4           2011-02-05                 2


# Extracting string patterns

The `Length` column in the `hiking` dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in pandas to apply the extraction to the DataFrame.

## Instructions

- Search the text in the `length` argument for numbers and decimals using an appropriate pattern.
- Extract the matched pattern and convert it to a float.
- Apply the `return_mileage()` function to each row in the `hiking["Length"]` column.

In [7]:
import re

hiking = pd.read_json("hiking.json")

# remove NaN values from the "Length" column
hiking = hiking[hiking['Length'].notna()]

In [8]:
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    
    # Search the text for matches
    mile = re.search(r'\d+\.\d*', length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking["Length"].apply(return_mileage)
print(hiking[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


# Vectorizing text

You'll now transform the `volunteer` dataset's `title` column into a text vector, which you'll use in a prediction task in the next exercise.

## Instructions

- Store the `volunteer["title"]` column in a variable named `title_text` .
- Instantiate a `TfidfVectorizer` as `tfidf_vec` .
- Transform the text in `title_text` into a tf-idf vector using `tfidf_vec` .

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

# Text classification using tf/idf vectors

Now that you've encoded the `volunteer` dataset's `title` column into tf/idf vectors, you'll use those vectors to predict the `category_desc` column.

## Instructions

- Split the `text_tfidf` vector and `y` target variable into training and test sets, setting the `stratify` parameter equal to `y` , since the class distribution is uneven. Notice that we have to run the `.toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
- Fit the `X_train` and `y_train` data to the Naive Bayes model, `nb` .
- Print out the test set accuracy.

In [None]:
# Solution

# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

In [15]:
# Take the title text and remove any NaN values from both title and category_desc
volunteer_clean = volunteer[(volunteer['title'].notna()) & (volunteer['category_desc'].notna())]
title_text = volunteer_clean['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

# Split the dataset according to the class distribution of category_desc
y = volunteer_clean["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

# Initialize the Naive Bayes model
nb = MultinomialNB()

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5548387096774193
