In [1]:
import pandas as pd
import numpy as np

In [2]:
hiking = pd.read_json("../hiking.json")
# hiking.head()

In [3]:
volunteer = pd.read_csv("../volunteer_opportunities.csv")
# volunteer.head()

In [4]:
wine = pd.read_csv("../wine_types.csv")

In [5]:
sel_cols = ['vol_requests', 'title', 'hits', 'category_desc', 'locality', 'region',
       'postalcode', 'created_date', 'vol_requests_lognorm', 'created_month',
       'Education', 'Emergency Preparedness', 'Environment', 'Health',
       'Helping Neighbors in Need', 'Strengthening Communities']
# new_volunteer_df = volunteer[sel_cols]

## Selecting relevant features

Now let's identify the redundant columns in the volunteer dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain repeated information, so it would make sense to keep only one of the features.

There are also features that have gone through the feature engineering process: columns like Education and Emergency Preparedness are a product of encoding the categorical variable category_desc, so category_desc itself is redundant now.

Take a moment to examine the features of volunteer in the console, and try to identify the redundant features.

* Instructions

    * Create a list of redundant column names and store it in the to_drop variable:
        * Out of all the location-related features, keep only postcode.
        * Features that have gone through the feature engineering process are redundant as well.
    * Drop the columns from the dataset using .drop().
    * Print out the .head() of the DataFrame to see the selected columns.


In [6]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
print(volunteer_subset.head())

   opportunity_id  content_id  event_time  \
0            4996       37004           0   
1            5008       37036           0   
2            5016       37143           0   
3            5022       37237           0   
4            5055       37425           0   

                                               title  hits  \
0  Volunteers Needed For Rise Up & Stay Put! Home...   737   
1                                       Web designer    22   
2      Urban Adventures - Ice Skating at Lasker Rink    62   
3  Fight global hunger and support women farmers ...    14   
4                                      Stop 'N' Swap    31   

                                             summary is_priority  category_id  \
0  Building on successful events last summer and ...         NaN          NaN   
1             Build a website for an Afghan business         NaN          1.0   
2  Please join us and the students from Mott Hall...         NaN          1.0   
3  The Oxfam Action Corps is a g

## Checking for correlated features

Let's take a look at the wine dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.
* Instructions

    * Print out the column correlations of the wine dataset using corr().
    * Take a minute to look at the correlations. Identify a column where the correlation value is greater than 0.75 at least twice and store it in the to_drop variable.
    * Drop that column from the DataFrame using drop().


In [7]:
new_wine_df = wine[['Flavanoids', 'Total phenols', 'Malic acid',
       'OD280/OD315 of diluted wines', 'Hue']]
new_wine_df.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [8]:
# Print out the column correlations of the wine dataset
print(new_wine_df.corr())

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
new_wine_df = new_wine_df.drop(to_drop, axis=1)

                              Flavanoids  Total phenols  Malic acid  \
Flavanoids                      1.000000       0.864564   -0.411007   
Total phenols                   0.864564       1.000000   -0.335167   
Malic acid                     -0.411007      -0.335167    1.000000   
OD280/OD315 of diluted wines    0.787194       0.699949   -0.368710   
Hue                             0.543479       0.433681   -0.561296   

                              OD280/OD315 of diluted wines       Hue  
Flavanoids                                        0.787194  0.543479  
Total phenols                                     0.699949  0.433681  
Malic acid                                       -0.368710 -0.561296  
OD280/OD315 of diluted wines                      1.000000  0.565468  
Hue                                               0.565468  1.000000  


## Exploring text vectors, part 1

Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.
* Instructions

    * Add parameters called original_vocab, for the tfidf_vec.vocabulary_, and top_n.
    * Call pd.Series on the zipped dictionary. This will make it easier to operate on.
    * Use the sort_values function to sort the series and slice the index up to top_n words.
    * Call the function, setting original_vocab=tfidf_vec.vocabulary_, setting vector_index=8 to grab the 9th row, and setting top_n=3, to grab the top 3 weighted words.


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

temp_df = volunteer[['title', 'category_desc']]
temp_df = temp_df.dropna(axis=0)

# Take the title text
title_text = temp_df["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

In [10]:
vocab = {v:k for k,v in   tfidf_vec.vocabulary_.items()}

In [11]:
# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

[189, 942, 466]


## Exploring text vectors, part 2

Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.
* Instructions

    * Call return_weights to return the top weighted words for that document.
    * Call set on the returned filter_list so we don't get duplicated numbers.
    * Call words_to_filter, passing in the following parameters: vocab for the vocab parameter, tfidf_vec.vocabulary_ for the original_vocab parameter, text_tfidf for the vector parameter, and 3 to grab the top_n 3 weighted words from each document.
    * Finally, pass that filtered_words set into a list to use as a filter for the text vector.


In [12]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Here we'll call the function from the previous exercise, and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# By converting filtered_words back to a list, we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

## Training Naive Bayes with feature selection

Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the volunteer dataset's title and category_desc columns.
* Instructions

    * Use train_test_split on the filtered_text text vector, the y labels (which is the category_desc labels), and pass the y set to the stratify parameter, since we have an uneven class distribution.
    * Fit the nb Naive Bayes model to train_X and train_y.
    * Score the nb model on the test_X and test_y test sets.


In [13]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

nb = GaussianNB(priors=None)

In [14]:
y = temp_df["category_desc"]

In [15]:
# Split the dataset according to the class distribution of category_desc
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(train_X, train_y)

# Print out the model's accuracy
print(nb.score(test_X, test_y))

0.535483870967742


## Using PCA

Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.
* Instructions

    * Set up the PCA object. You'll use PCA on the wine dataset minus its label for Type, stored in the variable wine_X.
    * Apply PCA to wine_X using pca's fit_transform method and store the transformed vector in transformed_X.
    * Print out the explained_variance_ratio_ attribute of pca to check how much variance is explained by each component.


In [16]:
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]


In [17]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [19]:
# Split the transformed X and the y labels into training and test sets
y = wine['Type']
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
print(knn.score(X_wine_test, y_wine_test))

0.7333333333333333
