# Selecting relevant features

In this exercise, you'll identify the redundant columns in the `volunteer` dataset, and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the `volunteer` dataset in the console, you'll see three features which are related to location: `locality` , `region` , and `postalcode` . They contain related information, so it would make sense to keep only one of the features.

Take some time to examine the features of `volunteer` in the console, and try to identify the redundant features.

## Instructions

- Create a list of redundant column names and store it in the `to_drop` variable: Out of all the location-related features, keep only `postalcode` . Features that have gone through the feature engineering process are redundant as well.
- Out of all the location-related features, keep only `postalcode` .
- Features that have gone through the feature engineering process are redundant as well.
- Drop the columns in the `to_drop` list from the dataset.
- Print out the `.head()` of `volunteer_subset` to see the selected columns.

In [1]:
import pandas as pd

volunteer = pd.read_csv("volunteer_opportunities.csv")

In [2]:
# Create a list of redundant column names to drop
# locality, region -> already got postalcode
# created_date -> preprocessed and added created_month
# vol_requests -> preprocessed and added vol_requests_lognorm
# category_desc -> already got one-hot encoded
to_drop = ["locality", "region", "created_date", "vol_requests", "category_desc"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of volunteer_subset
print(volunteer_subset.head())

   opportunity_id  content_id  event_time  \
0            4996       37004           0   
1            5008       37036           0   
2            5016       37143           0   
3            5022       37237           0   
4            5055       37425           0   

                                               title  hits  \
0  Volunteers Needed For Rise Up & Stay Put! Home...   737   
1                                       Web designer    22   
2      Urban Adventures - Ice Skating at Lasker Rink    62   
3  Fight global hunger and support women farmers ...    14   
4                                      Stop 'N' Swap    31   

                                             summary is_priority  category_id  \
0  Building on successful events last summer and ...         NaN          NaN   
1             Build a website for an Afghan business         NaN          1.0   
2  Please join us and the students from Mott Hall...         NaN          1.0   
3  The Oxfam Action Corps is a g

# Checking for correlated features

You'll now return to the `wine` dataset, which consists of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

## Instructions

- Print out the Pearson correlation coefficients for each pair of features in the `wine` dataset.
- Drop any columns from `wine` that have a correlation coefficient above 0.75 with at least two other columns .

In [3]:
wine = pd.read_csv('wine_types.csv')

In [4]:
# Print out the column correlations of the wine dataset
print(wine.corr())

# Drop that column from the DataFrame
wine = wine.drop(['Flavanoids'], axis=1)

print(wine.head())

                                  Type   Alcohol  Malic acid       Ash  \
Type                          1.000000 -0.328222    0.437776 -0.049643   
Alcohol                      -0.328222  1.000000    0.094397  0.211545   
Malic acid                    0.437776  0.094397    1.000000  0.164045   
Ash                          -0.049643  0.211545    0.164045  1.000000   
Alcalinity of ash             0.517859 -0.310235    0.288500  0.443367   
Magnesium                    -0.209179  0.270798   -0.054575  0.286587   
Total phenols                -0.719163  0.289101   -0.335167  0.128980   
Flavanoids                   -0.847498  0.236815   -0.411007  0.115077   
Nonflavanoid phenols          0.489109 -0.155929    0.292977  0.186230   
Proanthocyanins              -0.499130  0.136698   -0.220746  0.009652   
Color intensity               0.265668  0.546364    0.248985  0.258887   
Hue                          -0.617369 -0.071747   -0.561296 -0.074667   
OD280/OD315 of diluted wines -0.788230

# Exploring text vectors, part 1

Let's expand on the text vector exploration method we just learned about, using the `volunteer` dataset's `title` tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our `text_tfidf` vector.

## Instructions

- Add parameters called `original_vocab` , for the `tfidf_vec.vocabulary_` , and `top_n` .
- Call `pd.Series()` on the zipped dictionary. This will make it easier to operate on.
- Use the `.sort_values()` function to sort the series and slice the index up to `top_n` words.
- Call the function, setting `original_vocab=tfidf_vec.vocabulary_` , setting `vector_index=8` to grab the 9th row, and setting `top_n=3` , to grab the top 3 weighted words.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Load the volunteer dataset if not already loaded
volunteer = pd.read_csv("volunteer_opportunities.csv")

# Create TF-IDF vectorizer and fit it on the title column
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(volunteer['title'])

# Create vocab as a reverse mapping of the vocabulary
vocab = {v: k for k, v in tfidf_vec.vocabulary_.items()}


In [6]:
# Add in the rest of the arguments
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

[188, 23, 562]


# Exploring text vectors, part 2

Using the `return_weights()` function you wrote in the previous exercise, you're now going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

## Instructions

- Call `return_weights()` to return the top weighted words for that document.
- Call `set()` on the returned `filter_list` to remove duplicated numbers.
- Call `words_to_filter` , passing in the following parameters: `vocab` for the `vocab` parameter, `tfidf_vec.vocabulary_` for the `original_vocab` parameter, `text_tfidf` for the `vector` parameter, and `3` to grab the `top_n` 3 weighted words from each document.
- Finally, pass that `filtered_words` set into a list to use as a filter for the text vector.

In [7]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(filtered_words)]

# Training Naive Bayes with feature selection

You'll now re-run the Naive Bayes text classification model that you ran at the end of Chapter 3 with our selection choices from the previous exercise: the `volunteer` dataset's `title` and `category_desc` columns.

## Instructions

- Use `train_test_split()` on the `filtered_text` text vector, the `y` labels (which is the `category_desc` labels), and pass the `y` set to the `stratify` parameter, since we have an uneven class distribution.
- Fit the `nb` Naive Bayes model to `X_train` and `y_train` .
- Calculate the test set accuracy of `nb` .

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Create a mask for non-null category_desc values
mask = volunteer['category_desc'].notna()

# Filter both the target variable and the text features using the same mask
y = volunteer.loc[mask, 'category_desc']
filtered_text_clean = filtered_text[mask]

# Create and fit the Naive Bayes model
nb = MultinomialNB()

In [9]:
# Split the dataset according to the class distribution of category_desc
X_train, X_test, y_train, y_test = train_test_split(filtered_text_clean.toarray(), y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5548387096774193


# Using PCA

In this exercise, you'll apply PCA to the `wine` dataset, to see if you can increase the model's accuracy.

## Instructions

- Instantiate a `PCA` object.
- Define the features ( `X` ) and labels ( `y` ) from `wine` , using the labels in the `"Type"` column.
- Apply PCA to `X_train` and `X_test` , ensuring no data leakage, and store the transformed values as `pca_X_train` and `pca_X_test` .
- Print out the `.explained_variance_ratio_` attribute of `pca` to check how much variance is explained by each component.

In [12]:
# Import PCA
from sklearn.decomposition import PCA

# Load the wine dataset if not already loaded
wine = pd.read_csv('wine_types.csv')

In [13]:
# Instantiate a PCA object
pca = PCA()

# Define the features and labels from the wine dataset
X = wine.drop("Type", axis=1)
y = wine["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply PCA to the wine dataset X vector
pca_X_train = pca.fit_transform(X_train)
pca_X_test = pca.transform(X_test)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.97795009e-01 2.02071827e-03 9.88350594e-05 5.66222566e-05
 1.26161135e-05 8.93235789e-06 3.13856866e-06 1.57406401e-06
 1.15918860e-06 7.49332354e-07 3.70332305e-07 1.94185373e-07
 8.08440051e-08]


# Training a model with PCA

Now that you have run PCA on the `wine` dataset, you'll finally train a KNN model using the transformed data.

## Instructions

- Fit the `knn` model to the PCA-transformed features, `pca_X_train` , and training labels, `y_train` .
- Print the test set accuracy of the `knn` model using `pca_X_test` and `y_test` .

In [14]:
# Import KNN
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier
knn = KNeighborsClassifier()

In [15]:
# Fit knn to the training data
knn.fit(pca_X_train, y_train)

# Score knn on the test data and print it out
print(knn.score(pca_X_test, y_test))

0.7777777777777778
