# Introduction to preprocessing

# PART ONE

## 1 What is data preprocessing?

- After exploratory data analysis and data cleaning
- Preparing data for modeling
- Example: transforming categorical features into numerical features (dummy variables)

## Why preprocess?

- Transform dataset so it's suitable for modeling
- Improve model performance
- Generate more reliable results

![Screen Shot 2023-08-30 at 11.24.15 AM](Screen%20Shot%202023-08-30%20at%2011.24.15%20AM.png)


## Recap: exploring data with pandas

In [1]:
import pandas as pd
hiking = pd.read_json("datasets/hiking.json")
wine = pd.read_csv('datasets/wine_types.csv')
running_times_5k = pd.read_csv('datasets/running.csv')
print(hiking.head())
print(hiking.info())
print(wine.describe())

  Prop_ID                     Name  ... lat lon
0    B057  Salt Marsh Nature Trail  ... NaN NaN
1    B073                Lullwater  ... NaN NaN
2    B073                  Midwood  ... NaN NaN
3    B073                Peninsula  ... NaN NaN
4    B073                Waterfall  ... NaN NaN

[5 rows x 11 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Prop_ID         33 non-null     object 
 1   Name            33 non-null     object 
 2   Location        33 non-null     object 
 3   Park_Name       33 non-null     object 
 4   Length          29 non-null     object 
 5   Difficulty      27 non-null     object 
 6   Other_Details   31 non-null     object 
 7   Accessible      33 non-null     object 
 8   Limited_Access  33 non-null     object 
 9   lat             0 non-null      float64
 10  lon             0 non-null      float64
d

## Removing missing data

In [3]:
print(df)
print(df.dropna())
print(df.drop([1, 2, 3]))
print(df.drop("A", axis=1))
print(df.isna().sum())
print(df.dropna(subset=["B"]))
print(df.dropna(thresh=2))

## Exploring missing data
You've been given a dataset comprised of volunteer information from New York City, stored in the volunteer DataFrame. Explore the dataset using the plethora of methods and attributes pandas has to offer to answer the following question.

### How many missing values are in the locality column?

In [2]:
volunteer = pd.read_csv('datasets/volunteer_opportunities.csv')
print(volunteer['locality'].isna().sum())

70


## Dropping missing data
Now that you've explored the volunteer dataset and understand its structure and contents, it's time to begin dropping missing values.

In this exercise, you'll drop both columns and rows to create a subset of the volunteer dataset.

### Instructions

- Drop the `Latitude` and `Longitude` columns from `volunteer`, storing as volunteer_cols.
- Subset `volunteer_cols` by dropping rows containing missing values in the `category_desc`, and store in a new variable called `volunteer_subset`.
- Take a look at the `.shape` attribute of `volunteer_subset`, to verify it worked correctly.

In [3]:
# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(['Latitude', 'Longitude'], axis =1)

# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset=['category_desc'])

# Print out the shape of the subset
print(volunteer_subset.shape)

(617, 33)


## 2 Working With Data Types

## Why are types important?

- `object`: string/mixed types
- `int64`: integer
- `float64`: float
- `datetime64`: dates and times

## Converting column types

In [None]:
print(df.info())
df["C"] = df["C"].astype("float")
print(df.dtypes)

## Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, you want to know what types you'll be working with as you start to do more preprocessing.

### Which data types are present in the volunteer dataset?

In [7]:
print(volunteer.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region  

## Converting a column type

If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.

### Instructions

- Take a look at the `.head()` of the hits column.
- Convert the `hits` column to type `int`.
- Take a look at the `.dtypes` of the dataset again, and notice that the column type has changed.

In [4]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype('int')

# Look at the dtypes of the dataset
print(volunteer.dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64


## 3 Training and testsets

## Why split?
1. Reduces overfitting
2. Evaluate performance on a holdout set

## Splitting up your dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Stratified sampling
- Dataset of 100 samples: 80 **class** 1 and 20 **class** 2
- Training set of 75 samples: 60 **class** 1 and 15 **class** 2
- Test set of 25 samples: 20 **class** 1 and 5 **class** 2

## Stratified sampling

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, random_state=42)
y["labels"].value_counts()
y_train["labels"].value_counts()
y_test["labels"].value_counts()

## Exercise
In the volunteer dataset, you're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, you need to know what the class distribution (and imbalance) is for that label.

## Which descriptions occur less than 50 times in the volunteer dataset?

In [9]:
classes = volunteer['category_desc'].value_counts()
print(classes)
print(classes[classes <50])

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64
Environment               32
Emergency Preparedness    15
Name: category_desc, dtype: int64


## Stratified sampling
You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

### Instructions

- Create a DataFrame of features, `X`, with all of the columns except `category_desc`.
- Create a DataFrame of labels, `y` from the category_desc column.
- Split `X` and `y` into training and test sets, ensuring that the class distribution in the labels is the same in both sets
- Print the labels and counts in `y_train` using `.value_counts()`.

In [10]:
# import package
from sklearn.model_selection import train_test_split
volunteer = pd.read_csv('datasets/volunteer.csv')
# Create a DataFrame with all columns except category_desc
X = volunteer.drop('category_desc', axis=1)

# Create a category_desc labels dataset
y = volunteer[['category_desc']]

# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Print the category_desc counts from y_train
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64


# PART TWO

## 1 Standardization

## What is standardization?

**Standardization**: transform continuous data to appear normally distributed  
- scikit-learn models assume normally distributed data
- Using non-normal training data can introduce bias
- Log normalization and feature scaling in this course
- Applied to continuous numerical data


## When to standardize: linear distances
- Model in linear space
- Examples:
    - k-Nearest Neighbors (kNN)
    - Linear regression
    - K-Means Clustering

![Screen Shot 2023-08-30 at 12.40.26 PM](Screen%20Shot%202023-08-30%20at%2012.40.26%20PM.png)

## When to standardize: high variance
- Model in linear space

- Examples:
    - k-Nearest Neighbors (kNN)
    - Linear regression
    - K-Means Clustering

- Dataset features have high variance

## When to standardize: different scales

- Features are on different scales
- Example:
    - Predicting house prices using no. bedrooms and last sale price
- Linearity assumptions


## Modeling without normalizing
Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first.

Here we have a subset of the `wine` dataset. One of the columns, `Proline`, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The `scikit-learn` model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the `X` and `y` sets you need to fit and score on.

### Instructions

- Split up the `X` and `y` sets into training and test sets, ensuring that class labels are equally distributed in both sets.
- Fit the `knn` model to the training features and labels.
- Print the test set accuracy of the `knn` model using the `.score()` method.

In [11]:
# Import the datasets
X = pd.read_csv('datasets/X.csv')
y = pd.read_csv('datasets/y.csv', index_col=[0])
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
# Import KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

# Fit the knn model to the training data
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.9777777777777777


## 2 Log normalization

## What is log normalization?

- Useful for features with high variance
- Applies logarithm transformation
- Natural log using the constant $\epsilon(\approx 2.718)$
- $\epsilon ^{3.4} = 30$
- Captures relative changes, the magnitudeof change, and keeps everything positive

|Number  |Log|
|----  |---|
|30  |3.4|
|300  |5.7|
|3000  |8|


## Log normalization in Python

In [None]:
print(df)
import numpy as np
df["log_2"] = np.log(df["col2"])
print(df)
print(df[["col1", "log_2"]].var())

## Checking the variance
Check the variance of the columns in the wine dataset.  
Out of the four columns listed, which column is the most appropriate candidate for normalization?

In [12]:
print(wine.var())

Type                                0.600679
Alcohol                             0.659062
Malic acid                          1.248015
Ash                                 0.075265
Alcalinity of ash                  11.152686
Magnesium                         203.989335
Total phenols                       0.391690
Flavanoids                          0.997719
Nonflavanoid phenols                0.015489
Proanthocyanins                     0.327595
Color intensity                     5.374449
Hue                                 0.052245
OD280/OD315 of diluted wines        0.504086
Proline                         99166.717355
dtype: float64


## Log normalization in Python

Now that we know that the Proline column in our wine dataset has a large amount of variance, let's log normalize it.
`numpy` has been imported as `np`.

### Instructions

- Print out the variance of the Proline column for reference.
- Use the `np.log()` function on the Proline column to create a new, log-normalized column named `Proline_log`.
- Print out the variance of the `Proline_log` column to see the difference.

In [14]:
# Import numpy as np
import numpy as np
# Print out the variance of the Proline column
print(wine['Proline'].var())

# Apply the log normalization function to the Proline column
wine['Proline_log'] = np.log(wine['Proline'])

# Check the variance of the normalized Proline column
print(wine['Proline_log'].var())

99166.71735542436
0.17231366191842012


## 3 Scaling data

## What is feature scaling?

- Features on different scales
- Model with linear characteristics
- Center features around 0 and transform to variance of 1
- Transforms to approximately normal distribution


## How to scale data

In [None]:
print(df)
print(df.var())
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)
print(df_scaled.var())

## Scaling data - investigating columns
You want to use the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model.

### Which of the following statements about these columns is true?

In [15]:
print(wine.describe())

             Type     Alcohol  ...      Proline  Proline_log
count  178.000000  178.000000  ...   178.000000   178.000000
mean     1.938202   13.000618  ...   746.893258     6.530303
std      0.775035    0.811827  ...   314.907474     0.415107
min      1.000000   11.030000  ...   278.000000     5.627621
25%      1.000000   12.362500  ...   500.500000     6.215606
50%      2.000000   13.050000  ...   673.500000     6.512486
75%      3.000000   13.677500  ...   985.000000     6.892642
max      3.000000   14.830000  ...  1680.000000     7.426549

[8 rows x 15 columns]


## Scaling data - standardizing columns
Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

### Instructions

- Import the StandardScaler class.
- Instantiate a StandardScaler() and store it in the variable, scaler.
- Create a subset of the wine DataFrame containing the Ash, Alcalinity of ash, and Magnesium columns, assign it to wine_subset.
- Fit and transform the standard scaler to wine_subset.

In [16]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create the scaler
scaler = StandardScaler()

# Subset the DataFrame you want to scale 
wine_subset = wine[['Ash', 'Alcalinity of ash', 'Magnesium']]

# Apply the scaler to wine_subset
wine_subset_scaled = scaler.fit_transform(wine_subset)

## 4 Standardized data and modeling

## K-nearest neighbors
- Data leakage: non-training data is used to train the model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
knn = KNeighborsClassifier()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)

## KNN on non-scaled data

Before adding standardization to your `scikit-learn` workflow, you'll first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data.

The knn model as well as the` X` and `y` data and labels sets have been created already.

### Instructions

- Split the dataset into training and test sets.
- Fit the `knn` model to the training data.
- Print out the test set accuracy of your trained `knn` model.

In [17]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Fit the k-nearest neighbors model to the training data
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Score the model on the test data
print(knn.score(X_test, y_test))

0.9777777777777777


## KNN on scaled data
The accuracy score on the unscaled wine dataset was decent, but let's see what you can achieve by using standardization. Once again, the knn model as well as the X and y data and labels set have already been created for you.

### Instructions

- Create the `StandardScaler()` method, stored in a variable named `scaler`.
- Scale the training and test features, being careful not to introduce data leakage.
- Fit the `knn` model to the scaled training data.
- Evaluate the model's performance by computing the test set accuracy.

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Instantiate a StandardScaler
scaler = StandardScaler()

# Scale the training and test features
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train_scaled, y_train)

# Score the model on the test data
print(knn.score(X_test_scaled, y_test))

0.9777777777777777


# PART THREE

## 1 Feature engineering

## What is feature engineering?

Feature engineering: Creation of new features from existing ones
- Improve performance
- Insight into relationships between features
- Need to understand the data first!
- Highly dataset-dependent

## Feature engineering scenarios

|Id  |Text|
|--  |---|
|1  |"Feature engineering is fun!"|
|2  |"Feature engineering is a lot of work."|
|3  |"I don't mind feature engineering."|

|user  |fav_color|
|---   |------|
|1  |blue|
|2  |green|
|3  |orange|


|Id  |Date|
|---  |---| 
|4  |July 30 2011|
|5  |January 29 2011|
|6  |February 05 2011|

|user  |test1  |test2  |test3|
|--|--|--|--|
|1  |90.5  |89.6  |91.4|
|2  |65.5  |70.6  |67.3|
|3  |78.1  |80.7  |81.8|


## 2 Encoding categorical variables

## Categorical variables

|  |user  |subscribed  |fav_color|
|--|--|--|--|
|0     |1          |y      |blue|
|1     |2          |n     |green|
|2     |3          |n    |orange|
|3     |4          |y     |green|


## Encoding binary variables - pandas

In [None]:
print(users["subscribed"])
users["sub_enc"] = users["subscribed"].apply(lambda val: 1 if val == "y" else 0)
print(users[["subscribed", "sub_enc"]])

## Encoding binary variables - scikit-learn

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
users["sub_enc_le"] = le.fit_transform(users["subscribed"])
print(users[["subscribed", "sub_enc_le"]])

## Encoding categorical variables - binary

Take a look at the hiking dataset. There are several columns here that need encoding before they can be modeled, one of which is the Accessible column. Accessible is a binary feature, so it has two values, Y or N, which need to be encoded into 1's and 0's. Use `scikit-learn`'s LabelEncoder method to perform this transformation.

### Instructions

- Store `LabelEncoder()` in a variable named enc.
- Using the encoder's `.fit_transform()` method, encode the hiking dataset's `"Accessible"` column. Call the new column `Accessible_enc`.
- Compare the two columns side-by-side to see the encoding.

In [19]:
from sklearn.preprocessing import LabelEncoder
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking["Accessible_enc"] = enc.fit_transform(hiking["Accessible"])

# Compare the two columns
print(hiking[["Accessible", "Accessible_enc"]].head())

  Accessible  Accessible_enc
0          Y               1
1          N               0
2          N               0
3          N               0
4          N               0


## One-hot encoding

Values: [blue, green, orange]

- blue: [1, 0, 0]
- green: [0, 1, 0]
- orange: [0, 0, 1]

In [None]:
print(users["fav_color"])
print(pd.get_dummies(users["fav_color"]))

## Encoding categorical variables - one-hot

One of the columns in the volunteer dataset, `category_desc`, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use pandas' `pd.get_dummies()` function to do so.

### Instructions

- Call get_dummies() on the `volunteer["category_desc"]` column to create the encoded columns and assign it to `category_enc`.
- Print out the `.head()` of the `category_enc` variable to take a look at the encoded columns.

In [20]:
# Transform the category_desc column
category_enc = pd.get_dummies(volunteer["category_desc"])

# Take a look at the encoded columns
print(category_enc.head())

   Education  ...  Strengthening Communities
0          0  ...                          1
1          0  ...                          1
2          0  ...                          1
3          0  ...                          0
4          0  ...                          0

[5 rows x 6 columns]


## 3 Engineering numerical features

In [None]:
print(temps)
temps["mean"] = temps.loc[:,"day1":"day3"].mean(axis=1)
print(temps)

## Dates

In [None]:
print(purchases)
purchases["date_converted"] = pd.to_datetime(purchases["date"])
purchases['month'] = purchases["date_converted"].dt.month
print(purchases)

## Aggregating numerical features
A good use case for taking an aggregate statistic to create a new feature is when you have many features with similar, related values. Here, you have a DataFrame of running times named `running_times_5k`. For each name in the dataset, take the mean of their 5 run times.

### Instructions

- Use the .`loc[]` method to select all rows and columns to find the `.mean()` of the each columns.
- Print the `.head()` of the DataFrame to see the mean column.



In [21]:
# Use .loc to create a mean column
running_times_5k["mean"] = running_times_5k.loc[:, running_times_5k.columns].mean(axis=1)

# Take a look at the results
print(running_times_5k.head())

   Unnamed: 0   name  run1  run2  run3  run4  run5       mean
0           0    Sue  20.1  18.5  19.6  20.3  18.3  16.133333
1           1   Mark  16.5  17.1  16.9  17.6  17.3  14.400000
2           2   Sean  23.5  25.1  25.2  24.6  23.9  20.716667
3           3   Erin  21.7  21.1  20.9  22.1  22.2  18.500000
4           4  Jenny  25.8  27.1  26.1  26.7  26.9  22.766667


## Extracting datetime components

There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the `start_date_date` column and extract just the month to use as a feature for modeling.

### Instructions

- Convert the `start_date_date` column into a `pandas` datetime column and store it in a new column called `start_date_converted`.
- Retrieve the month component of `start_date_converted` and store it in a new column called `start_date_month`.
- Print the `.head()` of just the `start_date_converted` and `start_date_month` columns.

In [22]:
# First, convert string column to date column
volunteer["start_date_converted"] = pd.to_datetime(volunteer['start_date_date'])

# Extract just the month from the converted column
volunteer["start_date_month"] = volunteer["start_date_converted"].dt.month

# Take a look at the converted and new month columns
print(volunteer[['start_date_converted', 'start_date_month']].head())

  start_date_converted  start_date_month
0           2011-02-01                 2
1           2011-01-29                 1
2           2011-02-14                 2
3           2011-02-05                 2
4           2011-02-12                 2


## 4 Engineering text features

## Extraction
Regular expressions: 
code to identify patterns

- `\d+`: \d means we want to grab digits, `+` means we want to grab as many as possible
- `\.` : `\.` means we want to grab periods/decimal points
- `\d+`: Putting this after the decimal point means grab the ditis after the point.

In [23]:
import re
my_string = "temperature:75.6 F"
temp = re.search("\d+\.\d+", my_string)
print(float(temp.group(0)))

75.6


## Vectorizing text
TF/IDF: Vectorizes words based upon importance
- TF = Term Frequency
- IDF = Inverse Document Frequency

## Vectorizing text

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
print(documents.head())
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)

## Text classification

$$
P(A∣B)= \frac{P(B∣A) P(A)}{P(B)}
$$

## Extracting string patterns
The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in pandas to apply the extraction to the DataFrame.

### Instructions

- Search the text in the length argument for numbers and decimals using an appropriate pattern.
- Extract the matched pattern and convert it to a float.
- Apply the `return_mileage()` function to each row in the `hiking["Length"]` column.

In [25]:
hiking_new = pd.read_csv('datasets/hiking.csv', index_col=[0])

# Write a pattern to extract numbers and decimals
def return_mileage(length):
    
    # Search the text for matches
    mile = re.search("\d+\.\d+", length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking_new["Length_num"] = hiking_new["Length"].apply(return_mileage)
print(hiking_new[["Length", "Length_num"]].head())

       Length  Length_num
0   0.8 miles        0.80
1    1.0 mile        1.00
2  0.75 miles        0.75
3   0.5 miles        0.50
4   0.5 miles        0.50


## Vectorizing text

You'll now transform the volunteer dataset's title column into a text vector, which you'll use in a prediction task in the next exercise.

### Instructions

- Store the `volunteer["title"]` column in a variable named `title_text`.
- Instantiate a `TfidfVectorizer` as `tfidf_vec`.
- Transform the text in `title_text` into a `tf-idf` vector using `tfidf_vec`

In [26]:
# Take the title text
title_text = volunteer["title"]

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

## Text classification using tf/idf vectors

Now that you've encoded the volunteer dataset's title column into tf/idf vectors, you'll use those vectors to predict the `category_desc` column.

### Instructions

- Split the `text_tfidf` vector and `y` target variable into training and test sets, setting the stratify parameter equal to `y`, since the class distribution is uneven. Notice that we have to run the `.toarray()` method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
- Fit the `X_train` and `y_train` data to the Naive Bayes model, `nb`.
- Print out the test set accuracy.

In [27]:
# Import MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# Split the dataset according to the class distribution of category_desc
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5548387096774193


# PART FOUR

## 1 Feature selection

## What is feature selection?

- Selecting features to be used for modeling
- Doesn't create new features
- Improve model's performance

## When to select features

|city  |state  |lat  |long|
|---  |---  |---  |---|
|hico  |tx  |31.982778  |-98.033333|
|mackinaw city  |mi  |45.783889  |-84.727778|
|winchester  |ky  |37.990000  |-84.179722|

- Reducing noice
- Features are strongly statistically correlated
- Reduce overall variance

## Selecting relevant features
In this exercise, you'll identify the redundant columns in the volunteer dataset, and perform feature selection on the dataset to return a DataFrame of the relevant features.
For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain related information, so it would make sense to keep only one of the features.

Take some time to examine the features of volunteer in the console, and try to identify the redundant features.

### Instructions

- Create a list of redundant column names and store it in the to_drop variable:
- Out of all the location-related features, keep only postalcode.
- Features that have gone through the feature engineering process are redundant as well.
- Drop the columns in the to_drop list from the dataset.
- Print out the `.head()` of `volunteer_subset` to see the selected columns.

In [28]:
# Create a list of redundant column names to drop
to_drop = ["vol_requests", "category_desc", "locality", "region", "created_date"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of volunteer_subset
print(volunteer_subset.head())

   Unnamed: 0  opportunity_id  ...  start_date_converted  start_date_month
0           1            5008  ...            2011-02-01                 2
1           2            5016  ...            2011-01-29                 1
2           3            5022  ...            2011-02-14                 2
3           4            5055  ...            2011-02-05                 2
4           5            5056  ...            2011-02-12                 2

[5 rows x 33 columns]


## 2 Removing redundant features

## Redundant features
- Remove noisy features
- Remove correlated features
- Remove duplicated features

## Scenarios for manual removal

|city  |state  |lat  |long|
|---  |---  |---  |---|
|hico  |tx  |31.982778  |-98.033333|
|mackinaw city  |mi  |45.783889  |-84.727778|
|winchester  |ky  |37.990000  |-84.179722|

## Correlated features
- Statistically correlated: features move together directionally
- Linear models assume feature independence
- Pearson's correlation coefficient

## Correlated features

In [None]:
print(df)
print(df.corr())

## Checking for correlated features
You'll now return to the wine dataset, which consists of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

### Instructions

- Print out the Pearson correlation coefficients for each pair of features in the wine dataset.
- Drop any columns from wine that have a correlation coefficient above 0.75 with at least two other columns.

In [32]:
# Print out the column correlations of the wine dataset
print(wine.corr())

# Drop that column from the DataFrame
wine = wine.drop(['Flavanoids'], axis = 1)

print(wine.head())

                                  Type  ...   Proline
Type                          1.000000  ... -0.633717
Alcohol                      -0.328222  ...  0.643720
Malic acid                    0.437776  ... -0.192011
Ash                          -0.049643  ...  0.223626
Alcalinity of ash             0.517859  ... -0.440597
Magnesium                    -0.209179  ...  0.393351
Total phenols                -0.719163  ...  0.498115
Flavanoids                   -0.847498  ...  0.494193
Nonflavanoid phenols          0.489109  ... -0.311385
Proanthocyanins              -0.499130  ...  0.330417
Color intensity               0.265668  ...  0.316100
Hue                          -0.617369  ...  0.236183
OD280/OD315 of diluted wines -0.788230  ...  0.312761
Proline                      -0.633717  ...  1.000000

[14 rows x 14 columns]
   Type  Alcohol  Malic acid  ...   Hue  OD280/OD315 of diluted wines  Proline
0     1    14.23        1.71  ...  1.04                          3.92     1065
1     1 

## 3 Selecting features using text vectors

## Looking at word weights

In [None]:
print(tfidf_vec.vocabulary_)
print(text_tfidf[3].data)
print(text_tfidf[3].indices)

## Looking at word weights

In [None]:
vocab = {v:k for k,v intfidf_vec.vocabulary_.items()}
print(vocab)
zipped_row = dict(zip(text_tfidf[3].indices,text_tfidf[3].data))
print(zipped_row)

def return_weights(vocab, vector, vector_index):    
    zipped = dict(zip(vector[vector_index].indices,                       
                      vector[vector_index].data))
    return {vocab[i]:zipped[i] for i in vector[vector_index].indices}
print(return_weights(vocab, text_tfidf, 3))

## Exploring text vectors, part 1
Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.

### Instructions

- Add parameters called original_vocab, for the tfidf_vec.vocabulary_, and top_n.
- Call pd.Series() on the zipped dictionary. This will make it easier to operate on.
- Use the .sort_values() function to sort the series and slice the index up to top_n words.
- Call the function, setting original_vocab=tfidf_vec.vocabulary_, setting vector_index=8 to grab the 9th row, and setting top_n=3, to grab the top 3 weighted words.

In [33]:
# Add in the rest of the arguments
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    
    # Transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, 8, 3))

## Exploring text vectors, part 2
Using the return_weights() function you wrote in the previous exercise, you're now going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

### Instructions

- Call `return_weights()` to return the top weighted words for that document.
- Call `set()` on the returned filter_list to remove duplicated numbers.
- Call `words_to_filter`, passing in the following parameters: vocab for the vocab parameter, `tfidf_vec.vocabulary_` for the `original_vocab` parameter, `text_tfidf` for the vector parameter, and 3 to grab the top_n 3 weighted words from each document.
- Finally, pass that `filtered_words` set into a list to use as a filter for the text vector.

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words
filtered_text = text_tfidf[:, list(filtered_words)]

## Training Naive Bayes with feature selection
You'll now re-run the Naive Bayes text classification model that you ran at the end of Chapter 3 with our selection choices from the previous exercise: the volunteer dataset's title and category_desc columns.

### Instructions

- Use `train_test_split()` on the `filtered_text` text vector, the `y` labels (which is the `category_desc` labels), and pass the `y` set to the stratify parameter, since we have an uneven class distribution.
- Fit the `nb` Naive Bayes model to `X_train` and `y_train`.
- Calculate the test set accuracy of `nb`.

In [None]:
# Split the dataset according to the class distribution of category_desc
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

## 4 Dimensionality reduction
## Dimensionality reduction and PCA
         
- Unsupervised learning method              - Principal component analysis        
- Combines/decomposes a feature space        - Linear transformation to uncorrelated space 
- Feature extraction - here we'll use to reduce our feature space       - Captures as much variance as possible ineach component



## PCA in scikit-learn

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
df_pca = pca.fit_transform(df)
print(df_pca)
print(pca.explained_variance_ratio_)

## PCA caveats
- Difficult to interpret components
- End of preprocessing journey


## Using PCA
In this exercise, you'll apply PCA to the wine dataset, to see if you can increase the model's accuracy.

### Instructions

- Instantiate a PCA object.
- Define the features (X) and labels (y) from wine, using the labels in the "Type" column.
- Apply PCA to X_train and X_test, ensuring no data leakage, and store the transformed values as pca_X_train and pca_X_test.
- Print out the .explained_variance_ratio_ attribute of pca to check how much variance is explained by each component.

In [35]:
from sklearn.decomposition import PCA
pca = PCA()
# Instantiate a PCA object
pca = PCA()

# Define the features and labels from the wine dataset
X = wine.drop(['Type'], axis=1)
y = wine["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply PCA to the wine dataset X vector
pca_X_train = pca.fit_transform(X_train)
pca_X_test = pca.transform(X_test)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.97802349e-01 2.02071713e-03 9.82348559e-05 5.53994004e-05
 1.10395648e-05 5.87233448e-06 3.13858204e-06 1.54420449e-06
 1.02927386e-06 3.90521513e-07 1.95535151e-07 8.99659634e-08]


## Training a model with PCA
Now that you have run PCA on the wine dataset, you'll finally train a KNN model using the transformed data.

### Instructions

- Fit the `knn` model to the PCA-transformed features, `pca_X_train`, and training labels, `y_train`.
- Print the test set accuracy of the `knn` model using `pca_X_test` and `y_test`.

In [36]:
knn = KNeighborsClassifier()
# Fit knn to the training data
knn.fit(pca_X_train, y_train)

# Score knn on the test data and print it out
print(knn.score(pca_X_test, y_test))


0.7777777777777778


# PART FIVE

## 1 UFOs and preprocessing

## Identifying areas for preprocessing


### Important concepts to remember
- Missing data: `.dropna()` and `.isna()`
- Types: `.astype()`
- Stratified sampling: `train_test_split(X, y, stratify=y)`


## 2 Categorical variables and standardization

## Categorical variables

|   |state |country       |type|
|-----|--------|--------|------|
|295    |az      |us     | light|
|296    |tx      |us  |formation|
|297    |nv      |us   |fireball|

- One-hot encoding: `pd.get_dummies()`

## Standardization
- `.var()`
- `np.log()`

## 3 Engineering new features

## UFO feature engineering

|date  |length_of_time  |desc|
|----|-----|-----|
|6/16/2013 |21005 minutes  |Sabino Canyon Tucson Arizona night UFO sighting.|
|9/12/2005 |22355 minutes  |Star like objects hovering in sky, slowly m...|
|12/31/2013 |22253 minutes  |Three orange fireballs spotted by witness in E...|

- Dates: `.dt.month` or `.dt.hour` attributes
- Regex: `\d` and `.group()`
- Text: tf-idf and `TfidfVectorizer`

- 

## 4 Feature selection and modeling

## Feature selection and modeling
- Redundant features
- Text vector

## Final thoughts
- Iterative processes
- Know your dataset
- Understand your modeling task


## What you've learned
- Preparing data for modeling:
    - Missing dataIncorrect types
    - Standardize numerical values
    - Process categorical values
    - Feature engineering
    - Select features for modeling