# Preprocessing for Machine Learning in Python

## Introduction to Data Preprocessing


In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

# Exploring missing data

You've been given a dataset comprised of volunteer information from New York City, stored in the volunteer DataFrame. Explore the dataset using the plethora of methods and attributes pandas has to offer to answer the following question.

How many missing values are in the locality column?

volunteer["locality"].isna().sum()

# Dropping missing data

Now that you've explored the volunteer dataset and understand its structure and contents, it's time to begin dropping missing values.

In this exercise, you'll drop both columns and rows to create a subset of the volunteer dataset.

In [None]:
# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(["Latitude","Longitude"],axis=1)

# Drop rows with missing category_desc values from volunteer_cols(Subset volunteer_cols by dropping rows containing missing values in the category_desc, and store in a new variable called volunteer_subset.)
volunteer_subset = volunteer_cols.dropna(subset=["category_desc"])

# Print out the shape of the volunteer_subset(Take a look at the .shape attribute of volunteer_subset, to verify it worked correctly.)
print(volunteer_subset.shape)

# working with Data types

# Exploring data types
Taking another look at the dataset comprised of volunteer information from New York City, you want to know what types you'll be working with as you start to do more preprocessing.

Which data types are present in the volunteer dataset?

volunteer.info()

Floats, integers, and objects

# Converting a column type

If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int

In [None]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype(int)

# Take a look at the .dtypes of the dataset again, and notice that the column type has changed.
print(volunteer.dtypes)

# Training and test set

Class imbalance
In the volunteer dataset, you're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, you need to know what the class distribution (and imbalance) is for that label.

Which descriptions occur less than 50 times in the volunteer dataset?


Possible answers


-Emergency Preparedness, -Health

-Environment, -Environment and Emergency Preparedness

All of the above

### code solution

category_counts = volunteer['category_desc'].value_counts()

rare_categories = category_counts[category_counts < 50]

rare_categories

Out[1]:

category_desc

Environment               32

Emergency Preparedness    15

Name: count, dtype: int64

# Stratified sampling

You now know that the distribution of class labels in the category_desc column of the volunteer dataset is uneven. If you wanted to train a model to predict category_desc, you'll need to ensure that the model is trained on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this!

In [None]:
# Create a DataFrame with all columns except category_desc
X = volunteer.drop("category_desc", axis=1)

# Create a category_desc labels dataset
y = volunteer[["category_desc"]]

# Use stratified sampling to split up the dataset according to the y dataset(Split X and y into training and test sets, ensuring that the class distribution in the labels is the same in both sets)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Print the category_desc counts from y_train
print(y_train["category_desc"].value_counts())

# Standardizing Data

This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.

# When to standardize
reasons to standardize?


A column you want to use for modeling has extremely high variance.

You have a dataset with several continuous columns on different scales, and you'd like to use a linear model to train the data.

The models you're working with use some sort of distance metric in a linear space.

# Modeling without normalizing

Let's take a look at what might happen to your model's accuracy if you try to model data without doing some sort of standardization first.

Here we have a subset of the wine dataset. One of the columns, Proline, has an extremely high variance compared to the other columns. This is an example of where a technique like log normalization would come in handy, which you'll learn about in the next section.

The scikit-learn model training process should be familiar to you at this point, so we won't go too in-depth with it. You already have a k-nearest neighbors model available (knn) as well as the X and y sets you need to fit and score on.

In [None]:
# Split up the X and y sets into training and test sets, ensuring that class labels are equally distributed in both sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

knn = KNeighborsClassifier()

# Fit the knn model to the training features and labels.
knn.fit(X_train, y_train)

# Print the test set accuracy of the knn model using the .score() method.
print(knn.score(X_test, y_test))

# Log Normalization

# Checking the variance

Check the variance of the columns in the wine dataset. Out of the four columns listed, which column is the most appropriate candidate for normalization?

wine.var()

Out[1]:

Type                                0.601

Alcohol                             0.659

Malic acid                          1.248

Ash                                 0.075

Alcalinity of ash                  11.153

Magnesium                         203.989

Total phenols                       0.392

Color intensity                     5.

Proline                         99166.717

dtype: float64

Answer is proline

# Log normalization in Python

Now that we know that the Proline column in our wine dataset has a large amount of variance, let's log normalize it.

numpy has been imported as np.

In [None]:
# Print out the variance of the Proline column
print(wine["Proline"].var())

# Use the np.log() function on the Proline column to create a new, log-normalized column named Proline_log
wine["Proline_log"] = np.log(wine["Proline"])

# Check the variance of the normalized Proline column
print(wine["Proline_log"].var())

# Scaling data - investigating columns

You want to use the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset to train a linear model, but it's possible that these columns are all measured in different ways, which would bias a linear model.

Which of the following statements about these columns is true?

answer

The max of Ash is 3.23, the max of Alcalinity of ash is 30, and the max of Magnesium is 162.

# Scaling data - standardizing columns

Since we know that the Ash, Alcalinity of ash, and Magnesium columns in the wine dataset are all on different scales, let's standardize them in a way that allows for use in a linear model.

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate a StandardScaler() and store it in the variable, scaler
scaler = StandardScaler()

# Create a subset of the wine DataFrame containing the Ash, Alcalinity of ash, and Magnesium columns, assign it to wine_subset
wine_subset = wine[["Ash", "Alcalinity of ash", "Magnesium"]]

# Apply the scaler to wine_subset(Fit and transform the standard scaler to wine_subset.)
wine_subset_scaled = scaler.fit_transform(wine_subset)

# In scikit-learn, running .fit_transform() during preprocessing will both fit the method to the data as well as transform the data in a single step.

# Standardized data and modeling

# KNN on non-scaled data

Before adding standardization to your scikit-learn workflow, you'll first take a look at the accuracy of a K-nearest neighbors model on the wine dataset without standardizing the data.

The knn model as well as the X and y data and labels sets have been created already.

In [None]:
# Split the dataset and labels into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Fit the k-nearest neighbors model to the training data
knn.fit(X_train,y_train)

# Score the model on the test data(Print out the test set accuracy of your trained knn model)
print(knn.score(X_test, y_test))

Well done! This accuracy definitely isn't poor, but let's see if we can improve it by standardizing the data.

# KNN on scaled data

The accuracy score on the unscaled wine dataset was decent, but let's see what you can achieve by using standardization. Once again, the knn model as well as the X and y data and labels set have already been created for you.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Create the StandardScaler() method, stored in a variable named scaler.
scaler = StandardScaler()

# Scale the training and test features, being careful not to introduce data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit the k-nearest neighbors model to the scaled training data
knn.fit(X_train_scaled, y_train)

# Evaluate the model's performance by computing the test set accuracy
print(knn.score(X_test_scaled, y_test))

That's quite the improvement, and definitely made scaling the data worthwhile.

# Feature Engineering

# Feature engineering knowledge test
Now that you've learned about feature engineering, which of the following examples are good candidates for creating new features?

A column of timestamps

A column of newspaper headlines

Correct! Timestamps can be broken into days or months, and headlines can be used for natural language processing.

# Encoding categorical variables

# Encoding categorical variables - binary

Take a look at the hiking dataset. There are several columns here that need encoding before they can be modeled, one of which is the Accessible column. Accessible is a binary feature, so it has two values, Y or N, which need to be encoded into 1's and 0's. Use scikit-learn's LabelEncoder method to perform this transformation.

In [None]:
# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column(Using the encoder's .fit_transform() method, encode the hiking dataset's "Accessible" column. Call the new column Accessible_enc)
hiking["Accessible_enc" ]= enc.fit_transform(hiking["Accessible"])

# Compare the two columns side-by-side to see the encoding.
print(hiking[["Accessible", "Accessible_enc"]].head())

.fit_transform() is a good way to both fit an encoding and transform the data in a single step.

# Encoding categorical variables - one-hot

One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use pandas' pd.get_dummies() function to do so.

In [None]:
# Call get_dummies() on the volunteer["category_desc"] column to create the encoded columns and assign it to category_enc.
category_enc = pd.get_dummies(volunteer["category_desc"])

# Print out the .head() of the category_enc variable to take a look at the encoded columns.
print(category_enc.head())

# get_dummies() is a simple and quick way to encode categorical variables.

# Engineering numerical features

# Aggregating numerical features

A good use case for taking an aggregate statistic to create a new feature is when you have many features with similar, related values. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

In [None]:
# Use the .loc[] method to select all rows and columns to find the .mean() of the each columns.
running_times_5k["mean"] = running_times_5k.loc[:, "run1":"run5"].mean(axis= 1)

# Print the .head() of the DataFrame to see the mean column.
print(running_times_5k.head())

# .loc[] is especially helpful for operating across columns.

# Extracting datetime components

There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

In [None]:
# First, Convert the start_date_date column into a pandas datetime column and store it in a new column called start_date_converted.
volunteer["start_date_converted"] = pd.to_datetime(volunteer["start_date_date"])

# Extract just the month from the converted column(Retrieve the month component of start_date_converted and store it in a new column called start_date_month)
volunteer["start_date_month"] = volunteer["start_date_converted"].dt.month

# Take a look at the converted and new month columns(Print the .head() of just the start_date_converted and start_date_month columns)
print(volunteer[["start_date_converted", "start_date_month"]].head())

# You can also use attributes like .day to get the day and .year to get the year from datetime columns.

# Engineering text features

# Extracting string patterns

The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in pandas to apply the extraction to the DataFrame.

In [None]:
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    
    # Search the text in the length argument for numbers and decimals using an appropriate pattern.
    mile = re.search("\d+\.\d+", length)
    
    # If a value is returned, use group(0) to return the found value(Extract the matched pattern and convert it to a float.)
    if mile is not None:
        return float(mile.group(0))
        
# Apply the return_mileage() function to each row in the hiking["Length"] column.
hiking["Length_num"] = hiking["Length"].apply(return_mileage)
print(hiking[["Length", "Length_num"]].head())

# Vectorizing text

You'll now transform the volunteer dataset's title column into a text vector, which you'll use in a prediction task in the next exercise.

In [None]:
# Store the volunteer["title"] column in a variable named title_text
title_text = volunteer["title"]

# Instantiate a TfidfVectorizer as tfidf_vec
tfidf_vec = TfidfVectorizer()

# Transform the text in title_text into a tf-idf vector using tfidf_vec
text_tfidf = tfidf_vec.fit_transform(title_text)

scikit-learn provides several methods for text vectorization.

# Text classification using tf/idf vectors

Now that you've encoded the volunteer dataset's title column into tf/idf vectors, you'll use those vectors to predict the category_desc column.

In [None]:
# Split the text_tfidf vector and y target variable into training and test sets, setting the stratify parameter equal to y, since the class distribution is uneven. Notice that we have to run the .toarray() method on the tf/idf vector, in order to get in it the proper format for scikit-learn.
y = volunteer["category_desc"]
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y, random_state=42)

# Fit the X_train and y_train data to the Naive Bayes model, nb.
nb.fit(X_train, y_train)

# Print out the model's accuracy(ie test set accuracy)
print(nb.score(X_test, y_test))

<script.py> output:

    0.5161290322580645

Nice work! Notice that the model doesn't score very well. We'll work on selecting the best features for modeling in the next chapter.

# Selecting Features for Modeling

# When to use feature selection
You've finished standardizing your data and creating new features. Which of the following scenarios is NOT a good candidate for feature selection?

Select one answer

Several columns of running times have been averaged into a new column

A text field that hasn't been turned into a tf/idf vector yet (correct) The text field needs to be vectorized before removing it, otherwise we might lose important data.

A column of text that has had a float extracted from it

A categorical field that has been one-hot encoded

There are columns related to whether something is a fruit or vegetable, the name of the fruit or vegetable, and the scientific name of the plant

# Removing redundant features

# Selecting relevant features
In this exercise, you'll identify the redundant columns in the volunteer dataset, and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain related information, so it would make sense to keep only one of the features.

Take some time to examine the features of volunteer in the console, and try to identify the redundant features.

In [None]:
# Create a list of redundant column names to drop
to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of volunteer_subset
print(volunteer_subset.head())

# Checking for correlated features

You'll now return to the wine dataset, which consists of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

In [None]:
# Print out the Pearson correlation coefficients for each pair of features in the wine dataset.
print(wine.corr())

# Drop any columns from wine that have a correlation coefficient above 0.75 with at least two other columns.
wine = wine.drop("Flavanoids", axis=1)

print(wine.head())

Dropping correlated features is often an iterative process, so you may need to try different combinations in your model.

# Your recent learnings

When you left 21 hours ago, you worked on Selecting Features for Modeling, chapter 4 of the course Preprocessing for Machine Learning in Python. Here is what you covered in your last lesson:

You learned about selecting the most important features for your modeling tasks, focusing on removing redundant features from your dataset. Redundant features are those that are unnecessary for modeling because they either duplicate information present in other features or are too strongly correlated with other features. Key points covered include:

The importance of dropping redundant features to avoid noise in your models. For instance, if two features are strongly correlated, keeping both can introduce bias.
How to identify redundant features, such as through manual inspection for repeated information or using Pearson's correlation coefficient for numerical data.
The process of using pandas to calculate Pearson's correlation coefficients between pairs of features, helping to identify which features move together directionally.
For example, to drop redundant columns from a dataset, you used the following code:

### Create a list of redundant column names to drop

to_drop = ["category_desc", "created_date", "locality", "region", "vol_requests"]

### Drop those columns from the dataset

volunteer_subset = volunteer.drop(to_drop, axis=1)

### Print out the head of volunteer_subset

print(volunteer_subset.head())

This lesson emphasized the iterative nature of feature selection and the need to reassess choices based on model performance.

The goal of the next lesson is to understand how to enhance machine learning models by selecting and utilizing the most impactful features from TF-IDF vectors.

# Selecting features using text vectors

# Exploring text vectors, part 1

Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.

In [None]:
# Add in the rest of the arguments(Add parameters called original_vocab, for the tfidf_vec.vocabulary_, and top_n.)
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series(Call pd.Series() on the zipped dictionary. This will make it easier to operate on.)
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words(Use the .sort_values() function to sort the series and slice the index up to top_n words.)
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Call the function, setting original_vocab=tfidf_vec.vocabulary_, setting vector_index=8 to grab the 9th row, and setting top_n=3, to grab the top 3 weighted words.
print(return_weights(vocab,tfidf_vec.vocabulary_, text_tfidf, 8, 3))

# Exploring text vectors, part 2

Using the return_weights() function you wrote in the previous exercise, you're now going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

In [None]:
def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call return_weights() to return the top weighted words for that document.
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices(Call set() on the returned filter_list to remove duplicated numbers)
    return (set(filter_list))

# Call words_to_filter function, passing in the following parameters: vocab for the vocab parameter, tfidf_vec.vocabulary_ for the original_vocab parameter, text_tfidf for the vector parameter, and 3 to grab the top_n 3 weighted words from each document.
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, 3)

# Filter the columns in text_tfidf to only those in filtered_words(pass that filtered_words set into a list to use as a filter for the text vector.)
filtered_text = text_tfidf[:, list(filtered_words)]

Excellent! In the next exercise, you'll train a model using the filtered vector.

# Training Naive Bayes with feature selection

You'll now re-run the Naive Bayes text classification model that you ran at the end of Chapter 3 with our selection choices from the previous exercise: the volunteer dataset's title and category_desc columns.

In [None]:
# Split the dataset according to the class distribution of category_desc(Use train_test_split() on the filtered_text text vector, the y labels (which is the category_desc labels), and pass the y set to the stratify parameter, since we have an uneven class distribution.)
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), volunteer["category_desc"], stratify=y, random_state=42)

# Fit the nb Naive Bayes model to X_train and y_train
nb.fit(X_train, y_train)

# Print out the model's accuracy(ie the test set accuracy of nb)
print(nb.score(X_test,y_test))

You can see that our accuracy score wasn't that different from the score at the end of Chapter 3. But don't worry, this is mainly because of how small the title field is.

# Dimensionality reduction

# Using PCA

In this exercise, you'll apply PCA to the wine dataset, to see if you can increase the model's accuracy.

In [None]:
# Instantiate a PCA object
pca = PCA()

# Define the features (X) and labels (y) from wine, using the labels in the "Type" column.
X = wine.drop(["Type"], axis=1)
y = wine["Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Apply PCA to X_train and X_test, ensuring no data leakage, and store the transformed values as pca_X_train and pca_X_test.
pca_X_train = pca.fit_transform(X_train)
pca_X_test = pca.transform(X_test)

# Print out the .explained_variance_ratio_ attribute of pca to check how much variance is explained by each component.
print(pca.explained_variance_ratio_)

In the next exercise, you'll train a model using the PCA-transformed vector.

# Training a model with PCA

Now that you have run PCA on the wine dataset, you'll finally train a KNN model using the transformed data.

In [None]:
# Fit the knn model to the PCA-transformed features, pca_X_train, and training labels, y_train.
knn.fit(pca_X_train, y_train)

# Print the test set accuracy of the knn model using pca_X_test and y_test
print(knn.score(pca_X_test, y_test))

# Good work! PCA turned out to be a good choice for the wine dataset.

# Putting It All Together (UFOs dataset)

Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.

## Checking column types

Take a look at the UFO dataset's column types using the .info() method. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [None]:
# Call the .info() method on the ufo dataset.
print(ufo.info())

# Convert the type of the seconds column to the float data type.
ufo["seconds"] = ufo["seconds"].astype(float)

# Convert the type of the date column to the datetime data type.
ufo["date"] = pd.to_datetime(ufo["date"])

# Call .info() on ufo again to see if the changes worked.
print(ufo.info())

Nice job on transforming the column types! This will make feature engineering and standardization much easier.

## Dropping missing data

In this exercise, you'll remove some of the rows where certain columns have missing values. You're going to look at the length_of_time column, the state column, and the type column. You'll drop any row that contains a missing value in at least one of these three columns.

In [None]:
# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[["length_of_time", "state", "type"]].isna().sum())

# Drop rows where length_of_time, state, or type are missing(The subset parameter is actually used with the dropna() method, which is designed to remove rows with missing values in specific columns)
ufo_no_missing = ufo.dropna(subset=["length_of_time", "state", "type"])

# Print out the shape of the new ufo_no_missing dataset.
print(ufo_no_missing.shape)

# Extracting numbers from strings

The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

In [None]:
def return_minutes(time_string):

    # Search for numbers in time_string
    num = re.search("\d+", time_string)
    if num is not None:
        return int(num.group(0))
        
# Use the .apply() method to call the return_minutes() on every row of the length_of_time column.
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Print out the .head() of both the length_of_time and minutes columns to compare.
print(ufo[["minutes","length_of_time"]].head())

## Identifying features for standardization

In this exercise, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normalize the seconds column.

In [None]:
# Calculate the variance in the seconds and minutes columns and take a close look at the results.
print(ufo[["minutes","seconds"]].var())

# Perform log normalization on the seconds column, transforming it into a new column named seconds_log
ufo["seconds_log"] = np.log(ufo["seconds"])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

# Your recent learnings
When you left 6 hours ago, you worked on Putting It All Together, chapter 5 of the course Preprocessing for Machine Learning in Python. Here is what you covered in your last lesson:

You learned about handling categorical variables and standardizing numerical data in the context of preprocessing the UFO dataset. Specifically, you focused on:

One Hot Encoding Categorical Variables: You discovered that categorical variables, such as location data and the type of encounter in the UFO dataset, can be transformed into a format that's suitable for modeling through one hot encoding. This was achieved using pandas' get_dummies function, allowing models to interpret these categorical variables effectively.

Extracting Numbers from Strings: You tackled the challenge of extracting numerical values from the length_of_time field, which contains the duration of each sighting in a text format. By employing regular expressions, you extracted the number of minutes from these strings and applied this transformation across the dataset using the .apply() method. Here's how you did it:

def return_minutes(time_string):

    num = re.search("\d+", time_string)
    
    if num is not None:
    
        return int(num.group(0))
        
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

Standardizing Numerical Features: You learned the importance of standardization in preprocessing, particularly for the seconds column of the UFO dataset, which exhibited high variance. By calculating the variance and applying log normalization using NumPy's log function, you transformed the seconds column into a more model-friendly seconds_log column, significantly reducing its variance and making the data more uniform for analysis.

These steps are crucial in preparing your dataset for more accurate and efficient model training by ensuring that both categorical and numerical variables are in a suitable format for analysis.

The goal of the next lesson is to learn about advanced data preprocessing techniques to further enhance the performance of machine learning models.

# Engineering new features

# Encoding categorical variables

There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [None]:
# Using apply(), write a conditional lambda function that returns a 1 if the value is "us", else return 0.
ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x == "us" else 0)

# Print out the number of .unique() values in the type column.
print(len(ufo["type"].unique()))

# Using pd.get_dummies(), create a one-hot encoded set of the type column.
type_set = pd.get_dummies(ufo["type"])

# Finally, use pd.concat() to concatenate the type_set encoded variables to the ufo dataset.
ufo = pd.concat([ufo, type_set], axis=1)

# Features from dates

Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

In [None]:
# Look at the first 5 rows of the date column
print(ufo["date"].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the .head() of the date, month, and year columns.
print(ufo[["date", "month", "year"]].head())

# Text vectorization

You'll now transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [None]:
# Take a look at the head of the desc field
print(ufo["desc"].head())

# Instantiate the tfidf vectorizer object
vec = TfidfVectorizer()

# Fit and transform the desc column using vec
desc_tfidf = vec.fit_transform(ufo["desc"])

# Print out the .shape of the desc_tfidf vector, to take a look at the number of columns this created.
print(desc_tfidf.shape)

<script.py> output:

    0    It was a large&#44 triangular shaped flying ob...
        
    1    Dancing lights that would fly around and then ...
        
    2    Brilliant orange light or chinese lantern at o...
        
    3    Bright red light moving north to north west fr...
        
    4    North-east moving south-west. First 7 or so li...
        
    Name: desc, dtype: object
        
    (1866, 3422)

    You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

# Selecting the ideal dataset

Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.

You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.

You vectorized desc, so it can be removed. For now you'll keep type.

You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.

In [None]:
# Make a list of features to drop  
to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

# Drop these columns from ufo
ufo_dropped = ufo.drop(to_drop, axis =1)

# Use the words_to_filter() function you created previously; pass in vocab, vec.vocabulary_, desc_tfidf, and keep the top 4 words as the last parameter.
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

# Modeling the UFO dataset, part 1

In this exercise, you're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. The X dataset contains the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The y labels are the encoded country column, where 1 is "us" and 0 is "ca".

In [None]:
# Take a look at the features in the X set of data(Print out the .columns of the X set)
print(X.columns)

# Split the X and y sets, ensuring that the class distribution of the labels is the same in the training and tests sets, and using a random_state of 42
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

# Fit knn to the training data.
knn.fit(X_train,y_train)

# Print the test set accuracy of the knn model.
print(knn.score(X_test, y_test))

<script.py> output:

    Index(['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash', 'formation', 'light', 'other', 'oval', 'rectangle',
           'sphere', 'teardrop', 'triangle', 'unknown', 'month', 'year'],
          dtype='object')
          
    0.867237687366167

Awesome work! This model performs pretty well. It seems like you've made pretty good feature selection choices here.

# Modeling the UFO dataset, part 2

Finally, you'll build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if you can predict the type of the sighting based on the text. You'll use a Naive Bayes model for this.

In [None]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# Fit nb to the training sets
nb.fit(X_train, y_train)

# Print the score of nb on the test sets
print(nb.score(X_test,y_test))

<script.py> output:

    0.17987152034261242

this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting type.