# 1. Introduction to Text Encoding
So far in this course you have dealt with data that, while sometimes messy, has been generally columnar in nature. When you are faced with text data this is often not going to be the case.

2. Standardizing your text
Data that is not in a predefined form is called unstructured data, and free text data is a good example of this. Before you can leverage text data in a machine learning model you must first transform it into a series of columns of numbers or vectors. There are many different approaches to doing this and in this chapter we will go through the most common approaches. In this chapter, you will be working with the United States inaugural address dataset, which contains the text for each President's inaugural speech. With George Washington's shown here. It is clear that free text like this is not in tabular form.

3. Dataset
Before any text analytics can be performed, you must ensure that the text data is in a format that can be used. The speeches have been loaded as a pandas DataFrame called speech_df, with the body of the text in the 'text' column as can be seen by looking at the top five rows using the head() method as shown.

4. Removing unwanted characters
Most bodies of text will have non letter characters such as punctuation, that will need to be removed before analysis. This can be achieved by using the replace() method along with the str accessor. We have used this in an earlier chapter, but instead of specifying the exact characters you wish to replace, this time you will use patterns called regular expressions. Now unless you go through the text of all speeches, it is difficult to determine which non-letter characters are present in the data. So the easiest way to deal with this to specify a pattern which replaces all non letter characters as shown here. The pattern lowercase a to lowercase z followed by uppercase A to uppercase Z inside square brackets basically indicates include all letter characters. Placing a caret before this pattern inside square brackets negates this, that is, says all non letter characters. So we use the replace() method with this pattern to replace all non letter characters with a white-space as shown here.

5. Removing unwanted characters
Here you can see the text of the first speech before and after processing. Notice that the hyphen and the colon are missing.

6. Standardize the case
Once all unwanted characters have been removed you will want to standardize the remaining characters in your text so that they are all lower case. This will ensure that the same word with and without capitalization will not be counted as separate words. You can use the lower() method to achieve this as shown here.

7. Length of text
Later in this chapter you will work through the creation of features based on the content of different texts, but often there is value in the fundamental characteristics of a passage, such as its length. Using the len() method, you can calculate the number of characters in each speech.

8. Word counts
Along with the pure character length of the speech, you may want to know how many words are contained in it. The most straight forward way to do this is to split the speech based an any white-spaces, and then count how many words there are after the split. First, you will need to split the text with with the split() method as shown here and

9. Word counts
then chain the len() method to count the total number of words in each speech.

10. Average length of word
Finally, one other stat you can calculate is the average word length. Since you already have the total number of characters and the word count, you can simply divide them to obtain the average word length.

11. Let's practice!
Now it's time for you to practice what you have learned about how to manipulate text.

In [1]:
import pandas as pd

In [2]:
speech_df = pd.read_csv('inaugural_speeches.csv')

In [3]:
speech_df.head()

Unnamed: 0,Name,Inaugural Address,Date,text
0,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House...
1,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by th...
2,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, t..."
3,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CALLED upon to u...
4,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualifica..."


# Cleaning up your text
Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline.

In this chapter you will be working with a new dataset containing the inaugural speeches of the presidents of the United States loaded as speech_df, with the speeches stored in the text column.

Instructions 2/2
50 XP
2
Replace all non letter characters in the text column with a whitespace.
Make all characters in the newly created text_clean column lower case.

In [4]:
# Print the first 5 rows of the text column
print(speech_df.text.head())


# Replace all non letter characters with a whitespace
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')

# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()

# Print the first 5 rows of the text_clean column
print(speech_df['text_clean'].head())


0    Fellow-Citizens of the Senate and of the House...
1    Fellow Citizens:  I AM again called upon by th...
2    WHEN it was first perceived, in early times, t...
3    Friends and Fellow-Citizens:  CALLED upon to u...
4    PROCEEDING, fellow-citizens, to that qualifica...
Name: text, dtype: object
0    fellow citizens of the senate and of the house...
1    fellow citizens   i am again called upon by th...
2    when it was first perceived  in early times  t...
3    friends and fellow citizens   called upon to u...
4    proceeding  fellow citizens  to that qualifica...
Name: text_clean, dtype: object


  speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')


# High level text features
Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words. In this exercise (and the rest of this chapter), you will focus on the cleaned/transformed text column (text_clean) you created in the last exercise.

Instructions
100 XP
Record the character length of each speech in the char_count column.
Record the word count of each speech in the word_count column.
Record the average word length of each speech in the avg_word_length column.

In [5]:
# Find the length of each text
speech_df['char_cnt'] = speech_df['text_clean'].str.len()

# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()

# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']

# Print the first 5 rows of these columns
print(speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']])

                                           text_clean  char_cnt  word_cnt  \
0   fellow citizens of the senate and of the house...      8616      1432   
1   fellow citizens   i am again called upon by th...       787       135   
2   when it was first perceived  in early times  t...     13871      2323   
3   friends and fellow citizens   called upon to u...     10144      1736   
4   proceeding  fellow citizens  to that qualifica...     12902      2169   
5   unwilling to depart from examples of the most ...      7003      1179   
6   about to add the solemnity of an oath to the o...      7148      1211   
7   i should be destitute of feeling if i was not ...     19894      3382   
8   fellow citizens   i shall not attempt to descr...     26322      4466   
9   in compliance with an usage coeval with the ex...     17753      2922   
10  fellow citizens   about to undertake the arduo...      6818      1130   
11  fellow citizens   the will of the american peo...      7061      1179   

# 1. Word Count Representation
Once high level information has been recorded you can begin creating features based on the actual content of each text.

2. Text to columns
The most common approach to this is to create a column for each word and record the number of times each particular word appears in each text. This results in a set of columns equal in width to the number of unique words in the dataset, with counts filling each entry. Taking just one sentence we can see that "of" occurs 3 tines, "the" 2 times and the other words once.

3. Initializing the vectorizer
While you could of course write a script to do this counting yourself, scikit-learn already has this functionality built in with its CountVectorizer class. As usual, first import CountVectorizer from sklearn dot feature_extraction dot text, then instantiate it by assigning it to a variable name, cv in this case.

4. Specifying the vectorizer
It may have become apparent that creating a column for every word will result in far too many values for analyses. Thankfully, you can specify arguments when initializing your CountVectorizer to limit this. For example, you can specify the minimum number of texts that a word must be contained in using the argument min_df. If a float is given, the word must appear in at least this percent of documents. This threshold eliminates words that occur so rarely that they would not be useful when generalizing to new texts. Conversely, max_df limits words to only ones that occur below a certain percentage of the data. This can be useful to remove words that occur too frequently to be of any value.

5. Fit the vectorizer
Once the vectorizer has been instantiated you can then fit it on the data you want to create your features around. This is done by calling the fit() method on relevant column.

6. Transforming your text
Once the vectorizer has been fit you can call the transform() method on the column you want to transform. This outputs a sparse array, with a row for every text and a column for every word that has been counted.

7. Transforming your text
To transform this to a non sparse array you can use the toarray() method.

8. Getting the features
You may notice that the output is an array, with no concept of column names. To get the names of the features that have been generated you can call the get_feature_names() method on the vectorizer which returns a list of the features generated, in the same order that the columns of the converted array are in.

9. Fitting and transforming
As an aside, while fitting and transforming separately can be useful, particularly when you need to transform a different dataset than the one that you fit the vectorizer to, you can accomplish both steps at once using the fit_transform() method.

10. Putting it all together
Now that you have an array containing the count values of each of the words of interest, and a way to get the feature names you can combine these in a DataFrame as shown here. The add_prefix() method allows you to be able to distinguish these columns in the future.

1 ```out Counts_aback Counts_abandon Counts_abandonment 0 1 0 0 1 0 0 1 2 0 1 0 3 0 1 0 4 0 0 0 ```
11. Updating your DataFrame
You can now combine this DataFrame with your original DataFrame so they can be used to generate future analytical models using pandas concat method. Checking the DataFrames shape shows the new much wider size. Remember to specify the axis argument to 1 as you want column bind these DataFrames.

12. Let's practice!

# Counting words (I)
Once high level information has been recorded you can begin creating features based on the actual content of each text. One way to do this is to approach it in a similar way to how you worked with categorical variables in the earlier lessons.

For each unique word in the dataset a column is created.
For each entry, the number of times this word occurs is counted and the count value is entered into the respective column.
These "count" columns can then be used to train machine learning models.

Instructions
70 XP
Import CountVectorizer from sklearn.feature_extraction.text.
Instantiate CountVectorizer and assign it to cv.
Fit the vectorizer to the text_clean column.
Print the feature names generated by the vectorizer.


Show Answer (-70 XP)
Hint
Similar to the scalers and transformers from the previous chapter, vectorizers can be fit using the .fit() method.
Feature names from a vectorizer can be found using the .get_feature_names() method

In [10]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate CountVectorizer
cv = CountVectorizer()

# Fit the vectorizer
cv.fit(speech_df['text_clean'])

# Print feature names
print(cv.get_feature_names_out())

['abandon' 'abandoned' 'abandonment' ... 'zealous' 'zealously' 'zone']


# Counting words (II)
Once the vectorizer has been fit to the data, it can be used to transform the text to an array representing the word counts. This array will have a row per block of text and a column for each of the features generated by the vectorizer that you observed in the last exercise.

The vectorizer to you fit in the last exercise (cv) is available in your workspace.

Instructions 1/2
35 XP
1
2
Apply the vectorizer to the text_clean column.
Convert this transformed (sparse) array into a numpy array with counts.


Show Answer (-35 XP)
Hint
You can apply the vectorizer using the .transform() method.
To convert into a numpy array, use the .toarray() method.

In [11]:
# Apply the vectorizer
cv_transformed = cv.transform(speech_df['text_clean'])

# Print the full array
cv_array = cv_transformed.toarray()
print(cv_array)
print(cv_array.shape)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(58, 9043)


# Limiting your features
As you have seen, using the CountVectorizer with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.

For this purpose CountVectorizer has parameters that you can set to reduce the number of features:

min_df : Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts.
max_df : Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as "and" or "the".
Instructions
70 XP
Limit the number of features in the CountVectorizer by setting the minimum number of documents a word can appear to 20% and the maximum to 80%.
Fit and apply the vectorizer on text_clean column in one step.
Convert this transformed (sparse) array into a numpy array with counts.
Print the dimensions of the new reduced array.


Show Answer (-70 XP)
Hint
Use the min_df and max_df arguments when instantiating CountVectorizer.
To fit and apply the vectorizer, you can use the .fit_transform() method.

In [12]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Specify arguements to limit the number of features generated
cv = CountVectorizer(min_df=0.2, max_df=0.8)

# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()

# Print the array shape
print(cv_array.shape)

(58, 818)


# Text to DataFrame
Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a pandas DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame.

The numpy array (cv_array) and the vectorizer (cv) you fit in the last exercise are available in your workspace.

Instructions
100 XP
Create a DataFrame cv_df containing the cv_array as the values and the feature names as the column names.
Add the prefix Counts_ to the column names for ease of identification.
Concatenate this DataFrame (cv_df) to the original DataFrame (speech_df) column wise.

# Text to DataFrame
Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a pandas DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame.

The numpy array (cv_array) and the vectorizer (cv) you fit in the last exercise are available in your workspace.

Instructions
70 XP
Create a DataFrame cv_df containing the cv_array as the values and the feature names as the column names.
Add the prefix Counts_ to the column names for ease of identification.
Concatenate this DataFrame (cv_df) to the original DataFrame (speech_df) column wise.


Show Answer (-70 XP)
Hint
The feature names of the columns created in a vectorizer can be found using the .get_feature_names() method.
The .add_prefix() method can be used to add a prefix to column names.
To concatenate the DataFrames, you can use the pandas' concat() function.

In [13]:
# Create a DataFrame with these features
cv_df = pd.DataFrame(cv_array, 
                     columns=cv.get_feature_names()).add_prefix('Counts_')

# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
print(speech_df_new.head())

                Name         Inaugural Address                      Date  \
0  George Washington   First Inaugural Address  Thursday, April 30, 1789   
1  George Washington  Second Inaugural Address     Monday, March 4, 1793   
2         John Adams         Inaugural Address   Saturday, March 4, 1797   
3   Thomas Jefferson   First Inaugural Address  Wednesday, March 4, 1801   
4   Thomas Jefferson  Second Inaugural Address     Monday, March 4, 1805   

                                                text  \
0  Fellow-Citizens of the Senate and of the House...   
1  Fellow Citizens:  I AM again called upon by th...   
2  WHEN it was first perceived, in early times, t...   
3  Friends and Fellow-Citizens:  CALLED upon to u...   
4  PROCEEDING, fellow-citizens, to that qualifica...   

                                          text_clean  char_cnt  word_cnt  \
0  fellow citizens of the senate and of the house...      8616      1432   
1  fellow citizens   i am again called upon by th...  

# 1. TF-IDF Representation (Term Frequency / Inverse Document Frequency)
While counts of occurrences of words can be a good first step towards encoding your text to build models, it has some limitations. The main issue is counts will be much higher for very common even when they occur across all texts, providing little value as a distinguishing feature.

2. Introducing TF-IDF
Take for example the counts of the word "the" shown here, with plentiful occurrences in every row. To limit these common words from overpowering your model some form of normalization can be used. One of the most effective approaches to do this is called "Term Frequency Inverse Document Frequency" or TF-IDF.

3. TF-IDF
TF-IDF divides number of times a word occurs in the document by a measure of what proportion of the documents a word occurs in all documents. This has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.

4. Importing the vectorizer
To use a TF-IDF vectorizer, the approach is very similar to how you applied a count vectorizer. First you must import TfidfVectorizer() from sklearn dot feature_extraction dot text, then you assign it to a variable name. Lets use tv in this case.

5. Max features and stopwords
Similar to when you were working with the Count vectorizer where you could limit the number of features created by specifying arguments when initializing TfidfVectorizer, you can specify the maximum number of features using max_features which will only use the 100 most common words. We will also specify the vectorizer to omit a set of stop_words, these are a predefined list of the most common english words such as "and" or "the". You can use scikit-learn's built in list, load your own, or use lists provided by other python libraries.

6. Fitting your text
Once the vectorizer has been specified you can fit it, and apply it to the text that you want to transform. Note that here we are fitting and transforming the train data, a subset of the original data.

7. Putting it all together
As before, you combine the TF-IDF values along with the feature names in a DataFrame as shown here.

8. Inspecting your transforms
After transforming your data you should always check how the different words are being valued, and see which words are receiving the highest scores through the process. This will help you understand if the features being generated make sense or not. One ad hoc method is to isolate a single row of the transformed DataFrame (`tv_df` in this case), using the iloc accessor, and then sorting the values in the row in descending order as shown here. These top ranked values make sense for the text of a presidential speech.

9. Applying the vectorizer to new data
So how do you apply this transformation on the test set? As mentioned before, you should preprocess your test data using the transformations made on the train data only. To ensure that the same features are created you should use the same vectorizer that you fit on the training data. So first transform the test data using the tv vectorizer and then recreate the test dataset by combining the TF-IDF values, feature names, and other columns.

10. Let's practice!
So, now you also know about TF-IDF! Great, it's time for you to implement this

# Tf-idf
While counts of occurrences of words can be useful to build models, words that occur many times may skew the results undesirably. To limit these common words from overpowering your model a form of normalization can be used. In this lesson you will be using Term frequency-inverse document frequency (Tf-idf) as was discussed in the video. Tf-idf has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.

Instructions
0 XP
Import TfidfVectorizer from sklearn.feature_extraction.text.
Instantiate TfidfVectorizer while limiting the number of features to 100 and removing English stop words.
Fit and apply the vectorizer on text_clean column in one step.
Create a DataFrame tv_df containing the weights of the words and the feature names as the column names.

Hint
Use the max_features and stop_words arguments when instantiating TfidfVectorizer.
To fit and apply the vectorizer, you can use the .fit_transform() method.
Use the .toarray() method to convert the Tf-idf weights into a numpy array.
The column names can be found using the .get_feature_names() method

In [15]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(speech_df['text_clean'])

# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(), 
                     columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
print(tv_df.head())

   TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  \
0      0.000000              0.133415       0.000000        0.105388   
1      0.000000              0.261016       0.266097        0.000000   
2      0.000000              0.092436       0.157058        0.073018   
3      0.000000              0.092693       0.000000        0.000000   
4      0.041334              0.039761       0.000000        0.031408   

   TFIDF_americans  TFIDF_believe  TFIDF_best  TFIDF_better  TFIDF_change  \
0              0.0       0.000000    0.000000      0.000000      0.000000   
1              0.0       0.000000    0.000000      0.000000      0.000000   
2              0.0       0.000000    0.026112      0.060460      0.000000   
3              0.0       0.090942    0.117831      0.045471      0.053335   
4              0.0       0.000000    0.067393      0.039011      0.091514   

   TFIDF_citizens  ...  TFIDF_things  TFIDF_time  TFIDF_today  TFIDF_union  \
0        0.229644  ...    

# Inspecting Tf-idf values
After creating Tf-idf features you will often want to understand what are the most highest scored words for each corpus. This can be achieved by isolating the row you want to examine and then sorting the the scores from high to low.

The DataFrame from the last exercise (tv_df) is available in your workspace.

Instructions
70 XP
Assign the first row of tv_df to sample_row.
sample_row is now a series of weights assigned to words. Sort these values to print the top 5 highest-rated words.


Show Answer (-70 XP)
Hint
Extracting a single row from a DataFrame can be accomplished using .iloc[n] where n is the index of the row.
You can use the .sort_values() method to sort the values.

In [16]:
# Isolate the row to be examined
sample_row = tv_df.iloc[0]

# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())

TFIDF_government    0.367430
TFIDF_public        0.333237
TFIDF_present       0.315182
TFIDF_duty          0.238637
TFIDF_country       0.229644
Name: 0, dtype: float64


# Transforming unseen data
When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: fit the vectorizer only on the training data, and apply it to the test data.

For this exercise the speech_df DataFrame has been split in two:

train_speech_df: The training set consisting of the first 45 speeches.
test_speech_df: The test set consisting of the remaining speeches.
Instructions
70 XP
Instantiate TfidfVectorizer.
Fit the vectorizer and apply it to the text_clean column.
Apply the same vectorizer on the text_clean column of the test data.
Create a DataFrame of these new features from the test set.


Show Answer (-70 XP)
Hint
Remember that you fit the vectorizer only on the training data.
To apply the vectorizer on test data, use the .transform() method.
Use the .toarray() method to convert the Tf-idf weights into a numpy array.
The column names can be found using the .get_feature_names() method.

In [22]:
speech_df

Unnamed: 0,Name,Inaugural Address,Date,text,text_clean,char_cnt,word_cnt,avg_word_length
0,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House...,fellow citizens of the senate and of the house...,8616,1432,6.01676
1,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by th...,fellow citizens i am again called upon by th...,787,135,5.82963
2,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, t...",when it was first perceived in early times t...,13871,2323,5.971158
3,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801",Friends and Fellow-Citizens: CALLED upon to u...,friends and fellow citizens called upon to u...,10144,1736,5.843318
4,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualifica...",proceeding fellow citizens to that qualifica...,12902,2169,5.948363
5,James Madison,First Inaugural Address,"Saturday, March 4, 1809",UNWILLING to depart from examples of the most ...,unwilling to depart from examples of the most ...,7003,1179,5.939779
6,James Madison,Second Inaugural Address,"Thursday, March 4, 1813",ABOUT to add the solemnity of an oath to the o...,about to add the solemnity of an oath to the o...,7148,1211,5.90256
7,James Monroe,First Inaugural Address,"Tuesday, March 4, 1817",I SHOULD be destitute of feeling if I was not ...,i should be destitute of feeling if i was not ...,19894,3382,5.882318
8,James Monroe,Second Inaugural Address,"Monday, March 5, 1821",Fellow-Citizens: I SHALL not attempt to descr...,fellow citizens i shall not attempt to descr...,26322,4466,5.893865
9,John Quincy Adams,Inaugural Address,"Friday, March 4, 1825",IN compliance with an usage coeval with the ex...,in compliance with an usage coeval with the ex...,17753,2922,6.075633


In [29]:
train_speech_df = speech_df.iloc[0:45,:]
test_speech_df = speech_df.iloc[45:57,:]

In [30]:
test_speech_df

Unnamed: 0,Name,Inaugural Address,Date,text,text_clean,char_cnt,word_cnt,avg_word_length
45,Richard Milhous Nixon,First Inaugural Address,"Monday, January 20, 1969","Senator Dirksen, Mr. Chief Justice, Mr. Vice P...",senator dirksen mr chief justice mr vice p...,11701,2152,5.437268
46,Richard Milhous Nixon,Second Inaugural Address,"Saturday, January 20, 1973","Mr. Vice President, Mr. Speaker, Mr. Chief Jus...",mr vice president mr speaker mr chief jus...,10048,1835,5.475749
47,Jimmy Carter,Inaugural Address,"Thursday, January 20, 1977","FOR myself and for our Nation, I want to thank...",for myself and for our nation i want to thank...,6934,1238,5.600969
48,Ronald Reagan,First Inaugural Address,"Tuesday, January 20, 1981","Senator Hatfield, Mr. Chief Justice, Mr. Presi...",senator hatfield mr chief justice mr presi...,13787,2457,5.611315
49,Ronald Reagan,Second Inaugural Address,"Monday, January 21, 1985","Senator Mathias, Chief Justice Burger, Vice Pr...",senator mathias chief justice burger vice pr...,14601,2586,5.646172
50,George Bush,Inaugural Address,"Friday, January 20, 1989","Mr. Chief Justice, Mr. President, Vice Preside...",mr chief justice mr president vice preside...,12536,2342,5.35269
51,Bill Clinton,First Inaugural Address,"Wednesday, January 21, 1993",My fellow citizens:Today we celebrate the myst...,my fellow citizens today we celebrate the myst...,9119,1608,5.67102
52,Bill Clinton,Second Inaugural Address,20-Jan-97,My fellow citizens:At this last presidential i...,my fellow citizens at this last presidential i...,12374,2201,5.62199
53,George W. Bush,First Inaugural Address,"Saturday, January 20, 2001","President Clinton, distinguished guests and my...",president clinton distinguished guests and my...,9084,1606,5.656289
54,George W. Bush,Second Inaugural Address,"Thursday, January 20, 2005","Vice President Cheney, Mr. Chief Justice, Pres...",vice president cheney mr chief justice pres...,12199,2122,5.748822


In [32]:
# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')

# Fit the vectroizer and transform the data
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])

# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])

# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(), 
                          columns=tv.get_feature_names_out()).add_prefix('TFIDF_')
print(test_tv_df.head())

   TFIDF_action  TFIDF_administration  TFIDF_america  TFIDF_american  \
0      0.000000              0.029540       0.233954        0.082703   
1      0.000000              0.000000       0.547457        0.036862   
2      0.000000              0.000000       0.126987        0.134669   
3      0.037094              0.067428       0.267012        0.031463   
4      0.000000              0.000000       0.221561        0.156644   

   TFIDF_authority  TFIDF_best  TFIDF_business  TFIDF_citizens  \
0         0.000000    0.000000        0.000000        0.022577   
1         0.000000    0.036036        0.000000        0.015094   
2         0.000000    0.131652        0.000000        0.000000   
3         0.039990    0.061516        0.050085        0.077301   
4         0.028442    0.087505        0.000000        0.109959   

   TFIDF_commerce  TFIDF_common  ...  TFIDF_subject  TFIDF_support  \
0             0.0      0.000000  ...            0.0       0.000000   
1             0.0      0.00000

# 1. Bag of words and N-grams
So far you have looked at individual words on their own without any context or word order, this approach is called a bag-of-words model, as the words are treated as if they are being drawn from a bag with no concept of order or grammar. While analyzing the occurrences of individual words can be a valuable way to create features from a piece of text, you will notice that individual words can loose all their context/meaning when viewed independently.

2. Issues with bag of words
Take for example the word 'happy' shown here. One would assume it was used in a positive context, but if in reality it was used in the phrase 'not happy' this assumption would be incorrect. Similarly if the phrase was extended to 'never not happy' the connotation changes again. One common method to retain at least some concept of word order in a text is to instead use multiple consecutive words like pairs (bi-gram) or three consecutive words (tri-grams). This maintains at least some ordering information while at the same time allowing for the creation of a reasonable set of features.

3. Using N-grams
To leverage n-grams in your own models an additional argument "ngram_range", can be specified when instantiating your TF-IDF vectorizer. The values assigned to the argument are the minimum and maximum length of n-grams to be included. In this case you would only be looking at bi-grams (n-grams with two words) Printing the bi-gram features created we can see the pairs of words instead of single words.

4. Finding common words
As mentioned in the last video, when creating new features, you should always take time to check your work, and ensure that the features you are creating make sense. A good way to check your n-grams is to see what are the most common values being recorded. This can be done by summing the values of your DataFrame of count values that you created using the sum() method.

5. Finding common words
After sorting the values in descending order you can see the most commonly occurring values. It comes as no surprise that the most commonly occurring bi-gram in a dataset of US president's speeches is "United States" which indicates that the features being created make sense.

6. Let's practice!
You should now be able to try out many different combinations of text based features. It can be interesting to go further and explore the most common longer n-grams such as three word sequences called tri-grams.

# Using longer n-grams
So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:

bigrams: Sequences of two consecutive words
trigrams: Sequences of two consecutive words
These can be automatically created in your dataset by specifying the ngram_range argument as a tuple (n1, n2) where all n-grams in the n1 to n2 range are included.

Instructions
0 XP
Import CountVectorizer from sklearn.feature_extraction.text.
Instantiate CountVectorizer while considering only trigrams.
Fit the vectorizer and apply it to the text_clean column in one step.
Print the feature names generated by the vectorizer.

Hint
When instantiating CountVectorizer, specify the ngram_range argument which takes the minimum and maximum n-gram length in the form of (minimum, maximum)

In [33]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate a trigram vectorizer
cv_trigram_vec = CountVectorizer(max_features=100, 
                                 stop_words='english', 
                                 ngram_range = (3,3))

# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])

# Print the trigram features
print(cv_trigram_vec.get_feature_names())

['ability preserve protect', 'agriculture commerce manufactures', 'america ideal freedom', 'amity mutual concession', 'anchor peace home', 'ask bow heads', 'best ability preserve', 'best interests country', 'bless god bless', 'bless united states', 'chief justice mr', 'children children children', 'citizens united states', 'civil religious liberty', 'civil service reform', 'commerce united states', 'confidence fellow citizens', 'congress extraordinary session', 'constitution does expressly', 'constitution united states', 'coordinate branches government', 'day task people', 'defend constitution united', 'distinction powers granted', 'distinguished guests fellow', 'does expressly say', 'equal exact justice', 'era good feeling', 'executive branch government', 'faithfully execute office', 'fellow citizens assembled', 'fellow citizens called', 'fellow citizens large', 'fellow citizens world', 'form perfect union', 'general welfare secure', 'god bless america', 'god bless god', 'good greates



# Finding the most common words
Its always advisable once you have created your features to inspect them to ensure that they are as you would expect. This will allow you to catch errors early, and perhaps influence what further feature engineering you will need to do.

The vectorizer (cv) you fit in the last exercise and the sparse array consisting of word counts (cv_trigram) is available in your workspace.

Instructions
0 XP
Create a DataFrame of the features (word counts).
Add the counts of word occurrences and print the top 5 most occurring words.

Hint
First, convert the sparse array of word counts into a numpy array using the .toarray() method to create the DataFrame.
Use the .sort_values() to print the top 5 most occurring words.

In [34]:
# Create a DataFrame of the features
cv_tri_df = pd.DataFrame(cv_trigram.toarray(), 
                         columns=cv_trigram_vec.get_feature_names()).add_prefix('Counts_')

# Print the top 5 words in the sorted output
print(cv_tri_df.sum().sort_values(ascending=False).head())


Counts_constitution united states    20
Counts_people united states          13
Counts_mr chief justice              10
Counts_preserve protect defend       10
Counts_president united states        8
dtype: int64


# 1. Wrap-up
Congratulations on completing the course “Feature Engineering for Machine Learning in Python”. This course set out to teach you about understanding your data types and how best to prepare your dataset for a machine learning model. Let's take a moment to recap what you have covered.

2. Chapter 1
In chapter one, you learned how to better understand the underlying types of data contained in your dataset, how to create features out of categorical columns and how to bin continuous columns.

3. Chapter 2
In chapter two, we moved on to exploring how to deal with some of the challenges of real world data, such as missing values and non desirable characters in your data.

4. Chapter 3
Chapter 3 discussed how different distributions can effect your models and how to mitigate it, and different ways to deal with spurious outlier values in your dataset.

5. Chapter 4
Finally in chapter 4, we explored how to deal with non tabular data such as free text and different ways to encode it for use with a machine learning model.

6. Next steps
Hopefully these newly learned skills should benefit both your personal projects and your professional careers. A great place to test out these skills is to try applying them to kaggle competitions or any of your own pet projects to see if they improve your results. Or, if you want to explore these topics further, perhaps you could try out some of the other related courses on DataCamp.

7. Thank You!
This is the final video, and would like to thank you for going through this course. I hope you have learned from it and it provides value in your machine learning work ahead.