#  Transforming Text For a Classifier

In this demo, you will see how scikit-learn converts raw text data into a matrix of <i>term frequency-inverse document frequency</i> (TF-IDF) features, and how to train a logistic regression model using these transformed features. You will also experiment with using different document-frequency values and see how they affect the performance of a logistic regression model.

### Import Packages

Before you get started, import a few packages. Run the code cell below. 

In [1]:
import pandas as pd
import numpy as np
import os 

We will also import the scikit-learn `LogisticRegression`, the `train_test_split()` function for splitting the data into training and test sets, and the function `roc_auc_score` to evaluate the model. 

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

## Step 1: Load a 'ready-to-fit' Data Set

We will work with a new version of the familiar Airbnb "listings" data set. It contains all of the numerical and binary columns we used previously, but also contains unstructured text fields.

In [3]:
filename = os.path.join(os.getcwd(), "data", "airbnb_text_readytofit.csv.gz")
df = pd.read_csv(filename, header=0)

In [4]:
df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_is_superhost,host_has_profile_pic,host_identity_verified,host_response_rate,...,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room,has_availability_True,instant_bookable_True
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,False,True,True,-0.591438,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,False,True,True,-4.744653,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",False,True,True,0.578481,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,False,True,False,-0.060696,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,False,True,True,0.578481,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


## Step 2: Create Training and Test Data Sets

### Create Labeled Examples

Let's obtain columns from our data set to create labeled examples. We will have one text feature and one label. 
The code cell below carries out the following steps:

* Gets the `host_is_superhost` column from DataFrame `df` and assign it to the variable `y`. This will be our label.
* Gets the column `description` from DataFrame `df` and assigns it to the variable `X`. This will our feature. Note that the `description` feature contains text describing the listing.


In [5]:
y = df['host_is_superhost'] 
X = df['description']

X.shape

(27388,)

In [6]:
X.head()

0    Beautiful, spacious skylit studio in the heart...
1    Enjoy 500 s.f. top floor in 1899 brownstone, w...
2    We welcome you to stay in our lovely 2 br dupl...
3    Please don’t expect the luxury here just a bas...
4    Our best guests are seeking a safe, clean, spa...
Name: description, dtype: object

### Split Labeled Examples into Training and Test Sets

Let's split our data into training and test sets with 75% of the data being the training set.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=1234)
X_train.head()

775      <b>The space</b><br />This 450 sf apartment is...
4        Our best guests are seeking a safe, clean, spa...
7294     I am subletting my room for the entire month o...
26836    4 queen bedrooms mean plenty of room to share ...
3914     Modern studio on the lower level of our histor...
Name: description, dtype: object

## Step 3:  Implement TF-IDF Vectorizer to Transform Text

A popular technique when transforming text to numerical feature vectors is to use the TF-IDF statistical measure. TF-IDF calculates how relevant a word is in a document relative to a collection of documents. It weighs words to indicate the words that are the most unique to the document and therefore can be used to represent the characteristics of the document. For example, the word "the" appears in many documents and therefore is not characteristic of one particular document in a collection. On the other hand, if a word appears often in one document and rarely in other documents in the collection, the word is given a higher value of importance to that one document. 

Let's look at a simple example. We will use the scikit-learn `TfidfVectorizer` class to implement a TF-IDF vectorizer. For more information, consult the online [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). First, let's import `TfidfVectorizer`.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

Let's consider the values in the TF-IDF matrix that `TfidfVectorizer` produces:

* <b>Row</b>: each document will be represented by a numerical vector (row) in the matrix. 
* <b>Column</b>: each column represents one word in the vocabulary, i.e. the number of words in ALL of the documents in the collection (with the exclusion of words that appear too frequently or too infrequently; scikit-learn has a list of such words to ignore by default, but you will see later that you can specify frequency thresholds to eliminate words that appear too often/little). 
    * The value in the columns are the TF-IDF scores (weights) for the word in every document in the collection (one document per row).

The code cell below transforms two "documents." Run the cell below to see what the code produces. 

In [9]:
document_collection = [
    'My cat loves yarn. Blue yarn.',
    'I have a blue dog.'
]

# 1. Create a TfidfVectorizer oject
vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer to document_collection
vectorizer.fit(document_collection)

# 3. Print the vocabulary
print("Vocabulary size {0}: ".format(len(vectorizer.vocabulary_)))
print(vectorizer.vocabulary_)

# 4. Transform the data into numerical vectors 
resulting_matrix = vectorizer.transform(document_collection)

# 5. Print the matrix
print(resulting_matrix.todense())


Vocabulary size 7: 
{'my': 5, 'cat': 1, 'loves': 4, 'yarn': 6, 'blue': 0, 'have': 3, 'dog': 2}
[[0.25969799 0.36499647 0.         0.         0.36499647 0.36499647
  0.72999294]
 [0.44943642 0.         0.6316672  0.6316672  0.         0.
  0.        ]]


Let's summarize the resulting matrix:

<table>
    <tr><th></th><th>blue</th><th>cat</th><th>dog</th><th>have</th><th>loves</th><th>my</th><th>yarn</th></tr>
<tr><th>Document 1</th><th>0.25969799</th><th>0.36499647</th><th>0.</th><th>0.</th><th>0.36499647</th><th>0.36499647</th><th>0.72999294</th><t/tr>
<tr><th>Document 2</th><th>0.44943642</th><th>0.</th><th>0.6316672</th><th>0.6316672</th><th>0.</th><th>0.</th><th>0.</th></tr>   
    </table>




We have 7 words in our vocabulary: 'blue', 'cat', dog', 'have', 'loves', 'my, 'yarn'. Note that scikit-learn excluded the words 'I' and 'a'. Therefore, we have 7 columns. Note that each word is considered a feature. Therefore, in this example, we have seven features.

The `vectorizer.vocabulary_` attribute outputs a mapping of words to column indices: {'my': 5, 'cat': 1, 'loves': 4, 'yarn': 6, 'blue': 0, 'have': 3, 'dog': 2}. This means that the TF-IDF score (weight) for the word 'my' is contained is in the 5th column (the first column is 0) in the matrix.

The table above summarizes the results of the code. Note that in our first document, the word 'dog' does not appear. Therefore, its value in the document's vector is 0. Since the word 'blue' appears in both documents, its importance is not as high for either document. Therefore, its value in both document vectors is not very high compared to other values. However, since 'dog' appears in the second document only, it has a higher importance since it is characteristic of the second document; its value in that document's vector is 0.6316672. 


Let's now transform our Airbnb feature values into numerical vectors using `TfidfVectorizer`. We will implement a TF-IDF transformation on the training and test data. Run the cell and inspect the results.

In [10]:
# 1. Create a TfidfVectorizer oject
tfidf_vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer to X_train
tfidf_vectorizer.fit(X_train)

# 3. Transform *both* the training and test data using the fitted vectorizer and its 'transform' attribute
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(X_train_tfidf)

  (0, 23779)	0.038561105614235314
  (0, 23771)	0.11589556798046946
  (0, 23753)	0.10098268550222658
  (0, 23589)	0.07414073282842239
  (0, 23471)	0.047515983056491795
  (0, 23361)	0.05155459572838163
  (0, 23334)	0.0742290362302789
  (0, 23245)	0.0704291887855407
  (0, 23016)	0.1658697199422041
  (0, 22813)	0.103512975598231
  (0, 22441)	0.04994977360595739
  (0, 22338)	0.16414868151481765
  (0, 22141)	0.04525268909148481
  (0, 21956)	0.07528502113419792
  (0, 21830)	0.11218926764426079
  (0, 21785)	0.06026807550555629
  (0, 21615)	0.0627539016485846
  (0, 21501)	0.11761714360779675
  (0, 21441)	0.06416906580885975
  (0, 21362)	0.1375436474856328
  (0, 20747)	0.08685863820037261
  (0, 20591)	0.09351951459480976
  (0, 20519)	0.08125780706785905
  (0, 20501)	0.11114219975162687
  (0, 20280)	0.07618672141212944
  :	:
  (20540, 6684)	0.08028542980233697
  (20540, 5695)	0.07745488700136696
  (20540, 5683)	0.12511711193936928
  (20540, 5674)	0.07046376746580481
  (20540, 5551)	0.033630104036

## Step 4: Fit a Logistic Regression Model to the Transformed Training Data and Evaluate the Model
The code cell below trains a logistic regression model using the TF-IDF features and computes the AUC on the test set.

In [11]:
# 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# 2. Make predictions on the transformed test data using the predict_proba() method and 
# save the values of the second column
probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

# 3. Make predictions on the transformed test data using the predict() method 
class_label_predictions = model.predict(X_test_tfidf)

# 4. Compute the Area Under the ROC curve (AUC) for the test data. Note that this time we are using one 
# function 'roc_auc_score()' to compute the auc rather than using both 'roc_curve()' and 'auc()' as we have 
# done in the past
auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

# 5. Print out the size of the resulting feature space using the 'vocabulary_' attribute of the vectorizer
len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

# 6. Get a glimpse of the features:
first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))


AUC on the test data: 0.7654
The size of the feature space: 25353
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('space', 20050), ('br', 4323), ('this', 21441), ('450', 1178)]:


Let's check one listing's description and see if our model properly predicted whether the host is a 'superhost' or not based on the location, the amenities, etc:

In [12]:
print('Description:\n')
print(X_test[389])

print('\nPrediction: Is host superhost? {}\n'.format(class_label_predictions[389]))

print('Actual: Is host superhost? {}\n'.format(y_test[389]))

Description:

***WE TAKE OUR GUESTS' SAFETY VERY SERIOUSLY!   We are following all recommended cleaning and sanitation protocols.<br /><br /><b>The space</b><br />Our 2 bedroom apartment is located on the first floor of a recently renovated, century-old brick townhouse.  It has wood floors throughout, exposed brick, original tin ceilings, tub/shower in bathroom, kitchen alcove, living room and two bedrooms.  Downstairs is an open plan finished basement with a washer/dryer and a Queen-sized sleeper sofa for extra sleeping space.<br /><br />We also have a lovely private backyard, and street parking if you are traveling by car.<br /><br />Located between the neighborhoods of DUMBO and Fort Greene, you are a 10 minute walk from either neighborhood with lots of shops and restaurants.  Across the street is Commodore Barry Park, Brooklyn's first park.  It was recently renovated and features an open area with lots of trees and benches directly across from the apartment, sports fields on the fa

## Step 5: Experiment with Different Document Frequency Values and Analyze the Results

When creating a `TfidfVectorizer` object, you can use the parameter `min_df` to specify the minimum 'document frequency.' This allows you to ignore words that have a document frequency lower than the specified value. In other words, they ignore words that occur in too few documents.

The code cell below puts the code above into a loop over a range of 'document frequency' values. For each value, it fits a vectorizer specifying `ngram_range=(1,2)` (instead of the default (1,1)). Run the code and inspect the results. 

Note: This may take a short while to run.

In [13]:
for min_df in [1,10,100,1000]:
    
    print('\nMin Document Frequency Value: {0}'.format(min_df))
    
    # 1. Create a TfidfVectorizer oject
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df, ngram_range=(1,2))

    # 2. Fit the vectorizer to X_train
    tfidf_vectorizer.fit(X_train)

    # 3. Transform the training and test data
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    # 4. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed 
    # training data
    model = LogisticRegression(max_iter=200)
    model.fit(X_train_tfidf, y_train)
    
    # 5. Make predictions on the transformed test data using the predict_proba() method and save 
    # the values of the second column
    probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

    # 6. Compute the Area Under the ROC curve (AUC) for the test data.
    auc = roc_auc_score(y_test, probability_predictions)
    print('AUC on the test data: {:.4f}'.format(auc))

    # 7. Compute the size of the resulting feature space using the 'vocabulary_' attribute of the vectorizer
    len_feature_space = len(tfidf_vectorizer.vocabulary_)
    print('The size of the feature space: {0}'.format(len_feature_space))
    
    # 8. Get a glimpse of the features:
    first_five = list(tfidf_vectorizer.vocabulary_.items())[1:5]
    print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))

    # 9: Print the first five "stop words" - words that we are ignoring
    first_five_stop = list(tfidf_vectorizer.stop_words_)[1:5]
    print('Glimpse of first 5 stop words \n{}:'.format(first_five_stop))
    


Min Document Frequency Value: 1
AUC on the test data: 0.7976
The size of the feature space: 361588
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('space', 288068), ('br', 59038), ('this', 315196), ('450', 7085)]:
Glimpse of first 5 stop words 
[]:

Min Document Frequency Value: 10
AUC on the test data: 0.7877
The size of the feature space: 32902
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('space', 25748), ('br', 5754), ('this', 28492), ('450', 502)]:
Glimpse of first 5 stop words 
['parks galore', 'mentioned above', 'knock on', 'large tastefully']:

Min Document Frequency Value: 100
AUC on the test data: 0.7631
The size of the feature space: 4558
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('space', 3471), ('br', 767), ('this', 3881), ('apartment', 403)]:
Glimpse of first 5 stop words 
['knock on', 'mentioned above', 'receiving basket', 'grupo de']:

Min Document Frequency 

<b>Analysis (ungraded):</b> Just as you can use the parameter `min_df` to specify the minimum 'document frequency,' you can use the parameter `max_df` to ignore words that have a document frequency higher than the specified value. Try using the parameter `max_def` and compare the results.