<table class="table table-bordered">
    <tr>
        <th style="width:250px"><img src='https://www.np.edu.sg/PublishingImages/Pages/default/odp/ICT.jpg' style="width: 100%; height: 125px; "></th>
        <th style="text-align:center;"><h1>Data Wrangling</h1><h2>Exercise 11: Extracting Features from Text Variables </h2><h3>Diploma in Data Science</h3></th>
    </tr>
</table>

## Objectives

Text can be part of the variables in our datasets. For example, in insurance, some variables that capture information about an incident may come from a free text field in a form. In data from a website that collects customer reviews or feedback, we may also encounter variables that contain short descriptions provided by text that has been entered manually by the users. 

Text is unstructured, that is, it does not follow a pattern, like the tabular pattern of the datasets we have worked with throughout this module. Text may also vary in length and content, and the writing style may be different. How can we extract information from text variables to inform our predictive models? This is the question we are going to address in this practical.

The techniques we will cover in this practical belong to the realm of Natural Language Processing (NLP). NLP is a subfield of linguistics and computer science, concerned with the interactions between computer and human language, or, in other words, how to program computers to understand human language. NLP includes a multitude of techniques to understand the syntax, semantics, and discourse of text, and therefore to do this field justice would require a course in itself.

In this practical, instead, we will discuss those techniques that will allow us to quickly extract features from short pieces of text, to complement our predictive models. Specifically, we will discuss how to capture text complexity by looking at some statistical parameters of the text such as the word length and count, the number of words and unique words used, the number of sentences, and so on. We will use the pandas and scikit-learn libraries, and we will make a shallow dive into a very useful Python NLP toolkit called Natural Language Toolkit (NLTK).

Pls refer to the `Practical 11.1 - Practical 11.5` in the seperate files for details. 

## Exercise

In `airbnb_sg` dataset, extract useful features from `name` columns using text analysis techniques. 

In [1]:
# import all the required packages

import pandas as pd
import numpy as np

# sklearn preprocess
from sklearn.model_selection import train_test_split

# Visual
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# NLTK
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

# Bag of words & TFIDF
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# load the dataset and show the info of dataset
data = pd.read_csv('./data/airbnb_sg.csv')
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,49091,COZICOMFORT LONG TERM STAY ROOM 2,266763,Francesca,North Region,Woodlands,1.44255,103.7958,Private room,83,180,1,2013-10-21,0.01,2,365
1,50646,Pleasant Room along Bukit Timah,227796,Sujatha,Central Region,Bukit Timah,1.33235,103.78521,Private room,81,90,18,2014-12-26,0.28,1,365
2,56334,COZICOMFORT,266763,Francesca,North Region,Woodlands,1.44246,103.79667,Private room,69,6,20,2015-10-01,0.2,2,365
3,71609,Ensuite Room (Room 1 & 2) near EXPO,367042,Belinda,East Region,Tampines,1.34541,103.95712,Private room,206,1,14,2019-08-11,0.15,9,353
4,71896,B&B Room 1 near Airport & EXPO,367042,Belinda,East Region,Tampines,1.34567,103.95963,Private room,94,1,22,2019-07-28,0.22,9,355


In [3]:
df = pd.DataFrame(data['name'].dropna())

In [4]:
df["name"] = df['name'].str.replace('[^\w\s]','')
df.head()

  df["name"] = df['name'].str.replace('[^\w\s]','')


Unnamed: 0,name
0,COZICOMFORT LONG TERM STAY ROOM 2
1,Pleasant Room along Bukit Timah
2,COZICOMFORT
3,Ensuite Room Room 1 2 near EXPO
4,BB Room 1 near Airport EXPO


In [5]:
# Task 2: remove numbers, keep only text
df['name'] = df['name'].str.replace('\d+', '')
df.head()

  df['name'] = df['name'].str.replace('\d+', '')


Unnamed: 0,name
0,COZICOMFORT LONG TERM STAY ROOM
1,Pleasant Room along Bukit Timah
2,COZICOMFORT
3,Ensuite Room Room near EXPO
4,BB Room near Airport EXPO


In [6]:
# Task 3: put in lower case
df['name']=df['name'].str.lower()
df.head()

Unnamed: 0,name
0,cozicomfort long term stay room
1,pleasant room along bukit timah
2,cozicomfort
3,ensuite room room near expo
4,bb room near airport expo


In [7]:
# Task 4: remove stop words
def remove_stopwords(text):
    stop = set(stopwords.words('english'))
    text = [word for word in text.split() if word not in stop]
    text = ' '.join(x for x in text)
    return text
df['name'] = df['name'].apply(remove_stopwords)
df['name'][10]

'conveniently located city room phone number hidden airbnb'

In [8]:
# Task 5: Stemming the words
stemmer = SnowballStemmer("english")
def stemm_words(text):
    text = [stemmer.stem(word) for word in text.split()]
    text = ' '.join(x for x in text)
    return text
df['name'] = df['name'].apply(stemm_words)
df['name'][10]

'conveni locat citi room phone number hidden airbnb'

In [9]:
# Task 6: Use Bag of Words transformation to create new features
vectorizer = CountVectorizer(lowercase=True,
                             stop_words='english',
                             ngram_range=(1, 1),
                             min_df=0.05)
vectorizer.fit(df['name'])
newvars = vectorizer.transform(df['name'])
bagofwords = pd.DataFrame(newvars.toarray(),
                          columns = vectorizer.get_feature_names())
print(bagofwords.shape)
bagofwords.head()

(7905, 18)


Unnamed: 0,apart,apt,bed,bedroom,br,central,citi,condo,cosi,cozi,min,mrt,near,orchard,privat,room,spacious,studio
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0


In [10]:
# Task 7: Use TF-IDF transformation to create new features
vectorizer2 = TfidfVectorizer(lowercase=True,
                             stop_words='english',
                             ngram_range=(1, 1),
                             min_df=0.05)
vectorizer2.fit(df['name'])
vars2 = vectorizer.transform(df['name'])
tfidf = pd.DataFrame(vars2.toarray(),
                          columns = vectorizer.get_feature_names())
print(tfidf.shape)
tfidf.head()

(7905, 18)


Unnamed: 0,apart,apt,bed,bedroom,br,central,citi,condo,cosi,cozi,min,mrt,near,orchard,privat,room,spacious,studio
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
