### Featurizing text data

With clean data we can begin to ask what is the best way to extract features from the data. There are many more approaches for text analytics and natural language processing (NLP). We only mention a few below. Note that the collection of unique words in the data is called a **vocabulary**. To avoid having a vocabulary that's too large, we can trim it by keeping the most frequent $N$ words, making $N$ the size of the vocabulary. A **document** usually refers to a single data point with raw text, such as a tweet, a review, an invoice, etc. So our documents are made up of "words" that come from the corpus (ignoring any words that are not in the vocabulary). The question now is how do we represent such a data numerically? Here are two approaches:

- The **bag of words model (BoW)** is a simple and surprisingly effective model for analysis of text data. The BoW model creates a **sparse vector representation** of each word in the corpus based on the frequency of the words in the document. The order of the words is not considered, nor is the similarity between different words. Despite serious shortcomings, the model can work well in many cases.
- We can usually do much better by using **word embeddings**, which are **dense vector respresentations** for each word in the corpus. Word embeddings are learned by examining the word's **context** (other words around it). Word embeddings are very common in **deep learning** applications of NLP, although the embeddings themselves are learned using a shallow network. If we learn word embeddings from a very large data set once, we can save and re-use these word embeddings to create features for other data sets. In fact, **pre-trained word embeddings** are trained by large companies like Google and made available for use by others. So we can load these embeddings and numerically represent a document using the average of the embeddings of the words in it. Because word embeddings are vectors, such an average would also be a vector that is a dense representation of the document.

As you can see, BoW models seem too simplistic and word embeddings seem a bit too sophisticated (I mean in the context of DATASCI 510 course). So here's another approach that is sort of between the two in terms of difficulty. It is called **TF-IDF** and it is a clever way to featurize words in documents. Just like a BoW model, we begin by "tokenizing" the data. In BoW we then create a one-hot encoded feature for each token (or word). But in TF-IDF we first extract the relative word frequencies per document (called **term frequencies** or TF), we then multiply the term frequencies by a multiplier we call IDF. This has the effect of dampening the values for terms that appear frequently across documents, giving them less influence when we move on to the machine learning phase. Note that we used the words "token", "word" and "term" almost interchangeably. Sorry for confusing you! Data scientists don't always agree on terminology.

In [53]:
#1) use pandas read_csv with sep='\t' to read in the following 2 files available from the us naval academy:
# url = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/keyword-tweets.txt'
# url = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/general-tweets.txt'
url_kt = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/keyword-tweets.txt'
url_gt = 'https://www.usna.edu/Users/cs/nchamber/data/twitter/general-tweets.txt'
keyword_tweets=pd.read_csv(url_kt,sep='\t',header=None)
general_tweets=pd.read_csv(url_gt,sep='\t',header=None)

keyword_tweets.head()
# general_tweets.head()

Unnamed: 0,0,1
0,POLIT,Global Voices Online Â» Alex Castro: A liberal...
1,POLIT,Do the Conservatives Have a Death Wish? http:/...
2,NOT,@MMFlint I've seen all of your movies and Capi...
3,POLIT,RT @AllianceAlert: * House Dems ask for civili...
4,POLIT,RT @AdamSmithInst Quote of the week: My politi...


In [54]:
# 2. concatenate these 2 data sets into a single data frame called LabeledTweets that has 2 columns, named Sentiment and Tweet <span style="color:red" float:right>[1 point]</span>

LabeledTweets = pd.concat([keyword_tweets, general_tweets], ignore_index=True)
LabeledTweets=LabeledTweets.rename(columns={0:'Sentiment' ,1:'Tweet'})
# help(pd.concat)  # Shows source code if available, or detailed info
LabeledTweets

Unnamed: 0,Sentiment,Tweet
0,POLIT,Global Voices Online Â» Alex Castro: A liberal...
1,POLIT,Do the Conservatives Have a Death Wish? http:/...
2,NOT,@MMFlint I've seen all of your movies and Capi...
3,POLIT,RT @AllianceAlert: * House Dems ask for civili...
4,POLIT,RT @AdamSmithInst Quote of the week: My politi...
...,...,...
3999,NOT,@themoderngal ditto for me. i am having remors...
4000,NOT,@ceebrito wats goodie my dominican brotha
4001,NOT,yea my fone iz a DUBB
4002,NOT,@camerongarcia oh yes! My mom wanted to buy my...


In [55]:
# 3)'POLIT': 1, 'NOT': 0;
map_rep={'POLIT': 1, 'NOT': 0}
LabeledTweets.replace(map_rep,inplace=True)
LabeledTweets.head()
LabeledTweets.shape

(4004, 2)

In [None]:
# 4)clean the tweets

# remove all tokens that contain a "@". Remove the whole token, not just the character.
# remove all tokens that contain "http". Remove the whole token, not just the characters.
# replace (not remove) all punctuation marks with a space (" ")
# replace all numbers with a space
# replace all non ascii characters with a space
# convert all characters to lowercase
# strip extra whitespaces
# lemmatize tokens
# No need to remove stopwords because TfidfVectorizer will take care of that

# remove all tokens that contain a "@". Remove the whole token, not just the character.
LabeledTweets_no_at = LabeledTweets.loc[~LabeledTweets['Tweet'].str.contains('@', na=False)]
LabeledTweets_no_at
# remove all tokens that contain "http". Remove the whole token, not just the characters.
LabeledTweets_notHttp=LabeledTweets_no_at.loc[~LabeledTweets_no_at['Tweet'].str.contains('http', na=False)]
LabeledTweets_notHttp


# replace (not remove) all punctuation marks with a space (" ")
# print(string.punctuation)
LabeledTweets_notHttp.loc[:,'Tweet']=LabeledTweets_notHttp['Tweet'].str.replace(f"[{string.punctuation}]"," ",regex=True)
# #replace all numbers with a space
LabeledTweets_notHttp.loc[:,'Tweet']=LabeledTweets_notHttp['Tweet'].str.replace("\d+"," ",regex=True)
# #replace all non ascii characters with a space
LabeledTweets_notHttp.loc[:,'Tweet'] = LabeledTweets_notHttp['Tweet'].str.replace(r'[^\x00-\x7F]+', ' ', regex=True)
# convert all characters to lowercase
LabeledTweets_notHttp.loc[:,'Tweet'] = LabeledTweets_notHttp['Tweet'].str.lower()

# strip extra whitespaces
LabeledTweets_notHttp.loc[:,'Tweet'] = LabeledTweets_notHttp['Tweet'].str.strip()

# lemmatize
# Initialize the lemmatizer
lmtzr = WordNetLemmatizer()

# Apply lemmatization to each word in the 'Tweet' column
LabeledTweets_notHttp.loc[:,'Tweet'] = LabeledTweets_notHttp['Tweet'].apply(
    lambda text: " ".join([lmtzr.lemmatize(word) for word in text.split()])
)

print(LabeledTweets_notHttp['Tweet'])

In [None]:
#5) Use TfidfVectorizer from sklearn to prepare the data for machine learning.  Use max_features = 50;
clean_texts = LabeledTweets_notHttp['Tweet']
# vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, max_features = 50, stop_words = 'english')

vectorizer = TfidfVectorizer(max_features = 50)
tfidf_matrix =  vectorizer.fit_transform(clean_texts)
doc = 0
feature_names = vectorizer.get_feature_names_out()
tfidf_matrix_dense = tfidf_matrix.toarray()
tfidf_df = pd.DataFrame(tfidf_matrix_dense, columns = feature_names)
print(tfidf_df.shape)
tfidf_df.head()