# SOLUTIONS: Advanced ML Week 1, Lecture 1: Working with and Preparing Text Data

In this notebook we will be preparing Twitter (X) Tweets for sentiment analysis.  Sentiment analysis is a common text classification challenge to determine whether a text is positive or negative.  

This is useful for companies that want to analyze large numbers of documents, tweets, reviews, etc., to determine public sentiment about a product or service.

The data was originally gathered from Twitter (now X) and hand-labeled.  Of course there will be some human bias in the labeling.  It was downloaded from Kaggle at this site: [Kaggle Twitter Tweets Sentiment Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset/)

There are 3 classes: positive, negative, and neutral.

In [18]:
## Import necessary packages
import pandas as pd
import nltk

# Load the Data

We will download our **corpus** of tweets.

In [19]:
## Download corpus of tweets
df = pd.read_csv('../Data/archive.zip')
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on the releases we already bought","Sons of ****,",negative


# Some light EDA

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27481 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27481 non-null  object
dtypes: object(4)
memory usage: 858.9+ KB


In [21]:
df.duplicated().sum()

0

# Some Light Data Cleaning

We see that our **corpus** has 27481 **documents**, each with an ID, the full text, a shortened version, and the labeled sentiment.

Interestingly, one of the tweets has no text!  We definitely want to get rid of that.  We will also drop the `textID` and `selected_text` columns.  We are going to use the entire text of each tweet, not just a subset.

We will keep the label, `sentiment` for later classification and analysis tasks.

In [22]:
df = df.drop(columns=['textID', 'selected_text'])
df = df.dropna()

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27480 entries, 0 to 27480
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       27480 non-null  object
 1   sentiment  27480 non-null  object
dtypes: object(2)
memory usage: 644.1+ KB


# Some More EDA
Let's look at some aspects of this text.
* What do the **documents** look like?
* How long do the tend to be?

## View some sample tweets

In [24]:
## Expand how many characters pandas will show
pd.set_option('display.max_colwidth', None)

## Display some of the documents (tweets)
df['text'].head(10)

0                                                             I`d have responded, if I were going
1                                                   Sooo SAD I will miss you here in San Diego!!!
2                                                                       my boss is bullying me...
3                                                                  what interview! leave me alone
4                      Sons of ****, why couldn`t they put them on the releases we already bought
5    http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth
6                                2am feedings for the baby are fun when he is all smiles and coos
7                                                                                      Soooo high
8                                                                                     Both of you
9                            Journey!? Wow... u just became cooler.  hehe... (is that possible!?)
Name: text, dtype: o

## Get some statistics on the length of **documents**

In [25]:
## Determine the length of each tweet
df['length'] = df['text'].map(len)
df.head(10)

Unnamed: 0,text,sentiment,length
0,"I`d have responded, if I were going",neutral,36
1,Sooo SAD I will miss you here in San Diego!!!,negative,46
2,my boss is bullying me...,negative,25
3,what interview! leave me alone,negative,31
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75
5,http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth,neutral,92
6,2am feedings for the baby are fun when he is all smiles and coos,positive,64
7,Soooo high,neutral,10
8,Both of you,neutral,12
9,Journey!? Wow... u just became cooler. hehe... (is that possible!?),positive,69


In [26]:
## Analyze the statistics of the lengths
df['length'].describe()

count    27480.000000
mean        68.330022
std         35.603870
min          3.000000
25%         39.000000
50%         64.000000
75%         97.000000
max        141.000000
Name: length, dtype: float64

The tweets have an mean length of 68 characters and a median of 64. They range from 3 to 141 characters with a standard deviation of 35.  The middle 50% are between 39 and 97 characters in length.

This gives us some idea of how long they tend to be.

# Text Normalization with NLTK

## Normalizing Casing

It's common practice to lower the casing of the text in our documents to contribut to normalizing.

In [27]:
df['lower_text'] = df['text'].str.lower()
df.head()

Unnamed: 0,text,sentiment,length,lower_text
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!
2,my boss is bullying me...,negative,25,my boss is bullying me...
3,what interview! leave me alone,negative,31,what interview! leave me alone
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought"


## Tokenizing

Tokenizing text into single word tokens is simple in Python.  We can just use `str.split()`.  The default separator for `.split()` is one space, so `' '`.

We can access Pandas' string accessor with `df.str.<method>`.  This allows us to apply string methods to all rows in a column.

When processing text, if memory allows, it can be useful to keep many versions of your text: tokenize, lemmatized, no stop words, etc.  Some analysis or modeling packages expect tokenized data and others do not.  We often want to use different versions for different kinds of analysis, too.

In [28]:
df['tokens'] = df['lower_text'].str.split()
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i`d, have, responded,, if, i, were, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego!!!]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview!, leave, me, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, ****,, why, couldn`t, they, put, them, on, the, releases, we, already, bought]"


### Better way to tokenize data

NLTK has a more sophisticated tokenization function that will isolate things like punctuation as well.  This way 'hooray' and 'hooray!!!' will be the same token.

In order for NLTK to recognize the punctuation, we will need to download the 'punkt' data.

In [29]:
## Download punkt
nltk.download('punkt')

df['tokens'] = df['lower_text'].apply(nltk.word_tokenize)
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\caell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,text,sentiment,length,lower_text,tokens
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]"


## Remove Stop Words

In [30]:
## Download NLTK stopword list
nltk.download('stopwords')

## Load the English stop words.
stop_words = nltk.corpus.stopwords.words('english')
stop_words[:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\caell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

<font color=red> NOTICE </font> that all of the stop words are lower case.  It's necessary to ensure that your tokens are all lower case before using this list to remove stop words.

To remove the stop words from each document, we will apply a function that will check each word in the list of tokens against the list of stopwords and remove them if they are in the list.  More specifically, it will only save them if they are NOT in the list.

In [31]:
## Remove Stopwords Function
def remove_stopwords(tokens):
    no_stops = [token for token in tokens if not token in stop_words]
    return no_stops

df['no_stops'] = df['tokens'].map(remove_stopwords)
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]"


## Remove Punctuation

We can remove punctuation in a similar that we removed stop words.  However, we will get our list of punctuation from the built in Python string library.

In [32]:
## Import built-in String Libary
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [33]:
def remove_punct(tokens):
    no_punct = [token for token in tokens if not token in punctuation]
    return no_punct

df['no_stops_no_punct'] = df['no_stops'].apply(remove_punct)
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops,no_stops_no_punct
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]","[responded, going]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]","[sooo, sad, miss, san, diego]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]","[boss, bullying, ...]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]","[interview, leave, alone]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]","[sons, put, releases, already, bought]"


## Results

Note how many fewer tokens we have in our `no_stops_no_punct` tokens than in our original.  However, some information was lost, but a lot was also retained.  

Normalization is a huge part of the NLP process and is always a balance between reducing the size of our vocabulary and therefor simplifying our models, and retaining enough information for the model to extract some meaningful patterns in the texts.  

There are a lot of choices here to make.

# Normalizing Text with spaCy

The spaCy Python package provides text processing pipelines that can do many of these operations, plus much more complicated processing, very fast and in many fewer steps.  For this reason it is a very popular tool.  

It utilizes pretrained language models that can recognize things like parts of speech and named entities (people, specific places, currency, etc.)

spaCy was not included in your original dojo_env, so you will need to install if if you have not already.

We will also download the pretrained english language model trained on millions of web documents.  We will use the small sized one for efficiency.

In [34]:
## Install spacy if necessary
!pip install spacy

import spacy

## Download the English small-sized model trained on web documents if necessary
spacy.cli.download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## The spaCy model

In [35]:
## Load the model.  Disable Named Entity Recognizer (too slow)
nlp_model = spacy.load('en_core_web_sm', disable='ner')
nlp_model.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

We have our model, and we can apply it like a function.  It expects a string of text as the input.

In [36]:
## Process a document with the model
doc = nlp_model(df['text'][0])
doc

 I`d have responded, if I were going

The document is a collection of tokens we can iterate over

## Documents and Tokens

In [37]:
## Display the tokens in the document
[token for token in doc]

[ , I`d, have, responded, ,, if, I, were, going]

Each token is much more than a string.  It

In [38]:
## Isolate the last token in the document
word = doc[-1]

## Display the text and type of the token
print(word)
type(word)

going


spacy.tokens.token.Token

Each has many attributes that we can take advantage of, such as the lemma form and whether it is punctuation or space, and whether it is a stop word

In [39]:
## Display the lemmatized form of the token
word.lemma_

'go'

In [40]:
## Check whether the token is punctuation
word.is_punct

False

In [41]:
## Check whether the token is a space
word.is_space

False

Spacy can even determine the part of speech that the token is!

In [42]:
## Check the part of speech of the token
word.pos_

'VERB'

In [43]:
[token.pos_ for token in doc]

['SPACE', 'PROPN', 'AUX', 'VERB', 'PUNCT', 'SCONJ', 'PRON', 'AUX', 'VERB']

In [44]:
## Make a list of the lemmas for each token in the document
[token.lemma_ for token in doc]

[' ', 'I`d', 'have', 'respond', ',', 'if', 'I', 'be', 'go']

Notice that the spaCy lemmatization does not automatically lower the casing of words when lemmatizing.  Let's go ahead and make sure they are all lower case.

In [45]:
## Make a list of only the tokens in the document that are not punctuation or spaces
## Lower the casing as well
[token.lemma_.lower() for token in doc if not token.is_punct and not token.is_space]

['i`d', 'have', 'respond', 'if', 'i', 'be', 'go']

In [46]:
## Make a list of all the tokens in the document that are not punctuation, spaces, or stop words
[token.lemma_.lower() for token in doc if not token.is_punct and not token.is_space and not token.is_stop]

['i`d', 'respond', 'go']

In order to use spaCy to process our entire dataframe, we will need to make a function and apply it to our text column.

## Preprocessing with spaCy

In [47]:
## Define a function to use spacy to process our text
def spacy_process(text):
        """Lemmatize tokens, lower case, remove punctuation, spaces, and stop words"""
        doc = nlp_model(text)
        processed_doc = [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_space and not token.is_stop]
        return processed_doc

## process the tweets using the spacy function
df['spacy_lemmas'] = df['text'].apply(spacy_process)
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops,no_stops_no_punct,spacy_lemmas
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]","[responded, going]","[i`d, respond, go]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]","[sooo, sad, miss, san, diego]","[sooo, sad, miss, san, diego]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]","[boss, bullying, ...]","[boss, bully]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]","[interview, leave, alone]","[interview, leave]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]","[sons, put, releases, already, bought]","[son, couldn`t, release, buy]"


We used spaCy to tokenize, lemmatize, and remove punctuation and stopwords from our text in one step!

Notice that the spaCy processed data is a little different than our previously processed data.  The text has been lemmatized and spaCy has a different list of stop words than NLTK.

The learn platform has directions for how you can customize your spaCy stopword list and a function with more flexibility in how spaCy will process your data.

# Ngrams

ngrams combine multiple words into tokens

In [48]:
## Import the ngrams function
from nltk import ngrams

In [49]:
## Isolate the first lemmatized document
lemma_doc = df['spacy_lemmas'][5]
lemma_doc

['http://www.dothebouncy.com/smf',
 'shameless',
 'plug',
 'good',
 'rangers',
 'forum',
 'earth']

In [50]:
# Create bigrams
list(ngrams(lemma_doc,2))

[('http://www.dothebouncy.com/smf', 'shameless'),
 ('shameless', 'plug'),
 ('plug', 'good'),
 ('good', 'rangers'),
 ('rangers', 'forum'),
 ('forum', 'earth')]

In [51]:
# Create trigrams
list(ngrams(lemma_doc,3))

[('http://www.dothebouncy.com/smf', 'shameless', 'plug'),
 ('shameless', 'plug', 'good'),
 ('plug', 'good', 'rangers'),
 ('good', 'rangers', 'forum'),
 ('rangers', 'forum', 'earth')]

## Applying `ngrams` to make a new column

We need to make a function that returns a list of bigrams.  It won't work to just pass the ngrams function to `.apply()`

In [52]:
## Create a function to create bigrams
def make_bigrams(doc):
    bigrams = ngrams(doc, 2)
    bigrams = list(bigrams)
    return bigrams

In [53]:
# add bigrams to the df with .apply()
df['bigrams'] = df['spacy_lemmas'].apply(make_bigrams)
df.head()

Unnamed: 0,text,sentiment,length,lower_text,tokens,no_stops,no_stops_no_punct,spacy_lemmas,bigrams
0,"I`d have responded, if I were going",neutral,36,"i`d have responded, if i were going","[i, `, d, have, responded, ,, if, i, were, going]","[`, responded, ,, going]","[responded, going]","[i`d, respond, go]","[(i`d, respond), (respond, go)]"
1,Sooo SAD I will miss you here in San Diego!!!,negative,46,sooo sad i will miss you here in san diego!!!,"[sooo, sad, i, will, miss, you, here, in, san, diego, !, !, !]","[sooo, sad, miss, san, diego, !, !, !]","[sooo, sad, miss, san, diego]","[sooo, sad, miss, san, diego]","[(sooo, sad), (sad, miss), (miss, san), (san, diego)]"
2,my boss is bullying me...,negative,25,my boss is bullying me...,"[my, boss, is, bullying, me, ...]","[boss, bullying, ...]","[boss, bullying, ...]","[boss, bully]","[(boss, bully)]"
3,what interview! leave me alone,negative,31,what interview! leave me alone,"[what, interview, !, leave, me, alone]","[interview, !, leave, alone]","[interview, leave, alone]","[interview, leave]","[(interview, leave)]"
4,"Sons of ****, why couldn`t they put them on the releases we already bought",negative,75,"sons of ****, why couldn`t they put them on the releases we already bought","[sons, of, *, *, *, *, ,, why, couldn, `, t, they, put, them, on, the, releases, we, already, bought]","[sons, *, *, *, *, ,, `, put, releases, already, bought]","[sons, put, releases, already, bought]","[son, couldn`t, release, buy]","[(son, couldn`t), (couldn`t, release), (release, buy)]"


# Save the final data version for modeling

In [54]:
## Save the processed data
df.to_csv('../Data/processed_data.csv', index=False)