## CodeLab 1: Cleaning text data with Spacy - nltweets

In [1]:
!dir

 Volume in drive C has no label.
 Volume Serial Number is 4296-BF44

 Directory of C:\Users\riley\Documents\Coding\nltweets\nltweets-master\codelabs

19/02/23  22:00    <DIR>          .
19/02/23  22:00    <DIR>          ..
19/02/23  15:06    <DIR>          .ipynb_checkpoints
19/02/23  14:59            23,768 CodeLab0TwitterAPI.ipynb
19/02/23  22:00            23,716 CodeLab1Spacy.ipynb
19/02/20  19:27           324,254 corpus.txt
19/02/23  14:04               491 tweets_with_mention_uxdesign.txt
19/02/23  10:49               243 twitter_credentials.json
               5 File(s)        372,472 bytes
               3 Dir(s)  91,350,781,952 bytes free


When we want to analyze text data, we want to clean the data to remove noise. 

Lets look at a text file containing a tweet on each line. This raw data is pretty messy, as you can see:

In [2]:
with open('corpus.txt') as f:
    for line in f.readlines()[:5]:
        print(line)

b"I'm at SF MUNI - L Taraval - @sfmta_muni in San Francisco, CA https://t.co/3N1JXENWs4"

b'@James_Gross @sfmta_muni'

b'@sfmta_muni I like the Rate My Ride feature in the Muni Mobile app, but can you add an option to give other types of feedback?  For example, arrival predictions on the 24 line are several minutes out of whack tonight.  Or wondering if the new Metro cars will always be so jerky?'

b'People that live and work in San Francisco should be proud of @sfmta_muni supporting @AirResources proposed standard for zero-emission buses https://t.co/TInaTn1zoc https://t.co/Nxukzkco6H'

b'People that live and work in San Francisco should be proud of @sfmta_muni supporting @AirResources proposed standard for zero-emission buses https://t.co/TInaTn1zoc https://t.co/Nxukzkco6H'



First, let's load this data into memory... 

In [3]:
lines = []
with open('corpus.txt') as f:
    for line in f.readlines():
        lines.append(line)

and get rid of the b' ' around these strings:

In [4]:
for index, line in enumerate(lines):
    lines[index] = line[2:-2]

Now what we have is much nicer:

In [5]:
for line in lines[:5]:
    print(line)

I'm at SF MUNI - L Taraval - @sfmta_muni in San Francisco, CA https://t.co/3N1JXENWs4
@James_Gross @sfmta_muni
@sfmta_muni I like the Rate My Ride feature in the Muni Mobile app, but can you add an option to give other types of feedback?  For example, arrival predictions on the 24 line are several minutes out of whack tonight.  Or wondering if the new Metro cars will always be so jerky?
People that live and work in San Francisco should be proud of @sfmta_muni supporting @AirResources proposed standard for zero-emission buses https://t.co/TInaTn1zoc https://t.co/Nxukzkco6H
People that live and work in San Francisco should be proud of @sfmta_muni supporting @AirResources proposed standard for zero-emission buses https://t.co/TInaTn1zoc https://t.co/Nxukzkco6H


Lets do a couple more things to our text using Spacy, a powerful natural language processing library.

First, we need to download the 'en' package for Spacy like so:

In [6]:
!pip install spacy



Now, let's import Spacy and load 'en'. Below, we store the default Spacy model into the variable 'nlp'.

In [10]:
import spacy

In [11]:
!python -m spacy download en

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
Installing collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm: started
    Running setup.py install for en-core-web-sm: finished with status 'done'
Successfully installed en-core-web-sm-2.0.0

    Linking successful
    C:\Users\riley\AppData\Local\Programs\Python\Python37-32\lib\site-packages\en_core_web_sm
    -->
    C:\Users\riley\AppData\Local\Programs\Python\Python37-32\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load('en')



You do not have sufficient privilege to perform this operation.


In [13]:
nlp = spacy.load('en')

Spacy gives us a bunch of features for free. As you can see below, the default Spacy pipeline will run a part-of-speech tagger, dependency parser, and named entity recognizer on our text. We don't need every feature in the pipeline, let's use some basic features for the purposes of this CodeLab.

In [16]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0xf569f30>),
 ('parser', <spacy.pipeline.DependencyParser at 0xf547d50>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0xf547ae0>)]

Let's select a sample tweet and run a few preprocessing steps to make it more ready for analysis.

In [17]:
lines[2]

'@sfmta_muni I like the Rate My Ride feature in the Muni Mobile app, but can you add an option to give other types of feedback?  For example, arrival predictions on the 24 line are several minutes out of whack tonight.  Or wondering if the new Metro cars will always be so jerky?'

We can call make_doc(string) on our document to get a Spacy Doc object that adds a bunch of nice metadata.

In [18]:
doc = nlp.make_doc(lines[2])
type(doc)

spacy.tokens.doc.Doc

First, we can easily tokenize our document, which is splitting it into individual "tokens." Spacy tokens are words tagged with useful properties.

In [19]:
tokens = [token for token in doc]
print(tokens)

[@sfmta_muni, I, like, the, Rate, My, Ride, feature, in, the, Muni, Mobile, app, ,, but, can, you, add, an, option, to, give, other, types, of, feedback, ?,  , For, example, ,, arrival, predictions, on, the, 24, line, are, several, minutes, out, of, whack, tonight, .,  , Or, wondering, if, the, new, Metro, cars, will, always, be, so, jerky, ?]


Firstly, let's use the token's "is_punct" attribute to remove punctuation tokens from our data.

In [20]:
tokens = [token for token in tokens if not token.is_punct]
print(tokens)

[@sfmta_muni, I, like, the, Rate, My, Ride, feature, in, the, Muni, Mobile, app, but, can, you, add, an, option, to, give, other, types, of, feedback,  , For, example, arrival, predictions, on, the, 24, line, are, several, minutes, out, of, whack, tonight,  , Or, wondering, if, the, new, Metro, cars, will, always, be, so, jerky]


Next, let's explicitly remove blank tokens from our string.

In [21]:
tokens = [token for token in tokens if token.text != ' ']
print(tokens)

[@sfmta_muni, I, like, the, Rate, My, Ride, feature, in, the, Muni, Mobile, app, but, can, you, add, an, option, to, give, other, types, of, feedback, For, example, arrival, predictions, on, the, 24, line, are, several, minutes, out, of, whack, tonight, Or, wondering, if, the, new, Metro, cars, will, always, be, so, jerky]


Now, let's delete non-dictionary words from our text, such as hashtags.

In [22]:
tokens = [token for token in tokens if token.text in nlp.vocab]
print(tokens)

[I, like, the, Rate, My, feature, in, the, Muni, Mobile, app, but, can, you, add, an, option, to, give, other, types, of, feedback, For, example, arrival, predictions, on, the, 24, line, are, several, minutes, out, of, whack, tonight, Or, wondering, if, the, new, Metro, cars, will, always, be, so, jerky]


We can also delete stop words using the "is_stop" attribute of our tokens. Stop words are common but not meaningful words such as "the" and "in".

In [23]:
tokens = [token for token in tokens if not token.is_stop]
print(tokens)

[I, like, Rate, My, feature, Muni, Mobile, app, add, option, types, feedback, For, example, arrival, predictions, 24, line, minutes, whack, tonight, Or, wondering, new, Metro, cars, jerky]


Finally, let's retrieve the raw text data from our tokens and join them back into a single string.

In [24]:
tokens = [token.text for token in tokens]

In [25]:
print(" ".join(tokens))

I like Rate My feature Muni Mobile app add option types feedback For example arrival predictions 24 line minutes whack tonight Or wondering new Metro cars jerky


Compared to our original document, our cleaned document is much more concise but retains all the semantically meaningful information.

In [26]:
nlp(lines[2])

@sfmta_muni I like the Rate My Ride feature in the Muni Mobile app, but can you add an option to give other types of feedback?  For example, arrival predictions on the 24 line are several minutes out of whack tonight.  Or wondering if the new Metro cars will always be so jerky?

One more thing: since named entity recognition is particularly interesting to our project, let's use the Spacy pipeline to do some analysis on named entities.

For example, let's pick out a tweet with some clear named entities:

In [27]:
line = lines[82]
print(line)

Reminder: Bus shuttles providing #TThird svc btwn Embarcadero and Bayshore/Sunnydale all day today. Watch for signs showing where to board bus &amp; allow for extra travel time. T and #KIngleside prediction times will not be available. https://t.co/HTJkPtzHAA


Spacy can grab the named entities from this document like so:

In [28]:
doc = nlp(line)
for ent in doc.ents:
    print(ent.label_, ent.text)

MONEY TThird
FAC Embarcadero
ORG Bayshore/Sunnydale
DATE all day today
MONEY #KIngleside


Spacy correctly picks out the named entities, although makes some mistakes categorizing the entities. For example, Bayshore/Sunnydale is categorized as an organization, but for our purposes, it's a SF MUNI stop.

Let's do a simple analysis: we can go through our corpus and count up the most frequently appearing named entities. This can help us get an idea of what SF MUNI users care about the most.

In [29]:
from collections import Counter
entities = []
for line in lines:
    doc = nlp(line)
    for ent in doc.ents:
        entities.append(str(ent))
Counter(entities).most_common(10)

[('@SFBART', 104),
 ('Thanksgiving', 100),
 ('San Francisco', 64),
 ('SF', 55),
 ('Muni', 53),
 ('@sfmta_muni', 43),
 ('Embarcadero', 40),
 ('#', 38),
 ('2', 38),
 ('today', 35)]

I hope this was helpful in learning some of the basic features of Spacy! Questions? DM me @daniel.zou on the sfbrigade Slack