![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Basics

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Sentiment-Analysis-as-an-example-of-machine-learning/deep-learning-classification" data-toc-modified-id="Sentiment-Analysis-as-an-example-of-machine-learning/deep-learning-classification-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Analysis as an example of machine learning/deep learning classification</a></span></li><li><span><a href="#Analyse-trivial-documents-with-built-in-sentiment-analysis-tool" data-toc-modified-id="Analyse-trivial-documents-with-built-in-sentiment-analysis-tool-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analyse trivial documents with built-in sentiment analysis tool</a></span></li><li><span><a href="#Acquire-and-analyse-lell-trivial-documents" data-toc-modified-id="Acquire-and-analyse-lell-trivial-documents-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Acquire and analyse lell trivial documents</a></span></li><li><span><a href="#Train-and-test-a-sentiment-analysis-tool-with-trivial-data" data-toc-modified-id="Train-and-test-a-sentiment-analysis-tool-with-trivial-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Train and test a sentiment analysis tool with trivial data</a></span></li><li><span><a href="#You-can-train-and-test-a-sentiment-analysis-tool-with-more-interesting-data-too..." data-toc-modified-id="You-can-train-and-test-a-sentiment-analysis-tool-with-more-interesting-data-too...-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>You can train and test a sentiment analysis tool with more interesting data too...</a></span></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#Further-reading" data-toc-modified-id="Further-reading-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Further reading</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction

Sentiment analysis is a commonly used example of automatic classification. To be clear, automatic classification means that a model or learning algorithm has been trained on correctly classified documents and it uses this training to returns a probability assessment of what class a new document should belong to. 

Sentiment analysis works the same way, but usually only has two classes - positive and negative. A trained model looks at new data and says whether that new data is likely to be positive or negative. Let's take a look!

## Sentiment Analysis as an example of machine learning/deep learning classification

Let's start off by importing and downloading some useful packages, including textblob. Textblob is based on nltk and has built in sentiment analysis tools. 

Run/Shift+Enter.

In [1]:
import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import csv                        # csv is for importing and working with csv files
import statistics


# List all of the files in the "data" folder that is provided to you
for file in os.listdir("./Sentiment_Analysis"):
   print("A files we can use is... ", file)
print("")

!pip install -U textblob -q
!python -m textblob.download_corpora -q
from textblob import TextBlob 

A files we can use is...  testing_set.csv
A files we can use is...  training_set.csv

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is a

## Analyse trivial documents with built-in sentiment analysis tool

Now, lets get some data.

Run/Shift+Enter, as above!

In [2]:
Doc1 = TextBlob("Textblob is just super. I love it!")              # A few simple documents in textblog format
Doc2 = TextBlob("Cabbages are the worst. Say no to cabbages!")  
Doc3 = TextBlob("Paris is the capital of France. ")   
print("...")
type(Doc1)

...


textblob.blob.TextBlob

Docs 1 through 3 are Textblobs, which we can see by the output of type(Doc1). 

We get a Textblob by passing a string to the function that we imported above. Specifically, this is done by using this format --> Textblob('string goes here'). Textblobs are ready for analysis through the textblob tools, such as the built-in sentiment analysis tool that we see in the code below. 

Run/Shift+Enter on those Textblobs.

In [3]:
print(Doc1.sentiment)
print(Doc2.sentiment)
print(Doc3.sentiment)

Sentiment(polarity=0.47916666666666663, subjectivity=0.6333333333333333)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=0.0, subjectivity=0.0)


The output of the previous code returns two values for each Textblob object. Polarity refers to a positive-negative spectrum while subjectivity refers to an opinion-fact spectrum. 

We can see, for example, that Doc1 is fairly positive but also quite subjective while Doc2 is very negative and very subjective. Doc3, in contrast, is both neutral and factual. 

Maybe you don't need both polarity and subjectivity. For example, if you are trying to categorise opinions, you don't need the subjectivity score and would only want the polarity. 

To get only one one of the two values, you can call the appropriate sub-function as shown below. 

Run/Shift+Enter for sub-functional fun. 

In [4]:
print(Doc1.sentiment.polarity)
print(Doc1.sentiment.subjectivity)

0.47916666666666663
0.6333333333333333


## Acquire and analyse lell trivial documents

Super. We have importand some documents (in our case, just sentences in string format) to textblob and analysed it using the built-in sentiment analyser. But we don't want to import documents one string at a time... That would take forever!

Let's import data in .csv format instead! The data here comes from a set of customer reviews of amazon products. Naturally, not all of the comments in the product reviews are really on topic, but it does not actually matter for our purposes. But, I think it is only fair to warn you... There is some foul language and potentially objectionable personal opinions in the texts if you go through it all. 

Run/Shift+Enter (if you dare!)

In [6]:
with open('./Sentiment_Analysis/training_set.csv', newline='') as f:              # 2, a csv of scored product reviews
    reader = csv.reader(f)
    Doc_set = list(reader)

print(Doc_set[:20])                                                          # 3, a quick look at the first 10 items in the csv

[['@queenzita  then why do they have these stupid little pictures on my iPod I can add to text if no one can see them', '0'], ['at the Holocaust museum. here come the tears  love you guys so much!! miss you&lt;33', '0'], ["just got done working out and now i'm sore ", '0'], ['i was under the impression that there was never rain in israel after Pesach. apparently i was wrong ', '0'], ['Lunchbreak is over, back to work ', '0'], ['@cramur They died  we are going to have them replaced before the big party.... are you going to be here for that!?!?', '0'], ['@ACUsports:  dang', '0'], ["for someone frm Mumbai - it's spellbinding to see how organised the cities are. Marvelous society &amp; culture. Indians are light yrs away ", '0'], ['heart breaking ', '0'], ['@HairyGee You lost the url as the tweet was too long  Feel free to cut me out and resend to make it smaller. ;)', '0'], ["I HATE when my alarm doesn't go off ", '0'], ["@Kiki_Neko And thanks for also calling me a coward. I'll just add t

A very good start (although you will see what I mean about the off-topic comments and foul language). 

Now, the csv has multiple strings per row, but we need to pass that to texblob to create a Textblob object before we can get a polirity or sujectivity score. 

The code below creates a new list that has the text string and the sentiment score for each item in the imported Doc_set, and also shows you the first 20 results of that new list to look at. 

Run/Shift+Enter

In [7]:
Doc_set_analysed = []

for item in Doc_set:
    Doc_set_analysed.append([item[0], TextBlob(item[0]).sentiment])

print(Doc_set_analysed[:20])

[['@queenzita  then why do they have these stupid little pictures on my iPod I can add to text if no one can see them', Sentiment(polarity=-0.49374999999999997, subjectivity=0.75)], ['at the Holocaust museum. here come the tears  love you guys so much!! miss you&lt;33', Sentiment(polarity=0.40625, subjectivity=0.4)], ["just got done working out and now i'm sore ", Sentiment(polarity=0.0, subjectivity=0.0)], ['i was under the impression that there was never rain in israel after Pesach. apparently i was wrong ', Sentiment(polarity=-0.225, subjectivity=0.625)], ['Lunchbreak is over, back to work ', Sentiment(polarity=0.0, subjectivity=0.0)], ['@cramur They died  we are going to have them replaced before the big party.... are you going to be here for that!?!?', Sentiment(polarity=0.0, subjectivity=0.1)], ['@ACUsports:  dang', Sentiment(polarity=0.0, subjectivity=0.0)], ["for someone frm Mumbai - it's spellbinding to see how organised the cities are. Marvelous society &amp; culture. Indians

Now, try to edit the code above to return only the polarity or only the subjectivity. 

While you are at it, try to edit the code to return a different number of results to look at. How about 25?

## Train and test a sentiment analysis tool with trivial data

The built-in tool is all well and good, but... have a look back at the sentiment analysis scores for Doc1 and Doc2. 
- Doc1 scored as .48 on polarity, about halfway between totally neutral and totally positive. 
- Doc2 scored -1 on polarity, which is the most negative it could score. 

Do we really think Doc2 is so much more negative than Doc1 is positive? Hmmmm. Maybe the built-in sentiment analyser is not as accurate as we would want. Let's try to train our own, starting with a small set of trivial training and testing data sets. 

The following code does a few different things:
- It defines 'train' as a data set with 10 sentences, each of which is marked as 'pos' or 'neg'.
- It defines 'test' as a data set with 6 completely different sentences, also marked as 'pos' or 'neg'. 
- It imports NaiveBayesClassifier from the textblob.classifiers.
- It defines 'cl' as a brand new NaiveBayesClassifier that is trained on the 'train' data set. 

Run/Shift+Enter to make it so. 

In [8]:
train = [
    ('I love this sandwich.', 'pos'),
    ('this is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('this is my best work.', 'pos'),
    ("what an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('he is my sworn enemy!', 'neg'),
    ('my boss is horrible.', 'neg')]
test = [
     ('the beer was good.', 'pos'),
     ('I do not enjoy my job', 'neg'),
     ("I ain't feeling dandy today.", 'neg'),
     ("I feel amazing!", 'pos'),
     ('Gary is a friend of mine.', 'pos'),
     ("I can't believe I'm doing this.", 'neg')]


from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

Hmm. The code ran but there is nothing to see. This is because we have no output! Let's get some output and see what it did. 

The next code block plays around with 'cl', the classifier we trained on our 'train' data set.

The first line asks 'cl' to return a judgment of one sentence about a library. 

Then, we ask it to return a judgement of another sentence about something being a doozy. Although both times we get a judgement on whether the sentence is 'pos' or 'neg', the second one has more detailed sub-judgements we can analyse that show us how the positive and negative the sentence is so we can see whether the overall judgement is close or not. 

Do the Run/Shift+Enter thing that you are so good at doing!

In [9]:
print("Our 'cl' classifier says 'This is an amazing library!' is ", cl.classify("This is an amazing library!"))
print('...')

prob_dist = cl.prob_classify("This one is a doozy.")
print("Our 'cl' classifier says 'This one is a doozy.' is probably",
      prob_dist.max(), "because its positive score is ",
      round(prob_dist.prob("pos"), 2),
      " and its negative score is ",
      round(prob_dist.prob("neg"), 2),
      ".")

Our 'cl' classifier says 'This is an amazing library!' is  pos
...
Our 'cl' classifier says 'This one is a doozy.' is probably pos because its positive score is  0.63  and its negative score is  0.37 .


Super. Now... What if we want to apply our 'cl' classifier to a document with multiple sentences... What kind of judgements can we get with that? 

Well, textblob is sophisticated enough to give an overall 'pos' or 'neg' judgement, as well as a sentence-by-sentence judgement. 

Run/Shift+Enter, buddy. 

In [10]:
blob = TextBlob("The beer is good. But the hangover is horrible.", classifier=cl)

print("Overall, 'blob' is ", blob.classify(), " because it's sentences are ...")
for s in blob.sentences:
     print(s)
     print(s.classify())

Overall, 'blob' is  pos  because it's sentences are ...
The beer is good.
pos
But the hangover is horrible.
neg


What if we try to classify a document that we converted to Textblob format with the built-in sentiment analyser?

Well, we still have Doc1 to try it on.

Run/Shift+Enter

In [11]:
print(Doc1)
Doc1.classify()

Textblob is just super. I love it!


NameError: This blob has no classifier. Train one first!

Uh huh. We get an error. 

The error message says the blob known as Doc1 has no classifier. It suggests we train one first, but we can just apply 'cl'. 

Run/Shift+Enter

In [12]:
cl_Doc1 = TextBlob('Textblob is just super. I love it!', classifier=cl)
cl_Doc1.classify()

'pos'

Unsurprisingly, when we classify the string that originally went into Doc1 using our 'cl' classifier, we still get a positive judgement. 

Now, what about accuracy? We have been using 'cl' even though it is trained on a REALLY tiny training data set. What does that do to our accuracy? For that, we need to run an accuracy challenge using our test data set. 

Run/Shift+Enter

In [13]:
cl.accuracy(test)


0.8333333333333334

Hmmm. Not perfect.

Fortunately, we can add more training data and try again. The code below defines a new training data set and then runs a re-training functiong called 'update' on our 'cl' classifier. 

Run/Shift+Enter.

In [14]:
new_data = [('She is my best friend.', 'pos'),
            ("I'm happy to have a new friend.", 'pos'),
            ("Stay thirsty, my friend.", 'pos'),
            ("He ain't from around here.", 'neg')]

cl.update(new_data)

True

Now, edit and run the next code block to test out how the new_data has improved 'cl. 

In [21]:
# Copy and paste the accuracy challenge from above into this cell and re-run it to get an updated accuracy score. 



1.0

## You can train and test a sentiment analysis tool with more interesting data too...

This is all well and good, but seriously, 'cl' is trained on some seriously trivial data. What if we want to use some more interesting data, like the Doc_set that we imported from .csv earlier?

Well, we are in luck! Sort of...

We can definitely train a classifier on Doc_set, but let's just have a closer look at Doc_set before we jump right in and try that. 


In [None]:
print(Doc_set[:10])
print('...')
print(len(Doc_set))

Doc_set is a set of comments that come from product reviews. Each item has two strings, the first of which is the comment and the second of which is string with a number 4, 2 or 0. This second item, the string with a number inside, as a score of whether the comment is positive or negative. These scores may have been manually created, or may be the result of a semi-manual or supervised automation process. Excellent for our purposes, but not ideal because:
- These scores are strings rather than integers. You can tell because they are enclose in quotes.
- These scores range from 0 (negative) to 4 (positive) and also contains 2 (neutral), while the textblob sentiment analysis and classifier functions we have been using return scores from -1 (negative) through 0 (neutral) to 1 (positive). 

Well, we could change all the 4 to 1, 2 to 0 and 0 to -1 with RegEx if we wanted. But as you will see, this is not strictly necessary. 

However, there is another issue. Doc_set has 20,000 items. This is big, but this is actually MUCH smaller than it could be. This is a subset of a 1,000,000+ item data set that you can download for free (see extra resources and reading at the end). The original data set was way too big for jupyter notebook and was even too big for me to analyse on my laptop. I know because I tried. When you find yourself in a situation like this, you can try: 
- Accessing proper research computing facilities (good for real research, too much for a code demo). 
- Dividing a too big data set into into chunks, and train/update a chunk at a time. 
- Processing a too big data set to remove punctuation, stop words, urls, twitter handles, etc. (saving computer power for what matters).
- Or a combination of these options. 



## Conclusions

You can train a classifier on whatever data you want and with whatever categories you want. 

Want to train a classifier to recognise sarcasm? Go for it. 
How about recognising lies in political speeches? Good idea. 
How about tweets from bots or from real people? Definitely useful. 

The hard part is actually getting the data ready to feed to train your classifier. But feel feel to start small. 10 items? 100? what can you do quickly that will give you enough of an idea to see if it is worth investing more time. 

Good luck!

## Further reading

Books, tutorials, package recommendations, etc. for Python

- Natural Language Processing with Python by Steven Bird, Ewan Klein and Edward Loper, http://www.nltk.org/book/
- Foundations of Statistical Natural Language Processing by Christopher Manning and Hinrich Schütze, https://nlp.stanford.edu/fsnlp/promo/
- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Dan Jurafsky and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
- Deep Learning in Natural Language Processing by Li Deng, Yang Liu, https://lidengsite.wordpress.com/book-chapters/
- Sentiment Analysis data sets https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124

NLTK options
- nltk.corpus http://www.nltk.org/howto/corpus.html
- Data Camp tutorial on sentiment analysis with nltk https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python
- Vader sentiment analysis script available on github (nltk) https://www.nltk.org/_modules/nltk/sentiment/vader.html
- TextBlob https://textblob.readthedocs.io/en/dev/
- Flair, a NLP script available on github https://github.com/flairNLP/flair

spaCy options
- spaCy https://nlpforhackers.io/complete-guide-to-spacy/
- Data Quest tutorial on sentiment analysis with spaCy https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/


Books and package recommendations for R
- Quanteda, an R package for text analysis https://quanteda.io/​
- Text Mining with R, a free online book https://www.tidytextmining.com/​