![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Basics

Welcome to the <a href="https://ukdataservice.ac.uk/" target=_blank>UK Data Service</a> training series on *New Forms of Data for Social Science Research*. This series guides you through some of the most common and valuable new sources of data available for social science research: data collected from websites, social media platorms, text data, conducting simulations (agent based modelling), to name a few. We provide webinars, interactive notebooks containing live programming code, reading lists and more.

* To access training materials for the entire series: <a href="https://github.com/UKDataServiceOpen/new-forms-of-data" target=_blank>[Training Materials]</a>

* To keep up to date with upcoming and past training events: <a href="https://ukdataservice.ac.uk/news-and-events/events" target=_blank>[Events]</a>

* To get in contact with feedback, ideas or to seek assistance: <a href="https://ukdataservice.ac.uk/help.aspx" target=_blank>[Help]</a>

<a href="https://www.research.manchester.ac.uk/portal/julia.kasmire.html" target=_blank>Dr Julia Kasmire</a> and <a href="https://www.research.manchester.ac.uk/portal/diarmuid.mcdonnell.html" target=_blank>Dr Diarmuid McDonnell</a> <br />
UK Data Service  <br />
University of Manchester <br />
May 2020

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Retrieval" data-toc-modified-id="Retrieval-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Retrieval</a></span></li><li><span><a href="#Processing" data-toc-modified-id="Processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Processing</a></span></li><li><span><a href="#Basic-Natural-Language-Processing" data-toc-modified-id="Basic-Natural-Language-Processing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic Natural Language Processing</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Bibliography" data-toc-modified-id="Bibliography-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Bibliography</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction



## Retrieval


The first step in text-mining, or any form of data-mining, is retrieving a data set to work with. Within text-mining, or any language analysis context, one data set is usually referred to as 'a corpus' while multiple data sets are referred to as 'corpora' because it is a latin word and therefore has a funny plural. 

For text-mining, a corpus can be:
- a set of tweets, 
- the full text of an 18th centrury novel,
- the contents of a page in the dictionary, 
- random gibberish letters and numbers, or
- just about anything else in text format. 


Retrieval is a very important step, but it is not the focus of this training series. If you are particularly interested in creating a corpus from internet adat, then we recommend you check out our previous training sessions on Web-scraping (recording or jupyter notebook) and API's (recording or jupyter notebook) Both of these demonstrate and discuss ways to get data from the internet that you could use to build a corpus. 

Instead, for the purposes of this session, we will assume that you already have a corpus to analyse. This is easy for us to assume, because we have provided a sample text file that we can use as a corpus for these exercises. 

First, let's check that it is there. To do that, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 


In [18]:
# It is good practice to always start by importing the modules and packages you will need. 
# os is a module for navigating your machine (e.g., file directories) and the print statement is just a bit of encouragement!
import os
print("1. Succesfully imported necessary modules")    
print("")

# List all of the files in the "data" folder that is provided to you
for file in os.listdir("./data"):
   print("2. One of the files in ./data is...", file)
print("")


1. Succesfully imported necessary modules

2. One of the files in ./data is... sample_text.txt



Great! We have imported a useful module and used it to check that we have access to the sample_text file. 

Now we need to load that sample_text file into a variable that we can work with in python. Time to Run/Shift+Enter again!

In [19]:
# Open the "sample_text" file and read (import) its contents to a variable called "corpus"
with open("./data/sample_text.txt", "r") as f:
    corpus = f.read()
    
    print(corpus)

This is a sample corpus. It haz some spelling errors and has numbers written two ways. For example, it has both 1972 and ninety-six. 

This sample corpus also uses abbreviations sometimes, but not always. California is spelled out once but also written CA. 

To really complicate things, another country name is written as the U.K., the UK, the United Kingdom, the United Kingdom of Great Britain and The United Kingdom of Great Britain and Northern Ireland becuase sometimes full names are important. 

Further, here is a bunch of unrelated toxt just to fill up the space. 

This privacy policy (“Privacy Policy”) is intended to inform you of some policies and practices regarding the collection, use, and disclosure of your Personal Information through our site and any other sites that links to this Privacy Policy (the “Site”). We define “Personal Information” as information that allows someone to identify you personally or contact you, including for example your name, address, telephone numbe

Hmm. Not excellent literature, but it will do for our purposes. 

A quick look tells us that there are capital letters, contractions, punctuation, numbers as digits, numbers written out, abbreviations, and other things that, as humans, we know are equivalent but that computers do not know about. 

Before we go further, it helps to know what kind of variable corpus is. Run/Shift+Enter the next code block to find out!

In [20]:
type(corpus)

str

This tells us that 'corpus' is one very long string of text characters. That is a good starting point, but 'one long thing' is less than ideal for statistical analysis, which prefers 'lots of short things'. 

As a consequence, a common starting point for text-mining is to turn a text from a string to  list in a consistent format, also known as a 'bag of words'. A 'bag of words' ignores whether (in the original text) any two words occurred next to each other or not. As a result, the 'bag of words' will miss out on the noun-verb distinction for 'building' in:
- "He is building a diorama for a school project." where 'building' is a verb and 
- "The building is a clear example of brutalist architecture." where 'building' is a noun.

There are other kinds of analyses, but for now let's proceed with the 'bag of words' model. The first step is to turn the one long string into a list of short strings by dividing the text into words. The most obvious way to do this is by telling the computer to split the string into substrings every time it finds a white space (including tabs and new lines). 

Let's try that. But this time, let's just have a look at the first 100 things it finds instead of the entire text.
Run/Shift+Enter.

In [24]:
# Split our corpus variable whenever we find white space and print the first 100 things that are split out. 
# NOTE: It is good practice to leave the raw text as a variable and create a new variable with the manipulations.
# This allows us to get back to the original corpus easily. From now on, we can choose to create new variables each time, 
# or to keep rewriting corpus_words. Up to you. 

corpus_words = corpus.split()
print(corpus_words[:100])

['This', 'is', 'a', 'sample', 'corpus.', 'It', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways.', 'For', 'example,', 'it', 'has', 'both', '1972', 'and', 'ninety-six.', 'This', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes,', 'but', 'not', 'always.', 'California', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'CA.', 'To', 'really', 'complicate', 'things,', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'U.K.,', 'the', 'UK,', 'the', 'United', 'Kingdom,', 'the', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'The', 'United', 'Kingdom', 'of', 'Great', 'Britain', 'and', 'Northern', 'Ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important.', 'Further,', 'here', 'is', 'a', 'bunch', 'of', 'unrelated', 'toxt', 'just', 'to', 'fill', 'up', 'the', 'space.', 'This', 'privacy', 'policy', '(“Privacy']


OK. Well, this is a start. There are still some problems with spelling errors, capital letters and puctuation. For example, the there is a full stop attached to each word at the end of a sentence. More worryingly, the 100th thing split out by this method is '("Privacy', which has an opening parenthasis and the first half of a set of double quotes. 

Clearly, there are lots of steps to take to clean this bag of words. The easiest one to do first is to remove all uppercase letters with a built in python command. We can try that out next, again returning just the first 100 items instead of the whole thing. 

Do the Run/Shift+Enter thing. 

In [28]:
# You can see that I edited corpus_words rather than create a new variable called corpus_lower_words or something similar. 
# If you prefer to keep each iteration of manipulated corpus as separate, you can edit the variable names used in the code. 

corpus_words = [word.lower() for word in corpus_words]
print(corpus_words[:100])

['this', 'is', 'a', 'sample', 'corpus.', 'it', 'haz', 'some', 'spelling', 'errors', 'and', 'has', 'numbers', 'written', 'two', 'ways.', 'for', 'example,', 'it', 'has', 'both', '1972', 'and', 'ninety-six.', 'this', 'sample', 'corpus', 'also', 'uses', 'abbreviations', 'sometimes,', 'but', 'not', 'always.', 'california', 'is', 'spelled', 'out', 'once', 'but', 'also', 'written', 'ca.', 'to', 'really', 'complicate', 'things,', 'another', 'country', 'name', 'is', 'written', 'as', 'the', 'u.k.,', 'the', 'uk,', 'the', 'united', 'kingdom,', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'the', 'united', 'kingdom', 'of', 'great', 'britain', 'and', 'northern', 'ireland', 'becuase', 'sometimes', 'full', 'names', 'are', 'important.', 'further,', 'here', 'is', 'a', 'bunch', 'of', 'unrelated', 'toxt', 'just', 'to', 'fill', 'up', 'the', 'space.', 'this', 'privacy', 'policy', '(“privacy']


In [31]:
!pip install pyspellchecker

Collecting pyspellchecker
  Using cached pyspellchecker-0.5.4-py2.py3-none-any.whl (1.9 MB)
Installing collected packages: pyspellchecker
Successfully installed pyspellchecker-0.5.4


In [None]:


#substitute contractions
# from pycontractions import Contractions

# remove punctuation
# substitute abbreviations


In [21]:
import nltk
import nltk.corpus

# importing word_tokenize from nltk
from nltk.tokenize import word_tokenize
# Passing the string text into word tokenize for breaking the sentences
token = word_tokenize(text)
token


nltk.downloader twitter_samples

from nltk.corpus import twitter_samples

text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""

print(twitter_samples)

SyntaxError: invalid syntax (<ipython-input-21-e633aa894af1>, line 11)

## Processing



## Basic Natural Language Processing



<!-- ### Thinking computationally

[Barba et al. (2019)](https://jupyter4edu.github.io/jupyter-edu-book/)

* Decomposition: Breaking down data, processes, or problems into smaller, manageable parts
* Pattern Recognition: Observing patterns, trends, and regularities in data
* Abstraction: Identifying the general principles that generate these patterns
* Algorithm Design: Developing the step by step instructions for solving this and similar problems

 -->

In [14]:
import nltk
nltk.download('punkt')


print(text)
type(text)

Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome.
The sky is pinkish-blue. You shouldn't eat cardboard


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


str

Import

In [13]:
from nltk.tokenize import sent_tokenize

tokenized_text=sent_tokenize(text)

print(tokenized_text)
type(tokenized_text)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]


list

Commentary

In [15]:
from nltk.tokenize import word_tokenize

tokenized_word=word_tokenize(text)

print(tokenized_word)
type(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard']


list

Commentary

## Conclusion

Hopefully this chapter has demystified aspects of CSS and whetted your appetite for some applied work. The subsequent chapters provide plenty of opportunity to practice CSS with various forms of data. For now I wanted to reflect on some outstanding issues.

<!-- #### Python vs R vs Julia vs ....

[Perhaps a table with some properties of each?] The general point is it's your choice.
 -->

## Bibliography


<!-- ## Further Reading and Resources

[Copy AQMEN reading lists] -->