<a href="https://colab.research.google.com/github/AlbinB/Text-Analytics-NLP/blob/main/1_2_2_%2BCleaning%2BStrings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture Notes

<h2>Automating Experiments</h2>

**Pipelines allow us to automate our experiments from data collection all the way to visualization.**

Let’s say we want to monitor corruption in societies and governments. The more corruption there is, we think, the more people will be talking about corruption in news articles and tweets. So we use text analytics to categorize new articles and new tweets: how many people are talking about corruption today? Over time, this gives us a dashboard to monitor corruption. We can ask whether there is a statistically significant change against our expected rate. This kind of application is a good example of why we need a pipeline: to be effective, we have to monitor talk about corruption continuously. The analysis must be automated from start to finish. Automated pipelines are best practice because they allow reproducible and streaming analytics: every step along the way is performed without human intervention. This means that every choice is documented in the code and easy to change.

The main idea in this module is that digital documents are stored as strings. This is a kind of data type in a programming language. But humans don’t view language as strings; we think about language as a sequence of individual words. And AI can’t work with language as strings, either. AI needs to convert language into numbers.

So we need a pipeline that takes a string like “The cat sat on the mat” and turns it into something that looks like actual language: “the”, “cat”, “sat”, “on”, “the”, “mat”.  Then we need to make that language work for a machine: we have to convert it into a numeric representation. This module takes us from input (digital text) to a linguistic representation to a machine representation.

* **Indexing**. Strings are ordered. This means each character has a particular place. Look at the string “the cat sat on the mat.” The first character is “t” and the seventh character is also “t”. This index includes every character, including spaces and punctuation. You’ll see why this matters in the labs.

* **Punctuation**. Now we have to remove or separate punctuation. This allows us to identify individual words without punctuation getting in the way. For example, “cat” and “cat,” and “cat:” are different strings to a machine. But they aren’t different words to a human.

* **Symbols and Numbers**. Texts have a lot of non-linguistic information. For example, if we want to know what language a tweet is written in, we probably want to get rid of URLs and hashtags. If we want to know what a web page is about, we probably want to get rid of email addresses. These parts of a text just create noise that gets in our way.

* **Letter Case**. Are “cat” and “Cat” and “CAT” the same word? In most cases, yes. But these are different strings. So we standardize case and characters as part of our pipeline, too.


The language that we get from a string is very different from the language that we get as humans. Think about the sentence “Peter thought he had forgotten his manners.” What kind of information does a human get out of this string?

**First**, we know this string has 7 words. Four of these words contain content: Peter is a specific person, thought is a verb for cognition, forgotten is another verb for cognition, and manners is a social object, a way of behaving properly. The other three words are functional: he and his just refer back to Peter and had tells us about when it happened. Humans know the string contains these individual words and, generally, what each word means, as well

**Second**, we humans know that this sentence has two clauses:

**[Peter thought [he had forgotten his manners] ]**

In fact, the second clause is a sentence on its own. We humans also know that **[his manners]** behaves like a single unit and that **[had forgotten]** behaves like a single unit (a noun and a verb, respectively). So humans also generalize about the type of word. We know that other nouns can go in the same place that **[his manners]** goes: Peter thought he had forgotten the three oranges. But machines don’t know about nouns.

**Third**, we humans know that if you forget your keys that means you’ve left them somewhere else. And if you forget your line that means you can’t remember something. But when you forget your manners you still remember how you should act and you still have your manners with you. This is important because we have just one word (“forget”) but humans know that it means something a bit different in each case. A machine doesn’t know this.

For humans, we automatically divide language into words, combine words into phrases, and rule out word meanings that aren’t relevant in our context. But machines don’t do this.

How do we make strings look like language? We walk through the coding details in the labs, but the basic idea is to split, chunk, clean, and disambiguate.

* **Split It.** First we take that string and split it into words. For most languages we can use whitespace, but of course that doesn’t work for languages like Chinese that don’t use whitespace the way English does. And English sometimes separates one word, like White House or attorney general, into two words. We’ve just learned to spell these words with a space in the middle. But whitespace usually works just fine.

* **Clean It.** Second, we have to get the text ready. Cat and cat and cat? are three different strings. But they’re just one word. So we have to remove punctuation and get all the letters into lowercase and a few things like that. The labs give you code samples to get this all done.

* **Chunk It.** We might need to know where there are gaps in a sentence. Take this sentence: “The owner of the dog who kept smelling people was horribly embarrassed.” But who is smelling people, the dog or the owner? We have two options, given below. Humans always know how to read sentences like this. But machines don’t.

<br>
(1) [The owner [of the dog] who kept smelling people]

(2) [The owner [of the dog who kept smelling people]
<br>


* **Disambiguating.** Think of a simple joke: “Smoking might kill you and bacon might kill you, but I heard that smoking bacon actually cures it.” This is funny (if it is), because cure means two different things. There’s just one string (to a machine) but there are actually two different words (to a human). We usually know which particular sense is meant. We’ll pick this up in a later section.


# Colab Setup

In [1]:
# if you are running these labs in CoLab, you will first need to mount the drive and 
# copy text_analitics.py to path 

from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
###Add text_analytics.py to path 
!cp "/content/drive/My Drive/Colab Notebooks/CourseWork/Text Analytics and Natural Language Processing/text_analytics.py" .
print("Done!")

Done!


# Lecture Lab

Welcome to our first lab of Module 2! In this lab we're going to do some work with strings.

So let's start by loading up our environment.

In [3]:
from text_analytics import text_analytics
import os
import pandas as pd

ai = text_analytics()
print("Done!")

Done!


This time we're going to work with articles about corruption. These are lead paragraphs from *The New York Times*. Some are about corruption and some aren't. But they are drawn from the same set of countries. So we load our data set in memory.

In [5]:
file = "NYT.Corruption.gz"
file = os.path.join(ai.data_dir, file)
print(file)
df = pd.read_csv(file, index_col = 0)
print(df)
print("Done!")

/content/drive/My Drive/Data/NYT.Corruption.gz
         Country       Class                                               Text
0      Argentina       Other  Secretary of State Madeleine Albright, visitin...
1      Argentina  Corruption  Argentine Pres Fernando de la Rua denies recen...
2      Argentina       Other  British Telecommunications PLC plans to invest...
3      Argentina       Other  Shareholders clear $650 million rescue package...
4      Argentina       Other  United States International Trade Commission r...
...          ...         ...                                                ...
19310  Venezuela       Other  Representative Ilhan Omar, a Minnesota Democra...
19311  Venezuela       Other  Secretary of State Mike Pompeo made his first ...
19312  Venezuela       Other  President Nicolás Maduro has long enjoyed the ...
19313  Venezuela       Other  Secretary of State Mike Pompeo discussed diffe...
19314  Venezuela       Other  Juan Guaidó, the Venezuelan opposition lead

We're working with *strings* today. So we'll grab one at random from the data set.

In [14]:
line = ai.print_sample(df)
print("Done!")

['The Red Cross sent in their first shipment of medical supplies and hopes to distribute them without political intervention. The delay cost an untold number of lives.']
Done!


So here we have a raw string. It doesn't tell us anything about words.

Here we're going to split this string into individual words.

In [15]:
line_split = line.split()
print(line_split)
print("Done!")

['The', 'Red', 'Cross', 'sent', 'in', 'their', 'first', 'shipment', 'of', 'medical', 'supplies', 'and', 'hopes', 'to', 'distribute', 'them', 'without', 'political', 'intervention.', 'The', 'delay', 'cost', 'an', 'untold', 'number', 'of', 'lives.']
Done!


We notice that, if there is any punctuation, it is included inside of a word. So we have a series of cleaning steps change this. Let's get a new line and try it in sequence.

In [16]:
#commenting out code from the class, because i would rather see the same line go through different transformations 
#line = ai.print_sample(df)
print("line")
print("\n")
#commenting out code from the class, because i would rather see the same line go through different transformations 
#line = ai.clean_wordclouds(line, stage = 1)
line_clean_stage1 = ai.clean_wordclouds(line, stage = 1)
print(line_clean_stage1)
print("Done!")

line


['The', 'Red', 'Cross', 'sent', 'shipment', 'medical', 'supplies', 'hopes', 'distribute', 'political', 'intervention.', 'The', 'delay', 'cost', 'untold', 'number', 'lives.']
Done!


This removes stopwords. These are words that are so common they dilute the content of a text.

In [17]:
#commenting out code from the class, because i would rather see the same line go through different transformations 
#line = ai.print_sample(df)
print(line)
print("\n")
line_clean_stage2 = ai.clean_wordclouds(line, stage = 2)
print(line_clean_stage2)
print("Done!")

The Red Cross sent in their first shipment of medical supplies and hopes to distribute them without political intervention. The delay cost an untold number of lives.


['red', 'cross', 'sent', 'shipment', 'medical', 'supplies', 'hopes', 'distribute', 'political', 'intervention.', 'delay', 'cost', 'untold', 'number', 'lives.']
Done!


This makes everything lowercase as well.

In [19]:
#commenting out code from the class, because i would rather see the same line go through different transformations 
#line = ai.print_sample(df)
print(line)
print("\n")
line_clean_stage3 = ai.clean_wordclouds(line, stage = 3)
print(line_clean_stage3)
print("Done!")

The Red Cross sent in their first shipment of medical supplies and hopes to distribute them without political intervention. The delay cost an untold number of lives.


['red', 'cross', 'sent', 'shipment', 'medical', 'supplies', 'hopes', 'distribute', 'political', 'intervention', 'delay', 'cost', 'untold', 'number', 'lives']
Done!


This has removed punctuation.

In [20]:
#commenting out code from the class, because i would rather see the same line go through different transformations 
#line = ai.print_sample(df)
print(line)
print("\n")
line_clean_stage4 = ai.clean_wordclouds(line, stage = 4)
print(line_clean_stage4)
print("Done!")

The Red Cross sent in their first shipment of medical supplies and hopes to distribute them without political intervention. The delay cost an untold number of lives.


['red', 'cross', 'sent', 'shipment', 'medical', 'supplies', 'hopes', 'distribute', 'political', 'intervention', 'delay', 'cost', 'untold', 'number', 'lives']
Done!


And this gets rid of any other non-linguistic material, like email addresses. So, we don't always need to use this step-by-step cleaning function. Mostly we'll use the code below to fully clean each line.

In [21]:
#commenting out code from the class, because i would rather see the same line go through different transformations 
#line = ai.print_sample(df)
print(line)
print("\n")
line_clean_stage5 = ai.clean_wordclouds(line, stage = 5)
print(line_clean_stage5)
print("Done!")

The Red Cross sent in their first shipment of medical supplies and hopes to distribute them without political intervention. The delay cost an untold number of lives.


['red', 'cross', 'sent', 'shipment', 'medical', 'supplies', 'hopes', 'distribute', 'political', 'intervention', 'delay', 'cost', 'untold', 'number', 'lives']
Done!


And that's all for this lab. We've seen examples of what each pre-processing step looks like. For our purposes, you can use this code, *ai.clean()*, to take care of the cleaning. If you want to have a closer look, reference that function in the *text_analytics* package.

In [22]:
line = ai.print_sample(df)
print(line)

line_split = line.split()
print("\n")
print("Line Split")
print(line_split)

line_s1 = ai.clean_wordclouds(line, stage = 1)
print("\n")
print("Line Stage 1", "(Remove stop words)")
print(line_s1)

line_s2 = ai.clean_wordclouds(line, stage = 2)
print("\n")
print("Line Stage2", "(Make everything lower case)")
print(line_s2)

line_s3 = ai.clean_wordclouds(line, stage = 3)
print("\n")
print("Line Stage3", "(Remove punctuation)")
print(line_s3)

line_s4 = ai.clean_wordclouds(line, stage = 4)
print("\n")
print("Line Stage4", "(Remove any other non-linguisic material: emails, links)")
print(line_s4)

line_s5 = ai.clean_wordclouds(line, stage = 5)
print("\n")
print("Line Stage5", "Join phrases")
print(line_s5)


['How several career diplomats illuminated the state of foreign policy in the Trump administration, even as the White House tried to block them from testifying.']
How several career diplomats illuminated the state of foreign policy in the Trump administration, even as the White House tried to block them from testifying.


Line Split
['How', 'several', 'career', 'diplomats', 'illuminated', 'the', 'state', 'of', 'foreign', 'policy', 'in', 'the', 'Trump', 'administration,', 'even', 'as', 'the', 'White', 'House', 'tried', 'to', 'block', 'them', 'from', 'testifying.']


Line Stage 1 (Remove stop words)
['How', 'several', 'career', 'diplomats', 'illuminated', 'state', 'foreign', 'policy', 'Trump', 'administration,', 'White', 'House', 'tried', 'block', 'testifying.']


Line Stage2 (Make everything lower case)
['several', 'career', 'diplomats', 'illuminated', 'state', 'foreign', 'policy', 'trump', 'administration,', 'white', 'house', 'tried', 'block', 'testifying.']


Line Stage3 (Remove punct