<a href="https://colab.research.google.com/github/Baron-Sun/E-book-Research-/blob/master/Linguistic_Data_Science_Week_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Welcome to Linguistic Data Science!**

The first thing we will be doing is a review of some of the basic parts of the Python language that we will be using. Right now we will look at built-in objects. Later, we will be using objects coming from the libraries `numpy`, `pandas`, and `nltk`. We will also cover how to use Jupyter notebooks (where we are now).

You will be turning in your mini-project writeups in the form of Jupyter notebooks. Jupyter notebooks allow you to combine text, code, and visualization. They are an example of **literate code**: code that is meant to be read in the same way that text is meant to be read, and is accompanied by text and figures.

We will be dealing with text data. At first, we will work with small amounts of text data. As the course goes on, we work with large amounts of text data, and toward the end of the class, very large amounts of text data. 

For now, let's just look at a single paragraph worth of text.

In [None]:
text = """It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness, it was the epoch of belief, it was the 
epoch of incredulity, it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair, we had everything 
before us, we had nothing before us, we were all going direct to Heaven, we were
all going direct the other way – in short, the period was so far like the 
present period, that some of its noisiest authorities insisted on its being 
received, for good or for evil, in the superlative degree of comparison only.
"""

print(text)

Write in some of your own text data below, any English text.

In [None]:
mytext1 = # TODO
mytext2 = # TODO

Here we've stored our data in an object of type `str`, named `text`.  In general, you can figure out the type of an object by calling `type` on it.

In [None]:
type(text)

A `str` in Python 3 represents a string of Unicode characters. There is no problem using non-ASCII or non-English characters in a `str`:

In [None]:
chinese_text = "自強不息；厚德載物"
print(chinese_text)

emoji_text = "this 🧑 is an emoji"
print(emoji_text)

String objects in Python are super versatile. You can figure out what you can do with an object by calling `help` on its type:

In [None]:
help(str)

You will use this a lot. No one has memorized everything there is to know about every type in any programming language. Expert programmers and data scientists usually know next to nothing about the programming languages and frameworks they work in, because they work in very many different languages and frameworks. What they are good at is looking up references, getting help fast, and playing around to quickly familiarize themselves with a system. Right now we are looking at basic Python types, but as we go on, we will be using various libraries for data analysis, visualization, and machine learning. It will be crucial to know how to get help on how to use these libraries. Sometimes, the `help` function will be enough. Other times, you will have to look at online documentation.

Check out the help documentation for the type `str` above. Let's scroll through the documentation and look at some of the methods and try them out. For example, we see the method `upper`. Under the method name, it says `S.upper() -> str` which means that calling the method `upper` with zero arguments on the string `S` will return a new object of type `str`. Let's try it out:

In [None]:
print(text.upper())

Now try out some other methods you see listed in the documentation. Try out the one that take zero arguments and ignore the ones with double underscores (`__`) for now. Try out three methods in the boxes below.

You may get errors. Don't worry if you do. It's good to get an error, because it means you learned something about how the language works (or doesn't work). Just replace the code in the box and run it again until you get something that works.

You can get the length of a `str` using the function `len`:

In [None]:
len(text)

Strings can be added to each other, in which case they are concatenated. 

In [None]:
combined_text = chinese_text + emoji_text
print(combined_text)

So, if "cat" + "cat" = "catcat", then what is 2 * "cat"?

In [None]:
"cat" + "cat"

In [None]:
2 * "cat"

You can index into `str`s as if they were arrays. For example:

In [None]:
print("text[0]:", text[0]) # Gives you the 0th character
print("text[1]:", text[1]) # Gives you the 1st character
print("text[11]:", text[11]) # Gives you the 11th character
print("text[20:25]:", text[5:10]) # Gives you characters 5 to 10 (non-inclusive)
print("text[:5]:", text[:5])  # Gives you the first 5 characters
print("text[600:]:", text[600:]) # Give you the 600th character up to the end
print("text[-1]:", text[-1]) # Gives you the last character
print("text[-2]:", text[-2]) # Gives you the second-to-last character

You can compare strings to each other using `<`, `>`, `<=` and `>=`. Can you figure out what this means? Try to figure it out using some of the cells below. (If you already know, then come up with a series of examples that demonstrate the answer.)

In [None]:
"cat" < "dogs"

In [None]:
"dog" < "cats"

Some of the `str` methods allow us to answer questions about the contents of the `str`. For example, check out the method `startswith`:

In [None]:
text.startswith("It")

In [None]:
text.startswith("it")

There's also `endswith`. Can you figure out why the following expression is `False`?

In [None]:
text.endswith("only.")

Try out `startswith` and `endswith` on one of your text strings. Find one example that is True and another example that is False.

In [None]:
# True examples

In [None]:
# False examples

The methods `startswith` and `endswith` check if the beginning or ending of a string match some other string. How would you tell if a string contains another string *anywhere*---not just at the beginning or end? To do this, you can use this syntax:

In [None]:
"times" in text

In [None]:
"Times" in text

In [None]:
if "道" in chinese_text:
  print("It's there")
else:
  print("It's not there")

Another extremely useful method of `str` objects is `count`. `count` returns the number of times a substring appears in a string. For example, we can ask: How often does the word "present" show up in our English text?

In [None]:
text.count("present")

Now here's a question: how often does the *word* "it" show up in our `text`? Try to answer below.

Often it will be useful to `split` a string into multiple parts. The method `split(x)` takes a string and breaks it up whenever it sees `x` inside the string. The result is a `list` of strings. For example:



In [None]:
"this is a string".split(" ")

In [None]:
"this is a string".split("s")

In [None]:
"this is a string".split("s ")

You might have noticed that our `text` contains multiple lines. Suppose we want to split it into eight strings, one representing each line. To do this, we need to know a bit about how lines are represented in text. In this text, and in a lot of the text data you will be dealing with, there is a special character called "newline", written `\n`, which indicates a line break.  So in order to split the text blob into its lines, you could split on `\n`:

In [None]:
text.split("\n")

There's a subtlety here: in text formatted on a Mac or on any Unix-based systems, the line breaks are indicated by the special character `\n`. But on text formatted on Windows machines, the line breaks are indicated by a sequence of two special characters, `\r\n`. (On other, more exotic operating systems, newlines might be indicated by other special characters or sequences of special characters.) This can get annoying, so Python provides a special method to split on newlines, which will do the right thing no matter what linebreak character is used:

In [None]:
text.splitlines()

Now we have data in a `list` of `str`s, and we will talk about `list`s. A `list` is a finite sequence of discrete elements. You can iterate through the elements of a `list` and do things with them. For example, the code below goes through each element in a list, makes it uppercase, then prints it out:

In [None]:
small_string = "This is a string and this is also an object"
parts = small_string.split(" ")
for part in parts:
  uppercase_part = part.upper()
  print(uppercase_part)

Now let's try counting the occurrences of the word "this" in `small_string` text using `split`.

In [None]:
parts = small_string.split(" ")
num_this = 0
for part in parts:
  if part == "this":
    num_this += 1
    
print(num_this)

Can you see a problem with this? Correct the problem below. There are multiple solutions.

In [None]:
parts = small_string.split(" ")
num_this = 0
for part in parts:
  if part == "this":
    num_this += 1
    
print(num_this)

Lists are also useful because you can index into them in the same way you index into strings. 

In [None]:
print(parts[0])
print(parts[1])
print(parts[2])
print(parts[:5])

You can also add them together just like strings,  use `in` on them like strings, and use `count`. You can find this out by calling `help(list)`!

In [None]:
parts + parts

In [None]:
2 * parts

In [None]:
"string" in parts

In [None]:
"is a string" in parts

In [None]:
parts.count("is")

Especially important are the methods that *mutate* a list by adding or removing elements. `append` adds an element:

In [None]:
my_list = ['this', 'is', 'a', 'list']
my_list.append('of')
my_list.append('items')
print(my_list)

The method `pop` returns the final element of a list and deletes it from the list.

In [None]:
last_thing = my_list.pop()
print(last_thing)
print(my_list)

You can delete an arbitrary list element by deleting its index using the keyword `del`:

In [None]:
del my_list[1] 
print(my_list)

Look up the list methods with `help(list)` and try some of them out below.

Now let's try to split up our Charles Dickens text into words. What are some problems here?

In [None]:
words = text.split(" ")
print(words)

Now I want to know: based on this `list` of words, how many times does the word "it" appear in the text? A naive way to do it might be:

In [None]:
words.count("it")

But this is undercounting, for two reasons. Do you see why? Try to fix it below.

The problem we're running into here is called **tokenization**: how do you take a string of text (a sequence of characters) and determine where are the boundaries between word tokens? This is the first real linguistic issue we are going to talk about. Because it turns out to be hard to say what counts as a word token in a text: in fact *there is no single solution that will work for all kinds of text*.  When you are analyzing text data, the first question you will have to ask is about tokenization.



The first and simplest trick for tokenization is to use Python's `split()` method with no arguments---this will magically split based on all whitespace characters, such as `" "`, `"\n"`, "`\r\n`", and friends, plus it will also split only once if there are multiple whitespaces next to each other. So:

In [None]:
weird_text = "this  is\tweird\r\n text as you can\tsee\n"
print(weird_text)

In [None]:
parts_of_weird_text = weird_text.split()
print(parts_of_weird_text)

This nicely gets rid of a lot of weirdness and it helps us solve our problem with the Dickens text:

In [None]:
words = text.casefold().split()
print(words)

Now we can count the "it"s pretty easily:

In [None]:
words.count("it")

But the word tokens are still a little off. Do you see why? Take a look at the content of the list `words` and see if the tokens are all corresponding to single words.

The remaining problem is that many of the word tokens that come out this way have punctuation attached to them. If we are interested in the meaningful words in a text, this is a problem. We don't want to think of `incredulity,` and `incredulity` as two different words. If someone asked "does this text contain the word incredulity?" and you answer no, it contains the word `incredulity,`, then you will not be giving the right answer.

One solution is to separate all the punctuation out into separate tokens. On the other hand, sometimes this may not be desirable: for example, it might make sense to keep the `.` attached to abbreviations like `Mr.`. Ultimately, your choice depends on what you will be doing with the word tokens later on.

Tokenization is the first hard problem in linguistic data science. Now I'd like you to try to write a tokenizer: a function that takes in a string, and outputs a list of the *meaningful* word tokens. Your tokenizer should work for the Charles Dickens text and also for the following examples. It should produce output that *you* find satisfying.

**Exercise**: Write a function that can take in `str` objects like `text` and the ones below, and which outputs a list of *meaningful* tokens. Each token should be one word or punctuation mark. No character from the text should be deleted (except whitespace). There are many ways to do it. Try to find a way that makes sense to you. Feel free to use any string and list methods you like. How many tokens do you end up with per test text?

In [None]:
test_text_1 = """This is a sample text for tokenization, which demonstrates some
of the problems that can come up. For example, compound-words can be tricky. And
words like "Mr." or like "Ms. Molly's." You'll find that it isn't always clear 
what should count as a token and what shouldn't count. The truth is: there is 
not a single correct answer--it depends on what your analysis's goal is. 
You will have to make decisions about how to do it. (And later you may have to go
back and change those decisions.) It's surprisingly tricky. """

test_text_2 = """Sometimes different tokenization schemes will be required 
depending on the style of text you are dealing with (different styles): 
for example, :) :( ^_^ >:( #tokenizationistricky"""

test_text_3 = """In Rivendell Mr Frodo meets Gandalf, who explains why he didn't 
meet them at Bree as planned-- while imprisoned in Saruman's tower, he was able to 
escape with the aide of Gwaihir, a giant eagle. In the meantime, there are many 
meetings between various peoples, and Elrond calls a council to decide what 
should be done with the Ring.The Ring can only be destroyed by throwing it into 
the fires (that is, lava) of Mt. Doom, where it was forged. Mt. Doom is 
located in Mordor, near Sauron's fortress of Barad-dûr, and will be an 
incredibly dangerous journey…""" 

test_text_4 = """Tokens are often categorized by character content or by context
within the data stream. Categories are defined by the rules of the lexe. 
Categories often idnvolve grammar elements of the language used in the data 
stream. rogramming langsuages often categorize tokens as: identifiers, operators,
grouping symbols, or by ata type. Written languages commonly categorize tokens 
as: nouns, verbs, adjectives, or punctuation. Categories are used for 
post-processing of the tokens either by the parser or by other functions in the 
program.

A lexical analyzer generally does nothing with combinations of tokens,    task 
left for a parser. For example, a typical lexical analyzer recognizes 
parenthses as tokens, but does nothing to ensure that each "(" is matched 
with a ")".
"""

test_text_5 = "@JetBlue Nothing better than having a delayed flight #sarcasm I want to get home soon 😢😢😢"

test_text_6 = """I finally found a place that carry ground bison meat and bison 
steak! They are currently having a sale on ground bison meat (90% meat and 10% 
fat) for $7.99!! Sprouts is selling the exact same meat to fat ratio for $15! I 
ended up nabbing 6 of them. The macro of the ground bison is 92 grams of protein
and 44 grams of fat. The bison steak have ridiculously insane macros! I get 80 
grams of protein and 6 grams of fat for $15. I only grabbed one since they are 
pricey and will be only for a special occasion.

Aside from the bison meat, I enjoy going to whole food for their hot 
food bar/salad bar. The price is $8.99 per pound.

The only thing I would complain about, is the rude people that shop at Whole 
Foods. This specific Whole Foods have a lot of traffic between 11am-2pm. Come 
before that or after and you're good!"""

test_text_7 = """I used to like this WF. But they haven't had Frozen Organic 
Broccoli for like three months! What in the world? Also they are constantly out 
of stock on Gluten free items such as Applegate products, and Glutino pretzels. 
It's really a hit or miss on the products which makes this store completely 
unreliable. On one occasion I had asked the butcher if they ever carried pasture
raised pork and he said there is no such thing. ?? Also, the rotisserie chickens
are awful! So dried out! Half of the chicken is not edible! Where is the quality
control here? It's such a great location, I would expect much higher standard of
quality."""




In [None]:
def tokenize(s):
  return s.split()

So, now (hopefully) we have a way to do tokenization that makes reasonable choices no matter the kind of text. 

### Off-the-shelf Tokenizers

Tokenization is usually going to be the first step in your linguistic data analysis. It is unlikely you want to spend your time writing a custom tokenizer for every project. So here are some of the off-the-shelf tokenizers that are available. 

The most commonly used tokenizers come from the `nltk` library. Here is how you import `nltk` to use the Punkt Treebank tokenizer:

In [None]:
import nltk
nltk.download("punkt")
nltk.tokenize.word_tokenize("This is a test!!!!!")

There is also a "casual" tokenizer:

In [None]:
from nltk.tokenize import TweetTokenizer
t = TweetTokenizer()
t.tokenize("This is like a #tweet :)")

Try to come up with some interesting text below that you think will get tokenized differently by these two different tokenizers.

In [None]:
import random
random.choices("abc", weights=[.1, .1, .8])

### Counting Tokens

Typically the main thing we will do with tokens is count them. For example, asking how many times the word "it" shows up in a text. A way to characterize any text in a very general, very high-level way is to look at all the word *types* and ask how many times they appear in the text (as *tokens*). Python has an extremely useful builtin type called a `Counter` for doing this:

In [None]:
from collections import Counter

In [None]:
tokens = tokenize(text)
c = Counter(tokens)
print(c)

![alt text](https://)A `Counter` is a dictionary that holds count values. 

In [None]:
print(c['the'])
print(c['of'])
print(c['epoch'])

`Counter`s have some very useful magic methods. For example, you can add them together:

In [None]:
c2 = Counter(tokenize(emoji_text))
c3 = c + c2

In [None]:
print(c3['epoch'])

In [None]:
print(c3['🧑'])

If you query a `Counter` for the count of a word it has never seen, then it will return `0` by default:

In [None]:
c['antidisestablishmentarianism']

When you tokenize a text and throw it into a `Counter`, what you've created is a **bag of words**. 

A **bag of words** is the simplest way to represent text for data science, and it is always the first thing you should try. 

[link text](https://)