# Demo 02: Python and Working with Data

In this notebook we'll see some more Python and some more advanced functions for working with data! As we discussed in class, data is the cornerstone of most machine learning / artificial intelligence models so we need to understand how to access and use that data.


# Review of the Basics

In the last class we saw a few examples of how to store data in variables in Python, let's quickly review.

First, if we only want to store a single value, we can create a variable to hold any type of data: a number, a word, or even a whole Python program!

In [None]:
# Print something to the screen!

print("This is a more advanced Notebook!")

Recall that making a variable creates a space in the computer memory where we can store things. When we put something new in that box, then we lose what was there... unless we explicitly tell Python to update the value of that variable.

In [None]:
new_variable = 22

print(new_variable)


In [None]:
new_variable = new_variable + 5

print(new_variable)

We also quickly saw something called a *list*, which is a type of variable that can hold other variables, in a fixed order. So if we want to keep track of a list of words where order matters, we can use a list.

In [None]:
my_words = ["this", "is", "a", "list", "of", "words"]

print(my_words)

We can add words to this list by updating it, the same as we did with the variables above.

In [None]:
my_words = my_words + ["new", "words"]

print(my_words)

Now we need one more thing that will help us out, it's a type of container called a dictionary. It let's us store lots of values with names, and get them back whenever we want.

In [None]:
# Start a dict

new_dictionary = {"students": 10,
                  "others": 20}

In [None]:
print(new_dictionary)

In [None]:
print(new_dictionary["students"])

These are very cool as they let us keep track of things in an un-ordered way (consider that lists are ordered) and we can keep all kinds of things in there, not just numbers!

In [None]:
new_dictionary["a new one"] = "this is a string"
print(new_dictionary)

So, let's pause for a second and thing about how we'd build a DFA if we wanted to count all the words in a list..
.
.
.
.
.
.
.
.




...


...


...


Let's start by iterating over the list...we're going to use a special word called enumerate that gives us both the position and the value of a list. We'll come back to these details again later.

In [None]:
for number, word in enumerate(my_words):
  print("The " + str(number) + " word in the list is: " + word)

So with the above we can start to think about how we might go over a list of words and keep track of how many times those words appear. Let's think about the DFA for this......


...

...



...



...



...


...

Yep, we need a new dictionary and we can increment all the values as we see them in the "box" or list of words. We needed a little if statement there, we'll come back to these but trust me for now this works!





In [None]:
all_words = ["this", "is", "a", "list", "of", "words", "words", "of", "list", "kiss"]
counting_dict = {} # empty dictionary

for word in all_words:
    if word in counting_dict:
        counting_dict[word] += 1
    else:
        counting_dict[word] = 1

In [None]:
print(counting_dict)

# Working with Data from the Internet!

Now consider that we want to read a lot of books... really fast. Well, we don't want to type them in one at a time, that would take to long.

Luckily we learned about ASCII and the way that text is encoded like any other data. So we might want to read a classic, like Frankenstein, which is available on a website called [Project Gutenberg](https://www.gutenberg.org/ebooks/42324).

Note that Frankenstein was written by [Mary Shelley](https://en.wikipedia.org/wiki/Mary_Shelley) who maybe had a thing with Lord Byron, Ada Lovelace's Dad...

There are a lot of different versions of that file. Some are for a Kindle, some are for other devices. At the end of the day they all use binary to represent the information, just in different ways! That's why we have file extensions in the first place: to tell us how the data in the file is organized.

We're interested in the most basic type of data, with no fancy formatting or anything else, which is called UTF-8, which we learned about a few weeks ago. As we can see there is a [UTF-8 version of Frankenstein](https://www.gutenberg.org/cache/epub/42324/pg42324.txt)

So the first thing we need to do is download that data and store it all in a variable. For now we'll do like above and store it in one single variable.


In [None]:
# Load up URL Lib and read all the data into one variable.

from urllib.request import urlopen
frankenstein_url = "https://www.gutenberg.org/cache/epub/42324/pg42324.txt"
frankenstein_text = urlopen(frankenstein_url).read().decode()
print(frankenstein_text)

## Dealing with All That Text...

So look above. Note that there is a whole lot of text at the beginning and end that we don't really want. So we're going to cut off the top and bottom and just get what we need.

In [None]:
# First, just how long is all that text?
print(len(frankenstein_text))

In [None]:
# Cut off the head and the tail of the text...

# Note that we left a few things in there but that's okay..

print(frankenstein_text[950:-18800])

In [None]:
# That's not great, we need to break up all the words...

text = "here is a long string of text"
print(text)

In [None]:
# We can use the .split command to do this which will break up words at spaces

print(text.split())

In [None]:
# Let's try it with our Huck Finn text...

frankenstein_words = frankenstein_text.split()
print(frankenstein_words)

Okay, so now we have a long long string of words that we got from the internet what can we do? Well we can go grab some of the code from above that did that simple counting thing and we can reuse it! This is one of the coolest things about code, we can reuse, and reuse, and reuse, and build very powerful things out of very simple code...




In [None]:
# So let's count all the words in Frankenstein!

counting_dict = {} # empty dictionary

for word in frankenstein_words:
    if word in counting_dict:
        counting_dict[word] += 1
    else:
        counting_dict[word] = 1

In [None]:
print(counting_dict)

## Using Pandas for Data Analysis

So this is all well and good but we need to use other code. To do this we're going to use a thing called [Pandas](https://pandas.pydata.org/) which let's us work with this data just like we would a spread sheet! Pandas is a lot like R, just in Python.

Pandas is great, in fact, it's the basis for a whole class I teach on [Introduction to Data Science](https://nmattei.github.io/cmps3160/) if you're interested!

In [None]:
# Kinda useful... what's the most common word?
import pandas as pd

df_words = pd.DataFrame(counting_dict.items(), columns=['Word', 'Count']).sort_values('Count', ascending=False)
df_words.head(15)


In [None]:
# It a lot of words
df_words.describe()

In [None]:
# Let's just plot the 100 most frequent words.

df_words.sort_values('Count', ascending=False)[:50].plot.bar(x='Word', y='Count', figsize=(20,10))

# Building a Really Dumb Language Model

So let's think about what a Language Model really is.. we just sample words from some set of words. So if we want to make something that sounds like, say, One Fish, Two Fish we could just put all the words together and sample from it.

In [None]:
# Some bits of "One Fish, Two Fish"

fish_words = '''One fish, Two fish, Red fish, Blue fish,
              Black fish, Blue fish, Old fish, New fish.
              This one has a little car.
              This one has a little star.
              Say! What a lot of fish there are.
              Yes. Some are red, and some are blue.
              Some are old and some are new.
              Some are sad, and some are glad,
              And some are very, very bad.
              Why are they sad and glad and bad?
              I do not know, go ask your dad.
              Some are thin, and some are fat.
              The fat one has a yellow hat.
              From there to here,
              From here to there,
              Funny things are everywhere.
              Here are some who like to run.
              They run for fun in the hot, hot sun.
              Oh me! Oh my! Oh me! oh my!
              What a lot of funny things go by.
              Some have two feet and some have four.
              Some have six feet and some have more.
              Where do they come from? I can't say.
              But I bet they have come a long, long way.
              we see them come, we see them go.
              Some are fast. Some are slow.
              Some are high. Some are low.
              Not one of them is like another.
              Don't ask us why, go ask your mother.'''

fish_words = fish_words.split()

# We can use random to pick some words

import random
new_text = ""
for num in range(10):
  new_text = new_text + " " + (random.choice(fish_words))

print(new_text)

So, let's think. If we wanted to make a book that **sounded like** Frankenstein, we could do what?
...


....


....



...
That's right, we could sample from the words that Mary Shelley used!


In [None]:
# Pick some Mary Shelley Words

new_text = ""
for num in range(25):
  new_text = new_text + " " + (random.choice(frankenstein_words))

print(new_text)

In [None]:
# Hard Mode... Sample from the word use probabilities in Frankenstein.

# Note that this code uses a lot of more complicated things, Numpy
# gives us all kinds of numerical methods, and I added a place to
# put in \n which means end-of-line to make it look fancy.
# If you want to learn all about this take CMPS 1100!

import numpy as np

choice_probabilities = [word_count / df_words["Count"].sum() for word_count in df_words["Count"]]
word_choices = list(df_words["Word"])

fake_frankenstein = ""
for num in range(1,100):
  fake_frankenstein = fake_frankenstein + " " + (np.random.choice(word_choices, p=choice_probabilities))
  if (num % 20 == 0):
    fake_frankenstein += "\n"

print(fake_frankenstein)

# Now You Try!

So that's it! We've downloaded a whole book, read it all, and plotted the frequency of all the words in that book.

For your challenge: do this for any 3 books that you choose on Project Gutenberg. You have all the code and tools that you'll need above. You only need to generate a dictionary that contains the word count for all the books you choose **combined**.

If you submit a notebook with a graph for each book independently as well as all the books combined, then we'll award **3 bonus points**.

Congratulations, we're well on our way to reading the entire internet and creating an LLM!

In [None]:
## Put all the code you need below here, you can add as many cells and steps as you want.

## The only rules are that you must save the graphs!


## Turning in Your Work

When you are done, make sure you **save a copy of this notebook to your drive**. You will then follow the directions below to turn it in on **Tulane Canvas**.

1. Open this notebook in CoLab.
2. mmediately click `File->Save` (Command/Ctrl S), then "Save a Copy in Drive"
3. This will create a new copy in the Colab Notebooks folder of your personal Google Drive for whichever Google account you are signed into at the time.
4. Save the file regularly as you complete the assignment.
5. **To Turn In:** When you're ready to submit your work:
6. Go to `File->Download .ipynb`
7. Upload the `.ipynb` file to the appropriate assignment in [Canvas](https://tulane.instructure.com/)