# Introduction to Python

## Why `python`?
- fast
- modular
- object oriented
- tons of libraries
- (relatively) easy to read
- *gets the job done*

Choice of primary programming language should really depend on your preferred style of thinking and tools of choice (e.g., linear models? decision trees?). But you'll never know if it's right (or wrong) for you unless you give it a try!

### What do you mean by `python`?
`Python` is the name of a programming language. Period. But different people could have different ideas about what '`python`' looks like. Three broad categories *I* have in mind are:
1. Running interactive commands in the `python` interpreter (aka, the glorified calculator)
<img src="img/ss_python_interp.png" width=500>

1. `Python` development in some kind of text editor or specialized environment
<img src="img/ss_python_spyder.png" width=500>

1. Research-type scripting with heavy documentation and snippets of code (usually wrangling data)
<img src="img/ss_python_notebook.png" width=500>

Today we'll be looking mostly at the third of these approaches, but i'll note that switching back and forth between a terminal and a python interpreter can also be a very powerful and quick way to accomplish some small tasks.

**Prerequisites:**
* Python 3.7
* JupyterLab (see the [README](https://github.com/5harad/css) on the home page)

If you're already looking at this notebook in your browser, you should be set.

## Computing Word Frequencies

One of the nice things about Python is that you really can start just diving in without understanding too much.

Let's install and then import a useful library, `requests`, for working with files on the internet:

In [None]:
!pip3 install requests

If the command above didn't work, you may need to open up a terminal window and run either `conda install requests` (if you're using the Anaconda distribution) or `pip install requests`.

In [None]:
import requests

Now we can use it to do some useful things.

**Find an interesting book on [Project Gutenberg](https://www.gutenberg.org), and copy the link to the plain text (utf-8) file.** (Actually, anywhere you can find a plain text file of sufficient size is fine.)

<img src="img/ttc-gutenberg-screenshot.png" width="600" />

**Paste the link in place of the URL below.**

In [None]:
FULL_TXT_URL = "https://www.gutenberg.org/files/98/98-0.txt" # REPLACE THIS WITH YOUR OWN

For this example, i'm using Charles Dickens's *A Tale of Two Cities*, because a past TA (Jongbin Jung) likes it so much.

In [None]:
response = requests.get(FULL_TXT_URL)

Okay, realistically, you probably didn't need a Python library for this---you could just have downloaded the file using your web browser or used a command-line utility like `curl` or `wget`---but later in the course, you'll see why it's useful to have this power from within Python. (For example: Interacting with web-based JSON APIs.)

How long is the text?

In [None]:
len(response.text)

That's pretty long. If we printed it all out, it'd take up our screen.

Let's instead turn to a random "page". To do this, we'll need the `random` standard library module:

In [None]:
import random
i = random.randint(1, len(response.text) - 1000)
print(response.text[i:i+1000])

Feel free to run the above cell multiple times to see different portions of the text.

### Sneak peek

We're going to dig into this, but check out what we're going to build:

In [None]:
from lib import word_freqs
print(word_freqs.compute_word_freqs(response.text)[:100])

Wow. What was that?

Let's spend some time building up to the solution.

### Strings

We have this object, `response.text`, which is a string (type `str`) containing the full text of the book. What can we do with it?

First, let's see some examples with a different string.

In [None]:
MY_NAME = "" # put your name here
s = "Hello, my name is " + MY_NAME + ". I have a feeling that Python is THE ANSWER to all my woes."

You can change everything to lower case:

In [None]:
s.lower()

You can replace pieces of the string with others (note that *all* occurrences are replaced):

In [None]:
s.replace("is", "is NOT")

You can split it into a list:

In [None]:
s.split(" ")

To get an idea of what else you can do with a string, use some combination of `dir` and `help`, e.g.:

In [None]:
dir(s)

In [None]:
help(s.find)

Okay, at this point we probably know enough to split our full text into a list of words.

**Your turn: Split `response.text` into a list of words, saving it into a variable called `words`. How many words does your text contain?**

Any problems with this method? *(Discussion)*

**Improve your solution above.**

### Lists

Okay, now we've got this object, `words`, which is a `list`:

In [None]:
type(words)

What might we do with a list? Let's start with the obvious ones, like finding the length:

In [None]:
len(words)

What about checking if a specific word is in the list?

In [None]:
'horse' in words

In [None]:
'iphone' in words

If we're curious exactly where the element shows up in the list, we would do:

In [None]:
words.index('horse')

To confirm things are working as expected, we could check the word at that index:

In [None]:
words[2389]

Another super handy thing to do with lists, called *slicing*, is accessing a desired range of elements:

In [None]:
words[3000:3020]

Indexing and slicing work from the back of the list, too, using negative numbers:

In [None]:
words[-10:]

Everything above just treats `words` as *immutable* (unchanging). What if we want to add, remove, or modify elements of a list?

In [None]:
pets = ["calico cat", "black labrador", "parakeet", "goldfish"]

Let's suppose our second-grade child begged and pleaded until we bought a gerbil:

In [None]:
pets.append("gerbil")
pets

Unfortunately, said child was also in charge of feeding the gerbil, with predictable consequences:

In [None]:
pets.pop()

We are back to our original list of pets:

In [None]:
pets

You trade in your parakeet for something flashier:

In [None]:
pets[2] = 'scarlet macaw'
pets

The last obvious thing we might want to do is *iterate* over the list:

In [None]:
for pet in pets:
    print(pet)

**Your turn: Compute the fraction of words of even length by iterating over `words` and maintaining a counter variable.**

*Hint 1:* `i % 2`, meaning "`i` modulo 2", is 1 if `i` is odd and 0 if `i` is even.

*Hint 2:* `x += 1` is a convenient shorthand for `x = x + 1`. `x++` does not exist in Python.

There's actually an even shorter way to write this with *list comprehensions*:

### Tangent: Sets

In Python, a set of unique elements is known as a `set` (surprise, surprise). You also have its immutable counterpart, `frozenset`. For example:

In [None]:
set([1, 1, 2, 3, 47, 'red fish', 'blue fish', 'blue fish'])

In Python 3, sets can also be constructed with bracket notation:

In [None]:
{1, 1, 2, 3, 47, 'red fish', 'blue fish', 'blue fish'}

**Your turn: How many *unique* words are in your list of words?**

### Dictionaries

We know how to iterate over lists, how to count, how to split into words. But how to maintain oodles of different counters? Enter the Python *dictionary*, or `dict` type.

A `dict` associates *keys* with *values*. (We also say it "maps" keys to values.) It's a natural way to think about all sorts of things you might want to do, e.g. maintaining a store inventory:

In [None]:
inventory = {
    'apples': 50,
    'oranges': 60,
    'pet food': 0,
    'toilet paper': 1,
    'burts bees': 'none, ever since it got bought out by clorox',
}

Now we can look things up:

In [None]:
inventory['apples']

We can ask if we have an item:

In [None]:
'oranges' in inventory

*Caution:* See what happens when we try to look up an item we don't have:

In [None]:
inventory['stereo systems']

If someone buys a few oranges, we can update the inventory:

In [None]:
inventory['oranges'] -= 3
inventory

If we start stocking a new item, it's easy enough to add it:

In [None]:
inventory['gt kombucha'] = 36
inventory

Let's get rid of that silly Burt's Bees entry:

In [None]:
inventory.pop('burts bees')
inventory

Now let's say we wanted to compute the total value of our store inventory. I tell you that the unit prices are as follows:

* apples: \\$0.40 ea
* oranges: \\$0.35 ea
* pet food: \\$80.00 ea
* toilet paper: \\$11.00 ea
* gt kombucha: \\$5.00 ea

**Create a dictionary called `prices` with the unit prices of each item. The item should be used as the key, and the value should contain the price.**

Here are a few ways of looping over the items in the dictionary:

In [None]:
for item in prices:
    print(item, '-', prices[item])

In [None]:
for item, unit_price in prices.items():
    print(item, '-', unit_price)

**Now: compute and output the total valuation of the store's inventory.**

*Hint:* If you're trying to be concise, Python *does* have a `sum` function.

Okay, back to our list of `words`. Given our knowledge, we should be able to do this!

**Build a dictionary that associates each word with its number of occurrences in the text.**

*Caution:* Think about what happens if you haven't seen a word before.

*Super hacker hint:* If you want an extra concise solution, look into the `dict.get` method.

### Builtins/functions

One natural question you might ask is what the most frequent words are. For this, we probably want to sort and then examine the most common items. But how?

Luckily, Python has a couple ways of going about this. Let's start with some random data:

In [None]:
rand_arr = [random.random() for _ in range(10)]
print(rand_arr)

We can return a sorted copy using Python's builtin function `sorted`, which doesn't modify the list in place:

In [None]:
print(sorted(rand_arr))
print(rand_arr)

Or, we can sort the list in place with the `sort` method available on a list:

In [None]:
rand_arr.sort()
print(rand_arr)

What if we want to sort in descending order? Well, we could be sneaky:

In [None]:
rand_arr.reverse()
print(rand_arr)

But that's no fun. Instead, let's use the `key` argument of the `sorted` builtin with a lambda function:

In [None]:
print(sorted(rand_arr, key=lambda x: -x))

Note that sorting only works on *sequence types* (`list`, `string`, `set`) but not on a `dict`. Why might this be? *(Discussion.)*

### Putting it together

Back to our dictionary of word frequencies.

**Now, obtain a sorted copy of the items in the dictionary, in descending order of word count.**

*Extended hint:* Python contains a builtin function called `sorted`. It operates on *sequences* (`list`, `string`, `set`) but not on a `dict`. (Can you think why this is?) It also takes a `key` keyword argument that takes a `function` specifying how to access the piece of data to be used in comparison for sorting. You may want to use `dict.items`.

### Saving your progress

We have this beautiful, sorted list of all our words and their respective counts, but right now it's sitting in memory and could disappear if anything goes wrong (computer crash, segfault, EMP shockwave, etc.). How do we save this in a handy dandy file?

Here's a quick recipe:

In [None]:
with open('hello.txt', 'w') as f:
    f.write('Hello, my name is ' + MY_NAME + '.')

We don't have to do it all in one fell swoop, either. We can write line by line (*but be careful with newlines*):

In [None]:
with open('inventory.txt', 'w') as f:
    for item in inventory:
        line = 'We have %d %s at $%f each.\n' % (inventory[item], item, prices[item])
        f.write(line)

It's often easier just to do something like `'\n'.join(lines)`, where `lines` is a list of strings, and then just write the entire string:

In [None]:
lines = ['We have %d %s at $%f each.' % (inventory[item], item, prices[item])
         for item in inventory]
with open('inventory.txt', 'w') as f:
    f.write('\n'.join(lines))

If you're running a Python script in the terminal, on the other hand, you'd probably just use `print` and redirect the output where you want it.

To read the file back in, you need `'r'` mode rather than `'w'`. Actually, `'r'` is the default, so you can just omit it:

**Your turn: Write the sorted frequencies to `word_frequencies.txt`, in the following format:**
```
word1,count1
word2,count2
word3,count3
```
et cetera. For example, my (very uninteresting) first few lines are
```
the,8228
and,5066
of,4139
to,3650
```

🎉That's it! 🎉

Here, have a 🏅for your efforts.