# Python Overview

This notebook covers an introduction to Python. It includes:

- data types
- functions
- control flow
- files

## Data Types

### Variables & Names

When programming, we use variables to keep track of data. In the following example, we add two numbers togetherWe differentiate our variables by naming them

The following line is an expression where we compute a value, specifically 2+2

In [1]:
2+2

4

We can use a variable to store the value returned by the expression. We assign the value to the variable by using the `=` sign.

In [2]:
a = 2+2

Here, the name of the variable is `a`. It is good practice to name the variable something descriptive. 

**Question:** What would be a more appropriate name for the variable that stores the result of the expression `2+2`?

In [6]:
# ... = 2+2
four = 2+2
four

4

### Numbers

In [7]:
int_num = 4
float_num = 3.11

type(int_num), type(float_num)

(int, float)

In [173]:
int_num, float_num

(4, 3.11)

**Question:** What happens if we add, subtract, multiply, and divide floats and ints?

In [174]:
#addition
int_num + float_num 

7.109999999999999

In [175]:
int_num + float_num  == 7.11

False

In [178]:
import numpy as np
np.isclose(int_num + float_num, 7.11) # A way to compare floats

True

In [10]:
type(int_num + float_num)

float

In [11]:
# subtraction
int_num - float_num 

0.8900000000000001

In [12]:
#multiplication
int_num * float_num 

12.44

In [13]:
#divition
int_num / float_num 

1.2861736334405145

#### Long floats

In [14]:
float1 = 34.234592019394939293499293992434828482842
float2 = 34.2345920193949392934992939924

**Question:** Are these two values above equivalent?

**Question:** Are the values assigned to the variables `float1` and `float2` equivalent?

(zoom poll)

In [15]:
float1, float2

(34.23459201939494, 34.23459201939494)

In [16]:
float1 == float2

True

In [17]:
x = 3
y = "4"
z= '5.6'

In [18]:
x + y

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [19]:
int(y)

4

In [20]:
int(y + z)

ValueError: invalid literal for int() with base 10: '45.6'

In [22]:
y + z, y, z

('45.6', '4', '5.6')

In [28]:
float(y + z)

45.6

In [25]:
prefix = "My name is "
name = "Adam"
prefix + name

'My name is Adam'

In [29]:
str(x) + int(y)

TypeError: can only concatenate str (not "int") to str

In [30]:
y + float(z)

TypeError: can only concatenate str (not "float") to str

In [34]:
int(5.6), int(10.22)

(5, 10)

In [35]:
float(1)

1.0

#### Increment an integer

In [169]:
number = 5
number

5

In [164]:
number + 1

6

In [165]:
number

5

In [166]:
number += 1 # this is the same as number = number + 1
number

6

In [167]:
number -= 1 # this is the same as number = number - 1
number

5

In [170]:
number *= 2 # this is the same as number = number * 2
number

10

In [171]:
number /= 2 # this is the same as number = number / 2
number # notice that it now becomes a float

5.0

### Strings

In [36]:
string = "hello word!"
string

'hello word!'

#### String methods

In [39]:
" ".join([func for func in dir(str) if "__" not in func])

'capitalize casefold center count encode endswith expandtabs find format format_map index isalnum isalpha isascii isdecimal isdigit isidentifier islower isnumeric isprintable isspace istitle isupper join ljust lower lstrip maketrans partition replace rfind rindex rjust rpartition rsplit rstrip split splitlines startswith strip swapcase title translate upper zfill'

In [40]:
string.capitalize()

'Hello word!'

In [41]:
string.islower()

True

In [42]:
string.split()

['hello', 'word!']

**When do you think split() would be helpful in text analysis?**
<details>
<summary>Solution</summary>
    <b>Getting words from a string</b>
    <br>
    <i>There are some issues with this that we'll discuss late this week</i>

</details>

In [43]:
string_example = "What is an API? The Federal Circuit described an API as a tool that \
“allow[s] programmers to use ... prewritten code to build certain functions into their \
own programs, rather than write their own code to perform those functions from scratch."
string_example

'What is an API? The Federal Circuit described an API as a tool that “allow[s] programmers to use ... prewritten code to build certain functions into their own programs, rather than write their own code to perform those functions from scratch.'

Let's use `split()` here

In [46]:
string_example.split("?")

['What is an API',
 ' The Federal Circuit described an API as a tool that “allow[s] programmers to use ... prewritten code to build certain functions into their own programs, rather than write their own code to perform those functions from scratch.']

**Question:** What is the length of `string_example`?

In [47]:
len(string_example)

242

**Question:** Let's lowercase string_example

In [48]:
string_example.lower()

'what is an api? the federal circuit described an api as a tool that “allow[s] programmers to use ... prewritten code to build certain functions into their own programs, rather than write their own code to perform those functions from scratch.'

**Question:** How many `c`'s are in `string_example`? 

(Zoom poll)

In [49]:
string_example.count('c')

9

In [None]:
str

##### Combining Strings

In [None]:
string_example + " hello"

##### Combining Strings and Numbers

**f strings** - https://zetcode.com/python/fstring/

In [52]:
fav_fruit = "banana"

f"I like to eat {fav_fruit}"

'I like to eat banana'

In [51]:
"I like to eat " + fav_fruit

'I like to eat banana'

### Show Contexual Help 

*(no memorization, use resources)*

(back to slides)

### Collections
- Lists
- Tuples
- Sets
- Dictionaries

#### Lists 
Lists store multiple items in a single variable. (Definition is form w3schoolhttps://www.w3schools.com/python/python_lists.asp

In [53]:
fruits = ["bananas", "apples", "oranges"]
fruits

['bananas', 'apples', 'oranges']

In [54]:
type(fruits)

list

##### Size of lists

**Question:** How many items are in `fruits`?

In [55]:
len(fruits)

3

##### Accessing items in lists 

We use brackets, `[ ]` to access items in a list

In [60]:
fruits[3]

IndexError: list index out of range

**Question:** How do we access the first item in the list `fruits`?

In [61]:
first_fruit = fruits[0]
first_fruit

'bananas'

**Question:** How do we access the last item in the list `fruits`?

In [62]:
last_fruit = fruits[2]
last_fruit

'oranges'

In [68]:
fruit_length = len(fruits)  - 1
fruits[fruit_length]
#fruit_length

'oranges'

In [71]:
fruits[-1]

'oranges'

In [72]:
fruits[-2]

'apples'

**Question:** How do we access the second item in the list `fruits`

In [None]:
second_fruit = ...
second_fruit

##### Accessing sub-lists from lists

`[first_index : last_index]`

`first_index` is inclusive, `last_index` is exclusive

In [73]:
fruits[0:2]

['bananas', 'apples']

In [74]:
fruits[1:2]

['apples']

In [75]:
fruits[1:]

['apples', 'oranges']

In [76]:
fruits[:2]

['bananas', 'apples']

In [77]:
dir(fruits)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [None]:
fruits.append

##### Adding to lists


`<list>.append(...)` adds ... to the end of the list (in-place)

In [78]:
fruits.append("strawberries")
fruits

['bananas', 'apples', 'oranges', 'strawberries']

Override at a specific index

In [79]:
fruits[0] = "pear"
fruits

['pear', 'apples', 'oranges', 'strawberries']

##### Other things we can do to a list

(Just run the code below, its ok if we don't understand it just yet)

In [80]:
[", ".join([func for func in dir(fruits) if not func.startswith("__") ])]

['append, clear, copy, count, extend, index, insert, pop, remove, reverse, sort']

The official [python documentation](https://docs.python.org/3.8/tutorial/datastructures.html#more-on-lists) contains descriptions of each of these methods

In [81]:
fruits.pop()

'strawberries'

In [82]:
fruits

['pear', 'apples', 'oranges']

In [83]:
fruits.insert(0, 'bananas')

In [84]:
fruits

['bananas', 'pear', 'apples', 'oranges']

In [85]:
fruits.insert(3, 'first organge')

In [86]:
fruits

['bananas', 'pear', 'apples', 'first organge', 'oranges']

In [88]:
fruits.index('bananas')

0

**Iterating through a list**

> **for** *element* **in** *list*: <br>
        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;do something

*Indentation is important*

**Question:** Iterate through each item in fruits and print out the item

In [91]:
for fruit in fruits:
    print(fruit, len(fruit))

bananas 7
pear 4
apples 6
first organge 13
oranges 7


### Tuples

In [92]:
tup = (1,2,3)
tup, type(tup)

((1, 2, 3), tuple)

Tuples are unmutiple lists. Lets look at what that means:

**Adding to a tuple**

In [93]:
tup.append(4)

AttributeError: 'tuple' object has no attribute 'append'

In [94]:
tup[0] = 2

TypeError: 'tuple' object does not support item assignment

**What can we do with a tuple?**

What python method will list out the options:
<details>
<summary>Solution</summary>
    <b>dir(tup)</b> or <b>help(typ)</b>

</details>


In [95]:
dir(tup)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'index']

In [96]:
len(fruits)

5

In [100]:
len(tup)

3

In [101]:
tup

(1, 2, 3)

In [102]:
tup.count(1)

1

In [103]:
tup.index(1)

0

**Why use tuples?**

In [104]:
play = ("Shakspeare", "A Midsumer Night's Dream", 1595)
author, title, year = play
author, title, year

('Shakspeare', "A Midsumer Night's Dream", 1595)

In [105]:
author, title = play

ValueError: too many values to unpack (expected 2)

Tuples assignment is a way of unpacking values.

### Sets

In [106]:
set_example = set([0,1,2,3,4])
set_example, type(set_example)

({0, 1, 2, 3, 4}, set)

In [107]:
set([0,0,0,0])

{0}

**Adding to a set** 

Lets try adding the number `5` to our set

In [108]:
set_example.append(5)

AttributeError: 'set' object has no attribute 'append'

In [110]:
set_example.add(5)
set_example

{0, 1, 2, 3, 4, 5}

In [111]:
set_example.add(0)
set_example

{0, 1, 2, 3, 4, 5}

What python method can you use to find out how to add to a set? 
<details>
<summary>Solution</summary>
<b>dir(set_example)</b>

</details>


Now let's add 5 to our set

#### Set of authors

In [112]:
authors = set(["Shakespeare", "Austin", "Morrison", "Woolf", "Shakespeare"])
authors

{'Austin', 'Morrison', 'Shakespeare', 'Woolf'}

**What would be a good use case of sets?**
<details>
<summary>Solution</summary>
<b>Vocabularies</b>

</details>

#### 2016 Republican & Democratic Miami Presidential Debates

Let's first store the debates in the variables `repub_debates` & `dem_debates`

In [113]:
repub_debates = open("data/Republican_16_Miami_Debate.txt").read()
dem_debates = open("data/Democratic_16_Miami_Debate.txt").read()

In [114]:
type(repub_debates)

str

In [115]:
repub_debates[:3000]

"TAPPER: Live from the Bank United Center on the campus of the University of Miami, this is the CNN Republican Presidential debate. For our viewers in the United States and around the world, welcome to Miami Florida, I'm Jake Tapper.\n\nIn just five days voters will go to the polls here in this state as well as in Ohio, Illinois, North Carolina and Missouri. The race for the Republican nomination for president could change dramatically.\n\nFlorida and Ohio each have a large number of delegates at stake and they award all of them to the candidate who wins. They're a winner-take-all state. So that's the first time that will happen in this primary season and this is the last debate before that critical round of voting.\n\nWe hope tonight the candidates will give the voters specifics on their visions for America.\n\nSo now let's welcome the candidates.\n\nOhio Governor John Kasich. [applause]\n\nSenator Ted Cruz of Texas. [applause]\n\nReal estate developer and businessman Donald Trump. [a

**What types are `repub_debates` & `dem_debates`?**

**Let's build our vocabularies**

In [116]:
repub_vocab, dem_vocab = set(), set()

**Let's discuss an algorithm for adding words to our vocabularies**

<details>
<summary>Solution - dont show for a while</summary>
1. Split the string containing the entire debate into a list of words
<br>
2. Loop through the list of words, one word at a time, and add each word to the set
</details>

In [125]:
for word in repub_debates.split():
    repub_vocab.add(word)
#repub_vocab

dem_vocab = set(dem_debates.split())

In [123]:
#set(repub_debates.split())

In [122]:
set([1,2,3,4,3,4,1])

{1, 2, 3, 4}

In [None]:
# skip

In [None]:
# skip

In [None]:
# skip

In [None]:
# skip

In [None]:
for word in repub_debates.split():
    repub_vocab.add(word)
for word in dem_debates.split():
    dem_vocab.add(word)

In [126]:
len(repub_vocab), len(dem_vocab)

(4009, 2962)

##### Validate, Validate, Validate

**Its a good idea to sanity check our variables** 
Let's look at our data a bit

In [None]:
dem_vocab

#### Union, intersection, difference

**Question: How big is the entire vocabulary across both debates?**

Approach 1: add the length of the two vocabularies together

In [127]:
len(repub_vocab) + len(dem_vocab)

6971

Why is this wrong?

<details>
<summary>Solution</summary>
<b>There are words in both vocabularies</b>

 What words do we think might be in both?   
</details>

Approach 2: Union of the two

In [128]:
len(repub_vocab.union(dem_vocab))

5569

In [129]:
total_vocab = repub_vocab.union(dem_vocab)
#total_vocab

**Question: What are the words used in one debate but not the other?**

In [131]:
difference_vocab = dem_vocab.difference(repub_vocab)

**Question: What do we notice about these words?**

<details>
<summary>Solution</summary>
<b>Punctuation</b>.
    
    Example: 'excuses' in dem_vocab, 'excuses.' in dem_vocab
    
    We'll deal with this later
</details>


### Dictionary

![key_val](images/key_val_dict.jpeg)
<p style='text-align: right;'>Image from <a href="https://medium.com/python-pandemonium/python-dictionaries-45cacc2b76aa">https://medium.com/python-pandemonium/python-dictionaries-45cacc2b76aa</a></p>

In [132]:
type({})

dict

In [133]:
dict_example = {"a": 1, 
                "b": 2, 
                "c": 3}
dict_example

{'a': 1, 'b': 2, 'c': 3}

In [135]:
dict_example['a']

1

#### Dictionary functions

In [137]:
#help(dict)

**Accessing items in dictionaries**

In [138]:
dict_example['a']

1

**Check if key is in a dictionary**

In [139]:
'a' in dict_example

True

In [141]:
'bananas' in fruits

True

In [142]:
'apple' in fruits

False

#### Use case
**Question:** What would be a good use of dictionaries from our previous example?

<details>
<summary>Solution</summary>
    <b>Term Frequencies</b>
</details>

In [147]:
repub_word_counts, dem_word_counts = {}, {}

for word in repub_debates.split():
    if word not in repub_word_counts:
        repub_word_counts[word] = 0
    repub_word_counts[word] += 1
    
    if word not in dem_word_counts:
        dem_word_counts[word] = 0
    dem_word_counts[word] += 1
    #dem_word_counts[word] = dem_word_counts[word] + 1
## add code here to make a dictionary of word counts for the republican and democrat debates


<details>
<summary>Solution</summary>


for word in repub_debates.split():
    if word not in repub_word_counts:
        repub_word_counts[word] = 0 
    repub_word_counts[word] += 1
    
for word in dem_debates.split():
    if word not in dem_word_counts:
        dem_word_counts[word] = 0
    dem_word_counts[word] += 1
    

</details>

Let's sort the dictionary

In [148]:
sorted_dict = {}
sorted_keys = sorted(dem_word_counts, key=dem_word_counts.get, reverse=True)
for w in sorted_keys:
    sorted_dict[w] = dem_word_counts[w]

print(sorted_dict) 

{'the': 626, 'to': 514, 'I': 352, 'that': 313, 'and': 298, 'of': 292, 'in': 239, 'a': 233, 'you': 220, 'is': 183, 'have': 156, 'we': 155, 'And': 143, 'for': 139, 'was': 92, '[applause]': 91, 'are': 90, 'not': 83, 'on': 82, 'be': 80, 'do': 79, 'with': 78, 'this': 77, 'will': 74, 'people': 74, 'what': 73, 'it': 71, 'SANDERS:': 70, 'think': 70, 'would': 67, 'who': 66, 'CLINTON:': 61, 'as': 61, 'about': 60, 'RAMOS:': 56, '—': 54, 'can': 54, 'they': 53, 'my': 53, 'their': 51, 'your': 51, 'going': 48, 'We': 47, 'very': 47, 'Secretary': 45, 'our': 45, 'has': 44, 'when': 44, 'So': 42, 'from': 41, 'SALINAS:': 40, 'Senator': 40, 'by': 38, 'But': 37, 'more': 37, 'get': 37, 'said': 37, 'at': 37, 'me': 37, "I'm": 36, 'if': 36, 'all': 35, 'Thank': 34, 'want': 33, 'he': 33, 'just': 32, 'up': 31, 'You': 30, 'go': 30, 'immigration': 29, 'United': 29, 'Well,': 29, 'one': 28, 'been': 28, 'but': 28, 'or': 28, '[through': 27, 'were': 27, 'time': 27, 'an': 26, 'know': 26, 'many': 26, 'had': 26, "don't": 26,

**In today's tutorial we will build candidate specific vocabularies and compute their frequencies**

## Functions

In [149]:
def add(x, y):
    """Returns the sum of two numbers passed as arguments"""
    return x + y

In [150]:
type(add)

function

In [152]:
add.__doc__

'Returns the sum of two numbers passed as arguments'

In [None]:
dict_example[0]

## Control Statements

### Loops

We saw loops above. We can loop through any collection

#### Looping through multiple collections

In [154]:
people = ['John', 'Mary', 'Karen']
fruits = ['apple', 'banana', 'organge']

for person, fruit in zip(people, fruits):
    # Do something
    print(f"{person} likes {fruit}")

John likes apple
Mary likes banana
Karen likes organge


#### Looping and getting indices

In [156]:
for idx, fruit in enumerate(fruits):
    print(f"{fruit} is in the {idx}-index")

apple is in the 0-index
banana is in the 1-index
organge is in the 2-index


### Conditionals: If statements


In [157]:
if fruits[0] == 'apple':
    # action 1
    print(1)
elif fruits[0] == 'banana': # other if second condition is true
    # action 2
    print(2)
else: # if neither first condition . nor second condition are true:
    # action 3
    print(3)

1


In [158]:
fruits[1] == 'apple'

False

In [159]:
3 > 2 

True

In [160]:
3 > 3

False

In [161]:
3 >= 3

True

## Python functions and libraries

**Question:** Write a function called `absolute()` that takes in a number of returns the absolute value

In [None]:
def absolute(number):
    '''
    Returns the absolute value of a number
    '''
    
    return

Let's look up built in python libraries

Many of the things we want to do in computational text analysis are already implemented in python libraries:

- Find all people, places, numbers, organizations, countries mentioned in a text
- Identify all nouns, verbs, adjectives, or adverbs in a text
- Predict the sentiment of a tweet or news article 
- Determine the vocabulary and frequency of different terms
- Represent words with meaningful lists of numbers
- Develop a Machine Learning classifier to predict X from text

## Loops again