# Python for Data Science

##### Jiin Jung & Minjae Yun

##### Fall 2019

# Welcome!

This workshop is a brief introduction to using Python and Jupyter Notebooks.


# Python

For most Data Science tasks there are two widely used Open Source languages: Python and R. R is favoured more by those with a mathematical background. Python is preferred by those with a programming background. Python is currenctly most popular language on Stack Overflow. See this [Most Popular Programming language on Stack Overflow Bar Chart Race](https://www.youtube.com/watch?v=cKzP61Gjf00). Choosing a language that is used by more pople allows you to communicate and collaborate with more people. 

Run  the following cell. You can run/execute cells with Ctrl-Enter (which will run the cell and keep the same cell selected), Shift-Enter (which will run the cell and then select the next cell), the Run button on the toolbar, the Run Cells in the Cell menu.


In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://businessoverbroadway.com/wp-content/uploads/2019/01/programming_languages_used.png")

Genevieve Hayes examined 100 data science job advertisements, across four English-speaking countries (Australia, Canada, UK and USA), found on LinkedIn between 22 April 2019 and 5 May 2019. Run the following cell to see the [top 10 data science programing languages](https://towardsdatascience.com/which-programming-language-should-data-scientists-learn-first-aac4d3fd3038).

In [None]:
Image(url= "https://miro.medium.com/max/2289/1*KWhvKrCjKG1JbbWPoSLO4g.png")

# Jupyter notebooks

Jupyter notebooks are an incredible way to work and cowork. They allow you to present documentation and working code in the same file. People can read through the documentation and see the running code. They also make it easy for coworkers to share the file and edit the code collectively.

### Cells
You may want to edit this documentation and make some notes while taking this class.
There are two types of **_cells_**: markdown and code. This is a markdown cell. Code cells run actual python code!
You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.



### Markdown cells for documentation
In order to edit the documentaion. (1) select this cell, (2) click Enter, (3) edit the text, and (4) click Run. 

You can change heaings by the number of #.
# Heading
## Heading
### Heading
#### Heading

### Tips

When using cells, try to separate distinct pieces of code.

At the end of each cell, there is room for some output. The output could be blank, `printf`s, images, html. Pretty much anything you can think of.

**_You can find many more tips by viewing "Help->Markdown"_** 

### Code cells for python codes

The following cell is a "code cell". You'll see a In [ ]: next to each cell for code, which is a counter for the cells you have run. 


In [None]:
# This is a code cell

You may run the code above but it won't produce any output. It is because "#" deactivates the code.

Try running the following cell and see what it prints out:

In [None]:
print("Hello world!")

### Practice

Print this: The world is round.


Did you get the output? Did you encounter an error? Check whether you used " ", and/or typed 'print' in lowercase letters.


You run cells by pressing ctrl-enter. If you press shift-enter this will run the cell and advance.

You can find many more handy keyboard shortcuts by viewing "Help->Keyboard shortcuts"

# Expressions

Run the following cells and see outputs.

In [None]:
3*4

In [None]:
3**4

In [None]:
9/2

In [None]:
9%2

In [None]:
5+2

In [None]:
5-2

Python expressions obey the same familiar rules of **_precedence_** as in algebra: 

- Multiplication and division occur before addition and subtraction. 

- Exponentiation occurs before multiplication and division.

- Parentheses can be used to group together smaller expressions within a larger expression.

Before you run the following cells, first calcuate your answers.

In [None]:
3**4*2

In [None]:
2*3**4

In [None]:
(2*3)**4

In [None]:
3**(4+2)

### Practice

Write a code for the expression: 3(2+5)^2 

In [None]:
3*(2+5)**2

# Names
Names are given to values in Python using an **_assignment_** statement. In an assignment, a name is followed by =, which is followed by any expression. The value of the expression to the right of = is assigned to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions.

In [None]:
a = 3
b = 4
a*b

In [None]:
fahrenheit = 55
celsius = (fahrenheit-32) *5/9
celsius


In [None]:
int(celsius)

In [None]:
round(celsius, 2)

In [None]:
kelvin = celsius + 273.15
kelvin

### Practice

Complete the code below and calcuate how much seconds will be taken for 400g tennis ball fall from a 10 meter high building.

Here is the equation for a falling body:

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/10f3d0383ea94ebc8fa369018069467481453818" align="left">

Note: Near the surface of the Earth, the acceleration due to gravity (g) is 9.807 m/s2 (meters per second squared)

In [None]:
d = 10
g = 9.804
t = (2*d/g)**0.5
t

# Call Expressions

Call expressions invoke **_functions_**, which are named operations. The name of the function appears first, followed by expressions in parentheses.

In [None]:
abs(-45)

In [None]:
round(kelvin)

In [None]:
max(fahrenheit, celsius, kelvin)

A few functions are available by default, such as abs and round, but most functions that are built into the Python language are stored in a collection of functions called a module. An import statement is used to provide access to a module, such as math or operator.

In [None]:
import math
import operator
math.sqrt(operator.add(4, 5))

In [None]:
math.log?

In [None]:
math.log(16, 2)

The list of [Python's built-in functions](https://docs.python.org/3/library/functions.html) is quite long and includes many functions that are never needed in data science applications. The list of [mathematical functions in the math module](https://docs.python.org/3/library/math.html) is similarly long. This text will introduce the most important functions in context, rather than expecting the reader to memorize or understand these lists.

# Variables and array types

Python is untyped by default. So variables are super easy to define.

Other than variables, the two constructs that we will be using all day are lists (a.k.a. arrays) and dictionaries (a.k.a. maps). Why they don't call them arrays and maps, I don't know.

In [None]:
my_variable = 2                         # A simple variable
my_list = [1, 2, 3]                     # A simple list
print(my_list[0])                       # Zero indexing, print is an inbuilt printf like function.
another_list = ["a", "string", "list"]
print(another_list[:])                  # Colon means "all"
print(another_list[0:2])                # Index ranges are exclusive.
character_list = 'abc'
print(character_list[-1])               # -1 means the last entry, -2 means last but one

In [None]:
first_dict = {'bob': 32, 'steve': 94}               # Simple dictionary
key = "bob"
print("%s is aged %d" % (key, first_dict[key]))     # print accepts parameters after a % sign. Note the brackets around the terms.

### Example

The code below will"

- Create a Python list of keys
- Create a map continaing those keys pointing to some values
- Use the Python function `print(...)` to print a value or range of values using your list

In [None]:
keys = ["a", "b", "c"]
d = {"a": 1, "b": 2, "c": 3}
for k in keys:
    print(d[k])

# Functions
Functions are defined in similar ways to other languages, but you might not be used to the syntax.

Because types are interpreted at runtime, it's not very strict (you can tell it to enforce types).

In [None]:
def printData(x=[1, 2, 3]):     # An = in the paramter list means "default to". Note the colon
    for x_i in x:               # Note the tab indentation in the function. This is required.
        print(x_i)              # The "in" construct iterates over values in x.
        
printData()
printData([11, 12])
printData(["a", "string"])

# Handy functions

There are a wide range of handy extensions to python. You might not need to use these. But here are some...

In [None]:
str_list = [str(x) for x in range(3)]               # "List comprehension", i.e. create a list from a for loop
print(str_list)                                 
print(', '.join(str(x) for x in range(3, 0, -1)))   # Joining strings

# The following will only work in python 3, the first (of many) difference between 2 and 3.
l = lambda x: print(x**2)                           # A "lambda", a function. ** = power. Most times you can just define a function.
l(3)

# Web Scraping
In this part, we aim to understand what the web scraping is, how we can conduct this method, and how we can create a flat file containing information.

## 1) Basic web scraping
What is web scraping?

In [None]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.icpsr.umich.edu/files/NACJD/ORIs/STATESoris.html'

To collect the every link in a list of agencies:

In [None]:
# use request package to get the content in the defined url
s = requests.get(url)
# use BeautifulSoup package to get the content in the format of html
soup = BeautifulSoup(s.text, 'lxml')
# get the "li" elements from html source
elements = soup.findAll("li")
# note: python does not discriminate "" and ''

To get the second link in the list:

In [None]:
# get "href" (Hypertext REFerence) 'attribute' from the element
elements[1].a['href']
# concatenate strings (or merge text) in python: "+"
url = 'https://www.icpsr.umich.edu/files/NACJD/ORIs/'+ elements[1].a['href']
print(url)

## 2) Repeating the same task: forloop and whileloop
Generate dynamic elemnts, assign them to key functions, and looping through repeatition

### Forloop
Python does the task sequentially in the loop

In [None]:
# Python index always starts from 0!
list(range(10))
# 'range' function includes the initial value and exclude the last value
# 'range' function
for i in range(10):
    print(i)
# we can also assign the initial value
for i in range(6,10):
    print(i)

### Practice
Use 'forloop' and one line of code to print 10 series of numbers of the exponential of 5

In [None]:
for i in range(10):
    val = #fill out this line#
    print(val)

Let's apply forloop to collect every link in the example url

In [None]:
length = len(elements)
links = []
for i in range(length):
    links.append('https://www.icpsr.umich.edu/files/NACJD/ORIs/'+elements[i].a['href'])
links[1]

### Whileloop
'whileloop' does the same task but deploy conditional statements instead of numbering system

In [None]:
i = 0
links = []
# Question: why doesn't the inequality need to include the equal sign?
# Answer:
while i < length:
    links.append('https://www.icpsr.umich.edu/files/NACJD/ORIs/'+elements[i].a['href'])
    i = i+1
i = 0
links = []
while i < length:
    links.append('https://www.icpsr.umich.edu/files/NACJD/ORIs/'+elements[i].a['href'])
    # Python uses += equivalently to = i+1
    i += 1

### Practice
Use 'whileloop' and list 'append' function to get every content from every link

In [None]:
# First one:
url = links[0]
s = requests.get(url)
# use BeautifulSoup package to get the content in the format of html
soup = BeautifulSoup(s.text, 'lxml')
# get the "li" elements from html source
contents = soup.findAll("pre")

### Example
Get the list of law enfocement agencies (note for myself: should I flip the order with above lecture? or delete this?) or continue pass break function? or try except function?

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents'
url = 'https://en.wikipedia.org/wiki/List_of_United_States_state_and_local_law_enforcement_agencies'

## 3) Regular Expression
Convert information to a structured format

In [None]:
# html element has a function 'text' which ignores the html syntax and only returns strings
import re
print(contents[0].text)
#length=len(contents[0].text)
#"AUTAUGA COUNTY SHERIFF'S OFFICE" in contents[0].text
content = contents[0].text
#print(re.findall('.*(?=\w+)', content))
print(re.split("\s+", content))
actual_content = re.split("\s+", content)
actual_content = [x for x in actual_content if x != '']
title = []
for i in range(3):
    title.append(actual_content)
length=len(actual_content)
tickers=[]
for i in range(length):
    if 'AL' in actual_content[i]:
        tickers.append(i)
print(tickers)

### Practice
Use 'join' function we learned previously to get each office names

In [None]:
# This is the answer - should delete these before handing out
first = ' '.join(actual_content[3:tickers[0]])
second = ' '.join(actual_content[tickers[1]+1:tickers[2]])
third = ' '.join(actual_content[tickers[3]+1:tickers[4]])
eachone = [first, second, third]

### Practice
Append the first office name, ORI7, and ORI9

In [None]:
# Below is the answer
agency=[]
agency.append(eachone[0])
agency.append(actual_content[tickers[0]])
agency.append(actual_content[tickers[0]+1])
print(agency)

Use loop to get a list containing other two agencies

In [None]:
# Below is the answer
agencies = []
for i in range(1,3):
    one=[]
    one.append(eachone[i])
    one.append(actual_content[tickers[2*i-1+1]])
    one.append(actual_content[tickers[2*i-1+1]+1])
    agencies.append(one)
print(agencies)    

## 4) Pandas dataframe
Get the dataframe

In [None]:
agencies.append(agency)
import pandas as pd
agencyList = pd.DataFrame(agencies)
print(agencyList)
agencyList.to_csv('filename.csv', header = True, index = False)
# or can be saved as a string 
with open('filename.txt', 'w', encoding='utf-8') as f:
    f.write(agencyList.to_string(header = True, index = False))