## Python Best Practices

### Reading in PDFs while checking if file already exists in folder
**Source:** Sarah's Discussion Notebook

In [1]:
# Import statements should be at top of code!
import os
import pandas as pd
import requests
import PyPDF2
import re # regular expressions

In [2]:
os.chdir(os.path.dirname(os.path.abspath(__file__)))

NameError: name '__file__' is not defined

In [2]:
# os. getcwd() returns the absolute path of the working directory where Python is currently running as a string str
path = os.getcwd()
path

'/Users/ezhang/Documents/**Fall 2020/TA - Data Skills for Policy/emily-discussion-notebooks'

In [3]:
url = 'https://countyofsb.org/ceo/asset.c/4171'
filename = 'FY_2020_21_Section_B_Executive_Summary.pdf'

In [4]:
def get_pdf(url, filename, path):
    response = requests.get(url)
    with open(os.path.join(path, filename), 'wb') as ofile:
        ofile.write(response.content)


if filename not in os.listdir():
    print('downloading document from {}'.format(url))
    get_pdf(url, filename, path)
else:
    print('document already in {}'.format(path))

document already in /Users/ezhang/Documents/**Fall 2020/TA - Data Skills for Policy/emily-discussion-notebooks


In [5]:
def read_pdf(filename, path):
    with open(os.path.join(path, filename), 'rb') as ifile:
        pdf = PyPDF2.PdfFileReader(ifile)

        print('Number of pages:', pdf.numPages)

        pages = []
        for p in range(pdf.numPages):
            page = pdf.getPage(p)
            text = page.extractText()
            text = text.replace("™", "'")
            text = text.replace("\n", "")
            pages.append(text)
        
        return pages

pages = read_pdf(filename, path)

Number of pages: 24


In [6]:
pages

['                  Section B  Executive Summary ',
 '           ',
 "Executive Summary B1 Adapting to an Unprecedented Pandemic  and Preparing for the ﬁNext Normalﬂ The COVID19 pandemic has caused a national recession, a sudden reduction in state and local revenues, and severe economic distress for businesses, families, community organizations and public agencies.  It has jeopardized the health of our communities and our economic livelihood.  As of the writing of the Recommended Budget, the County is still under the State Public Health Officer's ﬁStayatHomeﬂ Order, with very limited easing of restrictions.   In preparing this Recommended Budget, which began in November 2019 and was completed in early May to meet statutory requirements for public review, we lacked complete information on the extent and severity of the disruption to the County's Budget.  The Governor's May Revise of the State's budget demonstrates a deficit of $54.3 billion, and it will be several more months before the

### Format or string comprehension

In [7]:
years = [2018, 2019]

for year in years:
    url = f'https://www.eia.gov/outlooks/{year}/aeo/pdf/{xxx}.pdf'.format(year, xxx)
    print(url)

https://www.eia.gov/outlooks/aeo/pdf/2018.pdf
https://www.eia.gov/outlooks/aeo/pdf/2019.pdf


In [8]:
for year in years:
    url = 'https://www.eia.gov/outlooks/aeo/pdf/' + str(year) + '.pdf'
    print(url)

https://www.eia.gov/outlooks/aeo/pdf/2018.pdf
https://www.eia.gov/outlooks/aeo/pdf/2019.pdf


### Data cleaning
#### Using replace

In [10]:
sentence = 'ŠHello this isŁ a sample sentence with random characters in it!'

clean = sentence.replace('Š', '')
clean = clean.replace('Ł', '')
clean

'Hello this is a sample sentence with random characters in it!'

#### Using regular expressions

Note that you have to import re

In [11]:
clean = re.sub('Š|Ł|,|[|]', '', str(sentence)) # Use the | character between symbols you want to replace
clean

'Hello this is a sample sentence with random characters in it!'

In [13]:
lower = clean.lower()
lower

'hello this is a sample sentence with random characters in it!'

## Functions in Python

In [15]:
def add(x, y):
    return x + y

total = add(2, 5) # you need to call a function to use it! also, you need to save the output into a 
                    # variable if you want to use it later
total

7

You can output more than one thing from the function

In [16]:
def calculations(x, y):
    add = x + y
    multiply = x * y
    return add, multiply

add, multiply = calculations(2, 5)

print('sum: {}'.format(add))
print('product: {}'.format(multiply))

sum: 7
product: 10


## Lambda
Anonymous function

In [36]:
# Side lesson: creating data frames from a dictionary
family_dict = {'Person':['Kevin', 'Megan', 'Carol', 'Lewis'], 
        'Age': [8, 19, 35, 37]} 

family_df = pd.DataFrame(family_dict)
family_df

Unnamed: 0,Person,Age
0,Kevin,8
1,Megan,19
2,Carol,35
3,Lewis,37


In [37]:
family_df['Age'] = family_df['Age'].apply(lambda x: x + 3)
family_df

Unnamed: 0,Person,Age
0,Kevin,11
1,Megan,22
2,Carol,38
3,Lewis,40


## Loops
#### Enumerate

In [35]:
fruits = ["apple", "banana", "cherry"]

for i, val in enumerate(fruits):
    print(i, val)

0 apple
1 banana
2 cherry


#### Dictionaries
dictionary.items(), dictionary.values(), dictionary.keys()

In [18]:
pages = {'2018.pdf': [1,26], '2019.pdf': [1, 35]}

for key, value in pages.items():
    print(i, value)

2018.pdf [1, 26]
2019.pdf [1, 35]


In [20]:
for value in pages.values():
    print(value)

[1, 26]
[1, 35]


#### Nested Loops

In [21]:
energy_types = ['solar', 'wind', 'gas', 'coal']
categories = ['emissions', 'price']

for energy_type in energy_types:
    for category in categories:
        print(energy_type, category)
    print('')

solar emissions
solar price

wind emissions
wind price

gas emissions
gas price

coal emissions
coal price



One way to use a nested loop is to go through a dataframe in **row major order**. That is, you first loop through the rows, then loop through all the columns.

In [27]:
# Side lesson
energy_dict = {'energy type':['solar', 'wind', 'gas', 'coal'], 
        'emissions':['N/A', 'N/A', 'up', 'up'],
        'price': ['down', 'down', 'N/A', 'up']} 

energy_df = pd.DataFrame(energy_dict)
energy_df

Unnamed: 0,energy type,emissions,price
0,solar,,down
1,wind,,down
2,gas,up,
3,coal,up,up


In [32]:
energy_df.shape

(4, 3)

In [24]:
nrows, ncols = energy_df.shape

for i in range(nrows):
    for j in range(ncols):
        print(i, j, energy_df.iloc[i,j])

0 0 solar
0 1 N/A
0 2 down
1 0 wind
1 1 N/A
1 2 down
2 0 gas
2 1 up
2 2 N/A
3 0 coal
3 1 up
3 2 up


You can also loop through the dataframe items themselves using: iteritems(), iterrows(), itertuples()

In [25]:
for i in energy_df.itertuples(): 
    print(i)

Pandas(Index=0, _1='solar', emissions='N/A', price='down')
Pandas(Index=1, _1='wind', emissions='N/A', price='down')
Pandas(Index=2, _1='gas', emissions='up', price='N/A')
Pandas(Index=3, _1='coal', emissions='up', price='up')


In [26]:
# Access the items using
for i in energy_df.itertuples(): 
    print(i._1, i.emissions, i.price)

solar N/A down
wind N/A down
gas up N/A
coal up up
