# PDFs and Spreadsheets
While it is possible to export excel files and google spreadsheets to .csv files, it only exports the raw data.

Libraries to consider to work with:
- Pandas: Data analysis library.
- Openpyxl: Designed to work specifically with Excel files.
- Gogle Sheets Python API: Direct Python interface for working with Google Spreadsheets.

---
## Spreadsheets

In [26]:
import csv

### Open the file

In [27]:
data = open('Practice_Files/example.csv', encoding='utf-8')

In [28]:
csv_data = csv.reader(data)

### Reformat it into a python object list of lists

In [29]:
data_lines = list(csv_data)

### Working with the data
#### Headers

In [30]:
data_lines[0]

['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'city']

In [31]:
len(data_lines)

1001

#### Lines

In [32]:
for line in data_lines[:5]:
    print(line)

['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address', 'city']
['1', 'Joseph', 'Zaniolini', 'jzaniolini0@simplemachines.org', 'Male', '163.168.68.132', 'Pedro Leopoldo']
['2', 'Freida', 'Drillingcourt', 'fdrillingcourt1@umich.edu', 'Female', '97.212.102.79', 'Buri']
['3', 'Nanni', 'Herity', 'nherity2@statcounter.com', 'Female', '145.151.178.98', 'Claver']
['4', 'Orazio', 'Frayling', 'ofrayling3@economist.com', 'Male', '25.199.143.143', 'Kungur']


#### Extract one row

In [33]:
data_lines[10]

['10',
 'Hyatt',
 'Gasquoine',
 'hgasquoine9@google.ru',
 'Male',
 '221.155.106.39',
 'Złoty Stok']

#### Extract a field from row

In [34]:
data_lines[10][3]

'hgasquoine9@google.ru'

#### Extract one column

In [38]:
all_emails = []

In [39]:
for line in data_lines[1:10]:
    all_emails.append(line[3])

In [40]:
all_emails

['jzaniolini0@simplemachines.org',
 'fdrillingcourt1@umich.edu',
 'nherity2@statcounter.com',
 'ofrayling3@economist.com',
 'jmurrison4@cbslocal.com',
 'lgamet5@list-manage.com',
 'dhowatt6@amazon.com',
 'kherion7@amazon.com',
 'chedworth8@china.com.cn']

##### With list comprehension

In [43]:
[ line[3] for line in data_lines[1:5] ]

['jzaniolini0@simplemachines.org',
 'fdrillingcourt1@umich.edu',
 'nherity2@statcounter.com',
 'ofrayling3@economist.com']

#### Putting columns together

In [44]:
full_names = []

In [45]:
for line in data_lines[1:6]:
    full_names.append(line[1]+" "+line[2])

In [46]:
full_names

['Joseph Zaniolini',
 'Freida Drillingcourt',
 'Nanni Herity',
 'Orazio Frayling',
 'Julianne Murrison']

### Writing a csv file

In [48]:
file_to_output = open('Practice_Files/to_save_file.csv', mode='w', newline='')

In [49]:
csv_writer = csv.writer(file_to_output, delimiter=',')

In [50]:
csv_writer.writerow(['a', 'b', 'c'])

7

In [51]:
csv_writer.writerows([['1', '2', '3'], ['4', '5', '6']])

In [52]:
file_to_output.close()

### Adding to a file

In [53]:
f = open('Practice_Files/to_save_file.csv', mode='a', newline='')

In [54]:
csv_writer = csv.writer(f)

In [56]:
csv_writer.writerow(['1', '2', '3']) # The output of this functions is the number of characters added

7

In [57]:
f.close()

 
---
## PDFs
The most important thing to keep in mind is that while PDFs share the same extension and can be viewed in PDF readers, many PDFs are **not** machine readable through Python. There is no machine readable standard format, unlike CSV files.

There are many paid PDF programs that can read and extract from there files, but we will use the open-source and free **PyPDF2** library.

### Reading a PDF file

In [97]:
import PyPDF4

In [98]:
f = open('Practice_Files/Working_Business_Proposal.pdf', 'rb')

In [100]:
pdf_reader = PyPDF4.PdfFileReader(f)

In [101]:
len(pdf_reader.pages)

5

In [102]:
page_one = pdf_reader.pages[0]

In [104]:
page_one_text = page_one.extractText()

In [105]:
page_one_text

'Business Proposal\n \nThe Revolution is Coming\n \nLeverage agile frameworks to provide a robust synopsis for high level \noverviews. Iterative approaches to corporate strategy foster collaborative \nthinking to further the overall value proposition. Organically grow the \nholistic world view of disruptive innovation via workplace diversity and \nempowerment. \nBring to the table win-win survival strategies to ensure proactive \ndomination. At the end of the day, going forward, a new normal that has \nevolved from generation X is on the runway heading towards a streamlined \ncloud solution. User generated content in real-time will have multiple \ntouchpoints for offshoring. \nCapitalize on low hanging fruit to identify a ballpark value added activity to \nbeta test. Override the digital divide with additional clickthroughs from \nDevOps. Nanotechnology immersion along the information highway will \nclose the loop on focusing solely on the bottom line. \nPodcasting operational change m

In [106]:
f.close()

### Writing on a PDF file

In [117]:
f = open('Practice_Files/Working_Business_Proposal.pdf', 'rb')

pdf_reader = PyPDF4.PdfFileReader(f)

In [118]:
first_page = pdf_reader.pages[0]

In [119]:
pdf_writer = PyPDF4.PdfFileWriter()

In [120]:
type(first_page)

PyPDF4.pdf.PageObject

In [121]:
pdf_writer.addPage(first_page)

In [122]:
pdf_output = open('Practice_Files/Some_BrandNew_Doc.pdf', 'wb')

In [123]:
pdf_writer.write(pdf_output)

In [124]:
f.close()

In [125]:
pdf_output.close()

### Getting all the text from a PDF

In [126]:
f = open('Practice_Files/Working_Business_Proposal.pdf', 'rb')

pdf_text = []

pdf_reader = PyPDF4.PdfFileReader(f)

for num in range(pdf_reader.numPages):

    page = pdf_reader.getPage(num)

    pdf_text.append(page.extractText())

In [127]:
pdf_text

['Business Proposal\n \nThe Revolution is Coming\n \nLeverage agile frameworks to provide a robust synopsis for high level \noverviews. Iterative approaches to corporate strategy foster collaborative \nthinking to further the overall value proposition. Organically grow the \nholistic world view of disruptive innovation via workplace diversity and \nempowerment. \nBring to the table win-win survival strategies to ensure proactive \ndomination. At the end of the day, going forward, a new normal that has \nevolved from generation X is on the runway heading towards a streamlined \ncloud solution. User generated content in real-time will have multiple \ntouchpoints for offshoring. \nCapitalize on low hanging fruit to identify a ballpark value added activity to \nbeta test. Override the digital divide with additional clickthroughs from \nDevOps. Nanotechnology immersion along the information highway will \nclose the loop on focusing solely on the bottom line. \nPodcasting operational change 