Importing the required libraries 

# **Tutorial To Scrape PDFs from a Given URL**
This tutorial will help us to download all the PDFs in a given URL. In addition to downloading the PDF, this tutorial also helps us in reading a PDF and saving a table from the PDF to a conservative structured format like a CSV.

In [None]:
import os
import requests
import urllib.request
import pandas as pd
from urllib.parse import urljoin
from bs4 import BeautifulSoup

In [None]:
# Tabula scrapes tables from PDFs
!pip install tabula-py
import tabula

Collecting tabula-py
  Downloading tabula_py-2.3.0-py3-none-any.whl (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 98 kB/s 
Collecting distro
  Downloading distro-1.6.0-py2.py3-none-any.whl (19 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.6.0 tabula-py-2.3.0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The url input for dowloading pdfs and setting up output folder path

In [None]:
# Save contents from url into folder_location
url = 'https://www.premierleague.com/publications'
folder_location = r'/content/drive/MyDrive/Colab Notebooks/premier_league'
if not os.path.exists(folder_location):
    os.mkdir(folder_location)

Actual Code

In [None]:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")     

# Loop through all PDF links in the page
for link in soup.select("a[href$='.pdf']"):
    # Local lile name is the same as PDF file name in the URL (ignoring rest of the path)
    # https://premierleague-static-files.s3.amazonaws.com/premierleague/document/2016/07/02/e1648e96-4eeb-456e-8ce0-d937d2bc7649/2011-12-premier-league-season-review.pdf
    filename = os.path.join(folder_location, link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

Reading a table from a PDF document and storing it in a csv file 

In [None]:
  combined_pdf = folder_location + "/This-is-PL-Interactive-Combined.pdf"
tabula.read_pdf(combined_pdf,  pages='18')

[                                           Unnamed: 0  ... Total payment
 0                                                 NaN  ...       £149.4m
 1                                                 NaN  ...       £149.8m
 2                                                 NaN  ...       £144.4m
 3                                                 NaN  ...       £145.9m
 4                                                 NaN  ...       £141.7m
 5                                                 NaN  ...       £142.0m
 6                                                 NaN  ...       £119.8m
 7                                                 NaN  ...       £128.0m
 8                                                 NaN  ...       £118.2m
 9                                               £100m  ...       £123.0m
 10                                                NaN  ...           NaN
 11                                                NaN  ...           NaN
 12                                   

In [None]:
from tabula import convert_into

convert_into(combined_pdf, folder_location +"/table_output.csv", output_format="csv",pages = 18,area=[[275,504,640,900]])
pd.read_csv(folder_location+"/table_output.csv")

Unnamed: 0,Pos,Unnamed: 1,Club,W,D,L,GD,Pts,Total payment
0,1.0,,Manchester City,32.0,4.0,2.0,79.0,100.0,£149.4m
1,2.0,,Manchester United,25.0,6.0,7.0,40.0,81.0,£149.8m
2,3.0,,Tottenham Hotspur,23.0,8.0,7.0,38.0,77.0,£144.4m
3,4.0,,Liverpool,21.0,12.0,5.0,46.0,75.0,£145.9m
4,5.0,,Chelsea,21.0,7.0,10.0,24.0,70.0,£141.7m
5,6.0,,Arsenal,19.0,6.0,13.0,23.0,63.0,£142.0m
6,7.0,,Burnley,14.0,12.0,12.0,-3.0,54.0,£119.8m
7,8.0,,Everton,13.0,10.0,15.0,-14.0,49.0,£128.0m
8,9.0,,Leicester City,12.0,11.0,15.0,-4.0,47.0,£118.2m
9,10.0,,Newcastle United,12.0,8.0,18.0,-8.0,44.0,£123.0m
