# 3. Downloading PDFs

This Jupyter Notebook allows a user to download the pdf's of the books they have read, from a website called library genesis (libgen.rs).

How this works:
1. For each book, 4 different searchterms are created to increase the probability that a pdf is found.
2. For these books, the website libgen.is is scraped, looking for books in English and in pdf format.
3. These PDF's are saved in a folder, called 'BookDownloads3'

### Importing Libraries 

In [1]:
import pandas as pd
import requests
import time
import os
import warnings
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
warnings.filterwarnings(action='once')

#os.chdir('C:\\Users\\HP\\Data Science\\GoodReads')

In [2]:
# Defaults for webscraping
chrome_options = Options()
chrome_options.add_argument("--safebrowsing-disable-download-protection")
chrome_options.add_argument("safebrowsing-disable-extension-blacklist")
chrome_options.add_experimental_option("prefs", {
  "download.default_directory": r'C:\Users\HP\Data Science\GoodReads\BookDownloads3',
  "download.prompt_for_download": False,
  "download.directory_upgrade": True,
  "safebrowsing.enabled": False
})

# Website we'll scrape
url = 'https://libgen.rs/'

In [3]:
# Importing the DF
df = pd.read_csv("books.csv")
df.drop(['Unnamed: 0'], axis = 1, inplace = True) 

df.head(5)

Unnamed: 0,title,author,num pages,avg rating,num ratings,date pub,rating,date read
0,Grit: The Power of Passion and Perseverance,"Duckworth, Angela*",277,4.08,97922,"May 03, 2016",4,"Apr 13, 2021"
1,Talking to Strangers: What We Should Kno...,"Gladwell, Malcolm",388,4.02,234724,"Sep 10, 2019",3,"Mar 12, 2019"
2,The Subtle Art of Not Giving a F*ck: A C...,"Manson, Mark*",212,3.91,792438,"Sep 13, 2016",2,"Fb 19, 2021"
3,The Righteous Mind: Why Good People Are ...,"Haidt, Jonathan",419,4.21,46190,"Mar 13, 2012",5,-
4,Maybe You Should Talk to Someone: A Ther...,"Gottlieb, Lori*",415,4.38,223425,"Apr 02, 2019",3,-


### Creating different searchterms for each book

In [4]:
df_searchterms = df[['title', 'author']]

df_searchterms['term1'] = df_searchterms['title'] + ' ' + df_searchterms['author']
df_searchterms['term2'] = df_searchterms['title'].map(lambda x: x.split(":", 1)[0])
df_searchterms['term3'] = df_searchterms['term2'] + ' ' + df_searchterms['author']
df_searchterms['term4'] = df_searchterms['title'].map(lambda x: ' '.join(x.split()[:7]))
df_searchterms['term4'] = df_searchterms['title'].map(lambda x: ' '.join(x.split()[:7])) + ' ' + df_searchterms['author']

df_searchterms

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_searchterms['term1'] = df_searchterms['title'] + ' ' + df_searchterms['author']


Unnamed: 0,title,author,term1,term2,term3,term4
0,Grit: The Power of Passion and Perseverance,"Duckworth, Angela*",Grit: The Power of Passion and Persevera...,Grit,"Grit Duckworth, Angela*",Grit: The Power of Passion and Perseverance Du...
1,Talking to Strangers: What We Should Kno...,"Gladwell, Malcolm",Talking to Strangers: What We Should Kno...,Talking to Strangers,"Talking to Strangers Gladwell, Malcolm",Talking to Strangers: What We Should Know Glad...
2,The Subtle Art of Not Giving a F*ck: A C...,"Manson, Mark*",The Subtle Art of Not Giving a F*ck: A C...,The Subtle Art of Not Giving a F*ck,The Subtle Art of Not Giving a F*ck Mans...,"The Subtle Art of Not Giving a Manson, Mark*"
3,The Righteous Mind: Why Good People Are ...,"Haidt, Jonathan",The Righteous Mind: Why Good People Are ...,The Righteous Mind,"The Righteous Mind Haidt, Jonathan","The Righteous Mind: Why Good People Are Haidt,..."
4,Maybe You Should Talk to Someone: A Ther...,"Gottlieb, Lori*",Maybe You Should Talk to Someone: A Ther...,Maybe You Should Talk to Someone,Maybe You Should Talk to Someone Gottlie...,"Maybe You Should Talk to Someone: A Gottlieb, ..."
5,Outliers: The Story of Success,"Gladwell, Malcolm","Outliers: The Story of Success Gladwell,...",Outliers,"Outliers Gladwell, Malcolm","Outliers: The Story of Success Gladwell, Malcolm"
6,Man's Search for Meaning,"Frankl, Viktor E.","Man's Search for Meaning Frankl, Viktor E.",Man's Search for Meaning,"Man's Search for Meaning Frankl, Viktor E.","Man's Search for Meaning Frankl, Viktor E."
7,Blink: The Power of Thinking Without Thi...,"Gladwell, Malcolm",Blink: The Power of Thinking Without Thi...,Blink,"Blink Gladwell, Malcolm",Blink: The Power of Thinking Without Thinking ...
8,Fooled by Randomness: The Hidden Role of...,"Taleb, Nassim Nicholas*",Fooled by Randomness: The Hidden Role of...,Fooled by Randomness,"Fooled by Randomness Taleb, Nassim Nicho...",Fooled by Randomness: The Hidden Role of Taleb...
9,Deep Work: Rules for Focused Success in ...,"Newport, Cal",Deep Work: Rules for Focused Success in ...,Deep Work,"Deep Work Newport, Cal",Deep Work: Rules for Focused Success in Newpor...


### Functions

In [5]:
def search_book(book):
    browser.get(url)
    browser.find_element(By.CSS_SELECTOR, '#searchform').send_keys(book)
    browser.find_element(By.CSS_SELECTOR, 'body > table > tbody:nth-child(4) > tr > td:nth-child(2) > form > input[type=submit]:nth-child(2)').click()
    if browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2)'):
        return True
    
def download_book(book):
    try:    
        if browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2) > td:nth-child(7)').text == 'English':
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        elif browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(3) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(3) > td:nth-child(7)').text == 'English': 
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(3) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        elif browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(4) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(4) > td:nth-child(7)').text == 'English': 
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(4) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        elif browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(5) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(5) > td:nth-child(7)').text == 'English': 
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(5) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        else:
            pass
    except Exception:
        (f" No English PDF's of {book} found")
        pass

In [8]:
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
browser.maximize_window()

df_searchterms.reset_index(drop=True, inplace=True)
for index, row in df_searchterms.iterrows():
    book1 = row['term1']
    try:
        if search_book(book1):
            print('YES! Book found!')
            download_book(book1)
        else:
            print(f' {book1} Not found')
    except Exception as e:
        print(f' {book1} Not found')
        
        book2 = row['term2']
        try:
            if search_book(book2):
                print('YES! Book found!')
                download_book(book2)
            else:
                print(f' {book2} Not found')
        except Exception as e:
            print(f' {book2} Not found')
            
            book3 = row['term3']
            try:
                if search_book(book3):
                    print('YES! Book found!')
                    download_book(book3)
                else:
                    print(f' {book3} Not found')
            except Exception as e:
                print(f' {book3} Not found')
            
            book4 = row['term4']
            try:
                if search_book(book4):
                    print('YES! Book found!')
                    download_book(book4)
                else:
                    print(f' {book4} Not found')
            except Exception as e:
                print(f' {book4} Not found')

  browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
  browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)


YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
       Good to Great: Why Some Companies Make the Leap... and Others Don't Collins, James C. Not found
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
