# 3. Downloading PDFs

In this Jupyter Notebook, the code can be found for importing a dataframe with books and based on this, downloading the pdf's from a website called library genesis (libgen.rs).

How this works:
1. For each book, 4 different searchterms are created to increase the probability that a pdf is found.
2. For these books, the website libgen.is is scraped, looking for books in English and in pdf format.
3. These PDF's are saved in a folder, called 'BookDownloads3'

### Importing Dataframes

In [16]:
import pandas as pd
import requests
import time
import os
import warnings
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
warnings.filterwarnings(action='once')

os.chdir('C:\\Users\\HP\\Data Science\\GoodReads')

In [17]:
# Defaults for webscraping
chrome_options = Options()
chrome_options.add_argument("--safebrowsing-disable-download-protection")
chrome_options.add_argument("safebrowsing-disable-extension-blacklist")
chrome_options.add_experimental_option("prefs", {
  "download.default_directory": r'C:\Users\HP\Data Science\GoodReads\BookDownloads3',
  "download.prompt_for_download": False,
  "download.directory_upgrade": True,
  "safebrowsing.enabled": False
})

# Website we'll scrape
url = 'https://libgen.rs/'

In [18]:
# Importing the DF
df = pd.read_csv("books.csv")
df.drop(['Unnamed: 0'], axis = 1, inplace = True) 

df.head(5)

Unnamed: 0,title,author,num pages,avg rating,num ratings,date pub,rating,date read
0,Deep Work: Rules for Focused Success in ...,"Newport, Cal",296,4.19,115972,"Jan 05, 2016",4,"Jun 10, 2022"
1,The Psychology of Money,"Housel, Morgan",252,4.39,61708,unknown,4,"Jun 02, 2022"
2,The Daily Stoic: 366 Meditations for Cla...,"Holiday, Ryan*",416,4.31,26003,"Oct 18, 2016",5,"May 25, 2022"
3,Models: Attract Women Through Honesty,"Manson, Mark*",246,4.3,14007,"Jul 01, 2011",4,"May 23, 2022"
4,Your Brain On Porn: Internet Pornography...,"Wilson, Gary",180,4.23,4446,"Aug 25, 2014",4,"Apr 29, 2022"


### Creating different searchterms for each book

In [19]:
df_searchterms = df[['title', 'author']]

df_searchterms['term1'] = df_searchterms['title'] + ' ' + df_searchterms['author']
df_searchterms['term2'] = df_searchterms['title'].map(lambda x: x.split(":", 1)[0])
df_searchterms['term3'] = df_searchterms['term2'] + ' ' + df_searchterms['author']
df_searchterms['term4'] = df_searchterms['title'].map(lambda x: ' '.join(x.split()[:7]))
df_searchterms['term4'] = df_searchterms['title'].map(lambda x: ' '.join(x.split()[:7])) + ' ' + df_searchterms['author']

df_searchterms

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_searchterms['term1'] = df_searchterms['title'] + ' ' + df_searchterms['author']


Unnamed: 0,title,author,term1,term2,term3,term4
0,Deep Work: Rules for Focused Success in ...,"Newport, Cal",Deep Work: Rules for Focused Success in ...,Deep Work,"Deep Work Newport, Cal",Deep Work: Rules for Focused Success in Newpor...
1,The Psychology of Money,"Housel, Morgan","The Psychology of Money Housel, Morgan",The Psychology of Money,"The Psychology of Money Housel, Morgan","The Psychology of Money Housel, Morgan"
2,The Daily Stoic: 366 Meditations for Cla...,"Holiday, Ryan*",The Daily Stoic: 366 Meditations for Cla...,The Daily Stoic,"The Daily Stoic Holiday, Ryan*","The Daily Stoic: 366 Meditations for Clarity, ..."
3,Models: Attract Women Through Honesty,"Manson, Mark*",Models: Attract Women Through Honesty Ma...,Models,"Models Manson, Mark*","Models: Attract Women Through Honesty Manson, ..."
4,Your Brain On Porn: Internet Pornography...,"Wilson, Gary",Your Brain On Porn: Internet Pornography...,Your Brain On Porn,"Your Brain On Porn Wilson, Gary",Your Brain On Porn: Internet Pornography and W...
5,The Easy Way to Stop Smoking: Join the M...,"Carr, Allen",The Easy Way to Stop Smoking: Join the M...,The Easy Way to Stop Smoking,"The Easy Way to Stop Smoking Carr, Allen","The Easy Way to Stop Smoking: Join Carr, Allen"
6,Ikigai: The Japanese Secret to a Long an...,"Garcia Puigcerver, Hector*",Ikigai: The Japanese Secret to a Long an...,Ikigai,"Ikigai Garcia Puigcerver, Hector*",Ikigai: The Japanese Secret to a Long Garcia P...
7,Influence: The Psychology of Persuasion,"Cialdini, Robert B.*",Influence: The Psychology of Persuasion ...,Influence,"Influence Cialdini, Robert B.*",Influence: The Psychology of Persuasion Cialdi...
8,The 7 Habits of Highly Effective People:...,"Covey, Stephen R.",The 7 Habits of Highly Effective People:...,The 7 Habits of Highly Effective People,The 7 Habits of Highly Effective People ...,The 7 Habits of Highly Effective People: Covey...
9,Derksen,"Egmond, Michel van","Derksen Egmond, Michel van",Derksen,"Derksen Egmond, Michel van","Derksen Egmond, Michel van"


### Functions

In [21]:
def search_book(book):
    browser.get(url)
    browser.find_element(By.CSS_SELECTOR, '#searchform').send_keys(book)
    browser.find_element(By.CSS_SELECTOR, 'body > table > tbody:nth-child(4) > tr > td:nth-child(2) > form > input[type=submit]:nth-child(2)').click()
    if browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2)'):
        return True
    
def download_book(book):
    try:    
        if browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2) > td:nth-child(7)').text == 'English':
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(2) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        elif browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(3) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(3) > td:nth-child(7)').text == 'English': 
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(3) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        elif browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(4) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(4) > td:nth-child(7)').text == 'English': 
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(4) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        elif browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(5) > td:nth-child(9)').text == 'pdf' and browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(5) > td:nth-child(7)').text == 'English': 
            try:
                browser.find_element(By.CSS_SELECTOR, 'body > table.c > tbody > tr:nth-child(5) > td:nth-child(10) > a').click()
                browser.find_element(By.CSS_SELECTOR, '#download > h2 > a').click()
            except Exception:
                pass
        else:
            pass
    except Exception:
        (f" No English PDF's of {book} found")
        pass

In [22]:
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
browser.maximize_window()

df_searchterms.reset_index(drop=True, inplace=True)
for index, row in df_searchterms.iterrows():
    book1 = row['term1']
    try:
        if search_book(book1):
            print('YES! Book found!')
            download_book(book1)
        else:
            print(f' {book1} Not found')
    except Exception as e:
        print(f' {book1} Not found')
        
        book2 = row['term2']
        try:
            if search_book(book2):
                print('YES! Book found!')
                download_book(book2)
            else:
                print(f' {book2} Not found')
        except Exception as e:
            print(f' {book2} Not found')
            
            book3 = row['term3']
            try:
                if search_book(book3):
                    print('YES! Book found!')
                    download_book(book3)
                else:
                    print(f' {book3} Not found')
            except Exception as e:
                print(f' {book3} Not found')
            
            book4 = row['term4']
            try:
                if search_book(book4):
                    print('YES! Book found!')
                    download_book(book4)
                else:
                    print(f' {book4} Not found')
            except Exception as e:
                print(f' {book4} Not found')

  browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)


YES! Book found!
YES! Book found!
       The Daily Stoic: 366 Meditations for Clarity, Effectiveness, and Serenity Holiday, Ryan* Not found
YES! Book found!
YES! Book found!
YES! Book found!
       The Easy Way to Stop Smoking: Join the Millions Who Have Become Nonsmokers Using the Easyway Method Carr, Allen Not found
YES! Book found!
       Ikigai: The Japanese Secret to a Long and Happy Life / The Little Book of Lykke / Lagom: The Swedish Art of Balanced Living Garcia Puigcerver, Hector* Not found
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
       De passievrucht Loon, Karel Glastra van Not found
       De passievrucht Not found
       De passievrucht Loon, Karel Glastra van Not found
 De passievrucht Loon, Karel Glastra van Not found
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
       Kieft Egmond, Michel van Not found
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
YES! Book found!
