<div class="alert alert-warning">
<div class="label label-warning">NOTICE:</div><br>
<li>Chrome driver is REQUIRED: It is used by Selenium to manipulate an internet navigator using Python code. This method is used to simulate a human use of the navigator, so it doesn't get blocked by websites (Amazon for example).</li>
</div>

In [None]:
CHROME_PATH = "C:/Users/Seddik's PC/Desktop/Cours/Fun/Selenium/chromedriver.exe"

## <a id='I'></a><h2 style="text-align=center; font-size=220">  <span class="label label-info" style="text-align=center;"> I. Searching for Name,	Category,	Author,	Language,	Pages,	and Dimensions information</span>  </h2> 

###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success"> Importing Libraries</span></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from tqdm import tqdm
import re # Regular expressions

###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success"> Defining Base url</span></div>

In [3]:
BASE_URL = "https://www.amazon.fr/dp/"

###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success">Reading the first 10 rows (for demonstration purposes)</span></div>

In [4]:
books = pd.read_csv("./Data/BX-Books.csv", nrows=100, sep=";",usecols=["ISBN","Year-Of-Publication"])
books.head()

Unnamed: 0,ISBN,Year-Of-Publication
0,195153448,2002
1,2005018,2001
2,60973129,1991
3,374157065,1999
4,393045218,1999


###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success">Creating columns for the features that we are going to fill</span></div>

In [5]:
books['Name'] = np.nan
books['Category'] = np.nan
books['Author'] = np.nan
books['Language'] = np.nan
books['Pages'] = np.nan
books['Dimensions'] = np.nan
books = books.set_index('ISBN')
books.head()

Unnamed: 0_level_0,Year-Of-Publication,Name,Category,Author,Language,Pages,Dimensions
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
195153448,2002,,,,,,
2005018,2001,,,,,,
60973129,1991,,,,,,
374157065,1999,,,,,,
393045218,1999,,,,,,


###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success">Defining XPaths that we are going to use to look for the information</span></div>

In [6]:
XPATH_NAME = '//h1[@id="title"]/span[@id="productTitle"]'
XPATH_CATEGORY = '//a[@class="a-link-normal a-color-tertiary"]'
XPATH_AUTHOR = '//span[@class="author notFaded"]'
# XPATH_AUTHOR = '//span[@class="author notFaded"]/a[@class="a-link-normal"]'
XPATH_PAGES = '//td[@class="bucket"]/div/ul/li[contains(text(),"pages")]'
XPATH_LANGUAGE = '//td[@class="bucket"]/div/ul/li[contains(b,"Langue")]'
XPATH_DIMENSIONS = '//td[@class="bucket"]/div/ul/li[contains(b,"Dimensions")]'

In [7]:
def get_info(driver, *url):
    """
    extracts the features from a given book amazon url

    driver: the Selenium web navigator driver
    url: the book amazon url
    """
    if url:
        driver.get(url[0])
    else:
        pass
    name, categories, authors, language, pages, dimensions = (np.nan for i in range(6))
    try:
        name = driver.find_element_by_xpath(XPATH_NAME).text
    except:
        pass
    try:
        pages = driver.find_element_by_xpath(XPATH_PAGES).text
        pages = re.findall('\d+',pages)[0]
    except:
        pass            
    try:
        language = driver.find_element_by_xpath(XPATH_LANGUAGE).text
        language = language.split(':')[-1].strip()
    except:
        pass            
    try:
        dimensions = driver.find_element_by_xpath(XPATH_DIMENSIONS).text
        dimensions = dimensions.split(':')[-1].strip()
    except:
        pass            
    try:
        categories =driver.find_elements_by_xpath(XPATH_CATEGORY)
        categories = ' > '.join([elem.text for elem in categories])
        if len(categories) == 0:
            categories = np.nan
    except:
        pass            
    try:
        authors = driver.find_elements_by_xpath(XPATH_AUTHOR)
        authors = ' , '.join([re.findall('^[^(]+',author.text)[0].strip() for author in authors])
        if len(authors) == 0:
            authors = np.nan
    except:
        pass

    return name, categories, authors, language, pages, dimensions                          

In [8]:
def double_check(isbn,driver):
    """
    Checks if all the features are extracted. If not, it looks for other links to gather the remaining feature information
    
    isbn: the books id
    driver: Selenium web navigator driver
    """
    retry_counter = 1
    while books.loc[isbn].isnull().values.any():
        try:
            elem = driver.find_element_by_id("a-autoid-{}-announce".format(retry_counter)) # to check for other links to the product
            if elem.get_attribute('href'):
                if elem.get_attribute('href').find('/dp/') > -1:
                    
                    elem.click()
                    
                    parsed = get_info(driver)
                    if (books.loc[isbn,'Category'] is np.nan) and not re.search('(audio)|(audible)|(ebook)|(kindle)',str(parsed[1]).lower()):
                        books.loc[isbn,'Category'] = parsed[1]
                    if books.loc[isbn,'Author'] is np.nan:
                        books.loc[isbn,'Author'] = parsed[2]
                    if books.loc[isbn,'Language'] is np.nan:
                        books.loc[isbn,'Language'] = parsed[3]
                    if books.loc[isbn,'Pages'] is np.nan:
                        books.loc[isbn,'Pages'] = parsed[4]
                    if books.loc[isbn,'Dimensions'] is np.nan:
                        books.loc[isbn,'Dimensions'] = parsed[5]
                    
                    driver.back()
        except:
            print('finished all available links ({}, {})'.format(retry_counter, isbn))
            break # if the link doesn't exist, break

        retry_counter += 1
        

## <a id='I'></a><h2 style="text-align=center; font-size=220">  <span class="label label-info" style="text-align=center;"> I. Searching for Name,	Category,	Author,	Language,	Pages,	and Dimensions information</span>  </h2> 

###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success">Scrapping Amazon with double check: if we don't find all the information in the first page, look for it in another page</span></div>

In [9]:
if __name__ == '__main__':

    # instantiate a chrome options object so you can set the size and headless preference
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--window-size=1920x1080")

    # Chrome without pictures for reduced loading time
    chrome_options = webdriver.ChromeOptions()
    prefs = {"profile.managed_default_content_settings.images": 2}
    chrome_options.add_experimental_option("prefs", prefs)

    driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=CHROME_PATH)

    driver.maximize_window()
    for isbn in tqdm(books.index):
        books.loc[isbn,['Name', 'Category', 'Author', 'Language','Pages', 'Dimensions']] = get_info(driver, BASE_URL+isbn)
        double_check(isbn, driver)
    driver.close()

  del sys.path[0]
 32%|█████████████████████████▉                                                       | 32/100 [02:31<05:08,  4.54s/it]

finished all available links (3, 3404921038)


 33%|██████████████████████████▋                                                      | 33/100 [02:37<05:46,  5.17s/it]

finished all available links (3, 3442353866)


 35%|████████████████████████████▎                                                    | 35/100 [02:42<04:06,  3.79s/it]

finished all available links (4, 3442446937)


 53%|██████████████████████████████████████████▉                                      | 53/100 [05:13<03:31,  4.51s/it]

finished all available links (7, 0060914068)


 55%|████████████████████████████████████████████▌                                    | 55/100 [05:43<08:01, 10.71s/it]

finished all available links (1, 0245542957)


 58%|██████████████████████████████████████████████▉                                  | 58/100 [05:54<04:15,  6.09s/it]

finished all available links (1, 0961769947)


 79%|███████████████████████████████████████████████████████████████▉                 | 79/100 [08:25<03:05,  8.84s/it]

finished all available links (5, 1569871213)


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [10:08<00:00,  4.81s/it]


In [10]:
books

Unnamed: 0_level_0,Year-Of-Publication,Name,Category,Author,Language,Pages,Dimensions
ISBN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0195153448,2002,Classical Mythology,Livres anglais et étrangers > Literature & Fic...,"Mark P. O. Morford , Robert J. Lenardon",Anglais,1028,"23,2 x 3,1 x 18,9 cm"
0002005018,2001,Clara Callan: A novel,Livres anglais et étrangers > Literature & Fic...,Richard B. Wright,Anglais,432,"15,5 x 2,8 x 28,2 cm"
0060973129,1991,Decision in Normandy,Livres anglais et étrangers > History > Europe,Carlo D'Este,Anglais,576,"3 x 13,5 x 20,1 cm"
0374157065,1999,Flu: The Story of the Great Influenza Pandemic...,Livres anglais et étrangers > Science > Biolog...,Gina Bari Kolata,Anglais,330,"12,7 x 1,9 x 22,2 cm"
0393045218,1999,The Mummies of Urumchi,Livres anglais et étrangers > Arts & Photograp...,"Elizabeth Wayland Barber , E. J. W. Barber",Anglais,240,"19 x 2,5 x 24,8 cm"
0399135782,1991,The Kitchen God's Wife,Livres anglais et étrangers > Literature & Fic...,Amy Tan,Anglais,415,"23,7 x 3,7 x 16 cm"
0425176428,2000,What If?: The World's Foremost Historians Imag...,Livres anglais et étrangers > History > Histor...,Robert Cowley,Anglais,416,"15,2 x 2,3 x 22,9 cm"
0671870432,1993,Pleading Guilty,Livres anglais et étrangers > Literature & Fic...,"Scott Turow , Stacy Keach",Anglais,483,"10,2 x 2,9 x 17,8 cm"
0679425608,1996,Under the Black Flag: The Romance and the Real...,Livres anglais et étrangers > Literature & Fic...,"David Cordingly , Godoff",Anglais,296,"16,5 x 2,5 x 24,1 cm"
074322678X,2002,Where You'll Find Me: And Other Stories,Livres anglais et étrangers > Literature & Fic...,Beattie,Anglais,208,"13,3 x 1,1 x 20,5 cm"


###  <a id='ILaNA'></a> <div style="font-size:120%;  margin-left: 25px;"> <span class="label label-success">Checking the result (missing information)</span></div>

In [13]:
books.isna().sum()

Year-Of-Publication    0
Name                   0
Category               4
Author                 0
Language               0
Pages                  7
Dimensions             0
dtype: int64

## <a id='II'></a><h2 style="text-align=center; font-size=220">  <span class="label label-info" style="text-align=center;"> II Scrapping BookFinder.com to get descriptions about books to be used for content based recommendation</span>  </h2> 

## Getting Book description from the website bookfinder.com

In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup


# instantiate a chrome options object so you can set the size and headless preference
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")

# Chrome without pictures for reduced loading time
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=CHROME_PATH)

isbns = list(books.index):
descriptions = []
for isbn in isbns:
    
    driver.get("https://www.bookfinder.com/isbn_search/")

    driver.implicitly_wait(3)

    driver.find_element_by_xpath("""//*[@name="isbn"]""").send_keys(isbn)
    driver.find_element_by_xpath("""//*[@id="submitBtn"]""").click()

    driver.implicitly_wait(3)

    content = driver.find_element_by_xpath("""//*[@id="bookSummary"]""")
    data = BeautifulSoup(content.get_attribute('innerHTML'), "lxml")
    
    descriptions.append(data.text)


In [None]:
books['Description'] = descriptions

<div class='alert alert-info'>
This was just a demonstration. The process takes a lot of time, depending on the internet connection speed and the processing power of the computer. We can launch more than one navigator to scrap more than one page at once.</div>

## <a id='III'></a><h2 style="text-align=center; font-size=220">  <span class="label label-info" style="text-align=center;"> III Measuring the similarity between two books based by using the cosine distance between their Description</span>  </h2> 

This can be a way to perform content based recommandation: we can suggest to the end user the k books whose descriptions are the most similar to that of the book chosen by the user.

In [1]:
import re, math
from collections import Counter


WORD = re.compile(r'\w+')


def get_cosine(vec1, vec2):
    
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)


Harry_potter_1 = """It was always difficult being Harry Potter and it isn't much easier now that he is an overworked employee of the Ministry of Magic, a husband, and father of three school-age children.
While Harry grapples with a past that refuses to stay where it belongs, his youngest son Albus must struggle with the weight of a family legacy he never wanted. As past and present fuse ominously, both father and son learn the uncomfortable truth: sometimes, darkness comes from unexpected places.

The playscript for Harry Potter and the Cursed Child was originally released as a 'special rehearsal edition' alongside the opening of Jack Thorne's play in London's West End in summer 2016. Based on an original story by J.K. Rowling, John Tiffany and Jack Thorne, the play opened to rapturous reviews from theatregoers and critics alike, while the official playscript became an immediate global bestseller.

This revised paperback edition updates the 'special rehearsal edition' with the conclusive and final dialogue from the play, which has subtly changed since its rehearsals, as well as a conversation piece between director John Tiffany and writer Jack Thorne, who share stories and insights about reading playscripts. This edition also includes useful background information including the Potter family tree and a timeline of events from the Wizarding World prior to the beginning of Harry Potter and the Cursed Child."""


Harry_potter_2 = """Harry Potter has never even heard of Hogwarts when the letters start dropping on the doormat at number four, Privet Drive. Addressed in green ink on yellowish parchment with a purple seal, they are swiftly confiscated by his grisly aunt and uncle. Then, on Harry's eleventh birthday, a great beetle-eyed giant of a man called Rubeus Hagrid bursts in with some astonishing news: Harry Potter is a wizard, and he has a place at Hogwarts School of Witchcraft and Wizardry. An incredible adventure is about to begin! These new editions of the classic and internationally bestselling, multi-award-winning series feature instantly pick-up-able new jackets by Jonny Duddle, with huge child appeal, to bring Harry Potter to the next generation of readers. It's time to pass the magic on"""


Star_wars = """A Jedi must be a fearless warrior, a guardian of justice, and a scholar in the ways of the Force. But perhaps a Jedi’s most essential duty is to pass on what they have learned. Master Yoda trained Dooku; Dooku trained Qui-Gon Jinn; and now Qui-Gon has a Padawan of his own. But while Qui-Gon has faced all manner of threats and danger as a Jedi, nothing has ever scared him like the thought of failing his apprentice."""



vector1 = text_to_vector(Harry_potter_1)
vector2 = text_to_vector(Harry_potter_2)

vector3 = text_to_vector(Star_wars)

cosine1 = get_cosine(vector1, vector2)
cosine2 = get_cosine(vector2, vector3)

print ('Cosine distance between Harry Potter 1 and Harry Potter 2 reviews is:', cosine1)
print ('Cosine distance between Harry Potter 1 and Star wars reviews is:', cosine2)


Cosine distance between Harry Potter 1 and Harry Potter 2 reviews is: 0.5723332526830462
Cosine distance between Harry Potter 1 and Star wars reviews is: 0.48625899626414265


<div class='alert alert-info'>
This measured distance can be used later for Item-Item recommandation: REcommanding books based on the history of books that the user has already read</div>