# Objective

The goal is get the list of all exhibitors in a business exhibition.
I typically use it either:
 - before attending an exhibition : it allows me to check every startup attending and list only the best one to see on day D
 - if I can't get to the exhibiton : perform a check of all exhibitors to verify if some could have been interesting to discuss with

In [49]:
# immports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
import time
import requests

# Basic Test

In [121]:
# Variables to adjust to each site
BASE_URL="https://www.bigdataworld.fr/"
LIST_TO_SURF= [ 'https://www.bigdataworld.fr/exhibitors?page=1',
               'https://www.bigdataworld.fr/exhibitors?page=2'
    
]

#test classic scraping
url= LIST_TO_SURF[0]
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get(url,headers=headers)
page.content

b'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<body>\r\n</body></html>'

  Ok, seems like it will be harder than thought :  **Website is protecting its data by detecting we are not a real person. Even when setting the user-agent variable, it doesn't allow us to access its data **
 ** we'll need to use selenium to emulate the surf aof a real person **

# Use of selenium agent

In [123]:
# declare path of Chrome driver
driver = webdriver.Chrome(executable_path=r"c:\Users\Utilisateur\ChromeDriver\chromedriver.exe")

In [124]:
#test web surfing on a web site
driver.get(url)  #go to home page
time.sleep(5) #wait ?
agent = driver.execute_script("return navigator.userAgent") #check user agent
print(agent)
time.sleep(5)


Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36


 ** Note : website is well protected: we have to click 'I am nto a robot' Manually and then we could access its data **

In [137]:
def get_exhibitors_list(list):
    """
    Scrape list of all exhibitors (urls) 

    Args:
      list (list) : list of pages url to scrape

    Returns:
      url_list_all (list) : List of startups/exhibitors profile url. Will be used to scrape startups data alter

    """

    url_list_all=[]
    for url in LIST_TO_SURF:
        #get list of all startup description url
        driver.get(url)
        html_page = driver.page_source
        soup = BeautifulSoup(html_page,"lxml")
        link_list = soup.find_all('a', class_='m-exhibitors-list__items__item__header__title__link js-librarylink-entry')
        
        for link in link_list:
            url_list_all.append(link.get('href'))
    
    return url_list_all

In [151]:
def get_exhibitors_profile(list):
    
    
    """
    Scrape all companies main profile urls 

    Args:
      list (list) : list of each companies' profile

    Returns:
      companies (list) : List of all startups/exhibitors profile

    """
    
    tic= time.time()
    companies =[]
    for url in list:
        try:
            profile = {}
            driver.get(BASE_URL+url)
            time.sleep(1)
            page = driver.page_source
            soup = BeautifulSoup(page, "lxml")
            profile['name'] = soup.find('h1', class_="m-exhibitor-entry__item__header__infos__title").text
            profile['url'] = soup.find('div', class_="m-exhibitor-entry__item__body__contacts__additional__button__website").find('a').get('href')

            # TO DO : ADD YOUTUBE CHANNEL
            #youtube_channel = soup.find('i', class_="fa fab fa-youtube")
            #youtube_url = youtube_channel.find('a').get('href') if youtube_channel else ""
            #profil['youtube'] = youtube_url

            description = soup.find('div', class_="m-exhibitor-entry__item__body__description")
            description = description.text if description else ""
            profile['description'] = description

            companies.append(profile)
        except: continue
    tac= time.time()
    print("Total time to scrape: {}".format(tac-tic))
    return companies 



# Scrape and save Data to Dataframe

In [143]:
exhibitors_list = get_exhibitors_list(LIST_TO_SURF)
exhibitors_profile_list = get_exhibitors_profile(exhibitors_list)

#Import all data into a Dataframe
df = pd.DataFrame(exhibitors_profile_list, index=[i for i in range(len(exhibitors_profile_list))])
df.info()

Total time to scrape: 362.2736783027649
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45 entries, 0 to 44
Data columns (total 3 columns):
description    45 non-null object
name           45 non-null object
url            45 non-null object
dtypes: object(3)
memory usage: 1.4+ KB


In [149]:
# Export to CSV
df.to_csv("exhibitors_list.csv")
!head exhibitors_list.csv

,description,name,url
0,,4IN DATA,https://www.4indata.com/
1,"
AEKIDEN est un cabinet de conseil spÃ©cialisÃ© en Data Governance et Data Management, architecte et rÃ©alisateur de la transformation data driven des entreprises et organisations de tous secteurs.
Nos offres reflÃ¨tent les diffÃ©rentes facettes (culture, organisation, gouvernance, architecture) qui permettent Ã  nos clients de maÃ®triser leur donnÃ©es et dâ€™en tirer rapidement de la valeur stratÃ©gique et opÃ©rationnelle, avec une claire vision du pourquoi et du quoi, et une grande expÃ©rience du comment.
",Aekiden,http://www.aekiden.com
2,"
AMASAI est un cabinet de conseil en Intelligence Artificielle et Data Science.Â 
Les projets Data et IA sont complexes, et les entreprises manquent de ressources et de compÃ©tences en interne pour les piloter. Les prestataires de service en Data Science et Machine Learning disposent de profils techniques trÃ¨s pointus, mais qui maÃ®trisent mal le business de leurs clients. Cette situat

** Here we are ! We've got a csv file that we can consult and check all url that match our interest **

** For tradeshows where we don't know if our presence is needed, this tool can help guess the potential and interest to show at the exhibition ! **