# Expanding Target Audience by Scraping LinkedIn and Glassdoor
## Name: Kuan-Cheng Fu

## Chapter 1 Introduction
We decided to explore reliable AI or data processes to unpdate the information of a given target audience list and meanwhile expand the list for more companies in the similar industry. However, to be honest, we still haven't figure out the best way to solve the problem. Therefore, in this proposal, I will demonstrate an intuitive method to try to solve the problem by utilizing Google's search engine and two scraping tools, including Selenium and BeautifulSoup. Althought this method is simple and intuitive and might has some limitations, I believe that the concept of this method is applicable for several potential data sources and might be the inspiration for our next step.

In [138]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from bs4.element import Tag
from time import sleep
import pandas as pd
import numpy as np

## Chapter 2 Framework
### 2.1 Scraping Glassdoor
In fact, besides LinkedIn, there are several reliable job search websites, such as Glassdoor and Indeed, which contain information of different companies in the world. Therefore, in this section, I will focus on scraping information of companies in a certain industry and region from Glassdoor. 

Basically, the proposed method consists of three parts and I will briefly explain each part with an example. Regarding the example, I suppose that we would like to expand a target audience list for more companies in the coffee industry across Europe. 

First, we will navigate to Google to perform a specific search query. The search query **site:www.glassdoor.com/Overview/ AND "coffee" AND "Europe"** will return 9-10 company profiles on Glassdoor per page. Second, we will extract the URLs of each company profile on each page. In this example, we will extract the URLs from only two pages. Last, we will extract the desired data, including company name, website, size, and industry, from each URL. After that, We will store all the data into a dataFrame named **df_glassdoor**.

In [443]:
# Part One
driver = webdriver.Chrome('/Users/fufu/desktop/GitHub/Target Audience Expansion/chromedriver') # https://sites.google.com/a/chromium.org/chromedriver/home
url = 'https:www.google.com'
driver.get(url)
sleep(1)

search_query = driver.find_element_by_name('q')
search_query.send_keys('site:www.glassdoor.com/Overview/ AND "coffee" AND "Europe"') # specific search query
sleep(1)

search_query.send_keys(Keys.RETURN)
sleep(1)

In [446]:
# Part Two
whole_links = []
pages = 2
for i in range(pages):
    soup = BeautifulSoup(driver.page_source,'lxml')
    result_div = soup.find_all('div', attrs={'class': 'g'})
    
    links = []
    titles = []
    for r in result_div:
        try:
            link = r.find('a', href=True)
            title = None
            title = r.find('h3')

            if isinstance(title,Tag):
                title = title.get_text()

            if link != '' and title != '':
                links.append(link['href'])
                titles.append(title)

        except Exception as e:
            print(e)
            continue
    whole_links.extend(links)
    
    driver.find_element_by_link_text('下一頁').click()
    sleep(0.5)

In [447]:
# Part Three
rows = []
for i in range(15):
    driver.get(whole_links[i])
    soup = BeautifulSoup(driver.page_source,'lxml')
    
    result_name = soup.find_all('div', attrs={'class': 'header cell info'})
    result_info = soup.find_all('div', attrs={'class': 'infoEntity'})

    columns = ["Name", "Website", "Headquarters", "Size", "Part of ", "Founded", "Type", "Industry", "Revenue"]
    info = [result_name[0].find('span').text, "None", "None", "None", "None", "None", "None", "None", "None"]
    
    
    for j in range(len(result_info)):
        if result_info[j].find('label').text == "Website":
            info[columns.index("Website")] = result_info[j].find('a', href=True)['href']
            
        else:
            info[columns.index(result_info[j].find('label').text)] = result_info[j].find('span').text

    rows.append(info)

df_glassdoor = pd.DataFrame(rows, columns=["Name", "Website", "Headquarters", "Size", "Part of ", "Founded", "Type", "Industry", "Revenue"])

In [452]:
df_glassdoor.head(10)

Unnamed: 0,Name,Website,Headquarters,Size,Part of,Founded,Type,Industry,Revenue
0,Urrutia's Estate and Coffee,http://www.upcoffee.com,San Salvador (El Salvador),1 to 50 employees,,Unknown,Private Practice / Firm,Food & Beverage Manufacturing,Unknown / Non-Applicable per year
1,illy,http://www.illy.com,"Rye Brook, NY",51 to 200 employees,,Unknown,Company - Private,Food & Beverage Manufacturing,Unknown / Non-Applicable
2,Bikeeny Caffe,http://www.bikeeny.com,"Malden, MA",1 to 50 employees,,2018,Unknown,Casual Restaurants,Unknown / Non-Applicable
3,Broadcasting Center Europe,http://www.bce.lu,Luxembourg (Luxembourg),201 to 500 employees,,Unknown,Company - Private,Unknown,Unknown / Non-Applicable per year
4,Investors Europe,http://www.investorseurope.com,Gibraltar (Gibraltar),1 to 50 employees,,Unknown,Company - Private,Unknown,Unknown / Non-Applicable per year
5,Starbucks,http://www.starbucks.com,"Seattle, WA",10000+ employees,,1971,Company - Public (SBUX),Fast-Food & Quick-Service Restaurants,$10+ billion (USD) per year
6,maxingvest,http://www.maxingvest.de,Hamburg (Germany),10000+ employees,,Unknown,Company - Private,Food & Beverage Manufacturing,$5 to $10 billion (USD) per year
7,Selecta,http://www.selecta.com,Steinhausen (Switzerland),10000+ employees,,1957,Company - Private,Catering & Food Service Contractors,$1 to $2 billion (USD) per year
8,GeoSpock,http://www.geospock.com,"Cambridge, England (UK)",1 to 50 employees,,2013,Company - Private,Enterprise Software & Network Solutions,Unknown / Non-Applicable
9,Firmenich,http://www.firmenich.com,Geneva (Switzerland),5001 to 10000 employees,,1895,Company - Private,Chemical Manufacturing,$2 to $5 billion (USD) per year


### 2.2 Scraping LinkedIn
In this section, the proposed method and the applied example will be same as the previous section. Although there is a limitation on how many profiles we could scrape per day on LinkedIn, I am still going to try to scrape information of companies in a certain industry and region from LinkedIn in order to test the availability of our proposed method on different data source.

First, we will navigate to Google to perform a specific search query. The search query **site:www.linkedin.com/company/ AND "coffee" AND "Europe"** will return 9-10 company profiles on LinkedIn per page. Second, we will extract the URLs of each company profile on each page. In this example, we will also extract the URLs from only two pages. Last, we will extract the desired data, including company name, website, and industry, from each URL. After that, We will store all the data into a dataFrame named **df_linkedin**.

In [453]:
# Part One
url = 'https:www.google.com'
driver.get(url)
sleep(3)

search_query = driver.find_element_by_name('q')
search_query.send_keys('site:www.linkedin.com/company/ AND "coffee" AND "Europe"')
sleep(0.5)

search_query.send_keys(Keys.RETURN)
sleep(3)

In [454]:
# Part Two
whole_links = []
pages = 2
for i in range(pages):
    soup = BeautifulSoup(driver.page_source,'lxml')
    result_div = soup.find_all('div', attrs={'class': 'g'})
    
    links = []
    titles = []
    for r in result_div:

        try:
            link = r.find('a', href=True)
            title = None
            title = r.find('h3')

            if isinstance(title,Tag):
                title = title.get_text()

            if link != '' and title != '':
                links.append(link['href']+'/about')

        except Exception as e:
            print(e)
            continue
    whole_links.extend(links)
    
    driver.find_element_by_link_text('下一頁').click()
    sleep(0.5)

In [503]:
# Part Three
driver.get('https://www.linkedin.com')
linkedin_username = 'slam5827188@gmail.com'
linkedin_password = 'hick550beck057'
username = driver.find_element_by_xpath('//*[@type="text"]')
username.send_keys(linkedin_username)
sleep(0.5)
password = driver.find_element_by_xpath('//*[@type="password"]')
password.send_keys(linkedin_password)
sleep(0.5)
sign_in_button = driver.find_element_by_xpath('//*[@type="submit"]')
sign_in_button.click()
sleep(0.5)

df_linkedin = pd.DataFrame() 
for i in range(len(whole_links)):
    driver.get(whole_links[i])
    sleep(1)
    soup = BeautifulSoup(driver.page_source,'lxml')
    
    name = soup.find_all('h1', attrs={'class': 'org-top-card-summary__title t-24 t-black truncate'})
    result_col = soup.find_all('dt', attrs={'class': 'org-page-details__definition-term t-14 t-black t-bold'})
    result_info = soup.find_all('dd', attrs={'class': 'org-page-details__definition-text t-14 t-black--light t-normal'})
    result_size = soup.find_all('dd', attrs={'class': 'org-about-company-module__company-size-definition-text t-14 t-black--light mb1 fl'})
    
    row = {}
    row['name'] = name[0].find('span').text
    if result_col != []:
        temp_col = []
        for j in range(len(result_col)):
            col_name = result_col[j].text.strip()
            if col_name != 'Company size':
                temp_col.append(col_name)
            else:
                pass
                
        
        for k in range(len(temp_col)):
            if temp_col[k] == 'Phone':
                info = result_info[k].find('span').text.strip()
            else:
                info = result_info[k].text.strip()
            row[temp_col[k]] = info
            
        if result_size != []:
            row['Company size'] = result_size[0].text.strip()
        else:
            pass     
    else:
        pass
    
    df_temp = pd.DataFrame([row], columns=row.keys())
    df_linkedin = pd.concat([df_linkedin, df_temp],sort=False).reset_index(drop=True)

df_linkedin['LinekIn_URL'] = whole_links

In [507]:
df_linkedin.head(10)

Unnamed: 0,name,Website,Industry,Type,Company size,Headquarters,Founded,Specialties,Phone,LinekIn_URL
0,UCC Europe,http://www.ucc-europe.co.uk/,Food & Beverages,Privately Held,"501-1,000 employees",,,,,https://www.linkedin.com/company/united-coffee...
1,SPECIALITY COFFEE ASSOCIATION OF EUROPE,,,,,,,,,https://www.linkedin.com/company/speciality-co...
2,Starbucks Central & Eastern Europe,http://www.starbucks.com,Restaurants,Public Company,"10,001+ employees","Wroclaw, Poland",,,,https://www.linkedin.com/company/starbucks-cee...
3,HARIO EUROPE B.V.,https://www.hario-europe.com,Consumer Goods,Privately Held,201-500 employees,"Amstelveen, North Holland",1921.0,,,https://www.linkedin.com/company/hario-europe-...
4,Blue Mountain Coffee (Europe) Limited,http://www.bluemountaincoffeejamaica.com,Wholesale,Privately Held,2-10 employees,,,Jamaica Blue Mountain Coffee,,https://www.linkedin.com/company/blue-mountain...
5,European Coffee Trip,http://www.europeancoffeetrip.com,Media Production,Privately Held,2-10 employees,,2014.0,,,https://www.linkedin.com/company/european-coff...
6,UCC Coffee France,https://ucc-europe.com/fr/,Food Production,Partnership,51-200 employees,,1933.0,"Private labels services, Out-of-Home, Coffee a...",0475440202,https://www.linkedin.com/company/ucc-coffee-fr...
7,Eden Springs,http://www.edensprings.com,Food & Beverages,Public Company,"1,001-5,000 employees","Barcelona, Barcelona",,"Drinking water solutions for office and home, ...",,https://www.linkedin.com/company/eden-springs/...
8,Raja Europe BV – Royal Raja,http://www.raja-europe.com,"Glass, Ceramics & Concrete",Privately Held,2-10 employees,"Maastricht, Limburg",2004.0,"merchandise, licenses, private label, premiums...",0031434079200,https://www.linkedin.com/company/raja-europe-b...
9,UCC Coffee Benelux BV (NL),http://www.werkenbijucc-coffee.nl,Food & Beverages,Privately Held,51-200 employees,,1818.0,"Produceren private label koffie en thee, Produ...",+31(0)515-548333,https://www.linkedin.com/company/united-coffee...


## Chapter 3 Future Work
As mentioned in Chapter 1, the proposed method seems not the best way to solve the problem. However, on the basis of chapter 2, I believe that the concept of this method is intuitive and applicable for other potential data sources such as Indeed and Monster. Moreover, if we decide to keep developing the proposed method, I think there are several issues needed to be further discussed.

1. If we decide to rely on several different data sources, then we probably have to make sure whether our final target audience list is going to be the union or intersection of the lists from each data source.

2. Since the information of companies from each data source, except LinkedIn, might not contain LinkedIn URLs, we probably need to perform specific search queries, including company name, website, and industry, on Google's search engine in order to get the LinkedIn URLs. Howerver, the accuracy might be questionable.

3. According to Chapter 1, another challenging issue is how to unpdate the information of a given target audience list. Similarly, we could still rely on Google's search engine. However, as the previous point illustrates, the accuracy might be questionable and we might still need to check the updated list in person.