# Tutorial 6 - Python for Data Analysis
---
## Selenium

  - [Part 1](#exercise-1) : Installation
  - [Part 2](#exercise-2) : LinkedIn Login & Profile
  - [Part 3](#exercise-3) : LinkedIn Relations
  - [Bonus](#exercise-bonus) : Portail


For Selenium to be able to work, we need to download the adequate "webdriver" based on your browser.
I will be giving you the steps for Chrome, however they are sensibly the same for any web browser.

First step : Check your browser version. For Chrome, you need to visit "help" -> "about Chrome".

Second step : Download webdriver for the version of your browser here :
https://googlechromelabs.github.io/chrome-for-testing/ (you should have V.118 of Chrome)

Third step : choose the link with binary "webdriver" and your OS (mac, windows) to download the webdriver, then place it ideally in your project directory.

Fourth step : Either set your webdriver to be a PATH executable OR ignore this step and use the argument executable_path in the webdriver object definition

<a name="exercise-1">

### Part 1 : Installation
---

In [78]:
!pip install -U selenium




[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [79]:
!pip install lxml




[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [80]:
!pip list


[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Package                   Version
------------------------- ------------
anyio                     4.0.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.0
async-lru                 2.0.4
attrs                     23.1.0
Babel                     2.13.0
backcall                  0.2.0
beautifulsoup4            4.12.2
bleach                    6.1.0
certifi                   2023.7.22
cffi                      1.16.0
charset-normalizer        3.3.0
colorama                  0.4.6
comm                      0.1.4
contourpy                 1.1.1
cycler                    0.12.1
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
executing                 2.0.0
fastjsonschema            2.18.1
fonttools                 4.43.1
fqdn                      1.5.1
h11                       0.14.0
idna                      3.4
ipykernel                 6.25.2
ipython           

In [81]:
!pip install beautifulsoup4




[notice] A new release of pip available: 22.3.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [82]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

Create your first driver !

Install the browser driver depending on your browser version : https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/

Place the exe file in the directory of your jupyterlab directory.
For the latest documentation on selenium : https://www.selenium.dev/documentation/overview/

In [83]:
driver = webdriver.Chrome()

KeyboardInterrupt: 

In [None]:
driver.get("https://www.linkedin.com")

<a name="exercise-2">

### Part 2 : LinkedIn Login
---

1) Find the elements corresponding to the login field and fill them by inspecting the linkedin page

In [None]:
import getpass

2) Find two ways of login in : with submit button and keys.

In [None]:
username = input("Enter LinkedIn username (email): ")
password = getpass.getpass("Enter LinkedIn password: ")

login_field = driver.find_element(By.ID, "session_key")
login_field.send_keys(username)

password_field = driver.find_element(By.ID, "session_password")
password_field.send_keys(password)
password_field.send_keys(Keys.RETURN)


3) Find the button to go on your profile page and automate the action of clicking.

In [None]:
element = driver.find_element(By.CSS_SELECTOR, "a[href='/in/alex-szpakiewicz/']")
element.click()

4) Use a script to scroll the page to the bottom to browse all objects (execute_script function)

In [None]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

5) Use beautifulsoup and process the HTML of the previous scrolled page

In [None]:
from bs4 import BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')


6.1) Extract under the variable name "section" the HTML part of the page corresponding to your experiences using the section ID

In [None]:
section = soup.find('section', id='ember478')
print(section)

6.2) As you can see, if you refresh the page the section ID changes, what can you do to always find it back ? Create a function to solve this question.

In [None]:
import numpy as np
def find_section(soup):
    education_div = soup.find('div', id='education')
    
    if education_div:
        section = education_div.find_parent('section')
        return section
    
    return None

In [None]:
section = find_section(soup)
print(section)

7) From this extract part, extract separately experiences and from experiences : companies, period of time, description... using find_all (you might have doubled data but we will filter it after)

In [None]:
def extract_education_details(soup, section):
    details = []
    schools = section.find_all('li', class_="artdeco-list__item pvs-list__item--line-separated pvs-list__item--one-column")
    for school in schools:
        school_name_element = school.find('span', class_='visually-hidden')
        school_name = school_name_element.text if school_name_element else None

        period_element = school.find_all('span', class_='t-14 t-normal t-black--light')
        period = period_element[0].text if period_element else None

        description_element = school.find('div', class_='pv-shared-text-with-see-more')
        description = description_element.text.strip() if description_element else None

        details.append({
            'School/Company': school_name,
            'Period': period,
            'Description/Skills': description
        })

    return details

In [None]:
education_details = extract_education_details(soup, section)
for detail in education_details:
    print(detail)

8) Create a format_data function to process every list and keep only one title inside of a list

In [None]:
def format_data(data_list):
    seen_titles = set() 
    formatted_data = [] 

    for entry in data_list:
        title = entry.get('School/Company')
        
        if title not in seen_titles:
            seen_titles.add(title)
            formatted_data.append(entry)

    return formatted_data

In [None]:
deduplicated_data = format_data(education_details)
for detail in deduplicated_data:
    print(detail)

9) Create a dataframe named my_linkedin_data and insert all that data inside of it.

In [None]:
!pip install pandas

In [None]:
import pandas as pd

In [None]:
my_linkedin_data = pd.DataFrame(deduplicated_data)
print(my_linkedin_data.head())

<a name="exercise-3">

### Part 3 : LinkedIn Relations
---

1) Click on the "More relations" button to open your list of relations

In [86]:
moreRelations = driver.find_element(By.CSS_SELECTOR, "a[href='/mynetwork/network-manager/people-follow/followers/']")
moreRelations.click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"a[href='/mynetwork/network-manager/people-follow/followers/']"}
  (Session info: chrome=118.0.5993.118); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x00007FF60C9A8EF2+54786]
	(No symbol) [0x00007FF60C915612]
	(No symbol) [0x00007FF60C7CA64B]
	(No symbol) [0x00007FF60C80B79C]
	(No symbol) [0x00007FF60C80B91C]
	(No symbol) [0x00007FF60C846D87]
	(No symbol) [0x00007FF60C82BEAF]
	(No symbol) [0x00007FF60C844D02]
	(No symbol) [0x00007FF60C82BC43]
	(No symbol) [0x00007FF60C800941]
	(No symbol) [0x00007FF60C801B84]
	GetHandleVerifier [0x00007FF60CCF7F52+3524194]
	GetHandleVerifier [0x00007FF60CD4D800+3874576]
	GetHandleVerifier [0x00007FF60CD45D7F+3843215]
	GetHandleVerifier [0x00007FF60CA45086+694166]
	(No symbol) [0x00007FF60C920A88]
	(No symbol) [0x00007FF60C91CA94]
	(No symbol) [0x00007FF60C91CBC2]
	(No symbol) [0x00007FF60C90CC83]
	BaseThreadInitThunk [0x00007FFF802C257D+29]
	RtlUserThreadStart [0x00007FFF80BEAA78+40]


2) Extract the names and the jobs of your relations on the page.

In [90]:
html_realtion = driver.page_source
soup_realtion = BeautifulSoup(html_realtion, 'html.parser')
relation_blocks = soup_realtion.find_all('div', class_='entity-result')
relations_data = []

for block in relation_blocks:
    name_tag = block.find('span', class_='entity-result__title-text t-16')
    name = name_tag.get_text(strip=True) if name_tag else None
    
    job_tag = block.find('div', class_='entity-result__primary-subtitle t-14 t-black t-normal')
    job = job_tag.get_text(strip=True) if job_tag else None
    
    relations_data.append({
        'Name': name,
        'Job/Education': job
    })

for relation in relations_data:
    print(f"Name: {relation['Name']}")
    print(f"Job/Education: {relation['Job/Education']}\n")

Name: Sébastien MOINE
Job/Education: Master 1 Data & Intelligence Artificielle - École supérieure d'ingénieurs Léonard-de-Vinci

Name: Alan Weismann
Job/Education: Student at ESILV - Master's Degree of Engineering Energy and Sustainable Cities

Name: Ahmed Mili
Job/Education: Troisième année à L'ESILV

Name: Paul Vannesson Fauque
Job/Education: Etudiant à l'ESILV

Name: Antonin Dussart
Job/Education: Etudiant M1 à l'ESILV - Ecole Supérieure d'Ingénieurs Léonard de Vinci. Association automobile Vinci Eco Drive.

Name: Melchior Thierry
Job/Education: Master's student in IT, IoT & Security at ESILV (Ecole Supérieure d'Ingénieurs Léonard de Vinci)

Name: Olivier Linot
Job/Education: Délégué général

Name: Claire Brisbart
Job/Education: Product Owner at Dassault Systèmes | 💻 Student at IIM

Name: Solène Depret
Job/Education: 👩‍💻 Responsable communication et gestion de projet | Innovation, Recrutement, Formation, Networking | Hussar Academy

Name: Yvon MOYSAN
Job/Education: CEO at Hussar Aca

3) Find a way to navigate automatically between pages and extract all names, jobs, locationsand store them in a Dataframe

Personnaly, i have only one page but i shall try to implement it anyway

In [None]:
import time

In [95]:
from selenium.common.exceptions import NoSuchElementException

relations_data = []
while True:
    html_realtion = driver.page_source
    soup_realtion = BeautifulSoup(html_realtion, 'html.parser')
    relation_blocks = soup_realtion.find_all('div', class_='entity-result')
    
    for block in relation_blocks:
        name_tag = block.find('span', class_='entity-result__title-text t-16')
        name = name_tag.get_text(strip=True) if name_tag else None
    
        job_tag = block.find('div', class_='entity-result__primary-subtitle t-14 t-black t-normal')
        job = job_tag.get_text(strip=True) if job_tag else None
    
        relations_data.append({
            'Name': name,
            'Job/Education': job
        })

    
    try:
        next_button = driver.find_element(By.CSS_SELECTOR, "selector_for_next_button")
        next_button.click()
    except NoSuchElementException:
        break


df = pd.DataFrame(relations_data)
df.to_csv('linkedin_relations.csv', index=False)