# 2.2 - Web Scraping (bs4)

$$$$

![scraping](images/scraping.png)

$$$$

Web scraping o raspado web, es una técnica utilizada mediante programas de software para extraer información de sitios web. Usualmente, estos programas simulan la navegación de un humano en la web ya sea utilizando el protocolo HTTP manualmente, o incrustando un navegador en una aplicación.

El web scraping está muy relacionado con la indexación de la web, la cual indexa la información de la web utilizando un robot y es una técnica universal adoptada por la mayoría de los motores de búsqueda. Sin embargo, el web scraping se enfoca más en la transformación de datos sin estructura en la web, como el formato HTML, en datos estructurados que pueden ser almacenados y analizados en una base de datos central, en una hoja de cálculo o en alguna otra fuente de almacenamiento. Alguno de los usos del web scraping son la comparación de precios en tiendas, la monitorización de datos relacionados con el clima de cierta región, la detección de cambios en sitios webs y la integración de datos en sitios webs. 

En los últimos años el web scraping se ha convertido en una técnica muy utilizada dentro del sector del posicionamiento web gracias a su capacidad de generar grandes cantidades de datos para crear contenidos de calidad.

Podríamos pensar que el web scraping es nuestro último recurso a falta de una API o un feed RSS. A falta de una fuente de datos, siempre podemos extraer aquello que sale por pantalla.

### Extracción desde el HTML

Para scrapear una página web, en primer lugar debemos conocer las estructura que tiene el HTML. Veamos la estructura básica.

El HTML consiste en contenido `<etiquetado>`, es como si fueran cajas de contenido, organizado de manera jerárquica:

```
<html>
    <head>
        <title>Titulo de la pagina</title>
    </head>
    <body>
        <h1>Cabecera</h1>
        <p>Parrafo</p>
    </body>
</html>
```

$$$$

Las etiquetas el HTML se pueden clasificar en varios grupos, dependiendo del tipo de contenido que posea. Estos son algunos ejemplos:

+ cabecera: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
+ texto: `<b>`, `<p>`...
+ embebido: `<audio>`, `<img>`, `<video>`...
+ tabular: `<table>`, `<tr>`, `<td>`, `<tbody>`...
+ secciones: `<header>`, `<section>`, `<article>`...
+ metadata: `<meta>`, `<title>`, `<script>`...

$$$$

Las etiquetas pueden tener atributos. Por ejemplo:
 
`<div class="text-monospace" id="name_132", href="www.example.com"> Contenido de la pagina </div>` 

Esta etiqueta `div` tiene los siguientes atributos:

+ class: atributo con valor "text-monospace". La clase no es única en la página.
+ id: atributo con valor "name_132". El id de una etiqueta la identifica de manera unívoca.
+ href: atributo con valor "www.example.com". El href suele contener el link a otra parte de la página.

Siguiendo con la analogía de las cajas, si una etiqueta de HTML es una caja, los atributos serían las pegatinas pegadas en la tapa de la caja.

Conociendo cual es el contenido que queremos extraer, debemos encontrar las etiquetas dentro del HTML de la página web.

Usaremos la herramienta **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**.

In [None]:
%pip install beautifulsoup4

In [1]:
import requests as req

from bs4 import BeautifulSoup as bs   # ambos alias son cosa mia

### Ejemplos Wikipedia

**[Países europeos según esperanza de vida](https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy)**

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_European_countries_by_life_expectancy'

In [5]:
# usamos requests para extraer el html


html = req.get(url).content    # o .text

html[:1000]

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of European countries by life expectancy - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabl

In [6]:
len(html)

458048

In [8]:
# parsear (traducir string a algo similar al html)


soup = bs(html, 'html.parser')

type(soup)

bs4.BeautifulSoup

In [12]:
len(soup.find_all('table'))

14

In [15]:
tabla = soup.find_all('table')[2]

type(tabla)

bs4.element.Tag

In [19]:
tabla.find_all('tr')[0].text.split('\n')

['',
 'Countries',
 '',
 'Life expectancy at birth',
 '',
 'HALE at birth',
 '',
 'Life expectancy at age 60',
 '',
 'HALE at age 60',
 '',
 '',
 '']

In [29]:
tabla.find_all('tr')[10].find_all('td')[0].text.strip()

'Luxembourg'

In [39]:
filas = tabla.find_all('tr')

filas = [f.text.split('\n') for f in filas][3:]

filas[0]

['',
 '\xa0Switzerland',
 '83.4',
 '81.8',
 '85.1',
 '3.3',
 '3.7',
 '72.5',
 '72.2',
 '72.8',
 '0.6',
 '3.2',
 '25.4',
 '24.1',
 '26.6',
 '2.5',
 '2.4',
 '19.5',
 '18.8',
 '20.2',
 '1.4',
 '1.8',
 '',
 '']

In [40]:
# limpieza

final = []


for f in filas:
    
    tmp = []
    
    for palabra in f:
        
        if palabra!='':
            tmp.append(palabra)
        else:
            pass
        
    final.append(tmp)
    
final[:2]

[['\xa0Switzerland',
  '83.4',
  '81.8',
  '85.1',
  '3.3',
  '3.7',
  '72.5',
  '72.2',
  '72.8',
  '0.6',
  '3.2',
  '25.4',
  '24.1',
  '26.6',
  '2.5',
  '2.4',
  '19.5',
  '18.8',
  '20.2',
  '1.4',
  '1.8'],
 ['\xa0Spain',
  '83.2',
  '80.7',
  '85.7',
  '5.0',
  '4.1',
  '72.1',
  '71.3',
  '72.9',
  '1.6',
  '3.0',
  '25.4',
  '23.3',
  '27.3',
  '4.0',
  '2.7',
  '19.2',
  '18.0',
  '20.3',
  '2.3',
  '1.9']]

In [42]:
final = [[palabra for palabra in f if palabra!=''] for f in filas]

final[0]

['\xa0Switzerland',
 '83.4',
 '81.8',
 '85.1',
 '3.3',
 '3.7',
 '72.5',
 '72.2',
 '72.8',
 '0.6',
 '3.2',
 '25.4',
 '24.1',
 '26.6',
 '2.5',
 '2.4',
 '19.5',
 '18.8',
 '20.2',
 '1.4',
 '1.8']

In [46]:
import pandas as pd

df = pd.DataFrame(final)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       47 non-null     object
 1   1       47 non-null     object
 2   2       47 non-null     object
 3   3       47 non-null     object
 4   4       47 non-null     object
 5   5       47 non-null     object
 6   6       47 non-null     object
 7   7       47 non-null     object
 8   8       47 non-null     object
 9   9       47 non-null     object
 10  10      47 non-null     object
 11  11      47 non-null     object
 12  12      47 non-null     object
 13  13      47 non-null     object
 14  14      47 non-null     object
 15  15      47 non-null     object
 16  16      47 non-null     object
 17  17      47 non-null     object
 18  18      47 non-null     object
 19  19      47 non-null     object
 20  20      47 non-null     object
dtypes: object(21)
memory usage: 7.8+ KB


In [48]:
# es solo para pandas

df = pd.read_clipboard()     # ctrl+v

df

Unnamed: 0_level_0,—,—.1,—.2,—.3,85.9,84.3,87.7,3.4,Unnamed: 9
Monaco,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Switzerland,83.8,81.9,85.6,3.7,84.0,82.0,85.9,3.9,0.2
Malta,82.5,80.7,84.3,3.6,83.8,81.4,86.1,4.7,1.3
Liechtenstein,80.7,—,—,—,83.3,81.1,85.4,4.3,2.6
Norway,82.4,80.4,84.4,4.0,83.2,81.6,84.9,3.3,0.8
Sweden,82.8,81.0,84.6,3.6,83.0,81.1,84.9,3.8,0.2
Spain,83.6,80.8,86.2,5.4,83.0,80.2,85.8,5.6,−0.6
Italy,83.5,81.3,85.5,4.2,82.9,80.5,85.1,4.6,−0.6
Iceland,83.0,81.5,84.5,3.0,82.7,81.2,84.2,3.0,−0.3
Luxembourg,82.3,80.2,84.3,4.1,82.6,80.4,84.8,4.4,0.3
France,82.7,79.7,85.5,5.8,82.5,79.4,85.5,6.1,−0.2


In [49]:
pd.read_clipboard()

Unnamed: 0,pd.read_clipboard()


### Ejemplo geolocalización por IP

https://tools.keycdn.com/geo

**¿Dónde estoy?**

In [50]:
url = 'https://tools.keycdn.com/geo?host=2.136.118.161'

In [51]:
html = req.get(url).text

sopa = bs(html, 'html.parser')

In [54]:
len(sopa.find_all('div'))

62

In [55]:
sopa.find('div', id='geoResult')

<div class="mt-4" id="geoResult">
<div class="bg-light medium rounded p-3">
<p class="small text-uppercase text-muted font-weight-semi-bold line-height-headings mb-2">Location</p> <dl class="row mb-0">
<dt class="col-4">Country</dt><dd class="col-8 text-monospace">Spain (ES)</dd><dt class="col-4">Continent</dt><dd class="col-8 text-monospace">Europe (EU)</dd><dt class="col-4">Coordinates</dt><dd class="col-8 text-monospace">40.4172 (lat) / -3.684 (long)</dd><dt class="col-4">Time</dt><dd class="col-8 text-monospace">2023-10-31 12:41:09 (Europe/Madrid)</dd> </dl>
<p class="small text-uppercase text-muted font-weight-semi-bold line-height-headings mt-4 mb-2">Network</p>
<dl class="row mb-0">
<dt class="col-4">IP address</dt><dd class="col-8 text-monospace">2.136.118.161</dd><dt class="col-4">Hostname</dt><dd class="col-8 text-monospace">161.red-2-136-118.staticip.rima-tde.net</dd><dt class="col-4">Provider</dt><dd class="col-8 text-monospace">Telefonica De Espana S.a.u.</dd><dt class="co

In [58]:
sopa.find('div', class_='mt-4')

<div class="col-xl-2 col-md-3 d-flex align-items-end mt-md-0 mt-4">
<button class="btn btn-primary d-flex justify-content-center align-items-center w-md-100" id="geoBtn">Find</button>
</div>

In [59]:
sopa.find('div', {'id': 'geoResult', 'class': 'mt-4'})

<div class="mt-4" id="geoResult">
<div class="bg-light medium rounded p-3">
<p class="small text-uppercase text-muted font-weight-semi-bold line-height-headings mb-2">Location</p> <dl class="row mb-0">
<dt class="col-4">Country</dt><dd class="col-8 text-monospace">Spain (ES)</dd><dt class="col-4">Continent</dt><dd class="col-8 text-monospace">Europe (EU)</dd><dt class="col-4">Coordinates</dt><dd class="col-8 text-monospace">40.4172 (lat) / -3.684 (long)</dd><dt class="col-4">Time</dt><dd class="col-8 text-monospace">2023-10-31 12:41:09 (Europe/Madrid)</dd> </dl>
<p class="small text-uppercase text-muted font-weight-semi-bold line-height-headings mt-4 mb-2">Network</p>
<dl class="row mb-0">
<dt class="col-4">IP address</dt><dd class="col-8 text-monospace">2.136.118.161</dd><dt class="col-4">Hostname</dt><dd class="col-8 text-monospace">161.red-2-136-118.staticip.rima-tde.net</dd><dt class="col-4">Provider</dt><dd class="col-8 text-monospace">Telefonica De Espana S.a.u.</dd><dt class="co

In [60]:
tabla = sopa.find('div', {'id': 'geoResult', 'class': 'mt-4'})

In [62]:
tabla.find_all('dd')

[<dd class="col-8 text-monospace">Spain (ES)</dd>,
 <dd class="col-8 text-monospace">Europe (EU)</dd>,
 <dd class="col-8 text-monospace">40.4172 (lat) / -3.684 (long)</dd>,
 <dd class="col-8 text-monospace">2023-10-31 12:41:09 (Europe/Madrid)</dd>,
 <dd class="col-8 text-monospace">2.136.118.161</dd>,
 <dd class="col-8 text-monospace">161.red-2-136-118.staticip.rima-tde.net</dd>,
 <dd class="col-8 text-monospace">Telefonica De Espana S.a.u.</dd>,
 <dd class="col-8 text-monospace">3352</dd>]

In [64]:
type(tabla.find_all('dd')[0])

bs4.element.Tag

In [66]:
tabla.find_all('dd')[0].text

'Spain (ES)'

In [67]:
type(tabla.find_all('dd')[0].text)

str

In [68]:
[e.text for e in tabla.find_all('dd')]

['Spain (ES)',
 'Europe (EU)',
 '40.4172 (lat) / -3.684 (long)',
 '2023-10-31 12:41:09 (Europe/Madrid)',
 '2.136.118.161',
 '161.red-2-136-118.staticip.rima-tde.net',
 'Telefonica De Espana S.a.u.',
 '3352']

In [70]:
tabla.find_all('dt')[0].text

'Country'

In [71]:
[e.text for e in tabla.find_all('dt')]

['Country',
 'Continent',
 'Coordinates',
 'Time',
 'IP address',
 'Hostname',
 'Provider',
 'ASN']

In [73]:
conexion = dict(zip([e.text for e in tabla.find_all('dt')], 
                     [e.text for e in tabla.find_all('dd')]))

In [74]:
conexion

{'Country': 'Spain (ES)',
 'Continent': 'Europe (EU)',
 'Coordinates': '40.4172 (lat) / -3.684 (long)',
 'Time': '2023-10-31 12:41:09 (Europe/Madrid)',
 'IP address': '2.136.118.161',
 'Hostname': '161.red-2-136-118.staticip.rima-tde.net',
 'Provider': 'Telefonica De Espana S.a.u.',
 'ASN': '3352'}

In [75]:
# creo una funcion para todo el proceso, segun ip

def encontrar(ip):
    
    url = f'https://tools.keycdn.com/geo?host={ip}'
    
    html = req.get(url).text

    sopa = bs(html, 'html.parser')
    
    tabla = sopa.find('div', {'id': 'geoResult', 'class': 'mt-4'})
    
    conexion = dict(zip([e.text for e in tabla.find_all('dt')], 
                     [e.text for e in tabla.find_all('dd')]))
    
    return conexion

In [76]:
encontrar('2.136.118.161')

{'Country': 'Spain (ES)',
 'Continent': 'Europe (EU)',
 'Coordinates': '40.4172 (lat) / -3.684 (long)',
 'Time': '2023-10-31 12:48:07 (Europe/Madrid)',
 'IP address': '2.136.118.161',
 'Hostname': '161.red-2-136-118.staticip.rima-tde.net',
 'Provider': 'Telefonica De Espana S.a.u.',
 'ASN': '3352'}

In [77]:
encontrar('168.123.4.5')

{'City': 'Barrigada Village',
 'Postal code': '96921',
 'Country': 'Guam (GU)',
 'Continent': 'Oceania (OC)',
 'Coordinates': '13.4593 (lat) / 144.7942 (long)',
 'Time': '2023-10-31 21:48:12 (Pacific/Guam)',
 'IP address': '168.123.4.5',
 'Hostname': '168.123.4.5',
 'Provider': 'UNIVERSITY-GUAM',
 'ASN': '395400'}

In [78]:
encontrar('46.222.34.200')

{'City': 'Madrid',
 'Region': 'Madrid (M)',
 'Postal code': '28009',
 'Country': 'Spain (ES)',
 'Continent': 'Europe (EU)',
 'Coordinates': '40.4169 (lat) / -3.6841 (long)',
 'Time': '2023-10-31 12:48:20 (Europe/Madrid)',
 'IP address': '46.222.34.200',
 'Hostname': '46.222.34.200',
 'Provider': 'Xtra Telecom S.A.',
 'ASN': '15704'}

### Ejemplo LinkedIn

In [79]:
url = 'https://www.linkedin.com/jobs/search/?currentJobId=3750825763&f_E=4&keywords=Analista%20de%20datos&origin=JOB_SEARCH_PAGE_JOB_FILTER&refresh=true'

In [80]:
html = req.get(url).text

sopa = bs(html, 'html.parser')

In [85]:
tarjetas = sopa.find('ul', class_='jobs-search__results-list')

In [86]:
type(tarjetas)

bs4.element.Tag

In [88]:
len(tarjetas.find_all('li'))

25

In [113]:
curro = tarjetas.find_all('li')[0]


titulo = curro.find('span').text.strip()

empresa = curro.find('h4').text.strip()

link_curro = curro.find('a').attrs['href']

link_comp = curro.find('h4').find('a').attrs['href']

pais = curro.find('span', class_="job-search-card__location").text.strip()

fecha = curro.find('time').attrs['datetime']

In [114]:
{'titulo': titulo,

'empresa': empresa,

'link_curro': link_curro,

'link_comp': link_comp, 

'pais': pais, 

'fecha': fecha}

{'titulo': 'Junior Data Analyst',
 'empresa': 'The Hatcher Group',
 'link_curro': 'https://www.linkedin.com/jobs/view/junior-data-analyst-at-the-hatcher-group-3741802236?refId=ZN90hZrkzHt%2FmD%2FipPz1xw%3D%3D&trackingId=5xwQNqVBS6bR8c9EUykpAQ%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card',
 'link_comp': 'https://www.linkedin.com/company/the-hatcher-group?trk=public_jobs_jserp-result_job-search-card-subtitle',
 'pais': 'Bethesda, MD',
 'fecha': '2023-10-17'}

In [115]:
def linkedin(keywords, num_pages, loc, n_secs):
    
    URL = 'https://www.linkedin.com/jobs/search/'
    
    data = []
    
    for i in range(num_pages):
        
        scrape_url = ''.join([URL,    # url principal
                              f'?keywords={keywords}',
                              '&geoId=105646813',
                              f'&location={loc}',
                              f'&f_TPR={n_secs}',
                              f'&start={i*25}'
                             ])
        
        
        html = req.get(scrape_url).text

        sopa = bs(html, 'html.parser')
        
        
        tarjetas = sopa.find('ul', class_='jobs-search__results-list').find_all('li')
        
        
        # bucle sobre los li, las tarjetas de cada curro
        for curro in tarjetas:
            
            
            try:
                titulo = curro.find('span').text.strip()

                empresa = curro.find('h4').text.strip()

                link_curro = curro.find('a').attrs['href']

                link_comp = curro.find('h4').find('a').attrs['href']

                pais = curro.find('span', class_='job-search-card__location').text.strip()

                fecha = curro.find('time').attrs['datetime']
                
                
                data.append({'titulo': titulo,

                            'empresa': empresa,

                            'link_curro': link_curro,

                            'link_comp': link_comp, 

                            'pais': pais, 

                            'fecha': fecha})
                
                
            except:
                print('hola')
                continue
                
                
    return pd.DataFrame(data)

In [116]:
df = linkedin('analista%20de%20datos', 5, 'spain', 30000)

In [117]:
df.head()

Unnamed: 0,titulo,empresa,link_curro,link_comp,pais,fecha
0,Analista de Datos (POWER BI),Agromillora,https://es.linkedin.com/jobs/view/analista-de-...,https://es.linkedin.com/company/agromilloragro...,"Sant Sadurní d'Anoia, Catalonia, Spain",2023-10-11
1,"Data Analyst – Bilbao, Madrid",Teknei,https://es.linkedin.com/jobs/view/data-analyst...,https://es.linkedin.com/company/teknei-group?t...,"Madrid, Community of Madrid, Spain",2023-10-25
2,Analista de datos Junior,Krell Consulting,https://es.linkedin.com/jobs/view/analista-de-...,https://es.linkedin.com/company/krellconsultin...,"Madrid, Community of Madrid, Spain",2023-07-06
3,Data Analyst Power Bi Remoto Híbrido,Malthus Darwin,https://es.linkedin.com/jobs/view/data-analyst...,https://es.linkedin.com/company/malthus-darwin...,"Madrid, Community of Madrid, Spain",2023-10-26
4,Jr Data Analyst,Smadex,https://es.linkedin.com/jobs/view/jr-data-anal...,https://es.linkedin.com/company/smadex?trk=pub...,"Barcelona, Catalonia, Spain",2023-10-26


In [118]:
df.shape

(125, 6)

In [119]:
from IPython.display import HTML

In [120]:
HTML(df.head().to_html(render_links=True))

Unnamed: 0,titulo,empresa,link_curro,link_comp,pais,fecha
0,Analista de Datos (POWER BI),Agromillora,https://es.linkedin.com/jobs/view/analista-de-datos-power-bi-at-agromillora-3735671051?refId=uy%2FGar6H7w%2FJepbebs5JYA%3D%3D&trackingId=j%2FcJuW6cVlel487%2FSR6zAw%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card,https://es.linkedin.com/company/agromilloragroup?trk=public_jobs_jserp-result_job-search-card-subtitle,"Sant Sadurní d'Anoia, Catalonia, Spain",2023-10-11
1,"Data Analyst – Bilbao, Madrid",Teknei,https://es.linkedin.com/jobs/view/data-analyst-%E2%80%93-bilbao-madrid-at-teknei-3746819548?refId=uy%2FGar6H7w%2FJepbebs5JYA%3D%3D&trackingId=mmyU0wTz%2BVh5%2FTxYdKESWQ%3D%3D&position=2&pageNum=0&trk=public_jobs_jserp-result_search-card,https://es.linkedin.com/company/teknei-group?trk=public_jobs_jserp-result_job-search-card-subtitle,"Madrid, Community of Madrid, Spain",2023-10-25
2,Analista de datos Junior,Krell Consulting,https://es.linkedin.com/jobs/view/analista-de-datos-junior-at-krell-consulting-3657204428?refId=uy%2FGar6H7w%2FJepbebs5JYA%3D%3D&trackingId=cZQxvPJTIXc1GiBD4AMd%2BA%3D%3D&position=3&pageNum=0&trk=public_jobs_jserp-result_search-card,https://es.linkedin.com/company/krellconsulting?trk=public_jobs_jserp-result_job-search-card-subtitle,"Madrid, Community of Madrid, Spain",2023-07-06
3,Data Analyst Power Bi Remoto Híbrido,Malthus Darwin,https://es.linkedin.com/jobs/view/data-analyst-power-bi-remoto-h%C3%ADbrido-at-malthus-darwin-3747705809?refId=uy%2FGar6H7w%2FJepbebs5JYA%3D%3D&trackingId=QnH3q%2Bsxh2jJ0ZmoZD9yNQ%3D%3D&position=4&pageNum=0&trk=public_jobs_jserp-result_search-card,https://es.linkedin.com/company/malthus-darwin-sl?trk=public_jobs_jserp-result_job-search-card-subtitle,"Madrid, Community of Madrid, Spain",2023-10-26
4,Jr Data Analyst,Smadex,https://es.linkedin.com/jobs/view/jr-data-analyst-at-smadex-3749500727?refId=uy%2FGar6H7w%2FJepbebs5JYA%3D%3D&trackingId=0sWDcEt84z6oe94xzMGERw%3D%3D&position=5&pageNum=0&trk=public_jobs_jserp-result_search-card,https://es.linkedin.com/company/smadex?trk=public_jobs_jserp-result_job-search-card-subtitle,"Barcelona, Catalonia, Spain",2023-10-26


In [121]:
html = req.get('https://www.linkedin.com/in/yonatan-rodriguez/').text

html

'<html><head>\n<script type="text/javascript">\nwindow.onload = function() {\n  // Parse the tracking code from cookies.\n  var trk = "bf";\n  var trkInfo = "bf";\n  var cookies = document.cookie.split("; ");\n  for (var i = 0; i < cookies.length; ++i) {\n    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n      trk = cookies[i].substring(8);\n    }\n    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n      trkInfo = cookies[i].substring(8);\n    }\n  }\n\n  if (window.location.protocol == "http:") {\n    // If "sl" cookie is set, redirect to https.\n    for (var i = 0; i < cookies.length; ++i) {\n      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n        return;\n      }\n    }\n  }\n\n  // Get the new domain. For international domains such as\n  // fr.linkedin.com, we convert it to www.linkedin.com\n 