# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [4]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [6]:
# Realizar la solicitud HTTP GET
response = requests.get(url)

# Verificar si la solicitud fue exitosa (código de respuesta 200)
if response.status_code == 200:
    # Obtener el contenido HTML de la respuesta
    html_content = response.text

    # Analizar el contenido HTML con BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Encontrar todos los elementos con la clase "h1" que contienen los nombres de los desarrolladores (son elementos text)
    developer_names = soup.find_all('h1', class_='h3 lh-condensed')

    # Imprimir los nombres de los desarrolladores
    for name in developer_names:
        print(name.text.strip())


Yiming Cui
Stephen Celis
Nikita Sobolev
Jeff Dickey
Suyeol Jeon
Boni García
Adeeb Shihadeh
Bo-Yi Wu
Abdullah Atta
Dominic Farolino
Jan-Erik Rediger
Ismail Pelaseyed
Steve Macenski
Philipp Schmid
Alessandro Ros
Vladimir Mihailenco
dennis zhuang
Cameron Dutro
MichaIng
二货爱吃白萝卜
Mattt
Justin Clift
Laurent Mazare
Steven Tey
Ha Thach


#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [7]:
url = 'https://github.com/trending/developers'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    developer_names = []
    names_html_elements = soup.find_all('h1', class_='h3 lh-condensed')

    for element in names_html_elements:
        name = element.text.strip().replace('\n', '')
        developer_names.append(name)

    print(developer_names) #estos nombres no tienen etiquetas

['Yiming Cui', 'Stephen Celis', 'Nikita Sobolev', 'Jeff Dickey', 'Suyeol Jeon', 'Boni García', 'Adeeb Shihadeh', 'Bo-Yi Wu', 'Abdullah Atta', 'Dominic Farolino', 'Jan-Erik Rediger', 'Ismail Pelaseyed', 'Steve Macenski', 'Philipp Schmid', 'Alessandro Ros', 'Vladimir Mihailenco', 'dennis zhuang', 'Cameron Dutro', 'MichaIng', '二货爱吃白萝卜', 'Mattt', 'Justin Clift', 'Laurent Mazare', 'Steven Tey', 'Ha Thach']


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [18]:
html = requests.get(url)

if html.status_code == 200:
    soup = BeautifulSoup(html.text, 'html.parser')

    # Encuentra todos los elementos que contienen los nombres de los repositorios
    html_elements = soup.find_all('h1', class_='h3 lh-condensed')
    repo_names = []   
    for element in html_elements:
        name = element.text.strip().replace('\n', '')
        repo_names.append(name)
    print(repo_names)
  

['Yiming Cui', 'Stephen Celis', 'Nikita Sobolev', 'Jeff Dickey', 'Suyeol Jeon', 'Boni García', 'Adeeb Shihadeh', 'Bo-Yi Wu', 'Abdullah Atta', 'Dominic Farolino', 'Jan-Erik Rediger', 'Ismail Pelaseyed', 'Steve Macenski', 'Philipp Schmid', 'Alessandro Ros', 'Vladimir Mihailenco', 'dennis zhuang', 'Cameron Dutro', 'MichaIng', '二货爱吃白萝卜', 'Mattt', 'Justin Clift', 'Laurent Mazare', 'Steven Tey', 'Ha Thach']


#### Display all the image links from Walt Disney wikipedia page

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [19]:
url = 'https://en.wikipedia.org/wiki/Walt_Disney'
html = requests.get(url)

if html.status_code == 200:
    soup = BeautifulSoup(html.text, 'html.parser')

    # Encuentra todos los elementos <img=imagen> en la página
    image_elements = soup.find_all('img')

    # Extrae la URL de cada imagen
    image_links = [element['src'] for element in image_elements]

    # Imprime las URLs de las imágenes
    for link in image_links:
        print(link)
else:
    print('Error al obtener la página:', html.status_code)

/static/images/icons/wikipedia.png
/static/images/mobile/copyright/wikipedia-wordmark-en.svg
/static/images/mobile/copyright/wikipedia-tagline-en.svg
//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG
//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png
//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg
//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg
//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/220px-Steamboat-willie.jpg
//upload.wiki

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [None]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [21]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Python'
html = requests.get(url)

if html.status_code == 200:
    soup = BeautifulSoup(html.text, 'html.parser')

    # Cada elemento <a> contiene un atributo href que especifica la URL de destino del enlace...
    enlace_elements = soup.find_all('a')

    # Extrae el atributo href de cada <a> para obtener la URL del enlace
    enlaces = []
    for enlace in enlace_elements:
        if 'href' in enlace.attrs:
            enlaces.append(enlace['href'])

    # Filtra las URL para los enlaces internos de Wikipedia
    enlaces_wikipedia = [enlace for enlace in enlaces if enlace.startswith('/wiki/')]

    # Imprime los enlaces
    for enlace in enlaces_wikipedia:
        print(enlace)
else:
    print('Error al obtener la página:', html.status_code)

/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
/wiki/Python
/wiki/Talk:Python
/wiki/Python
/wiki/Python
/wiki/Special:WhatLinksHere/Python
/wiki/Special:RecentChangesLinked/Python
/wiki/Wikipedia:File_Upload_Wizard
/wiki/Special:SpecialPages
/wiki/Pythonidae
/wiki/Python_(genus)
/wiki/Python_(mythology)
/wiki/Python_(programming_language)
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/wiki/Python_Anghelo
/wiki/Python_(Efteling)
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(automobile_maker)
/wiki/Python_(Ford_prototype)
/wiki/Python_(mi

#### Number of Titles that have changed in the United States Code since its last release point 

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [23]:
import requests
from bs4 import BeautifulSoup

url = 'http://uscode.house.gov/download/download.shtml' 

html = requests.get(url)

if html.status_code == 200:
    soup = BeautifulSoup(html.text, 'html.parser')

    # Encuentra el elemento que contiene la información sobre los cambios en el título.
    div_cambios = soup.find('div', id='usctitlechanged')

    if div_cambios is not None:
        # Extrae el texto del elemento
        texto_cambios = div_cambios.get_text()

        # Analiza el texto para determinar la cantidad de títulos que han cambiado
        titulos_cambiados = len(texto_cambios.split(','))

        # Imprime la cantidad de títulos cambiados
        print("Número de títulos cambiados:", titulos_cambiados)
    else:
        print("No se encontró el elemento con id='usctitlechanged'")
else:
    print('Error al obtener la página:', html.status_code)

No se encontró el elemento con id='usctitlechanged'


#### A Python list with the top ten FBI's Most Wanted names 

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [26]:
html = requests.get(url)

if html.status_code == 200:
    soup = BeautifulSoup(html.text, 'html.parser')

    # Encuentra todos los enlaces directamente dentro del HTML utilizando soup.find_all('a')
    enlaces = soup.find_all('a')

    # Extrae los nombres de los enlaces que tienen el atributo 'title' y los guarda en una lista llamada nombres_mas_buscados
    nombres_mas_buscados = [enlace.get('title') for enlace in enlaces if enlace.get('title') is not None]

    # Imprime la lista de nombres
    for nombre in nombres_mas_buscados:
        print(nombre)
else:
    print('Error al obtener la página:', html.status_code)

An official website of the United States government
Ten Most Wanted Fugitives
Fugitives
Capitol Violence
Terrorism
Kidnappings & Missing Persons
Parental Kidnappings
Seeking Information
Indian Country
ECAP
ViCAP
Bank Robbers
Ten Most Wanted Fugitives FAQ
Ten Most Wanted Fugitives Historical Pictures
Most Wanted
Ten Most Wanted
Fugitives
Terrorism
Kidnappings / Missing Persons
Seeking Information
Bank Robbers
ECAP
ViCAP
FBI Jobs
Submit a Tip
Crime Statistics
History
FOIPA
Scams & Safety
FBI Kids
FBI Tour
News
Stories
Videos
Press Releases
Speeches
Testimony
Podcasts and Radio
Photos
Español
Apps
How We Can Help You
Law Enforcement
Victims
Parents and Caregivers
Students
Businesses
Safety Resources
Need an FBI Service or More Information?
What We Investigate
Terrorism
Counterintelligence
Cyber Crime
Public Corruption
Civil Rights
Organized Crime
White-Collar Crime
Violent Crime
WMD
About
Mission & Priorities
Leadership & Structure
Partnerships
Community Outreach
FAQs
Contact Us
Field Off

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.emsc-csem.org/Earthquake/'

html = requests.get(url)

if html.status_code == 200:
    soup = BeautifulSoup(html.text, 'html.parser')

    # Encuentra la tabla que contiene la información de los terremotos
    tabla=soup.find('table', class_='eqs table-scroll') 

     # Obtiene todas las filas de la tabla excluyendo la primera fila de encabezado
    filas = tabla.find_all('tr')[1:21]  # 20 filas para los 20 terremotos más recientes

    # Crea listas vacías para almacenar los datos
    fechas = []
    horas = []
    latitudes = []
    longitudes = []
    regiones = []

    # Recorre cada fila de la tabla y extrae los datos necesarios
    for fila in filas:
        columna_fecha_hora = fila.find('td', class_='tabev6').text.strip().split('\xa0')
        fecha = columna_fecha_hora[0]
        hora = columna_fecha_hora[1]
        coordenadas = fila.find('td', class_='tabev1').text.strip()
        latitud = coordenadas.split()[0]
        longitud = coordenadas.split()[1]
        region = fila.find('td', class_='tb_region').text.strip()

        # Agrega los datos extraídos a las listas correspondientes
        fechas.append(fecha)
        horas.append(hora)
        latitudes.append(latitud)
        longitudes.append(longitud)
        regiones.append(region)

    # Crea un diccionario con los datos
    datos = {
        'Fecha': fechas,
        'Hora': horas,
        'Latitud': latitudes,
        'Longitud': longitudes,
        'Región': regiones
    }

    # Crea el dataframe de pandas con los datos
    dataframe = pd.DataFrame(datos)

    print(dataframe)


Empty DataFrame
Columns: [Fecha, Hora, Latitud, Longitud, Región]
Index: []


#### Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

In [None]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

In [10]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://hackevents.co/hackathons'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table that contains the hackathon information
table = soup.find('table')

# Create empty lists to store the data
dates = []
days = []
titles = []
cities = []
countries = []

# Iterate over each row in the table (excluding the header row)
for row in table.find_all('tr')[1:]:
    # Extract the data from the columns
    date = row.find('td', class_='start-date').text.strip()
    day = row.find('span', class_='date').text.strip()
    title = row.find('a').text.strip()
    city = row.find('td', class_='city').text.strip()
    country = row.find('td', class_='country').text.strip()

    # Append the data to the respective lists
    dates.append(date)
    days.append(day)
    titles.append(title)
    cities.append(city)
    countries.append(country)

# Create a dictionary with the data
data = {
    'Date': dates,
    'Day': days,
    'Title': titles,
    'City': cities,
    'Country': countries
}

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head(25)

# NO ENCUENTRA LA TABLA, IGUAL HAY UN PROBLEMA DE ACTUALIZACIÓN.

AttributeError: 'NoneType' object has no attribute 'find_all'

#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://twitter.com/'

In [None]:
import tweepy

#Ten en cuenta que necesitarás tener una cuenta de desarrollador de Twitter y las credenciales de la API para acceder a la 
#API de Twitter mediante tweepy. Además, ten en cuenta que hay un límite de solicitudes a la API de Twitter, así que ten 
#cuidado al hacer muchas consultas seguidas.

# Agrega tus credenciales de la API de Twitter
#eemplazar TU_CONSUMER_KEY, TU_CONSUMER_SECRET, TU_ACCESS_TOKEN, TU_ACCESS_TOKEN_SECRET y NOMBRE_DE_LA_CUENTA con tus propias 
#credenciales de la API de Twitter y el nombre de la cuenta de Twitter que deseas contar los tweets.
consumer_key = 'TU_CONSUMER_KEY'
consumer_secret = 'TU_CONSUMER_SECRET'
access_token = 'TU_ACCESS_TOKEN'
access_token_secret = 'TU_ACCESS_TOKEN_SECRET'

# Configura las credenciales de autenticación
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Crea una instancia de la API
api = tweepy.API(auth)

# Nombre de la cuenta de Twitter que deseas contar los tweets
nombre_cuenta = 'NOMBRE_DE_LA_CUENTA'

try:
    # Obtiene el objeto del usuario
    usuario = api.get_user(screen_name=nombre_cuenta)
    
    # Obtiene el número de tweets del usuario
    num_tweets = usuario.statuses_count
    
    print(f"El número de tweets de @{nombre_cuenta} es: {num_tweets}")
    
    #NO TENGO CUENTA DE TWITER


#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
import tweepy

# Agrega tus credenciales de la API de Twitter
consumer_key = 'TU_CONSUMER_KEY'
consumer_secret = 'TU_CONSUMER_SECRET'
access_token = 'TU_ACCESS_TOKEN'
access_token_secret = 'TU_ACCESS_TOKEN_SECRET'

# Configura las credenciales de autenticación
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Crea una instancia de la API
api = tweepy.API(auth)

# Nombre de la cuenta de Twitter de la que deseas contar los seguidores
nombre_cuenta = 'NOMBRE_DE_LA_CUENTA'

try:
    # Obtiene el objeto del usuario
    usuario = api.get_user(screen_name=nombre_cuenta)
    
    # Obtiene el número de seguidores del usuario
    num_seguidores = usuario.followers_count
    
    print(f"El número de seguidores de @{nombre_cuenta} es: {num_seguidores}")

    
    #NO TENGO CUENTA DE TIWTER
  

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

#### A list with the different kind of datasets available in data.gov.uk 

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [15]:
import requests
from bs4 import BeautifulSoup

# URL para realizar el web scraping
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'
import requests
from bs4 import BeautifulSoup
​
# URL para realizar el web scraping
url = 'https://www.wikipedia.org/'
​
# Realizar una solicitud GET a la URL
response = requests.get(url)
​
# Crear un objeto BeautifulSoup para analizar el contenido HTML
soup = BeautifulSoup(response.content, 'html.parser')
​
# Encontrar todos los elementos con la clase 'central-featured-lang'
langs = soup.find_all('div', class_='central-featured-lang')
​
# Recorrer los elementos y extraer el nombre del idioma y el número de artículos relacionados
for lang in langs:
    lang_name = lang.find('strong').text
    article_count = lang.find('bdi').text
    print(f'{lang_name}: {article_count}')
English: 6 691 000+
日本語: 1 382 000+
Español: 1 881 000+
Русский: 1 930 000+
Deutsch: 2 822 000+
Français: 2 540 000+
Italiano: 1 820 000+
中文: 1 369 000+
Português: 1 105 000+
فارسی: فارسی
Top 10 languages by number of native speakers stored in a Pandas Dataframe
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
import pandas as pd
import requests
from bs4 import BeautifulSoup
​
# URL para realizar el web scraping
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
​
# Realizar una solicitud GET a la URL
response = requests.get(url)
​
soup = BeautifulSoup(response.content, 'html.parser')
​
​
table = soup.find('table', class_='wikitable sortable')
​
​
rankings = []
languages = []
native_speakers = []
​
# Iterar sobre las filas de la tabla (excluyendo la primera fila de encabezado)
for row in table.find_all('tr')[1:]:
    # Extraer los datos de las columnas
    columns = row.find_all('td')
    rank = columns[0].text.strip()
    language = columns[1].text.strip()
    speakers = columns[2].text.strip()
​
    # Agregar los datos a las listas
    rankings.append(rank)
    languages.append(language)
    native_speakers.append(speakers)
​
# Crear un diccionario con los datos
data = {
    'Rank': rankings,
    'Language': languages,
    'Native Speakers': native_speakers
}
​
# Crear el DataFrame
df = pd.DataFrame(data)
​
# Convertir el número de hablantes nativos a formato numérico
df['Native Speakers'] = df['Native Speakers'].str.replace(',', '').astype(int)
​
# Ordenar el DataFrame por el número de hablantes nativos de manera descendente
df = df.sort_values(by='Native Speakers', ascending=False)
​
# Obtener los 10 idiomas principales
top_10 = df.head(10)
​
# Mostrar el DataFrame
top_10.reset_index(drop=True, inplace=True)
top_10
---------------------------------------------------------------------------

url = 'https://www.wikipedia.org/'

# Realizar una solicitud GET a la URL
response = requests.get(url)

# Crear un objeto BeautifulSoup para analizar el contenido HTML
soup = BeautifulSoup(response.content, 'html.parser')

# Encontrar todos los elementos con la clase 'central-featured-lang'
langs = soup.find_all('div', class_='central-featured-lang')

# Recorrer los elementos y extraer el nombre del idioma y el número de artículos relacionados
for lang in langs:
    lang_name = lang.find('strong').text
    article_count = lang.find('bdi').text
    print(f'{lang_name}: {article_count}')

English: 6 691 000+
日本語: 1 382 000+
Español: 1 881 000+
Русский: 1 930 000+
Deutsch: 2 822 000+
Français: 2 540 000+
Italiano: 1 820 000+
中文: 1 369 000+
Português: 1 105 000+
فارسی: فارسی


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [17]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# URL para realizar el web scraping
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

# Realizar una solicitud GET a la URL
response = requests.get(url)

# Crear un objeto BeautifulSoup para analizar el contenido HTML
soup = BeautifulSoup(response.content, 'html.parser')

# Encontrar la tabla que contiene la información
table = soup.find('table', class_='wikitable sortable')

# Crear listas vacías para almacenar los datos
rankings = []
languages = []
native_speakers = []

# Iterar sobre las filas de la tabla (excluyendo la primera fila de encabezado)
for row in table.find_all('tr')[1:]:
    # Extraer los datos de las columnas
    columns = row.find_all('td')
    rank = columns[0].text.strip()
    language = columns[1].text.strip()
    speakers = columns[2].text.strip().rstrip('%')  # Eliminar el símbolo "%" al final de la cadena

    # Agregar los datos a las listas
    rankings.append(rank)
    languages.append(language)
    native_speakers.append(speakers)

# Crear un diccionario con los datos
data = {
    'Rank': rankings,
    'Language': languages,
    'Native Speakers': native_speakers
}

# Crear el DataFrame
df = pd.DataFrame(data)

# Convertir el número de hablantes nativos a formato numérico
df['Native Speakers'] = df['Native Speakers'].str.replace(',', '').astype(float)

# Ordenar el DataFrame por el número de hablantes nativos de manera descendente
df = df.sort_values(by='Native Speakers', ascending=False)

# Obtener los 10 idiomas principales
top_10 = df.head(10)

# Mostrar el DataFrame
top_10.reset_index(drop=True, inplace=True)
top_10


Unnamed: 0,Rank,Language,Native Speakers
0,1,Mandarin Chinese,12.3
1,2,Spanish,6.0
2,3,English,5.1
3,3,Arabic,5.1
4,5,Hindi,3.5
5,6,Bengali,3.3
6,7,Portuguese,3.0
7,8,Russian,2.1
8,9,Japanese,1.7
9,10,Western Punjabi,1.3


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code

#### Book name,price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'