# Lab | Web Scraping Multiple Pages

#### Business goal:

- Check the `case_study_gnod.md` file.
- Make sure you've understood the big picture of your project:

  - the goal of the company (`Gnod`),
  - their current product (`Gnoosic`),
  - their strategy, and
  - how your project fits into this context.

  Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

#### Instructions 

#### Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

#### Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

- Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
- Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
- Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

#### Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: `url ='https://en.wikipedia.org/wiki/Python'`
- Find the number of titles that have changed in the United States Code since its last release point: `url = 'http://uscode.house.gov/download/download.shtml'`
- Create a Python list with the top ten FBI's Most Wanted names: `url = 'https://www.fbi.gov/wanted/topten'`
- Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: `url = 'https://www.emsc-csem.org/Earthquake/'`
- List all language names and number of related articles in the order they appear in [wikipedia.org](wikipedia.org): `url = 'https://www.wikipedia.org/'`
- A list with the different kind of datasets available in [data.gov.uk](data.gov.uk): `url = 'https://data.gov.uk/'`
- Display the top 10 languages by number of native speakers stored in a pandas dataframe: `url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'`

In [17]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

## Expand the project

#### 100 Greatest Hip-Hop Songs of All Time by Rolling Stone

In [18]:
url = 'https://www.rollingstone.com/music/music-lists/100-greatest-hip-hop-songs-of-all-time-105784/ltrimm-cars-with-the-boom-106252/'
page = requests.get(url)
page.status_code

200

In [19]:
pip install selenium




In [20]:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [21]:
# Initialize the web browser
driver = webdriver.Chrome()

# Open the page in the browser
driver.get(url)

song_names = []

# Loop to load more content and extract song names
while True:
    # Extract the songs names
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    song_elements = soup.find_all('h2', class_='c-gallery-vertical-featured-image__title')
    song_names.extend([song.get_text() for song in song_elements])
    
    # Click the "Load More" button
    try:
        load_more_button = driver.find_element(By.XPATH, "//a[contains(text(), 'Load More')]")
        load_more_button.click()
    except:
        break
    
    # Wait for new content to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//h2[@class='c-gallery-vertical-featured-image__title']")))
    time.sleep(2)

# Close the browser
driver.quit()

for song in song_names:
    print(song)

L’Trimm, “Cars With the Boom”
Lil Jon and the East Side Boyz feat. Ying Yang Twins, “Get Low”
M.I.A., “Paper Planes”
Jay Z and Alicia Keys, “Empire State of Mind”
Brand Nubian, “Slow Down”
Bone Thugs-N-Harmony, “Tha Crossroads”
Missy ‘Misdemeanor’ Elliott, “The Rain (Supa Dupa Fly)”
Souls of Mischief, “93 ’til Infinity”
B.G. feat. Big Tymers and Hot Boys, “Bling Bling”
Rick Ross feat. Styles P, “B.M.F. (Blowin’ Money Fast)”
Biz Markie, “Just a Friend”
UGK feat. Outkast, “Int’l Players Anthem (I Choose You)”
MC Shan, “The Bridge”
Digital Underground, “The Humpty Dance”
Jermaine Dupri feat. Jay Z, “Money Ain’t a Thang”
Roxanne Shanté, “Roxanne’s Revenge”
U.T.F.O., “Roxanne, Roxanne”
Too $hort, “Freaky Tales”
Raekwon feat. Ghostface Killah, Method Man, Cappadonna, “Ice Cream”
Nas, “It Ain’t Hard to Tell”
Naughty by Nature, “O.P.P.”
Rammellzee and K-Rob, “Beat Bop”
Ultramagnetic MC’s, “Ego Trippin'”
A Tribe Called Quest, “Can I Kick It?”
Slick Rick, “Children’s Story”
M.O.P., “Ante Up (Rob

In [22]:
song_names[1]

'Lil Jon and the East Side Boyz feat. Ying Yang Twins, “Get Low”'

In [23]:
# Lists to store songs and artists
hip_hop_songs = []
hip_hop_artists = []

# Split songs and artists
for item in song_names:
    parts = item.split(',')
    if len(parts) >= 2: # Making sure we have at least two parts after splitting the string.
        artist = parts[0].strip() # .strip() is used to remove white space around separate parts (artist and song)
        song = parts[1].strip().strip('“”')  # Remove quotes around the song
        hip_hop_artists.append(artist)
        hip_hop_songs.append(song)

In [24]:
print(len(hip_hop_songs))
print(len(hip_hop_artists))

100
100


#### The 100 Best Songs of 2022 by Rolling Stone

In [25]:
url = 'https://www.rollingstone.com/music/music-lists/best-songs-2022-list-1234632381/'
page = requests.get(url)
page.status_code

200

In [26]:
# Initialize the web browser
driver = webdriver.Chrome()

# Open the page in the browser
driver.get(url)

song_names = []

# Loop to load more content and extract song names
while True:
    # Extract the songs names
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    song_elements = soup.find_all('h2', class_='c-gallery-vertical-featured-image__title')
    song_names.extend([song.get_text() for song in song_elements])
    
    # Click the "Load More" button
    try:
        load_more_button = driver.find_element(By.XPATH, "//a[contains(text(), 'Load More')]")
        load_more_button.click()
    except:
        break
    
    # Wait for new content to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//h2[@class='c-gallery-vertical-featured-image__title']")))
    time.sleep(2)

# Close the browser
driver.quit()

for song in song_names:
    print(song)

Lainey Wilson, ‘Heart Like a Truck’
Chronixx, ‘Never Give Up’
Plains, ‘Problem With It’
Hurray for the Riff Raff, ‘Saga’
Camilo ft. Grupo Firme, ‘Alaska’
Ingrid Andress, ‘Yearbook’
Jack Harlow, ‘First Class’
Psy feat. Suga, ‘That That’
Dead Cross, ‘Reign of Error’
Ethel Cain, ‘American Teenager’
Gladie, ‘Nothing’
Guided by Voices, ‘Alex Bell’
(G)I-dle, ‘Nxde’
The Weeknd, ‘Take My Breath’
Nayeon, ‘Pop!’
Bill Callahan, ‘Coyotes’
Protoje feat. Lila Iké, ‘Late at Night’
Blood Orange, ‘Jesus Freak Lighter’
Camila Cabello feat. Maria Becerra, ‘Hasta Los Dientes’
Charli XCX and Tiësto, ‘Hot in It’
Daddy Yankee and Bad Bunny, ‘X Ultima Vez’
The 1975, ‘Part of the Band’
Rauw Alejandro and Baby Rasta, ‘Punto 40’
Florence + the Machine, ‘Choreomania’
Saba feat. Day Wave, ‘2012’
Le Sserafim, ‘Antifragile’
Alvvays, ‘Pomeranian Spinster’
Big Bang, ‘Still Life’
Yeah Yeah Yeahs, ‘Blacktop’
NCT 127, ‘2 Baddies’
Guitarricadelafuente, ‘Mil Y Una Noches’
Meghan Trainor, ‘Made You Look’
Jin, ‘The Astronaut

In [27]:
# Lists to store songs and artists
best_2022_songs = []
best_2022_artists = []

# Split songs and artists
for item in song_names:
    parts = item.split(',')
    if len(parts) >= 2: # Making sure we have at least two parts after splitting the string.
        artist = parts[0].strip() # .strip() is used to remove white space around separate parts (artist and song)
        song = parts[1].strip().strip('‘’')  # Remove quotes around the song
        best_2022_artists.append(artist)
        best_2022_songs.append(song)

In [28]:
len(best_2022_songs)

100

#### The 500 Greatest Songs of All Time

In [29]:
url = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
page = requests.get(url)
page.status_code

200

In [30]:
# Initialize the web browser
driver = webdriver.Chrome()

# Open the page in the browser
driver.get(url)

song_names = []

# Loop to load more content and extract song names
while True:
    # Extract the songs names
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    song_elements = soup.find_all('h2', class_='c-gallery-vertical-album__title')
    song_names.extend([song.get_text() for song in song_elements])
    
    # Click the "Load More" button
    try:
        load_more_button = driver.find_element(By.XPATH, "//a[contains(@class, 'a-content-ignore') and contains(text(), 'Load More')]")
        load_more_button.click()
    except:
        break

    
    # Wait for new content to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//h2[@class='c-gallery-vertical-album__title']")))
    time.sleep(2)

# Close the browser
driver.quit()

for song in song_names:
    print(song)

Kanye West, ‘Stronger’
The Supremes, ‘Baby Love’
Townes Van Zandt, ‘Pancho and Lefty’
Lizzo, ‘Truth Hurts’
Harry Nilsson, ‘Without You’
Carly Simon, ‘You’re So Vain’
Cyndi Lauper, ‘Time After Time’
The Pixies, ‘Where Is My Mind?’
Miles Davis, ‘So What’
Guns N’ Roses, ‘Welcome to the Jungle’
Lil Nas X, ‘Old Town Road’
The Breeders, ‘Cannonball’
The Weeknd, ‘House of Balloons’
Solange, ‘Cranes in the Sky’
Lil Wayne, ‘A Milli’
Azealia Banks, ‘212’
Weezer, ‘Buddy Holly’
The Four Tops, ‘I Can’t Help Myself (Sugar Pie, Honey Bunch)’
Lady Gaga, ‘Bad Romance’
Robert Johnson, ‘Cross Road Blues’
Biz Markie, ‘Just a Friend’
Santana, ‘Oye Como Va’
Juvenile feat. Lil Wayne and Mannie Fresh, ‘Back That Azz Up’
The Go-Gos, ‘Our Lips Are Sealed’
Kris Kristofferson, ‘Sunday Mornin’ Comin’ Down’
Janet Jackson, ‘Rhythm Nation’
Curtis Mayfield, ‘Move On Up’
Tammy Wynette, ‘Stand by Your Man’
Peter Gabriel, ‘Solsbury Hill’
The Animals, ‘The House of the Rising Sun’
Gladys Knight and the Pips, ‘Midnight Tra

In [31]:
# Lists to store songs and artists
five_hundred_greatest_songs = []
five_hundred_greatest_artists = []

# Split songs and artists
for item in song_names:
    parts = item.split(',')
    if len(parts) >= 2: # Making sure we have at least two parts after splitting the string.
        artist = parts[0].strip() # .strip() is used to remove white space around separate parts (artist and song)
        song = parts[1].strip().strip('‘’')  # Remove quotes around the song
        five_hundred_greatest_artists.append(artist)
        five_hundred_greatest_songs.append(song)

In [32]:
len(five_hundred_greatest_songs)

500

#### The 200 greatest dance songs of all time by Rolling Stone

In [33]:
url = 'https://www.rollingstone.com/music/music-lists/200-greatest-dance-songs-of-all-time-1372888/'
page = requests.get(url)
page.status_code

200

In [34]:
# Initialize the web browser
driver = webdriver.Chrome()

# Open the page in the browser
driver.get(url)

song_names = []

# Loop to load more content and extract song names
while True:
    # Extract the songs names
    html_content = driver.page_source
    soup = BeautifulSoup(html_content, 'html.parser')
    song_elements = soup.find_all('h2', class_='c-gallery-vertical-featured-image__title')
    song_names.extend([song.get_text() for song in song_elements])
        
    # Click the "Load More" button
    try:
        load_more_button = driver.find_element(By.XPATH, "//a[contains(@class, 'a-content-ignore') and contains(text(), 'Load More')]")
        load_more_button.click()
    except:
        break
        
    # Wait for new content to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//h2[@class='c-gallery-vertical-featured-image__title']")))
    time.sleep(2)

# Close the browser
driver.quit()

for song in song_names:
    print(song)

Donna Summer, ‘Last Dance’ (1979)
Fatboy Slim, ‘The Rockafeller Skank’ (1998)
Mescalinum United, ‘We Have Arrived’ (1991)
Oliver Heldens, ‘Melody’ (2016)
Kerri Chandler, ‘Rain’ (1998)
Detroit Grand Pubahs, ‘Sandwiches’ (2000)
Black Box, ‘Everybody Everybody’ (1990)
Big Freedia, ‘Azz Everywhere’ (2010)
Joy Orbison, ‘Hyph Mngo’ (2009)
ESG, ‘Moody’ (1981)
La Roux, ‘In for the Kill (Skream’s Let’s Get Ravey Remix)’ (2009)
Double 99, ‘RIP Groove’ (1997) ​​
Snap!, ‘The Power’ (1990)
DJ Frosty feat. Fatman Scoop, DJ Webstar, Young B. & Smooth, ‘Ride That Wave (Remix)’ (2010)
Todd Terje, “Inspector Norse” (2012)
The Rapture, ‘House of Jealous Lovers’ (2002)
TNGHT, ‘Higher Ground’ (2012)
Roni Size and Reprazent, ‘Brown Paper Bag’ (1997)
Soul II Soul, ‘Back to Life (However Do You Want Me)’ (1989)
Felix da Housecat, ‘Silver Screen Shower Scene’ (2001)
Dntel feat. Ben Gibbard, “(This Is) The Dream of Evan and Chan (Superpitcher Kompakt Remix)” (2001)
Patrick Cowley feat. Sylvester, ‘Do Ya Wanna F

In [35]:
# Lists to store songs and artists
two_hundred_greatest_songs = []
two_hundred_greatest_artists = []

# Split songs and artists
for item in song_names:
    parts = item.split(',')
    if len(parts) >= 2: # Making sure we have at least two parts after splitting the string.
        artist = parts[0].strip() # .strip() is used to remove white space around separate parts (artist and song)
        song = parts[1].strip().strip('‘’').strip('“”').strip('’')  # Remove quotes around the song
        
        two_hundred_greatest_artists.append(artist)
        two_hundred_greatest_songs.append(song)

In [36]:
two_hundred_greatest_songs[0:21]

['Last Dance’ (1979)',
 'The Rockafeller Skank’ (1998)',
 'We Have Arrived’ (1991)',
 'Melody’ (2016)',
 'Rain’ (1998)',
 'Sandwiches’ (2000)',
 'Everybody Everybody’ (1990)',
 'Azz Everywhere’ (2010)',
 'Hyph Mngo’ (2009)',
 'Moody’ (1981)',
 'In for the Kill (Skream’s Let’s Get Ravey Remix)’ (2009)',
 'RIP Groove’ (1997) \u200b\u200b',
 'The Power’ (1990)',
 'DJ Webstar',
 'Inspector Norse” (2012)',
 'House of Jealous Lovers’ (2002)',
 'Higher Ground’ (2012)',
 'Brown Paper Bag’ (1997)',
 'Back to Life (However Do You Want Me)’ (1989)',
 'Silver Screen Shower Scene’ (2001)',
 '(This Is) The Dream of Evan and Chan (Superpitcher Kompakt Remix)” (2001)']

In [37]:
cleaned_two_hundred_greatest_songs = [song.split("’")[0].split("”")[0] for song in two_hundred_greatest_songs]

In [38]:
cleaned_two_hundred_greatest_songs[0:21]

['Last Dance',
 'The Rockafeller Skank',
 'We Have Arrived',
 'Melody',
 'Rain',
 'Sandwiches',
 'Everybody Everybody',
 'Azz Everywhere',
 'Hyph Mngo',
 'Moody',
 'In for the Kill (Skream',
 'RIP Groove',
 'The Power',
 'DJ Webstar',
 'Inspector Norse',
 'House of Jealous Lovers',
 'Higher Ground',
 'Brown Paper Bag',
 'Back to Life (However Do You Want Me)',
 'Silver Screen Shower Scene',
 '(This Is) The Dream of Evan and Chan (Superpitcher Kompakt Remix)']

#### Add the songs to the database

In [39]:
df1 = pd.read_csv('top_100_billboard_songs_2023-08-12.csv')
print(df1.shape)
df1.head()

(100, 2)


Unnamed: 0,song,artist
0,Last Night,Morgan Wallen
1,Fast Car,Luke Combs
2,Meltdown,Travis Scott Featuring Drake
3,Cruel Summer,Taylor Swift
4,FE!N,Travis Scott Featuring Playboi Carti


In [40]:
print(len(hip_hop_songs))
print(len(best_2022_songs))
print(len(five_hundred_greatest_songs))
print(len(cleaned_two_hundred_greatest_songs))

100
100
500
50


In [41]:
# Create a new DataFrame with the data from the new lists
new_data = {'song': hip_hop_songs + best_2022_songs + five_hundred_greatest_songs + cleaned_two_hundred_greatest_songs,
               'artist': hip_hop_artists + best_2022_artists + five_hundred_greatest_artists + two_hundred_greatest_artists}

df2 = pd.DataFrame(new_data)
df2

Unnamed: 0,song,artist
0,Cars With the Boom,L’Trimm
1,Get Low,Lil Jon and the East Side Boyz feat. Ying Yang...
2,Paper Planes,M.I.A.
3,Empire State of Mind,Jay Z and Alicia Keys
4,Slow Down,Brand Nubian
...,...,...
745,Losing My Edge,LCD Soundsystem
746,Ojos Asi (Thunder Mix),Shakira
747,My Red Hot Car,Squarepusher
748,Sing It Back (Boris Musical Mix),Moloko


In [42]:
merged_df = pd.concat([df1, df2], ignore_index=True, axis=0)
merged_df

Unnamed: 0,song,artist
0,Last Night,Morgan Wallen
1,Fast Car,Luke Combs
2,Meltdown,Travis Scott Featuring Drake
3,Cruel Summer,Taylor Swift
4,FE!N,Travis Scott Featuring Playboi Carti
...,...,...
845,Losing My Edge,LCD Soundsystem
846,Ojos Asi (Thunder Mix),Shakira
847,My Red Hot Car,Squarepusher
848,Sing It Back (Boris Musical Mix),Moloko


In [43]:
merged_df.duplicated().sum()

26

In [44]:
cleaned_merged_df = merged_df.drop_duplicates()
print(len(cleaned_merged_df))

824


In [45]:
import datetime

now = datetime.date.today()
namefile = 'songs_df_'+str(now)+'.csv'
namefile

'songs_df_2023-08-19.csv'

In [46]:
cleaned_merged_df.to_csv(namefile, index=False)

#### Top 1000 songs from 1920 to 2020 from Kaggle

To expand the database, a set of `kaggle` songs is added.

In [47]:
df3 = pd.read_csv('top1000_songs.csv')
print(len(df3))
df3.head()

999


Unnamed: 0,track_id,track_name,artist,album,acousticness,danceability,energy,key,loudness,mode,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,decade
0,7dOz8RrPWP9UgJ8X8p1vU7,West End Blues,Louis Armstrong,The Essential Louis Armstrong,0.986,0.602,0.206,3,-12.063,1,0.0457,0.372,0.0729,0.41,83.693,196600,4,1920
1,2WZxlrWF4LUC2ROlrdbQM1,St Louis Blues,Bessie Smith,The Anthology,0.993,0.568,0.205,10,-5.325,1,0.0404,5.3e-05,0.106,0.271,69.954,190720,4,1920
2,4M6I6aTorpVswszBYtkiUb,Blue Yodel #1 (t For Texas),Jimmie Rodgers,The Very Best Of,0.972,0.527,0.228,9,-15.461,1,0.0392,0.735,0.147,0.561,161.157,204200,4,1920
3,3agLlt9JpTB330JWO3IA5v,Ain't Misbehavin,Fats Waller,1934-1943 - Ain't Misbehavin,0.989,0.546,0.299,0,-14.836,0,0.0635,0.791,0.192,0.635,96.154,241427,4,1920
4,0nr7SuSNymfeyfe09ozVsu,Wildwood Flower,A. P. Carter,The Carter Family 1927 - 1934 Disc A,0.947,0.609,0.257,5,-13.194,1,0.0295,0.00307,0.132,0.47,109.379,190507,4,1920


In [48]:
columns_to_keep = ['track_name', 'artist']

cleaned_df3 = kaggle_data.loc[:, columns_to_keep]
cleaned_df3.columns = ['song', 'artist']
cleaned_df3.head()

Unnamed: 0,song,artist
0,West End Blues,Louis Armstrong
1,St Louis Blues,Bessie Smith
2,Blue Yodel #1 (t For Texas),Jimmie Rodgers
3,Ain't Misbehavin,Fats Waller
4,Wildwood Flower,A. P. Carter


In [55]:
final_merged_df = pd.concat([cleaned_df3, cleaned_merged_df], ignore_index=True, axis=0)
final_merged_df

Unnamed: 0,song,artist
0,West End Blues,Louis Armstrong
1,St Louis Blues,Bessie Smith
2,Blue Yodel #1 (t For Texas),Jimmie Rodgers
3,Ain't Misbehavin,Fats Waller
4,Wildwood Flower,A. P. Carter
...,...,...
1818,Losing My Edge,LCD Soundsystem
1819,Ojos Asi (Thunder Mix),Shakira
1820,My Red Hot Car,Squarepusher
1821,Sing It Back (Boris Musical Mix),Moloko


In [56]:
final_merged_df.duplicated().sum()

29

In [57]:
final_merged_df = final_merged_df.drop_duplicates()
print(len(final_merged_df))

1794


In [58]:
now = datetime.date.today()
namefile = 'full_song_df_'+str(now)+'.csv'
namefile

'full_song_df_2023-08-19.csv'

In [59]:
final_merged_df.to_csv(namefile, index=False)

## Practice web scraping

 ### List of links

In [27]:
url = 'https://en.wikipedia.org/wiki/Python'
page = requests.get(url)
page.status_code

200

In [32]:
soup = BeautifulSoup(page.content, "html.parser")

In [34]:
list_of_links = soup.find_all('a', href=True)
list_of_links

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>,
 <a href=

In [40]:
info = []
for link in list_of_links:
    href = link['href']
    info.append(href)
info

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Special:RecentChanges',
 '/wiki/Wikipedia:File_upload_wizard',
 '/wiki/Main_Page',
 '/wiki/Special:Search',
 '/w/index.php?title=Special:CreateAccount&returnto=Python',
 '/w/index.php?title=Special:UserLogin&returnto=Python',
 '/w/index.php?title=Special:CreateAccount&returnto=Python',
 '/w/index.php?title=Special:UserLogin&returnto=Python',
 '/wiki/Help:Introduction',
 '/wiki/Special:MyContributions',
 '/wiki/Special:MyTalk',
 '#',
 '#Snakes',
 '#Computing',
 '#People',
 '#Roller_coasters',
 '#Vehicles',
 '#Weaponry',
 '#Other_uses',
 '#See_also',
 

In [41]:
len(info)

160

In [53]:
filtered_list = [element for element in info if not element.startswith("#")]
len(filtered_list)

150

In [54]:
python_links_df = pd.DataFrame({'link' : filtered_list})
python_links_df

Unnamed: 0,link
0,/wiki/Main_Page
1,/wiki/Wikipedia:Contents
2,/wiki/Portal:Current_events
3,/wiki/Special:Random
4,/wiki/Wikipedia:About
...,...
145,https://developer.wikimedia.org
146,https://stats.wikimedia.org/#/en.wikipedia.org
147,https://foundation.wikimedia.org/wiki/Special:...
148,https://wikimediafoundation.org/


### Number of titles that have changed in the United States Code

In [55]:
url = 'http://uscode.house.gov/download/download.shtml'
page = requests.get(url)
page.status_code

200

In [56]:
soup = BeautifulSoup(page.content, "html.parser")

In [59]:
list_of_titles = soup.find_all("div", {"class" : "usctitle"})
list_of_titles

[<div class="usctitle" id="alltitles">
 
           All titles in the format selected compressed into a zip archive.
 
         </div>,
 <div class="usctitle" id="heading">
 
               
 
 	        </div>,
 <div class="usctitle" id="us/usc/t1">
 
           Title 1 - General Provisions <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitle" id="us/usc/t2">
 
           Title 2 - The Congress
 
         </div>,
 <div class="usctitle" id="us/usc/t3">
 
           Title 3 - The President <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitle" id="us/usc/t4">
 
           Title 4 - Flag and Seal, Seat of Government, and the States <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitle" id="us/usc/t5">
 
           Title 5 - Government Organization and Employees <span class="footnote"><a class="fn" href="#fn">٭</a></span>
 </div>,
 <div class="usctitle" id="us/usc/t6">
 
     

In [60]:
info = []
for element in list_of_titles:
    title = element.get_text()
    info.append(title)
info

['\n\n          All titles in the format selected compressed into a zip archive.\n\n        ',
 '\n\n             \xa0\n\n\t        ',
 '\n\n          Title 1 - General Provisions ٭\n',
 '\n\n          Title 2 - The Congress\n\n        ',
 '\n\n          Title 3 - The President ٭\n',
 '\n\n          Title 4 - Flag and Seal, Seat of Government, and the States ٭\n',
 '\n\n          Title 5 - Government Organization and Employees ٭\n',
 '\n\n          Title 6 - Domestic Security\n\n        ',
 '\n\n          Title 7 - Agriculture\n\n        ',
 '\n\n          Title 8 - Aliens and Nationality\n\n        ',
 '\n\n          Title 9 - Arbitration ٭\n',
 '\n\n          Title 10 - Armed Forces ٭\n',
 '\n\n          Title 11 - Bankruptcy ٭\n',
 '\n\n          Title 12 - Banks and Banking\n\n        ',
 '\n\n          Title 13 - Census ٭\n',
 '\n\n          Title 14 - Coast Guard ٭\n',
 '\n\n          Title 15 - Commerce and Trade\n\n        ',
 '\n\n          Title 16 - Conservation\n\n        '

In [73]:
clean_info = []
for element in info:
    clean_name = element.replace('\n', '').replace(' ', '')
    clean_info.append(clean_name)
clean_info

['Alltitlesintheformatselectedcompressedintoaziparchive.',
 '\xa0\t',
 'Title1-GeneralProvisions٭',
 'Title2-TheCongress',
 'Title3-ThePresident٭',
 'Title4-FlagandSeal,SeatofGovernment,andtheStates٭',
 'Title5-GovernmentOrganizationandEmployees٭',
 'Title6-DomesticSecurity',
 'Title7-Agriculture',
 'Title8-AliensandNationality',
 'Title9-Arbitration٭',
 'Title10-ArmedForces٭',
 'Title11-Bankruptcy٭',
 'Title12-BanksandBanking',
 'Title13-Census٭',
 'Title14-CoastGuard٭',
 'Title15-CommerceandTrade',
 'Title16-Conservation',
 'Title17-Copyrights٭',
 'Title18-CrimesandCriminalProcedure٭',
 'Title20-Education',
 'Title21-FoodandDrugs',
 'Title22-ForeignRelationsandIntercourse',
 'Title23-Highways٭',
 'Title24-HospitalsandAsylums',
 'Title25-Indians',
 'Title26-InternalRevenueCode',
 'Title27-IntoxicatingLiquors',
 'Title28-JudiciaryandJudicialProcedure٭',
 'Title29-Labor',
 'Title30-MineralLandsandMining',
 'Title31-MoneyandFinance٭',
 'Title32-NationalGuard٭',
 'Title33-NavigationandNav

In [69]:
title_df = pd.DataFrame({"titles": clean_info[2:]})
title_df

Unnamed: 0,titles
0,Title1-GeneralProvisions٭
1,Title2-TheCongress
2,Title3-ThePresident٭
3,"Title4-FlagandSeal,SeatofGovernment,andtheStates٭"
4,Title5-GovernmentOrganizationandEmployees٭
5,Title6-DomesticSecurity
6,Title7-Agriculture
7,Title8-AliensandNationality
8,Title9-Arbitration٭
9,Title10-ArmedForces٭


### Top 10 FBI's Most Wanted names

In [2]:
url = 'https://www.fbi.gov/wanted/topten'
page = requests.get(url)
page.status_code

200

In [8]:
soup = BeautifulSoup(page.content, "html.parser")

In [10]:
list_names = soup.find_all("h3", {"class" : "title"})
list_names

[<h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/donald-eugene-fields-ii">DONALD EUGENE FIELDS II</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alexis-flores">ALEXIS FLORES</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/arnoldo-jimenez">ARNOLDO JIMENEZ</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/omar-alexander-cardenas">OMAR ALEXANDER CARDENAS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/yulan-adonay-archaga-carias">YULAN ADONAY ARCHAGA CARIAS</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/bhadreshkumar-chetanbhai-patel">BHADRESHKUMAR CHETANBHAI PATEL</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/wilver-villegas-palomino">WILVER VILLEGAS-PALOMINO</a>
 </h3>,
 <h3 class="title">
 <a href="https://www.fbi.gov/wanted/topten/alejandro-castillo">ALEJANDRO ROSALES CASTILLO</a>
 </h3>,
 <h3 class="ti

In [11]:
info = []
for element in list_names:
    name = element.get_text()
    info.append(name)
info

['\nDONALD EUGENE FIELDS II\n',
 '\nALEXIS FLORES\n',
 '\nARNOLDO JIMENEZ\n',
 '\nOMAR ALEXANDER CARDENAS\n',
 '\nYULAN ADONAY ARCHAGA CARIAS\n',
 '\nBHADRESHKUMAR CHETANBHAI PATEL\n',
 '\nWILVER VILLEGAS-PALOMINO\n',
 '\nALEJANDRO ROSALES CASTILLO\n',
 '\nRUJA IGNATOVA\n',
 '\nJOSE RODOLFO VILLARREAL-HERNANDEZ\n']

In [13]:
clean_info = []
for element in info:
    clean_name = element.replace('\n', '')
    clean_info.append(clean_name)
clean_info

['DONALD EUGENE FIELDS II',
 'ALEXIS FLORES',
 'ARNOLDO JIMENEZ',
 'OMAR ALEXANDER CARDENAS',
 'YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'WILVER VILLEGAS-PALOMINO',
 'ALEJANDRO ROSALES CASTILLO',
 'RUJA IGNATOVA',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ']

In [16]:
top_ten_df = pd.DataFrame({'name' : clean_info})
top_ten_df

Unnamed: 0,name
0,DONALD EUGENE FIELDS II
1,ALEXIS FLORES
2,ARNOLDO JIMENEZ
3,OMAR ALEXANDER CARDENAS
4,YULAN ADONAY ARCHAGA CARIAS
5,BHADRESHKUMAR CHETANBHAI PATEL
6,WILVER VILLEGAS-PALOMINO
7,ALEJANDRO ROSALES CASTILLO
8,RUJA IGNATOVA
9,JOSE RODOLFO VILLARREAL-HERNANDEZ


In [19]:
import datetime

now = datetime.date.today()
namefile = 'ten_most_wanted_fugitives_'+str(now)+'.csv'
namefile

'ten_most_wanted_fugitives_2023-08-10.csv'

In [20]:
top_ten_df.to_csv(namefile)

### List all language names and number of related articles in wikipedia

In [3]:
url = 'https://www.wikipedia.org/'
page = requests.get(url)
page.status_code

200

In [4]:
soup = BeautifulSoup(page.content, "html.parser")

In [5]:
list_languages_wikipedia = soup.find_all("strong")
list_languages_wikipedia

[<strong class="jsl10n localized-slogan" data-jsl10n="portal.slogan">The Free Encyclopedia</strong>,
 <strong>English</strong>,
 <strong>日本語</strong>,
 <strong>Español</strong>,
 <strong>Русский</strong>,
 <strong>Deutsch</strong>,
 <strong>Français</strong>,
 <strong>Italiano</strong>,
 <strong>中文</strong>,
 <strong>Português</strong>,
 <strong><bdi dir="rtl">فارسی</bdi></strong>,
 <strong class="jsl10n" data-jsl10n="portal.app-links.title">
 <a class="jsl10n" data-jsl10n="portal.app-links.url" href="https://en.wikipedia.org/wiki/List_of_Wikipedia_mobile_applications">
 Download Wikipedia for Android or iOS
 </a>
 </strong>]

In [6]:
info_wikipedia = []
for element in list_languages_wikipedia:
    language_wikipedia = element.get_text()
    info_wikipedia.append(language_wikipedia)
info_wikipedia

['The Free Encyclopedia',
 'English',
 '日本語',
 'Español',
 'Русский',
 'Deutsch',
 'Français',
 'Italiano',
 '中文',
 'Português',
 'فارسی',
 '\n\nDownload Wikipedia for Android or iOS\n\n']

In [7]:
print(len(info_wikipedia))

12


In [8]:
languages = info_wikipedia[1:11]
languages

['English',
 '日本語',
 'Español',
 'Русский',
 'Deutsch',
 'Français',
 'Italiano',
 '中文',
 'Português',
 'فارسی']

In [9]:
len(languages)

10

In [10]:
list_number_of_articles = soup.find_all("bdi")
list_number_of_articles

[<bdi dir="ltr">6 691 000+</bdi>,
 <bdi dir="ltr">1 382 000+</bdi>,
 <bdi dir="ltr">1 881 000+</bdi>,
 <bdi dir="ltr">1 930 000+</bdi>,
 <bdi dir="ltr">2 822 000+</bdi>,
 <bdi dir="ltr">2 540 000+</bdi>,
 <bdi dir="ltr">1 820 000+</bdi>,
 <bdi dir="ltr">1 369 000+</bdi>,
 <bdi dir="ltr">1 105 000+</bdi>,
 <bdi dir="rtl">فارسی</bdi>,
 <bdi dir="ltr">969 000+</bdi>,
 <bdi dir="ltr">
 1 000 000+
 </bdi>,
 <bdi dir="rtl">العربية</bdi>,
 <bdi dir="rtl">مصرى</bdi>,
 <bdi dir="ltr">
 100 000+
 </bdi>,
 <bdi dir="rtl">فارسی</bdi>,
 <bdi dir="rtl">עברית</bdi>,
 <bdi dir="rtl" lang="kk-Arab">قازاقشا</bdi>,
 <bdi dir="rtl">تۆرکجه</bdi>,
 <bdi dir="rtl">اردو</bdi>,
 <bdi dir="ltr">
 10 000+
 </bdi>,
 <bdi dir="rtl" lang="ku-Arab">كوردی</bdi>,
 <bdi dir="rtl">کوردیی ناوەندی</bdi>,
 <bdi dir="rtl">مازِرونی</bdi>,
 <bdi dir="rtl">پنجابی (شاہ مکھی)</bdi>,
 <bdi dir="rtl">پښتو</bdi>,
 <bdi dir="rtl">سنڌي</bdi>,
 <bdi dir="rtl">ייִדיש</bdi>,
 <bdi dir="ltr">
 1 000+
 </bdi>,
 <bdi dir="rtl" lang="lad-He

In [11]:
info_numbers_of_articles = []
for element in list_number_of_articles:
    number = element.get_text()
    info_numbers_of_articles.append(number)
info_numbers_of_articles

['6\xa0691\xa0000+',
 '1\xa0382\xa0000+',
 '1\xa0881\xa0000+',
 '1\xa0930\xa0000+',
 '2\xa0822\xa0000+',
 '2\xa0540\xa0000+',
 '1\xa0820\xa0000+',
 '1\xa0369\xa0000+',
 '1\xa0105\xa0000+',
 'فارسی',
 '969\xa0000+',
 '\n1\xa0000\xa0000+\n',
 'العربية',
 'مصرى',
 '\n100\xa0000+\n',
 'فارسی',
 'עברית',
 'قازاقشا',
 'تۆرکجه',
 'اردو',
 '\n10\xa0000+\n',
 'كوردی',
 'کوردیی ناوەندی',
 'مازِرونی',
 'پنجابی (شاہ مکھی)',
 'پښتو',
 'سنڌي',
 'ייִדיש',
 '\n1\xa0000+\n',
 'לאדינו',
 'ܐܬܘܪܝܐ',
 'الدارجة',
 'ދިވެހިބަސް',
 'گیلکی',
 'كٲشُر',
 'ئۇيغۇرچه',
 '\n100+\n']

In [12]:
clean_info_numbers_of_articles = []
for element in info_numbers_of_articles:
    #Syntax of re.sub() function 
    # re.sub(pattern, repl, string)
    clean_name = re.sub(r'[\n\xa0+]', '', element)
    clean_info_numbers_of_articles.append(clean_name)
print(clean_info_numbers_of_articles)

['6691000', '1382000', '1881000', '1930000', '2822000', '2540000', '1820000', '1369000', '1105000', 'فارسی', '969000', '1000000', 'العربية', 'مصرى', '100000', 'فارسی', 'עברית', 'قازاقشا', 'تۆرکجه', 'اردو', '10000', 'كوردی', 'کوردیی ناوەندی', 'مازِرونی', 'پنجابی (شاہ مکھی)', 'پښتو', 'سنڌي', 'ייִדיש', '1000', 'לאדינו', 'ܐܬܘܪܝܐ', 'الدارجة', 'ދިވެހިބަސް', 'گیلکی', 'كٲشُر', 'ئۇيغۇرچه', '100']


In [23]:
number_of_articles = clean_info_numbers_of_articles[0:9]
number_of_articles

['6691000',
 '1382000',
 '1881000',
 '1930000',
 '2822000',
 '2540000',
 '1820000',
 '1369000',
 '1105000']

In [14]:
clean_info_numbers_of_articles[10]

'969000'

In [24]:
number_of_articles.append(clean_info_numbers_of_articles[10])
number_of_articles

['6691000',
 '1382000',
 '1881000',
 '1930000',
 '2822000',
 '2540000',
 '1820000',
 '1369000',
 '1105000',
 '969000']

In [25]:
len(number_of_articles)

10

In [26]:
languages_wikipedia_df = pd.DataFrame({'languages' : languages, 'number_of_articles' : number_of_articles})
languages_wikipedia_df

Unnamed: 0,languages,number_of_articles
0,English,6691000
1,日本語,1382000
2,Español,1881000
3,Русский,1930000
4,Deutsch,2822000
5,Français,2540000
6,Italiano,1820000
7,中文,1369000
8,Português,1105000
9,فارسی,969000


### List with the different kind of datasets

In [52]:
url = "https://www.data.gov.uk/"
page = requests.get(url)
page.status_code

200

In [53]:
soup = BeautifulSoup(page.content, "html.parser")

In [57]:
list_datasets = soup.find_all("a", {"class" : "govuk-link"})
list_datasets

[<a class="govuk-link" href="/cookies">cookies to collect information</a>,
 <a class="govuk-link" href="/cookies">View cookies</a>,
 <a class="govuk-link" data-module="gem-track-click" data-track-action="Cookie banner settings clicked from confirmation" data-track-category="cookieBanner" href="/cookies">change your cookie settings</a>,
 <a class="govuk-link" href="http://www.smartsurvey.co.uk/s/3SEXD/">feedback</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Bt

In [58]:
datasets_info = []
for element in list_datasets:
    dataset = element.get_text()
    datasets_info.append(dataset)
print(datasets_info)

['cookies to collect information', 'View cookies', 'change your cookie settings', 'feedback', 'Business and economy', 'Crime and justice', 'Defence', 'Education', 'Environment', 'Government', 'Government spending', 'Health', 'Mapping', 'Society', 'Towns and cities', 'Transport', 'Digital service performance', 'Government reference data']


In [59]:
datasets_info[4:]

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

In [66]:
list_datasets_df = pd.DataFrame(datasets_info[4:])
list_datasets_df.columns = ['dataset']
list_datasets_df

Unnamed: 0,dataset
0,Business and economy
1,Crime and justice
2,Defence
3,Education
4,Environment
5,Government
6,Government spending
7,Health
8,Mapping
9,Society


### Top 10 languages by number of native speakers

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
page = requests.get(url)
page.status_code

200

In [4]:
soup = BeautifulSoup(page.content, "html.parser")

In [19]:
list_languages = soup.find_all("a", {"class" : "mw-redirect"})
list_languages

[<a class="mw-redirect" href="/wiki/Native_speaker" title="Native speaker">native speakers</a>,
 <a class="mw-redirect" href="/wiki/Mutually_intelligible" title="Mutually intelligible">mutually intelligible</a>,
 <a class="mw-redirect" href="/wiki/Arabic_language" title="Arabic language">Arabic</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:hin" title="ISO 639:hin">Hindi</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:por" title="ISO 639:por">Portuguese</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:ben" title="ISO 639:ben">Bengali</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:rus" title="ISO 639:rus">Russian</a>,
 <a class="mw-redirect" href="/wiki/ISO_639:jpn" title="ISO 639:jpn">Japanese</a>,
 <a class="mw-redirect" href="/w

In [20]:
info = []
for element in list_languages:
    language = element.get_text()
    info.append(language)
print(info)

['native speakers', 'mutually intelligible', 'Arabic', 'Mandarin Chinese', 'Spanish', 'English', 'Hindi', 'Portuguese', 'Bengali', 'Russian', 'Japanese', 'Yue Chinese', 'Vietnamese', 'Turkish', 'Wu Chinese', 'Marathi', 'Telugu', 'Korean', 'French', 'Tamil', 'Egyptian Spoken Arabic', 'Standard German', 'Urdu', 'Javanese', 'Western Punjabi', 'Eastern Punjabi', 'Italian', 'Gujarati', 'Iranian Persian', 'Bhojpuri', 'Hausa', 'CIA World Factbook', 'Arabic', 'Hindi', 'ISBN', 'ISBN', 'ISBN', 'ISBN', 'ISBN', 'Countries', 'Arabic', 'Spanish', 'Europe', 'List of Afro-Asiatic languages', 'List of Austronesian languages', 'List of Tungusic languages']


In [38]:
info[3:13]

['Mandarin Chinese',
 'Spanish',
 'English',
 'Hindi',
 'Portuguese',
 'Bengali',
 'Russian',
 'Japanese',
 'Yue Chinese',
 'Vietnamese']

In [40]:
top_ten_languages_df = pd.DataFrame({'language' : info[3:13]})
top_ten_languages_df

Unnamed: 0,language
0,Mandarin Chinese
1,Spanish
2,English
3,Hindi
4,Portuguese
5,Bengali
6,Russian
7,Japanese
8,Yue Chinese
9,Vietnamese


#### Another way obtaining all data in the table

In [21]:
td_elements = soup.find_all('td')
td_elements

[<td><a class="mw-redirect" href="/wiki/ISO_639:cmn" title="ISO 639:cmn">Mandarin Chinese</a><br/>(incl. <a href="/wiki/Standard_Chinese" title="Standard Chinese">Standard Chinese</a>, but excl. <a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">other varieties</a>)
 </td>,
 <td>939
 </td>,
 <td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
 </td>,
 <td><a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">Sinitic</a>
 </td>,
 <td><a class="mw-redirect" href="/wiki/ISO_639:spa" title="ISO 639:spa">Spanish</a>
 </td>,
 <td>485
 </td>,
 <td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>
 </td>,
 <td><a href="/wiki/Romance_languages" title="Romance languages">Romance</a>
 </td>,
 <td><a class="mw-redirect" href="/wiki/ISO_639:eng" title="ISO 639:eng">English</a>
 </td>,
 <td>380
 </td>,
 <td><a href="/wiki/Indo-European_languages" title="Indo-European languages">Indo-European</a>

In [22]:
info2 = []
for element in td_elements:
    td = element.get_text()
    info2.append(td)
print(info2)

['Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)\n', '939\n', 'Sino-Tibetan\n', 'Sinitic\n', 'Spanish\n', '485\n', 'Indo-European\n', 'Romance\n', 'English\n', '380\n', 'Indo-European\n', 'Germanic\n', 'Hindi(excl. Urdu, and other languages)\n', '345\n', 'Indo-European\n', 'Indo-Aryan\n', 'Portuguese\n', '236\n', 'Indo-European\n', 'Romance\n', 'Bengali\n', '234\n', 'Indo-European\n', 'Indo-Aryan\n', 'Russian\n', '147\n', 'Indo-European\n', 'Balto-Slavic\n', 'Japanese\n', '123\n', 'Japonic\n', 'Japanese\n', 'Yue Chinese(incl. Cantonese)\n', '86.1\n', 'Sino-Tibetan\n', 'Sinitic\n', 'Vietnamese\n', '85.0\n', 'Austroasiatic\n', 'Vietic\n', 'Turkish\n', '84.0\n', 'Turkic\n', 'Oghuz\n', 'Wu Chinese(incl. Shanghainese)\n', '83.4\n', 'Sino-Tibetan\n', 'Sinitic\n', 'Marathi\n', '83.2\n', 'Indo-European\n', 'Indo-Aryan\n', 'Telugu\n', '83.0\n', 'Dravidian\n', 'South-Central\n', 'Korean\n', '81.7\n', 'Koreanic\n', '—\n', 'French\n', '80.8\n', 'Indo-European\n', 'Romance\n', 

In [24]:
clean_info2 = []
for element in info2:
    clean_name = element.replace('\n', '')
    clean_info2.append(clean_name)
print(clean_info2)

['Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)', '939', 'Sino-Tibetan', 'Sinitic', 'Spanish', '485', 'Indo-European', 'Romance', 'English', '380', 'Indo-European', 'Germanic', 'Hindi(excl. Urdu, and other languages)', '345', 'Indo-European', 'Indo-Aryan', 'Portuguese', '236', 'Indo-European', 'Romance', 'Bengali', '234', 'Indo-European', 'Indo-Aryan', 'Russian', '147', 'Indo-European', 'Balto-Slavic', 'Japanese', '123', 'Japonic', 'Japanese', 'Yue Chinese(incl. Cantonese)', '86.1', 'Sino-Tibetan', 'Sinitic', 'Vietnamese', '85.0', 'Austroasiatic', 'Vietic', 'Turkish', '84.0', 'Turkic', 'Oghuz', 'Wu Chinese(incl. Shanghainese)', '83.4', 'Sino-Tibetan', 'Sinitic', 'Marathi', '83.2', 'Indo-European', 'Indo-Aryan', 'Telugu', '83.0', 'Dravidian', 'South-Central', 'Korean', '81.7', 'Koreanic', '—', 'French', '80.8', 'Indo-European', 'Romance', 'Tamil', '78.6', 'Dravidian', 'South', 'Egyptian Spoken Arabic(excl. Saʽidi Arabic)', '77.4', 'Afroasiatic', 'Semitic', 'Standar

In [25]:
len(clean_info2)

155

In [47]:
print(clean_info2[0:108])

['Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)', '939', 'Sino-Tibetan', 'Sinitic', 'Spanish', '485', 'Indo-European', 'Romance', 'English', '380', 'Indo-European', 'Germanic', 'Hindi(excl. Urdu, and other languages)', '345', 'Indo-European', 'Indo-Aryan', 'Portuguese', '236', 'Indo-European', 'Romance', 'Bengali', '234', 'Indo-European', 'Indo-Aryan', 'Russian', '147', 'Indo-European', 'Balto-Slavic', 'Japanese', '123', 'Japonic', 'Japanese', 'Yue Chinese(incl. Cantonese)', '86.1', 'Sino-Tibetan', 'Sinitic', 'Vietnamese', '85.0', 'Austroasiatic', 'Vietic', 'Turkish', '84.0', 'Turkic', 'Oghuz', 'Wu Chinese(incl. Shanghainese)', '83.4', 'Sino-Tibetan', 'Sinitic', 'Marathi', '83.2', 'Indo-European', 'Indo-Aryan', 'Telugu', '83.0', 'Dravidian', 'South-Central', 'Korean', '81.7', 'Koreanic', '—', 'French', '80.8', 'Indo-European', 'Romance', 'Tamil', '78.6', 'Dravidian', 'South', 'Egyptian Spoken Arabic(excl. Saʽidi Arabic)', '77.4', 'Afroasiatic', 'Semitic', 'Standar

In [50]:
groups = [clean_info2[i:i+4] for i in range(0, 108, 4)]
groups

[['Mandarin Chinese(incl. Standard Chinese, but excl. other varieties)',
  '939',
  'Sino-Tibetan',
  'Sinitic'],
 ['Spanish', '485', 'Indo-European', 'Romance'],
 ['English', '380', 'Indo-European', 'Germanic'],
 ['Hindi(excl. Urdu, and other languages)',
  '345',
  'Indo-European',
  'Indo-Aryan'],
 ['Portuguese', '236', 'Indo-European', 'Romance'],
 ['Bengali', '234', 'Indo-European', 'Indo-Aryan'],
 ['Russian', '147', 'Indo-European', 'Balto-Slavic'],
 ['Japanese', '123', 'Japonic', 'Japanese'],
 ['Yue Chinese(incl. Cantonese)', '86.1', 'Sino-Tibetan', 'Sinitic'],
 ['Vietnamese', '85.0', 'Austroasiatic', 'Vietic'],
 ['Turkish', '84.0', 'Turkic', 'Oghuz'],
 ['Wu Chinese(incl. Shanghainese)', '83.4', 'Sino-Tibetan', 'Sinitic'],
 ['Marathi', '83.2', 'Indo-European', 'Indo-Aryan'],
 ['Telugu', '83.0', 'Dravidian', 'South-Central'],
 ['Korean', '81.7', 'Koreanic', '—'],
 ['French', '80.8', 'Indo-European', 'Romance'],
 ['Tamil', '78.6', 'Dravidian', 'South'],
 ['Egyptian Spoken Arabic(e

In [51]:
languages_df = pd.DataFrame(groups, columns=['language', 'native_speakers(millions)', 'language_family', 'branch'])
languages_df

Unnamed: 0,language,native_speakers(millions),language_family,branch
0,"Mandarin Chinese(incl. Standard Chinese, but e...",939.0,Sino-Tibetan,Sinitic
1,Spanish,485.0,Indo-European,Romance
2,English,380.0,Indo-European,Germanic
3,"Hindi(excl. Urdu, and other languages)",345.0,Indo-European,Indo-Aryan
4,Portuguese,236.0,Indo-European,Romance
5,Bengali,234.0,Indo-European,Indo-Aryan
6,Russian,147.0,Indo-European,Balto-Slavic
7,Japanese,123.0,Japonic,Japanese
8,Yue Chinese(incl. Cantonese),86.1,Sino-Tibetan,Sinitic
9,Vietnamese,85.0,Austroasiatic,Vietic
