### Instructions Part 2
Practice web scraping. This is not involved with the GNOD project of the week.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

- Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page.

In [2]:
url = "https://en.wikipedia.org/wiki/Python"

In [3]:
response = requests.get(url)
response.status_code

200

In [4]:
soup = BeautifulSoup(response.content, "html.parser")

In [5]:
# soup

We search the path.

In [6]:
# soup.select("#mw-content-text > div.mw-parser-output > ul > li > a")

We obtain the wikis.

In [7]:
wikis = []
for w in soup.select("#mw-content-text > div.mw-parser-output > ul > li > a"):
    wiki = w.get("href")
    if wiki is not None:
            wikis.append(w["href"])

In [8]:
len(wikis)

23

We concatenate to obtain the url.

In [9]:
links = []
for wiki in wikis:
    url = "https://en.wikipedia.org" + wiki
    links.append(url)

links

['https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(mythology)',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.wikipedia.org/wiki/CMU_Common_Lisp',
 'https://en.wikipedia.org/wiki/PERQ#PERQ_3',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/wiki/Python_Anghelo',
 'https://en.wikipedia.org/wiki/Python_(Efteling)',
 'https://en.wikipedia.org/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wikipedia.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 'https://en.wikipedia.org/wiki/Python_(automobile_maker)',
 'https://en.wikipedia.org/wiki/Python_(Ford_prototype)',
 'https://en.wikipedia.org/wiki/Python_(missile)',
 'https://en.wikipedia.org/wiki/Python_(nuclear_primary)',
 'https://en.wikipedia.org/wiki/Colt_Python',
 'https://en.wikiped

We obtain the titles.

In [10]:
titles = []
for t in soup.select("#mw-content-text > div.mw-parser-output > ul > li > a"):
    title = t.get("title")
    if title is not None:
            titles.append(t["title"])

In [11]:
len(titles)

23

We create the dataframe.

In [12]:
python_links = pd.DataFrame({"title":titles,
                          "link":links})

In [13]:
python_links

Unnamed: 0,title,link
0,Pythonidae,https://en.wikipedia.org/wiki/Pythonidae
1,Python (mythology),https://en.wikipedia.org/wiki/Python_(mythology)
2,Python (programming language),https://en.wikipedia.org/wiki/Python_(programm...
3,CMU Common Lisp,https://en.wikipedia.org/wiki/CMU_Common_Lisp
4,PERQ,https://en.wikipedia.org/wiki/PERQ#PERQ_3
5,Python of Aenus,https://en.wikipedia.org/wiki/Python_of_Aenus
6,Python (painter),https://en.wikipedia.org/wiki/Python_(painter)
7,Python of Byzantium,https://en.wikipedia.org/wiki/Python_of_Byzantium
8,Python of Catana,https://en.wikipedia.org/wiki/Python_of_Catana
9,Python Anghelo,https://en.wikipedia.org/wiki/Python_Anghelo


- A list with the different kind of datasets available in data.gov.uk.

In [14]:
url = "https://www.data.gov.uk/"

In [15]:
response = requests.get(url)
response.status_code

200

In [16]:
soup = BeautifulSoup(response.content, "html.parser")

In [17]:
# soup

We search the path.

In [18]:
soup.select("#main-content > div:nth-child(3) > div > ul > li:nth-child(1) > h3 > a")

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>]

In [19]:
soup.select("h3 > a")

[<a class="govuk-link" href="/search?filters%5Btopic%5D=Business+and+economy">Business and economy</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Crime+and+justice">Crime and justice</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Defence">Defence</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Education">Education</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Environment">Environment</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government">Government</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Government+spending">Government spending</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Health">Health</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Mapping">Mapping</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Society">Society</a>,
 <a class="govuk-link" href="/search?filters%5Btopic%5D=Towns+and+cities">Towns and cities</a>,
 <a class="govuk-link" href="/search?f

We obtain the names and store them in a dataframe.

In [20]:
datasets = []
for e in soup.select("h3 > a"):
    dataset = e.get_text()
    if dataset is not None:
        datasets.append(e.get_text())

In [21]:
dataset_types = pd.DataFrame({"type_of_dataset":datasets})

In [22]:
dataset_types

Unnamed: 0,type_of_dataset
0,Business and economy
1,Crime and justice
2,Defence
3,Education
4,Environment
5,Government
6,Government spending
7,Health
8,Mapping
9,Society


- Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [23]:
url = "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers"

In [24]:
response = requests.get(url)
response.status_code

200

In [25]:
soup = BeautifulSoup(response.content, "html.parser")

In [26]:
# soup

We search the path in the first (this one contains the info we want).

In [27]:
table = soup.select("table")[0]

# table.select("tbody tr td a")

In [28]:
souplist = table.select('a.mw-redirect')
# souplist

Now with the path we found, we obtain each name (until the 10th element).

In [29]:
languages = []
count = 0
for e in table.select('a.mw-redirect'):
    if count < 10:
        language = e.get_text()
        if language is not None:
            languages.append(e.get_text())
        count = count +1

Store them in a dataframe.

In [30]:
top10_languages = pd.DataFrame({"top10_languages":languages})

In [31]:
top10_languages

Unnamed: 0,top10_languages
0,Mandarin Chinese
1,Spanish
2,English
3,Hindi
4,Portuguese
5,Bengali
6,Russian
7,Japanese
8,Yue Chinese
9,Vietnamese
