# web scraping

Below Notebook is an example of web-scraping on publicly available [*Wikipedia*](https://en.wikipedia.org/) and [*Inyourpocket*](https://www.inyourpocket.com) pages.

Table of Contents:
1. [Loading required libraries](#1)<br>
    1.1 [Installing external packages](#1)<br>
    1.2 [Importing the packages that will be used](#1.2)<br>
2. [Download data from *Inyourpocket*](#2)<br>
3. [Downloading data from *Wikipedia*](#3)<br>

I'm using '!*[pip](https://pypi.org/project/pip/)*' as I'm using the *Jupyter Notebook* in the *[Anaconda](https://www.anaconda.com/)* environment.

<a name='1'></a>
## 1. Loading required libraries
### 1.1 Installing external packages

In [1]:
# !pip install pandas
# !pip install tqdm

Versions of used libraries:

|library name|version|
|:-:|-:|
|[pandas](https://pandas.pydata.org/)|2.0.2|
|[tqdm](https://github.com/tqdm/tqdm)|4.65.0|

<a name='1.2'></a>
### 1.2 Importing the packages that will be used

In [2]:
#weebscraping
import urllib.request # handling url requests
from bs4 import BeautifulSoup as bs # handling url requests
import json # handling json format
from collections import ChainMap # conversion list(dict()) -> dict()
from time import sleep # forcing a time delay

# local data handling 
import re
import pandas as pd

# for loops progress tracking
from tqdm.autonotebook import tqdm

  from tqdm.autonotebook import tqdm


<a name='2'></a>
## 2. Download data from Inyourpocket

In [3]:
def get_url_for_targets(
    root:str="https://www.inyourpocket.com",
    city:str="warsaw",
    category:str="restaurants",
    no_page:int=1
):
    """
    get_url_for_targets function imports URL links from the 'Inyourpocket' site
    
    :param root: str; the beginning of URL, if you're using the Inyourpocket site, do not change
    :param city: str; city  name, all lowercase ie {'sopot', 'gdansk', 'gdynia', 'warsaw'}
    :param category: str; category name, all lowercase ie {'restaurants', 'sightseeing', 'museum'}
    :param no_page: int; the number of the page from which we want to scrape the URLs
    :return: list of str that are URL links 
    """
    # merging  function input data to one URL address
    url = ''.join([root,'/',city,'/',category,'?p=',str(no_page)])
    
    # loading requested page
    sauce = urllib.request.urlopen(url).read()
    soup = bs(sauce,'lxml')
    
    # finding all postfix of URL links that redirect to the target
    output =  [destination['href'] for destination in soup.find_all("a", {"class": "read_more"})]
    
    # adding base to postfixes to create usable URL links
    output = [root+url for url in output]
    
    return output

Test of the function '__get_url_for_targets__':

In [4]:
target_urls = get_url_for_targets()
target_urls

['https://www.inyourpocket.com/warsaw/aioli-cantine_111468v',
 'https://www.inyourpocket.com/warsaw/Amrit-Kebab_133462v',
 'https://www.inyourpocket.com/warsaw/argo-kuchnia-gruzinska_70057v',
 'https://www.inyourpocket.com/warsaw/aruba-sophisticated-ribs-bar_132463v',
 'https://www.inyourpocket.com/warsaw/azia-concept_152055v',
 'https://www.inyourpocket.com/warsaw/baila-show-dining_169713v',
 'https://www.inyourpocket.com/warsaw/banjaluka_107470v',
 'https://www.inyourpocket.com/warsaw/bar-bambino_152791v',
 'https://www.inyourpocket.com/warsaw/bar-mleczny-familijny_36804v',
 'https://www.inyourpocket.com/warsaw/barn-burger_111573v']

In [5]:
def get_json(soup:"bs4.BeautifulSoup"):
    """
    get_json function converts from bs4.BeautifulSoup JSON script placed in the 'soup' to a Python dictionary 
    
    :param soup: BeautifulSoup object
    
    :return script: a dictionary of hierarchy as found in JSON script
    """
    script = soup.find("script", {"type" : "application/ld+json"})
    script = script.contents
    script = ''.join(script)
    script = json.loads(script)
    
    return script

Test of the function '__get_json__':

In [6]:
url = target_urls[2]
print(url)

sauce = urllib.request.urlopen(url).read()
soup = bs(sauce,'lxml')

get_json(soup)

https://www.inyourpocket.com/warsaw/argo-kuchnia-gruzinska_70057v


{'@context': 'http://schema.org',
 '@type': 'Restaurant',
 'name': 'ARGO - Kuchnia Gruzińska',
 'url': 'https://www.inyourpocket.com/warsaw/argo-kuchnia-gruzinska_70057v',
 'openingHours': 'Tu,We,Th,Fr 16:00-22:00, Sa 13:00-22:00, Su 13:00-22:00',
 'servesCuisine': 'Georgian',
 'address': {'@type': 'PostalAddress',
  'streetAddress': 'ul. Piwna 46',
  'addressLocality': 'Warsaw',
  'addressCountry': 'PL'},
 'geo': {'@type': 'GeoCoordinates',
  'latitude': '52.249595566292',
  'longitude': '21.011036038399'},
 'telephone': '(+48) 22 635 06 03',
 'amenityFeature': [{'@type': 'LocationFeatureSpecification',
   'value': 'True',
   'name': 'Outside seating'}],
 'image': ['https://s.inyourpocket.com/gallery/295173.jpg'],
 'description': 'Where once was the smallest curry house in all Poland now stands a Georgian chop house which serves brilliant food at cracking prices. The lamb in plum sauce is top notch, and the chinkali (Georgian dumplings) will give any pierogi in town a run for their mo

In [7]:
def flatten_nested_json(input):
    """
    flatten_nested_json function converts the nested dictionary to a dictionary of depth 1 (structured)
    
    :param input: the nested dictionary
    
    :return script: a dictionary of depth 1 (not nested)
    """
    output = list()
    
    nested_dict = list()
    
    for key, value in input.items():
        if isinstance(value, dict):
            nested_dict.extend([value])
            
        elif isinstance(value, list):
            nested_dict.extend(value)
        else:
            output.append({key:value})
    
    for d in nested_dict:
        
        if "latitude" in d:
            output.extend([{"latitude": float(d["latitude"])}, {"longitude": float(d["longitude"])}])
            
        elif "streetAddress" in d:
            output.extend([{i:d[i]} for i in d])
        elif "name" in d:
            output.extend([{d["name"]:bool(d["value"])}])
            
    
    output =  dict(ChainMap(*output))
    
    return output
    
def get_data(soup):
    """
    get_data function converts from JSON script in BeautifulSoup object to a structured Python dictionary
    
    :param soup: BeautifulSoup object
    
    :return data: a dictionary of depth 1 (structured)
    """
    
    data = get_json(soup)
    
    data = flatten_nested_json(data)
    
    return data

Test of the functions '__flatten_nested_json__' and '__get_data__':

In [8]:
get_data(soup)

{'Outside seating': True,
 'longitude': 21.011036038399,
 'latitude': 52.249595566292,
 'addressCountry': 'PL',
 'addressLocality': 'Warsaw',
 'streetAddress': 'ul. Piwna 46',
 '@type': 'Restaurant',
 'description': 'Where once was the smallest curry house in all Poland now stands a Georgian chop house which serves brilliant food at cracking prices. The lamb in plum sauce is top notch, and the chinkali (Georgian dumplings) will give any pierogi in town a run for their money. Only a handful of tables, so reserve ',
 'telephone': '(+48) 22 635 06 03',
 'servesCuisine': 'Georgian',
 'openingHours': 'Tu,We,Th,Fr 16:00-22:00, Sa 13:00-22:00, Su 13:00-22:00',
 'url': 'https://www.inyourpocket.com/warsaw/argo-kuchnia-gruzinska_70057v',
 'name': 'ARGO - Kuchnia Gruzińska',
 '@context': 'http://schema.org'}

In [9]:
def scrap_inyourpocket(
    city:str="warsaw",
    category:str="restaurants",
    pages:list=None,
    sleep_time=6,
    log:bool=True
):
    """
    scrap_inyourpocket fetches parameters from each subpage of given URL and returns them in pandas DataFrame
    
    :param city: str; city  name, all lowercase ie {'sopot', 'gdansk', 'gdynia', 'warsaw'}
    :param category: str; category name, all lowercase ie {'restaurants', 'sightseeing', 'museum'}
    :param pages: int; the number of the page from which we want to scrape the URLs
    :param sleep_time:unsigned int (positive number); the number of seconds to wait after each URL connection (successful or unsuccessful)
    :param log: bool; if True, the error log is returned at the end of the scrap_inyourpocket function
    
    :return: pandas.DataFrame and list(); pandas.Dataframe with found data and list of errors that occurred at connections to each URL
    """
    if pages is None:
        return ValueError("no page have been given")
    
    output_data, log = list(), []
    
    for no_page in pages:
        try:
            target_urls = get_url_for_targets(city=city,category=category,no_page=no_page)
        except Exception as e:
            sleep(sleep_time)
            log.append(e)
            continue

        tqdm_desc_message = "city:{}, page:{}".format(city, no_page)
        for url in tqdm(target_urls, desc=tqdm_desc_message):
            sauce = urllib.request.urlopen(url).read()
            soup = bs(sauce,'lxml')
            try:
                output_data.append(
                    get_data(soup)
                )
                sleep(sleep_time)
            except Exception as e:
                sleep(sleep_time)
                log.append("error occurred for URL :{}\n{}\n\n\n".format(url, e))
            
                
    return pd.DataFrame(output_data), log

I'm using the '__scrap_inyourpocket__' function in order to collect data on restaurants and sightseeing from the Tri-City (a combination of cities in Poland: Sopot, Gdańsk, and Gdynia).

In [10]:
tricity = ('sopot', 'gdansk', 'gdynia')
pages = (5, 15, 5)

# collecting data on restaurants in Tri-City
for city,r_pages in zip(tricity, pages):
    df, log = scrap_inyourpocket(city=city, pages=range(r_pages))
    
    # saving a pandas DataFrame to a file in CSV format
    df.to_csv('inyourpocket_{}.csv'.format(city, r_pages), index=False)
    
    # printing occurred errors
    print(*log)

100%|██████████| 18/18 [01:59<00:00,  6.65s/it]
100%|██████████| 10/10 [01:06<00:00,  6.69s/it]
100%|██████████| 10/10 [01:06<00:00,  6.63s/it]
100%|██████████| 10/10 [01:05<00:00,  6.54s/it]
100%|██████████| 10/10 [01:07<00:00,  6.71s/it]





100%|██████████| 13/13 [01:23<00:00,  6.44s/it]
100%|██████████| 10/10 [01:03<00:00,  6.37s/it]
100%|██████████| 10/10 [01:04<00:00,  6.42s/it]
100%|██████████| 10/10 [01:04<00:00,  6.47s/it]
100%|██████████| 10/10 [01:05<00:00,  6.52s/it]
100%|██████████| 10/10 [01:05<00:00,  6.58s/it]
100%|██████████| 10/10 [01:07<00:00,  6.70s/it]
100%|██████████| 10/10 [01:07<00:00,  6.71s/it]
100%|██████████| 10/10 [01:05<00:00,  6.58s/it]
100%|██████████| 10/10 [01:06<00:00,  6.62s/it]
100%|██████████| 10/10 [01:06<00:00,  6.64s/it]
100%|██████████| 10/10 [01:06<00:00,  6.64s/it]
100%|██████████| 10/10 [01:06<00:00,  6.65s/it]
100%|██████████| 10/10 [01:05<00:00,  6.55s/it]
100%|██████████| 2/2 [00:13<00:00,  6.52s/it]





100%|██████████| 9/9 [00:54<00:00,  6.02s/it]
100%|██████████| 10/10 [01:04<00:00,  6.49s/it]
100%|██████████| 10/10 [00:59<00:00,  5.92s/it]
100%|██████████| 10/10 [01:06<00:00,  6.61s/it]
100%|██████████| 10/10 [01:05<00:00,  6.51s/it]

error ocured for url :https://www.inyourpocket.com/gdynia/luis-mexicantina_165958v
Invalid control character at: line 51 column 61 (char 1518)


 error ocured for url :https://www.inyourpocket.com/gdynia/falla_152773v
Invalid control character at: line 53 column 17 (char 1546)








In [11]:
tricity = ('sopot', 'gdansk', 'gdynia')
pages = (20, 19, 20)

# collecting data on sightseeing in Tri-City
for city,r_pages in zip(tricity, pages):
    df, log = scrap_inyourpocket(city=city, pages=range(r_pages), category='sightseeing')
    df.to_csv('inyourpocket_{}_sightseeing.csv'.format(city), index=False)
    print(*log)

city:sopot, page:0: 100%|██████████| 11/11 [01:10<00:00,  6.43s/it]
city:sopot, page:1: 100%|██████████| 10/10 [01:05<00:00,  6.58s/it]
city:sopot, page:2: 100%|██████████| 8/8 [00:50<00:00,  6.32s/it]
city:sopot, page:3: 0it [00:00, ?it/s]


error ocured for url :https://www.inyourpocket.com/sopot/sopot-fort_20140v
Invalid control character at: line 42 column 112 (char 1150)


 HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found


city:gdansk, page:0: 100%|██████████| 21/21 [02:16<00:00,  6.49s/it]
city:gdansk, page:1: 100%|██████████| 10/10 [01:06<00:00,  6.62s/it]
city:gdansk, page:2: 100%|██████████| 10/10 [01:05<00:00,  6.52s/it]
city:gdansk, page:3: 100%|██████████| 10/10 [01:04<00:00,  6.42s/it]
city:gdansk, page:4: 100%|██████████| 10/10 [01:06<00:00,  6.67s/it]
city:gdansk, page:5: 100%|██████████| 10/10 [01:03<00:00,  6.32s/it]
city:gdansk, page:6:   0%|          | 0/10 [00:00<?, ?it/s]
city:gdansk, page:7: 100%|██████████| 10/10 [01:04<00:00,  6.41s/it]
city:gdansk, page:8: 100%|██████████| 10/10 [01:06<00:00,  6.60s/it]
city:gdansk, page:9: 100%|██████████| 10/10 [01:06<00:00,  6.68s/it]
city:gdansk, page:10: 100%|██████████| 10/10 [01:04<00:00,  6.50s/it]
city:gdansk, page:11: 100%|██████████| 10/10 [01:04<00:00,  6.50s/it]
city:gdansk, page:12: 100%|██████████| 10/10 [01:05<00:00,  6.52s/it]
city:gdansk, page:13: 100%|██████████| 10/10 [01:06<00:00,  6.65s/it]
city:gdansk, page:14: 100%|██████████| 

error ocured for url :https://www.inyourpocket.com/gdansk/amber-museum_20840v
Invalid control character at: line 53 column 259 (char 1959)


 error ocured for url :https://www.inyourpocket.com/gdansk/european-solidarity-centre-ecs_33549v
Invalid control character at: line 51 column 175 (char 1842)


 error ocured for url :https://www.inyourpocket.com/gdansk/father-jankowski-statue_116324v
Invalid control character at: line 32 column 138 (char 724)


 error ocured for url :https://www.inyourpocket.com/gdansk/gdansk-photography-gallery_39959v
Invalid control character at: line 35 column 313 (char 1044)


 HTTP Error 404: Not Found error ocured for url :https://www.inyourpocket.com/gdansk/monument-to-the-fallen-shipyard-workers-of-1970_16205v
Invalid control character at: line 34 column 224 (char 975)


 error ocured for url :https://www.inyourpocket.com/gdansk/monument-to-the-shipyard-tragedy-of-1994_119395v
Invalid control character at: line 33 column 135 (char 816)


 error ocured for 

city:gdynia, page:0: 100%|██████████| 7/7 [00:45<00:00,  6.50s/it]
city:gdynia, page:1: 100%|██████████| 10/10 [01:05<00:00,  6.53s/it]
city:gdynia, page:2: 100%|██████████| 10/10 [01:07<00:00,  6.71s/it]
city:gdynia, page:3:  89%|████████▉ | 8/9 [00:53<00:06,  6.70s/it]
city:gdynia, page:4: 0it [00:00, ?it/s]


error ocured for url :https://www.inyourpocket.com/gdynia/emigration-museum_98969v
Invalid control character at: line 51 column 139 (char 1774)


 error ocured for url :https://www.inyourpocket.com/gdynia/antoni-abraham-monument_107933v
Invalid control character at: line 31 column 239 (char 770)


 error ocured for url :https://www.inyourpocket.com/gdynia/displaced-gdynians-monument_132578v
Invalid control character at: line 34 column 150 (char 870)


 HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found HTTP Error 404: Not Found


<a name='3'></a>
## 3. Downloading data from *Wikipedia*
In this part, I'm downloading the coordinates of railway stations '*SKM*' in Tri-City.

Since I only need a small amount of URLs from the Wikipedia page [*List of SKM stops*](https://en.wikipedia.org/wiki/List_of_SKM_stops) and Polish subpages, I copied and pasted them below to avoid multi-level redirects:

In [12]:
stations_url =[
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_G%C5%82%C3%B3wny",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_Stocznia",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_Politechnika",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_Wrzeszcz",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_Zaspa",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_Przymorze-Uniwersytet",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_Oliwa",
    "https://pl.wikipedia.org/wiki/Gda%C5%84sk_%C5%BBabianka-AWFiS",
    "https://pl.wikipedia.org/wiki/Sopot_Wy%C5%9Bcigi",
    "https://pl.wikipedia.org/wiki/Sopot_(stacja_kolejowa)",
    "https://pl.wikipedia.org/wiki/Sopot_Kamienny_Potok",
    "https://pl.wikipedia.org/wiki/Gdynia_Or%C5%82owo",
    "https://pl.wikipedia.org/wiki/Gdynia_Red%C5%82owo",
    "https://pl.wikipedia.org/wiki/Gdynia_Wzg%C3%B3rze_%C5%9Aw._Maksymiliana",
    "https://pl.wikipedia.org/wiki/Gdynia_G%C5%82%C3%B3wna",
    "https://pl.wikipedia.org/wiki/Gdynia_Stocznia_%E2%80%93_Uniwersytet_Morski",
    "https://pl.wikipedia.org/wiki/Gdynia_Grab%C3%B3wek",
    "https://pl.wikipedia.org/wiki/Gdynia_Leszczynki",
    "https://pl.wikipedia.org/wiki/Gdynia_Chylonia",
    "https://pl.wikipedia.org/wiki/Gdynia_Cisowa"
]

Below, for each URL I'm downloading Web Page title, latitude, and longitude:

In [13]:
stations = list()

for url in tqdm(stations_url):
    # loading requested page
    sauce = urllib.request.urlopen(url).read()
    soup = bs(sauce,'lxml')
    station = dict()
    station["name"] = soup.find("span", {"class": "mw-page-title-main"}).text
    station["latitude"] = soup.find("span", {"class": "latitude"}).text
    station["longitude"] = soup.find("span", {"class": "longitude"}).text
    stations.append(station)
    sleep(6)


100%|██████████| 20/20 [02:06<00:00,  6.32s/it]


Below is a preview of the collected data:

In [14]:
df = pd.DataFrame(stations)
df.head()

Unnamed: 0,name,latitude,longitude
0,Gdańsk Główny,54°21′26″N,18°38′40″E
1,Gdańsk Stocznia,54°21′55″N,18°38′27″E
2,Gdańsk Politechnika,54°22′27″N,18°37′38″E
3,Gdańsk Wrzeszcz,54°22′55″N,18°36′19″E
4,Gdańsk Zaspa,54°23′22″N,18°35′30″E


Since I will need latitude and longitude in floating point format in the future, I need to convert the coordinates:

In [15]:
def coord_to_float(x:str):
    output = re.findall('\d+',x)
    output[0] = output[0]+'.'
    output = sum(float(output[i])/60**i for i in range(len(output)))
    
    return output

df["latitude"] = df["latitude"].map(lambda x: coord_to_float(x))
df["longitude"] = df["longitude"].map(lambda x: coord_to_float(x))
df.head()

Unnamed: 0,name,latitude,longitude
0,Gdańsk Główny,54.357222,18.644444
1,Gdańsk Stocznia,54.365278,18.640833
2,Gdańsk Politechnika,54.374167,18.627222
3,Gdańsk Wrzeszcz,54.381944,18.605278
4,Gdańsk Zaspa,54.389444,18.591667


Since the railway stations data looks fine, I'm saving it to a text file in CSV format:

In [16]:
df.to_csv('tricity_stations_coordinates.csv'.format(city), index=False)