# About
* **Author**: Adil Rashitov
* **Created at**: 21.06.2021
* **Goal**: Prepare extraction plan for web scrapping [118 direct](http://www.118.direct)
* **Deliverables**:
    1. **What** business categories to scrap?
    1. **For which** locations to perform scrapping?

In [1]:
# Imports / Configs / Global vars

# Import of native python tools
import os
import json
from functools import reduce

# Import of base ML stack libs
import numpy as np
import sklearn as sc

# Multiprocessing for Mac / Linux
import platform
platform.system()
if platform.system() == 'Darwin':
    from multiprocess import Pool
else:
    from multiprocessing import Pool

# Visualization libraries
import plotly.express as px

# Logging configuraiton
import logging
logging.basicConfig(format='[ %(asctime)s ][ %(levelname)s ]: %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Ipython configs
from IPython.core.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
display(HTML("<style>.container { width:100% !important; }</style>"))
InteractiveShell.ast_node_interactivity = 'all'

# Pandas configs
import pandas as pd
import geopandas as gpd
pd.options.display.max_rows = 350
pd.options.display.max_columns = 250

# Jupyter configs
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

# GLOBAL VARS

# Steps

1. Extraction of businesses categories i will search
2. Prepare input extraction locations

### 1. Extraction of businesses categories 

To extraction locations, i need to prepare business categories i will use for search.

link: http://www.118.direct/popularsearches/

In [2]:
# Generation URLs to bussiness locations
import string
from bs4 import BeautifulSoup
import requests


business_categs = pd.Series(map(lambda x: f"http://www.118.direct/popularsearches/{x}",
                                list(string.ascii_lowercase)))
business_categs[:2]

0    http://www.118.direct/popularsearches/a
1    http://www.118.direct/popularsearches/b
dtype: object

In [3]:
# Parsing business categories
import time


def parse_page_of_categories(url):

    class_element = "popTermsList"

    html_text = requests.get(url).text
    soup = BeautifulSoup(html_text, "html.parser")

    categories = soup \
        .find_all("ul", class_=class_element)[0] \
        .find_all("a")
    
    df = pd.DataFrame({
        'url': url,
        'business_category': list(map(lambda x: x.contents, categories)),
        'href': list(map(lambda x: x.attrs['href'], categories)),
        
    })
    return df


categs_extracted = []
for _id, category in enumerate(business_categs):
    logging.info(f"[{_id}] Start processing category: {category}")
    categs_extracted.append(parse_page_of_categories(category))
    time.sleep(np.random.normal(1, 0.2))
    logging.info(f"[{_id}] Finish processing category: {category}\n")


categs_extracted = pd.concat(categs_extracted).reset_index(drop=True)
categs_extracted['business_category'] = categs_extracted['business_category'].str[0]

[ 06/21/2021 09:16:23 AM ][ INFO ]: [0] Start processing category: http://www.118.direct/popularsearches/a
[ 06/21/2021 09:16:24 AM ][ INFO ]: [0] Finish processing category: http://www.118.direct/popularsearches/a

[ 06/21/2021 09:16:24 AM ][ INFO ]: [1] Start processing category: http://www.118.direct/popularsearches/b
[ 06/21/2021 09:16:25 AM ][ INFO ]: [1] Finish processing category: http://www.118.direct/popularsearches/b

[ 06/21/2021 09:16:25 AM ][ INFO ]: [2] Start processing category: http://www.118.direct/popularsearches/c
[ 06/21/2021 09:16:26 AM ][ INFO ]: [2] Finish processing category: http://www.118.direct/popularsearches/c

[ 06/21/2021 09:16:26 AM ][ INFO ]: [3] Start processing category: http://www.118.direct/popularsearches/d
[ 06/21/2021 09:16:27 AM ][ INFO ]: [3] Finish processing category: http://www.118.direct/popularsearches/d

[ 06/21/2021 09:16:27 AM ][ INFO ]: [4] Start processing category: http://www.118.direct/popularsearches/e
[ 06/21/2021 09:16:28 AM ][ I

In [4]:
!mkdir -p /WORKDIR/data/sources/bussiness_categories_to_extract/
fname = '/WORKDIR/data/sources/bussiness_categories_to_extract/categories.csv.zip'
categs_extracted.to_csv(fname, index=False, compression='zip')

### 2. Locations for which perform scrapping

* Manchester

In [11]:
locations_to_extract = pd.DataFrame({
    'Locations': ['Manchester',
                  'Manchester Airport',
                  'Manchester Science Park']}).reset_index(drop=True)

!mkdir -p /WORKDIR/data/sources/locations_to_extract/
fname = '/WORKDIR/data/sources/locations_to_extract/locations.csv.zip'
locations_to_extract.to_csv(fname, index=False, compression='zip')