# SWP Index Web Scraping

Jupyter Notebook for the creation of the SWP Index with Web Scraping.

SWP means Size of the Wikipedia Page, and it's an estimation of the importance of a city based on the lenght of its Wikipedia page.

____________________________________________________________

For this process, the following CSV files is needed: ```out_COCI_citations_and_locations_analysis_ready.csv```.

The CSV file must be generated running the ```Citation Datasets Separation for the Analysis.ipynb``` Notebook that is contained in the ```8 - Citation Datasets Separation for the Analysis``` folder of this Repository.

____________________________________________________________

In particular, the following operations are going to be executed:
* Opening of the CSV dataset
* Extraction of the distinct cities
* Web scraping of the wikipedia page's size

Lastly, the entire processed dump is going to be saved on disk in CSV format

In [1]:
# Libraries Import
import pandas as pd
import numpy as np
import platform
import multiprocessing as mp 
import concurrent       
from turistic_scraper_multithread_utils import * 

pd.set_option('display.max_columns', None)

## File Paths
Please set your working directory paths.

In [2]:
# ******************* PATHS ********************+

# Dumps Directory Path
path_file_import = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Import/'

# CSV Exports Directory Path
path_file_export = r'/Users/marcoterzulli/File/Scuola Local/Magistrale/Materiale Corsi Attuali/Tirocinio/Cartella di Lavoro/Archivi Dump di Lavoro/Export/'

### Multithreading Settings
Settings needed for the multithreaded queries to gather the size of the wikipedia pages.

Please specify the max number of workers below:

We suggest to **set the number of workers to 1 if you have a large bandwidth** (over 100Mbps) to limit the requests to the website. Otherwise, you could try to set a higher value to make requests in parallel.

In [3]:
MAX_WORKERS = 4

You can also set a **sleep delay** (in seconds) between requests if that's not enough:

In [4]:
SLEEP_DELAY = 0.2 # Seconds

Special setting for the specific operating systems.

**Note**: Due to the latest MacOS releases' security measures, we need to use the spawn method instead of fork.

In [5]:
print(f"Notebook running on {platform.system()} OS: ")

if platform.system() == "Darwin" or platform.system() == "Windows": # MacOS and windows
    mp_ctx = mp.get_context("spawn")
    print("Spawn method has been set")
    
else: # other unix systems
    mp_ctx = mp.get_context("fork")
    print("Fork method has been set")

Notebook running on Darwin OS: 
Spawn method has been set


## Read of the CSV Dataset

In [6]:
df_coci_dataset = pd.read_csv(path_file_import + 'out_COCI_citations_and_locations_analysis_ready.csv', low_memory=False, index_col=[0])
df_coci_dataset

Unnamed: 0,CitationCount,ConferenceLocation,ConferenceNormalizedName,ConferenceSeriesNormalizedName,Doi,Year
0,10,"Austin, Texas, United States",disc 2014,disc,10.1007/978-3-662-45174-8_28,2014
1,5,"Wrocław, Lower Silesian Voivodeship, Poland",esa 2014,esa,10.1007/978-3-662-44777-2_60,2014
2,11,"Innsbruck, Tyrol, Austria",enter 2013,enter,10.1007/978-3-319-03973-2_13,2013
3,1,"Villefranche-sur-Saône, Auvergne-Rhône-Alpes, ...",dexa 2002,dexa,10.1007/3-540-46146-9_77,2002
4,9,"Zakopane, Lesser Poland Voivodeship, Poland",icaisc 2006,icaisc,10.1007/11785231_94,2006
...,...,...,...,...,...,...
3107878,4,"Thessaloniki, Macedonia and Thrace, Greece",sapere 2011,sapere,10.1007/978-3-642-31674-6_9,2011
3107879,4,"Thessaloniki, Macedonia and Thrace, Greece",sapere 2011,sapere,10.1007/978-3-642-31674-6_20,2011
3107880,2,"Thessaloniki, Macedonia and Thrace, Greece",sapere 2011,sapere,10.1007/978-3-642-31674-6_25,2011
3107881,0,"Thessaloniki, Macedonia and Thrace, Greece",sapere 2011,sapere,10.1007/978-3-642-31674-6_12,2011


## Obtaining the Size of the Wikipedia Page
The conferences locations are going to be obtained from queries to the Wikipedia Website.

First we need to obtain the distinct ciry names:

In [7]:
df_cities = df_coci_dataset['ConferenceLocation'].copy().to_frame()
df_cities = df_cities.drop_duplicates()

df_cities["City"] = np.nan

df_cities.City = df_cities.ConferenceLocation.str.split(',').str[0].str.replace(' ', '_')
df_cities = df_cities.drop_duplicates()

df_cities

Unnamed: 0,ConferenceLocation,City
0,"Austin, Texas, United States",Austin
1,"Wrocław, Lower Silesian Voivodeship, Poland",Wrocław
2,"Innsbruck, Tyrol, Austria",Innsbruck
3,"Villefranche-sur-Saône, Auvergne-Rhône-Alpes, ...",Villefranche-sur-Saône
4,"Zakopane, Lesser Poland Voivodeship, Poland",Zakopane
...,...,...
3103379,"Veneto, Italy",Veneto
3103468,"Bastia, Corsica, France",Bastia
3103514,"Laramie, Wyoming, United States",Laramie
3103806,"Longyearbyen, Norway",Longyearbyen


### Define of the Web Scraping Function
The web function is going to be passed as parameter.

In [8]:
def swp_scraper(cities_dataframe, mt_downloader_operation_function, wikipedia_url = "https://en.wikipedia.org/wiki/"):
    dict_city_swp = {}      
    download_list = zip(list(cities_dataframe.ConferenceLocation.values), list(cities_dataframe.City.values))

    executor = concurrent.futures.ProcessPoolExecutor(max_workers=int(MAX_WORKERS), mp_context=mp_ctx)
    futures = [executor.submit(mt_downloader_operation_function, conf_location, wikipedia_url, SLEEP_DELAY) for conf_location in download_list]

    for future in concurrent.futures.as_completed(futures):
        try:
            k, v = future.result()
        except Exception as e:
            print(f"{futures[future]} throws {e}")
        else:
            dict_city_swp[k] = v

    # Converting the resulting dictionary to a dataframe
    df_city_swp = pd.DataFrame(dict_city_swp.items(), columns=['Location', 'SWP'])

    return df_city_swp

#### Queries to the Wikipedia Website

Parallel execution of the queries to the Wikipedia website.

**Note**: this operation should take less than a 5 minutes, depending on your Internet speed.

In [9]:
df_city_swp = swp_scraper(df_cities, mt_get_city_swp_operation_v1, "https://en.wikipedia.org/wiki/")

Unzip of the location tuple and drop of the NaN values:

In [None]:
# unzip of the location tuple
df_city_swp["ConferenceLocation"] = np.nan
df_city_swp["City"] = np.nan
df_city_swp.ConferenceLocation = df_city_swp.Location.str[0]
df_city_swp.City = df_city_swp.Location.str[1]

# dataframe cleanup
df_city_swp.drop(columns="Location", inplace=True)
df_city_swp = df_city_swp.reindex(sorted(df_city_swp.columns), axis=1)
df_city_swp = df_city_swp.dropna()

df_city_swp

Fix of the column type

In [None]:
df_city_swp = df_city_swp.astype({"SWP": int})

### Addressing the Duplicates

There are cases of homonymy between cities, and it's not always easy to address them. 

In some cases, Wikipedia uses the following scheme "/cityname,_country_name", so we'll try to fix these cities in this way.

**Note**: we're going to to this only for the cities with an SWP value under 25k, because that usually means it's a wikipedia disambiguation page.

In [None]:
df_city_swp_duplicated_cities = df_city_swp[df_city_swp["City"].isin(df_city_swp["City"][df_city_swp["City"].duplicated()])].sort_values(by="City")
df_city_swp_duplicated_cities = df_city_swp_duplicated_cities.loc[df_city_swp_duplicated_cities["SWP"] < 25000]
df_city_swp_duplicated_cities

#### Queries to the Wikipedia Website

Parallel execution of the queries to the Wikipedia website.

**Note**: this operation should take less than a 5 minutes, depending on your Internet speed.

In [None]:
df_city_swp_fixed = swp_scraper(df_city_swp_duplicated_cities, mt_get_city_swp_operation_v2, "https://en.wikipedia.org/wiki/")

Unzip of the location tuple and drop of the NaN values:

In [None]:
# unzip of the location tuple
df_city_swp_fixed["ConferenceLocation"] = np.nan
df_city_swp_fixed["City"] = np.nan
df_city_swp_fixed.ConferenceLocation = df_city_swp_fixed.Location.str[0]
df_city_swp_fixed.City = df_city_swp_fixed.Location.str[1]

# dataframe cleanup
df_city_swp_fixed.drop(columns="Location", inplace=True)
df_city_swp_fixed = df_city_swp_fixed.reindex(sorted(df_city_swp_fixed.columns), axis=1)
df_city_swp_fixed = df_city_swp_fixed.dropna()

df_city_swp_fixed

In [None]:
df_city_swp_fixed.head(50)

#### Merge with the First SWP Dataframe

In [None]:
df_city_swp = df_city_swp.update(df_city_swp_fixed)
df_city_swp

## Write of the Final CSV on Disk

Saving the resulting dataframe on disk in CSV format.

In [None]:
# Write of the resulting CSV on Disk
df_city_swp.to_csv(path_file_export + 'out_city_swp.csv')
print(f'Successfully Exported the Processed CSV to {path_file_export}out_city_swp.csv')

Check of the Exported CSV to be sure that everything went fine.

In [None]:
# Check of the Exported CSV
df_swp_exported_csv = pd.read_csv(path_file_export + 'out_city_swp.csv', low_memory=False, index_col=[0])
df_swp_exported_csv