# Social factors by countries Clustering Analysis - Web Scraping component using Selenium and Beautifulsoup

This is a personal project to developing my clustering techniques and interpretation using python. The data is sourced from a variety of websites, and so merging and cleaning will be a pivital step in excavating a sound analysis.

The purpose of this project is to deployed data I have scraped off the internet onto a usesable dashboard in PowerBi so that one can survey the factors of each countries. In addition, investigating whether these factors are distinctive enough between the countires to generate accurate clusters, and how within these clusters one can gather useful information. This can be used for personal enjoyment, or to predict the qualities of future countires and determining within which cluster they fit.

The focus on social factors is due to a potential dominating force in other factors such as economic data as these are quite distinctive in their fields and so clustering would be meaningful as it should appear obvious. This is the reason why I choose to harp on the more subtle form of data in social factors



In [2]:
# Import required libraries
import requests
import json
import pandas as pd
import random
import time
import numpy as np 
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
import time 
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

## Continents and country

With gathering continental data from different sources around the web, one of the challenges that will invitabily occur is merging data on the basis of varying formats. Effective cleaning and formating should occur so that the process of data gose smoothly and for important information to not be lost in the process.

The focus on social facto

To initialise the process, a baseline of all the countries with their respective continents will be extracted from Github. Cleaning for major countires will be deployed.

In [None]:
url = 'https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv    '                                                                                                                                                                                    url = 'https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv'
countries_contin = pd.read_csv(url, index_col=0)
df = pd.DataFrame(countries_contin)
df = df.reset_index()
df = df.rename(columns = {'Country':'country'})

In [202]:
df['country'][df['country'] == 'Korea, North'] = 'North Korea'
df['country'][df['country'] == 'Korea, South'] = 'South Korea'
df['country'][df['country'] == 'US'] = 'United States'
df['country'][df['country'] == 'Russian Federation'] = 'Russia'
df['country'][df['country'] == 'CZ'] = 'Czech Republic'
df['country'] = [item.replace('&','and') for item in df['country']]

### C02 emissions

In [203]:
url = 'https://www.worldometers.info/co2-emissions/co2-emissions-per-capita/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Identify the whole table segemented by tr
row = soup.find_all('tr')
country = []
CO2_per_capita_2016 = []

# Go through each element in the table
for item in row:
    # element in each row defined by td
    element = item.find_all('td')
    # skip instances of empty rows
    if len(element) > 1:
        # get country and its CO2 per capita value
        country.append(element[1].text)
        CO2_per_capita_2016.append(element[2].text)

d = {'country' : country, 'CO2_per_capita_2016':CO2_per_capita_2016}
df_co2_capita = pd.DataFrame(d)
df_co2_capita['CO2_per_capita_2016'] = df_co2_capita['CO2_per_capita_2016'].astype(float)
df_co2_capita['country'] = [item.split('(')[0] for item in df_co2_capita['country']]
df_co2_capita['country'] = [item.strip() for item in df_co2_capita['country']]

In [204]:
merged_data = pd.merge(df,df_co2_capita, how = 'left', on = 'country')
merged_data['CO2_per_capita_2016'] = pd.to_numeric(merged_data.CO2_per_capita_2016.astype(str).str.replace(',',''), errors='coerce')
merged_data['CO2_per_capita_2016'] =  merged_data['CO2_per_capita_2016'].fillna(merged_data.groupby('Continent')['CO2_per_capita_2016'].transform('mean')).astype(float)

### Crime index data

In [205]:
crime_data = pd.read_csv('csvData.csv')
crime_data = crime_data.drop(columns = ['rank','pop2022'])
crime_data['country'] = [item.replace('And','and') for item in crime_data['country']]

In [206]:
merged_data = pd.merge(merged_data,crime_data, on  = 'country' , how = 'left')
merged_data['crimeIndex'] =  merged_data['crimeIndex'].fillna(merged_data.groupby('Continent')['crimeIndex'].transform('mean')).astype(float)

### Water index data

In [207]:
url = 'https://epi.yale.edu/epi-results/2020/component/h2o'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Identify the whole table segemented by tr
row = soup.find_all('tr')
country = []
epi_score_2022 = []

for item in row:
    # element in each row defined by td
    element = item.find_all('td')
    # skip instances of empty rows
    if len(element) > 1:
        # get country and its CO2 per capita value
        country.append(element[0].text.strip())
        epi_score_2022.append(element[2].text.strip())

d = {'country' : country, 'epi_score2022': epi_score_2022}
water_score_2022 = pd.DataFrame(d)

water_score_2022['country'][water_score_2022['country'] == 'Dem. Rep. Congo'] = 'Congo'
water_score_2022['country'][water_score_2022['country'] == 'United States of America' ] = 'United States'
water_score_2022['country'][water_score_2022['country'] == 'Viet Nam'] = 'Vietnam'

In [208]:
merged_data = pd.merge(merged_data,water_score_2022, how = 'left', on = 'country')
merged_data['epi_score2022'] = pd.to_numeric(merged_data.epi_score2022.astype(str).str.replace(',',''), errors='coerce')
merged_data['epi_score2022'] =  merged_data['epi_score2022'].fillna(merged_data.groupby('Continent')['epi_score2022'].transform('mean')).astype(float)

### Happiness index data

In [209]:
happy_data = pd.read_csv('happiness_index.csv')
happy_data = happy_data.drop(columns = ['rank','pop2022','happiness2020'])
happy_data['country'][happy_data['country'] == 'Republic of the Congo'] = 'Congo'


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  happy_data['country'][happy_data['country'] == 'Republic of the Congo'] = 'Congo'


In [210]:
merged_data = pd.merge(merged_data,happy_data, how = 'left', on = 'country')
merged_data['happiness2021'] =  merged_data['happiness2021'].fillna(merged_data.groupby('Continent')['happiness2021'].transform('mean')).astype(float)


### Life expectancy data

In [211]:
url = 'https://www.worldometers.info/demographics/life-expectancy/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

# Identify the table with all the row elements
rows = soup.find_all('tr')
country = []
fem_life = []
male_life = []

for item in rows:
    element = item.find_all('td')
    if len(element) > 0:
        country.append(element[1].text)
        fem_life.append(element[3].text)
        male_life.append(element[4].text)

d = {'country' : country, 'fem_life_expectancy': fem_life, 'male_life_expectancy': male_life}
life_expectancy_2022 = pd.DataFrame(d)

life_expectancy_2022['country'] = [item.replace('&' , 'and') for item in life_expectancy_2022['country']]
life_expectancy_2022['country'][life_expectancy_2022['country'] == 'Czech Republic (Czechia)'] = 'Czech Republic'

In [225]:
merged_data = pd.merge(merged_data,life_expectancy_2022, how = 'left' , on = 'country')
cols = ['fem_life_expectancy','male_life_expectancy']
merged_data[cols] = merged_data[cols].apply(pd.to_numeric, errors = 'coerce', axis= 1)
merged_data[cols] = merged_data[cols].fillna(merged_data.groupby('Continent')[cols].transform('mean')).astype(float)

### Religion Data

In [227]:
reli_data = pd.read_csv('religion.csv')
reli_data['religion'] = reli_data.loc[:, reli_data.columns != 'country'].idxmax(axis=1)
merged_data = pd.merge(merged_data, reli_data[['country','religion']], on = 'country', how = 'left')
merged_data['religion'] = merged_data['religion'].fillna(merged_data.groupby('Continent')['religion'].agg(pd.Series.mode)[0])

### Homosexuality data

Unlike the other websites, this is off wikipedia. It is not a simple process to scrap off the websites. Homosexuality is not a quantitative measure and thereofre factors such as homosexuality activity/marriage/criminsation must be weighted. The page includes these information in the form of a data with checkmarks.

I used selenium to determine the presence of these checkmarks and weight according to my personal scale on the severity. Some coutries do show mercy in some areas but are harsh in another. If a country passes activity but bans marriage, how would that be weight. Here are the metrics I used (Moderate: 1, Severe: 2):


* Same-sex sexual actvitiy: 1 

* Recognition of same-sex unions: 2

* Same-sex marriage: 2 

* Adoption by same-sex couples: 1

* LGBT people allowed to serve openly in military: 1

* Anti-discrimination laws concerning sexual orientation: 2 

* Laws concerning gender identity/expression: 1 



In [232]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

PATH = "C:\\Users\\Ivan Shamoon\\Desktop\\chromedriver.exe"
url = 'https://en.wikipedia.org/wiki/LGBT_rights_by_country_or_territory'

driver = webdriver.Chrome(executable_path = PATH)
driver.get(url)
driver.maximize_window()

  driver = webdriver.Chrome(executable_path = PATH)


In [233]:
countries = []
score = []

tables = driver.find_elements(By.XPATH, "//td[@style ='border: solid 1px silver; padding: 8px; background-color: white;']")
tables = tables[20:25]

for table in tables:
    slots = table.find_elements(By.TAG_NAME, 'tr')
    for item in slots:
        try:
            name = item.find_elements(By.TAG_NAME, 'td')
            countries.append(name[0].text)
        except:
            continue
        try:
            row = item.get_attribute('innerHTML').split('<td>')
            row = row[2:99]
            counter = 0
            index = 0
            
            for element in row:
                if index == 1 or index == 2 or index == 5:
                    if 'title="Yes"' in element:
                        counter = counter + 2
                else:
                    if 'title="Yes"' in element:
                        counter = counter + 1
                index = index + 1
            score.append(counter)
        except:
            continue
            
countries = [item.split('\n')[0] for item in countries]
d = {'country' : countries, 'homosexuality_score': score}
homosexuality_score = pd.DataFrame(d)

In [238]:
merged_data = pd.merge(merged_data,homosexuality_score, on = 'country', how = 'left')
merged_data['homosexuality_score'] = merged_data['homosexuality_score'].fillna(merged_data.groupby('Continent')['homosexuality_score'].transform('mean')).astype(int)

### Average schooling data

In [251]:
url = 'https://www.worldeconomics.com/Indicator-Data/ESG/Social/Mean-Years-of-Schooling/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

first_row = True
country = []
avg_schooling = []

table = soup.find_all('tr')

for item in table:
    if first_row:
        first_row = False
    else:
        element = item.find_all('td')
        country.append(element[0].text)
        avg_schooling.append(element[1].text)

d = {'country' : country, 'avg_schooling': avg_schooling}
avg_schooling = pd.DataFrame(d)

avg_schooling['country'][avg_schooling['country'] == 'Congo, Dem. Rep'] = 'Congo'

In [None]:
merged_data = pd.merge(merged_data, avg_schooling, on = 'country' , how = 'left')
merged_data['avg_schooling'] = pd.to_numeric(merged_data.avg_schooling.astype(str).str.replace(',',''), errors='coerce')
merged_data['avg_schooling'] = merged_data['avg_schooling'].fillna(merged_data.groupby('Continent')['avg_schooling'].transform('mean')).astype(int)


### Adult obesity data

In [None]:
obesity = pd.read_csv('obesity_adult.csv')
obesity_2016 = obesity[['Area', 'Value']]
obesity_2016.rename(columns={'Value':'obesity_2016'}, inplace=True)


### Percentage of people considered malnourished


In [265]:
malnourisment_2016_18  = pd.read_csv('undernourishment.csv')
malnourisment_2016_18 = malnourisment_2016_18[['Area', 'Value']]
malnourisment_2016_18.rename(columns={'Value':'malnourishment_2016_18'}, inplace=True)


### Percentage of areas with moderate to severe food insecurity

In [277]:
food_insecurity_2017  = pd.read_csv('food_insecurity_2017.csv')
food_insecurity_2017 = food_insecurity_2017[['Area', 'Value', 'Item']]
food_insecurity_2017.rename(columns={'Value':'food_insecurity_2017'}, inplace=True)
food_insecurity_2017 = food_insecurity_2017[food_insecurity_2017['Item'].str.contains('total')]
food_insecurity_2017 = food_insecurity_2017.drop('Item', axis = 1)

### Metric used to determine political stability

In [267]:
political_stability_2020  = pd.read_csv('political_stability_2020.csv')
political_stability_2020 = political_stability_2020[['Area', 'Value']]
political_stability_2020.rename(columns={'Value':'political_stability_2020'}, inplace=True)

### Protein supply by country

In [268]:
protein_supply_2017  = pd.read_csv('protein_supply_2017.csv')
protein_supply_2017 = protein_supply_2017[['Area', 'Value']]
protein_supply_2017.rename(columns={'Value':'protein_supply_2017'}, inplace=True)

### Merging FAOSTA Data

In [292]:
import functools as ft

dfs = [malnourisment_2016_18,protein_supply_2017,political_stability_2020,food_insecurity_2017]
df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='Area'), dfs)
df_final = pd.merge(df_final, obesity_2016, how = 'left' , on = 'Area')

In [None]:
df_final['Area'][df_final['Area'] == 'Czechia'] = 'Czech Republic'
df_final['Area'][df_final['Area'] == 'Viet Nam'] = 'Vietnam'
df_final['Area'][df_final['Area'] == 'Russian Federation'] = 'Russia'
df_final['Area'][df_final['Area'] == "Democratic People's Republic of Korea"] = 'North Korea'
df_final['Area'][df_final['Area'] == "Republic of Korea"] = 'South Korea' 
df_final['Area'][df_final['Area'] == "United States of America"] = 'United States'
df_final['Area'][df_final['Area'] == "United Kingdom of Great Britain and Northern Ireland"] = 'United Kingdom'
df_final['Area'][df_final['Area'] == "China, Hong Kong SAR"] = 'China'

new_list = []
for item in df_final['Area']:
    try:
        element = item.split("(")[0]
        new_list.append(element.strip())
    except:
        new_list.append(element.strip())


df_final.rename(columns = {'Area':'country'}, inplace = True)


### Final Merge

In [314]:
merged_data = pd.merge(merged_data, df_final, how = 'left', on = 'country')
merged_data.malnourishment_2016_18 =  [item.strip('<') for item in merged_data.malnourishment_2016_18.astype(str)]

cols = ['malnourishment_2016_18','protein_supply_2017','political_stability_2020','food_insecurity_2017','obesity_2016']
merged_data[cols] = merged_data[cols].fillna(merged_data.groupby('Continent')[cols].transform('mean')).astype(float)


In [328]:
merged_data.head(8)

Unnamed: 0,Continent,country,CO2_per_capita_2016,crimeIndex,epi_score2022,happiness2021,fem_life_expectancy,male_life_expectancy,religion,homosexuality_score,avg_schooling,malnourishment_2016_18,protein_supply_2017,political_stability_2020,food_insecurity_2017,obesity_2016
0,Africa,Algeria,3.850000,52.030000,53.200,4.887000,78.760000,76.300000,muslims,0,8,2.700000,89.300000,-0.860000,19.700000,27.400000
1,Africa,Angola,1.060000,66.480000,12.800,4.532128,65.120000,59.460000,christians,3,5,15.400000,52.400000,-0.520000,52.490909,8.200000
2,Africa,Benin,0.600000,53.974091,13.400,5.045000,64.450000,61.230000,christians,1,4,7.900000,64.300000,-0.440000,66.300000,9.600000
3,Africa,Botswana,2.980000,52.980000,20.800,3.467000,72.690000,66.720000,christians,3,10,21.300000,71.000000,1.090000,53.400000,18.900000
4,Africa,Burkina,1.140625,53.974091,19.325,4.532128,67.183878,63.171429,christians,0,5,16.957895,68.213158,-0.724082,52.490909,12.446809
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,South America,Paraguay,0.890000,49.370000,47.500,5.653000,76.780000,72.550000,christians,2,9,7.800000,68.000000,0.020000,15.600000,20.300000
212,South America,Peru,1.870000,66.720000,43.000,5.840000,80.150000,74.870000,christians,5,10,7.600000,87.400000,-0.290000,42.900000,19.700000
213,South America,Suriname,3.810000,61.356364,39.300,5.873900,75.550000,68.880000,christians,3,9,8.300000,62.000000,0.420000,25.371429,26.400000
214,South America,Uruguay,1.900000,51.730000,70.800,6.431000,81.880000,74.750000,christians,10,9,2.500000,84.000000,1.050000,25.100000,27.900000


In [327]:
merged_data.to_csv('social_factors.csv')

## Final words

This is a non-exhaustive list of factors, so feel free to add if you want. The data will be analysed, further cleaned, and visualised in the next component.