# Classify web news with country and continent
This is a part of data exploration. Our group would like to do further research based on the metadata.csv. I classify the web news with country and continent, and generate a csv file with country and continent columns, which could be helpful to do analysis on the geographic characteristics.

Note: I pip install geotext and pycountry_convert in my anaconda shell prompt. Geotext  extracts country and city mentions from text. Pycountry convert can get the continent corresponding to the country name.

**Geotext**

Free software: MIT license

Documentation: https://geotext.readthedocs.org

**Pycountry_convert**

Open source: Tsinghua Open Source Mirror

Source: https://mirrors.tuna.tsinghua.edu.cn/

Analyzed by: Sun Wengyi

In [1]:
%pip install pandas



In [2]:
%pip install geotext

Collecting geotext
  Downloading geotext-0.4.0-py2.py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: geotext
Successfully installed geotext-0.4.0


In [3]:
%pip install pycountry_convert

Collecting pycountry_convert
  Downloading pycountry_convert-0.7.2-py3-none-any.whl (13 kB)
Collecting pprintpp>=0.3.0 (from pycountry_convert)
  Downloading pprintpp-0.4.0-py2.py3-none-any.whl (16 kB)
Collecting pycountry>=16.11.27.1 (from pycountry_convert)
  Downloading pycountry-23.12.11-py3-none-any.whl (6.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
Collecting pytest-mock>=1.6.3 (from pycountry_convert)
  Downloading pytest_mock-3.12.0-py3-none-any.whl (9.8 kB)
Collecting pytest-cov>=2.5.1 (from pycountry_convert)
  Downloading pytest_cov-4.1.0-py3-none-any.whl (21 kB)
Collecting repoze.lru>=0.7 (from pycountry_convert)
  Downloading repoze.lru-0.7-py3-none-any.whl (10 kB)
Collecting coverage[toml]>=5.2.1 (from pytest-cov>=2.5.1->pycountry_convert)
  Downloading coverage-7.4.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (233 kB)
[2K     [90m━━━━━

In [4]:
import pandas as pd
from geotext import GeoText
import pycountry_convert as pc

In [5]:
#Convert country names to continent names
def country_to_continent(country_name):
    try:
        country_alpha2 = pc.country_name_to_country_alpha2(country_name)
        country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
        country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
        return country_continent_name
    except:
        return None

In [6]:
#Extract countries and their corresponding continents from the tags
def extract_country_and_continent(tags):
    places = GeoText(tags)
    countries = places.countries
    continents = [country_to_continent(country) for country in countries]
    return countries, list(set(filter(None, continents)))


In [7]:
def main(csv_file, output_file):
    #Read the CSV file into a DataFrame
    df = pd.read_csv(csv_file)
    #Initialize two lists to store country and continent information
    all_countries = []
    all_continents = []

    #Iterate over each row of the DataFrame
    for tags in df['Tags']:
        #Extract countries and continents
        countries, continents = extract_country_and_continent(tags)

        #Convert all countries and continents to strings, separated by commas
        country_str = ', '.join(countries) if countries else None
        continent_str = ', '.join(filter(None, continents)) if continents else None

        all_countries.append(country_str)
        all_continents.append(continent_str)

    #Add country and continent information to the DataFrame
    df['Country'] = all_countries
    df['Continent'] = all_continents

    #Write the processed DataFrame to a new CSV file
    df.to_csv(output_file, index=False)

    return df

In [8]:
from google.colab import files

uploaded = files.upload()



Saving metadata_improved.csv to metadata_improved.csv


In [10]:
#Output the new csv file
input_file_colab = '/content/metadata_improved.csv'
output_file_colab = '/content/new_metadata_with_country_and_continent.csv'
df_classified = main(input_file_colab, output_file_colab)
if df_classified is not None and isinstance(df_classified, pd.DataFrame):
    print("Processed data written to:", output_file_colab)
    print(df_classified.head())
else:
    print("Error: The main function did not return a valid DataFrame.")

Processed data written to: /content/new_metadata_with_country_and_continent.csv
   Unnamed: 0                                           WebTitle  \
0           0  Volodymyr Zelenskiy stands defiant in face of ...   
1           1  Record Covid cases in Russia and Ukraine compl...   
2           2  Back in the USSR: Lenin statues and Soviet fla...   
3           3  The artists of Ukraine find their voice in a c...   
4           4  Rail staff killed in ‘unprecedented’ attack on...   

                                              WebUrl  \
0  https://www.theguardian.com/world/2022/feb/26/...   
1  https://www.theguardian.com/world/2022/feb/04/...   
2  https://www.theguardian.com/world/2022/apr/23/...   
3  https://www.theguardian.com/world/2022/apr/23/...   
4  https://www.theguardian.com/world/2022/mar/29/...   

                     PubTime  \
0  2022-02-26 13:52:35+00:00   
1  2022-02-04 15:06:11+00:00   
2  2022-04-23 16:01:52+00:00   
3  2022-04-23 07:00:47+00:00   
4  2022-03-29 