# Motivation
This notebook gives ideas about different types of filtering that can be applied to the vast amount of sources that the dataset provides.

# Filtering

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
pd.set_option('max_colwidth', None)

### Metadata-only

The first big filter is to consider only texts for which there is metadata information.

In [None]:
path = '/kaggle/input/hackathon'
files = [f'{path}/task_1-google_search_english_original_metadata.csv',
         f'{path}/task_1-google_search_translated_to_english_metadata.csv']

dfs = []
for file in files:
    df = pd.read_csv(file, encoding = "ISO-8859-1")
    dfs.append(df)
    
df = pd.concat(dfs, ignore_index=True)

In [None]:
f"Considering only {df.shape[0]} sources"

In [None]:
df.head(1)

Drop some redundant columns:

In [None]:
df.drop(['Is Processed', 'Comments', 'language', 'query'], axis=1, inplace=True)

Fix alpha_2_code NaN values for Namibia:

In [None]:
df[df['alpha_2_code'].isna()].head()

In [None]:
assert all(df[df['alpha_2_code'].isna()]['country']=='Namibia')
df['alpha_2_code'].fillna('NA', inplace=True)

### Remove empty documents

In [None]:
df.drop(df[df['is_downloaded']==False].index, inplace=True)
df['char_number'] = pd.to_numeric(df['char_number'], errors='coerce')
df.drop(df[df['char_number']==0].index, inplace=True)

In [None]:
f"Considering only {df.shape[0]} sources"

### Remove duplicated urls

In [None]:
df.drop_duplicates('url', keep=False, inplace=True)

In [None]:
f"Considering only {df.shape[0]} sources"

### Analyse char number

In [None]:
df['char_number'].value_counts().head()

80 articles having exactly 895 number of characters is suspicious. Let's see which are the corresponding urls.

In [None]:
df[df['char_number']==895].head()

In [None]:
df[df['char_number']==895]['url'].str.contains('researchgate.net').mean()

It turns out most of those are from the website *researchgate.net*. Let's see what is the content of one of the files.

In [None]:
row = df[df['char_number']==895].iloc[0]
code = row['alpha_2_code']
filename=row['filename']
filename = f'/kaggle/input/hackathon/task_1-google_search_txt_files_v2/{code}/{filename}.txt'

with open(filename, 'r') as file:
    data = file.read()

data

This looks like a predefined message that is not useful for us, thus we can ignore all sources coming from *researchgate.net*

In [None]:
df.drop(df[(df['url'].str.contains('researchgate.net')) & (df['char_number']==895)].index, inplace=True)

In [None]:
f"Considering only {df.shape[0]} sources"

### Analyse url domains

In [None]:
from urllib.parse import urlparse
df['url_domain'] = df['url'].apply(lambda x: urlparse(x).netloc)

In [None]:
df['url_domain'].value_counts().head()

The domain with most entries in the filtered dataset is *www.ncbi.nlm.nih.gov*. This is the website of National Center for Biotechnology Information which provides access to biomedical and genomic information. It's a government site, which suggests that the information on it is reliable. Let's apply another filter and look only at the data from this website.

### NCBI sources only

In [None]:
df = df[df['url_domain']=='www.ncbi.nlm.nih.gov']

In [None]:
f"Considering only {df.shape[0]} sources"

Let's enrich the metadata information for these sources by extracting the title for each article.

In [None]:
! pip install pandarallel

In [None]:
import bs4 as bs
import urllib.request
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def get_url_title(url):
    try:
        req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        source = urllib.request.urlopen(req).read()
        soup = bs.BeautifulSoup(source,'lxml')
        if not soup.title:
            print('No title')
            print(url)
            return ""
        return soup.title.text
    except urllib.error.HTTPError as e:
        print(e)
        print(url)
        return ""

In [None]:
df['url_title'] = df['url'].parallel_apply(get_url_title)

In [None]:
df[['country', 'url_title']].head()

By looking at the titles, we can see that some of the articles are about more than one country, i.e for 'European countries'. This might cause problems, because when extracting the answer for a question we would not know for which country it applies. Let's filter such articles by checking whether the title contains the name of the country.

In [None]:
df['title_has_country'] = df.apply(lambda row: row['country'] in row['url_title'], axis=1)

In [None]:
df['title_has_country'].value_counts()

It's good news that most of the articles are about a single country.

In [None]:
df.drop(df[df['title_has_country'] == False].index, inplace=True)

In [None]:
f"Considering only {df.shape[0]} sources"

### Remove articles with the same title

In [None]:
df['url_title'].value_counts().head()

In [None]:
df[df['url_title']=='The Current Status of BCG Vaccination in Young Children in South Korea']['url']

Some articles have duplicated titles, because the source url is almost exactly the same, except for a different section tag at the end. This means we can remove those duplicates.

In [None]:
df.drop_duplicates('url_title', inplace=True)

In [None]:
f"Considering only {df.shape[0]} sources"

In [None]:
df['country'].value_counts()

In [None]:
df['char_number'].plot.box()

Finally, we end up with just 85 sources for 73 countries with an average of about 20K characters per article.