# About this notebook:

- We are cleaning our main data: Articles.
- Data cleaning processes include:
    - Addressing Missing
    - Checking for duplicates
    - Converting to lowercase
    - Removing stopwords and stemming for Articles' Abstract
    - Removing Abstract special characters

[Part 1: Library Imports & Functions Creation](#ID_1)<br>
[Part 2: Data Cleaning](#ID_2)<br>
[Part 3: Summary & Export](#ID_3)


# 1. Library Imports & Functions Creation <a class="anchor" id="ID_1"></a>

In [1]:
import numpy as np 
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


#Visualisation:
import seaborn               as sns
import matplotlib.pyplot     as plt
sns.set_theme(style="whitegrid")

#Progress tracking
from tqdm import tqdm
tqdm.pandas()

#Text processing
import ast

#Showing missing, duplicates, shape, dtypes
def df_summary(df):
    print(f"Shape(col,rows): {df.shape}")
    print(f"Number of duplicates: {df.duplicated().sum()}")
    print('---'*20)
    print(f'Number of each unqiue datatypes:\n{df.dtypes.value_counts()}')
    print('---'*20)
    print("Columns with missing values:")
    isnull_df = pd.DataFrame(df.isnull().sum()).reset_index()
    isnull_df.columns = ['col','num_nulls']
    isnull_df['perc_null'] = ((isnull_df['num_nulls'])/(len(df))).round(2)
    print(isnull_df[isnull_df['num_nulls']>0])

In [2]:
#Preprocessing
import re
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

from nltk.tokenize import word_tokenize

#remove stopwords
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [3]:
df = pd.read_csv('01_Data Collection/Step_3_Merge_dataframes/master_df.csv')
df
df_summary(df)

Unnamed: 0,ID,title,Pub_Date,abstract,MeSH_term
0,21210353,Human leukocyte antigen-G (HLA-G) as a marker ...,2011/01/07,Human leukocyte antigen-G (HLA-G) is a non-cla...,"['', ('Biomarkers, Tumor', ['immunology', 'met..."
1,21265258,Head and neck follicular dendritic cell sarcom...,2011/01/27,"Currently, of less than 50 cases of head and n...","['', ('Aged', []), ('Castleman Disease', ['com..."
2,21205401,A discrete choice experiment investigating pre...,2011/01/06,Policy debate about funding criteria for drugs...,"['', ('Biomedical Research', []), ('Choice Beh..."
3,21245633,Effectiveness of repeated intragastric balloon...,2011/01/20,A 19-year-old Japanese male with a BMI of 55.4...,"['', ('Gastric Balloon', ['trends']), ('Humans..."
4,21194024,Golden retriever muscular dystrophy (GRMD): De...,2011/01/05,Studies of canine models of Duchenne muscular ...,"['', ('Animals', []), ('Breeding', []), ('Dise..."
...,...,...,...,...,...
355799,26709456,Reactive oxygen species production by human de...,2015/12/29,Tuberculosis remains the single largest infect...,"['', ('Dendritic Cells', ['immunology', 'metab..."
355800,26675461,Evaluating the Use of Commercial West Nile Vir...,2015/12/18,We evaluated the utility of 2 types of commerc...,"['', ('Animals', []), ('Antigens, Viral', ['im..."
355801,26709605,Efficacy of protease inhibitor monotherapy vs....,2015/12/29,The aim of this analysis was to review the evi...,"['', ('Atazanavir Sulfate', ['therapeutic use'..."
355802,26662151,The occurrence of chronic lymphocytic leukemia...,2015/12/15,The occurrence of chronic myeloid leukemia (CM...,"['', ('Aged', []), ('Aged, 80 and over', []), ..."


Shape(col,rows): (355804, 5)
Number of duplicates: 0
------------------------------------------------------------
Number of each unqiue datatypes:
object    4
int64     1
dtype: int64
------------------------------------------------------------
Columns with missing values:
        col  num_nulls  perc_null
3  abstract      32640       0.09


# 2. Data Cleaning <a class="anchor" id="ID_2"></a>

## 2.1 Remove rows with missing Abstracts

Since these data rows will not be useful in helping our modelling later on (which require predicting labels of articles based on their abstract content).

In [4]:
df.dropna(subset=['abstract'], inplace=True)

In [5]:
#Number of data we have now
len(df)

323164

### Convert Pub_date to date_time formart

In [6]:
df['Pub_Date'] = pd.to_datetime(df['Pub_Date'], yearfirst=True)
df.dtypes

ID                    int64
title                object
Pub_Date     datetime64[ns]
abstract             object
MeSH_term            object
dtype: object

## 2.2 Rename columns

To avoid confusion subsequently when matching with NLM PubMed MeSH

In [7]:
df.rename(columns = {'MeSH_term':'Article_Given_MeSH'},inplace=True)

## 2.3 Conver Article_Given_MeSH to a column of lists

In [8]:
def extract_terms(term_list):
    """
    Extracts terms and qualifiers from the MeSH term list
    """
    terms = []
    for term in term_list:
        if isinstance(term, tuple):
            # extract term and qualifier
            term_str = term[0].strip("'") + (' ' + term[1][0].strip("'") if term[1] else '')
            terms.append(term_str)
    return terms

# loop over rows and extract terms
for i, row in tqdm(df.iterrows(), total=len(df)):
    term_str = row['Article_Given_MeSH']
    # convert string to list of tuples
    term_list = ast.literal_eval(term_str)
    # extract terms from list of tuples
    terms = extract_terms(term_list)
    # update the row with the new list of terms
    df.at[i, 'Article_Given_MeSH'] = terms

100%|██████████| 323164/323164 [29:28<00:00, 182.70it/s]


## 2.4 Convert to Title, Abstract and Article_Given_MeSH to lowercase

In [9]:
df['abstract'] = df['abstract'].str.lower()
df['title'] = df['title'].str.lower()
df

Unnamed: 0,ID,title,Pub_Date,abstract,Article_Given_MeSH
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,2011-01-07,human leukocyte antigen-g (hla-g) is a non-cla...,"[Biomarkers, Tumor immunology, HLA Antigens bi..."
1,21265258,head and neck follicular dendritic cell sarcom...,2011-01-27,"currently, of less than 50 cases of head and n...","[Aged, Castleman Disease complications, Dendri..."
2,21205401,a discrete choice experiment investigating pre...,2011-01-06,policy debate about funding criteria for drugs...,"[Biomedical Research, Choice Behavior, Cost-Be..."
3,21245633,effectiveness of repeated intragastric balloon...,2011-01-20,a 19-year-old japanese male with a bmi of 55.4...,"[Gastric Balloon trends, Humans, Male, Obesity..."
4,21194024,golden retriever muscular dystrophy (grmd): de...,2011-01-05,studies of canine models of duchenne muscular ...,"[Animals, Breeding, Disease Models, Animal, Do..."
...,...,...,...,...,...
355799,26709456,reactive oxygen species production by human de...,2015-12-29,tuberculosis remains the single largest infect...,"[Dendritic Cells immunology, Host-Pathogen Int..."
355800,26675461,evaluating the use of commercial west nile vir...,2015-12-18,we evaluated the utility of 2 types of commerc...,"[Animals, Antigens, Viral immunology, Culicida..."
355801,26709605,efficacy of protease inhibitor monotherapy vs....,2015-12-29,the aim of this analysis was to review the evi...,"[Atazanavir Sulfate therapeutic use, Cerebrosp..."
355802,26662151,the occurrence of chronic lymphocytic leukemia...,2015-12-15,the occurrence of chronic myeloid leukemia (cm...,"[Aged, Aged, 80 and over, B-Lymphocytes pathol..."


In [10]:
for i, row in df.iterrows():
    for j, term in enumerate(row['Article_Given_MeSH']):
        term = term.lower()
        df.at[i, 'Article_Given_MeSH'][j] = term

df

Unnamed: 0,ID,title,Pub_Date,abstract,Article_Given_MeSH
0,21210353,human leukocyte antigen-g (hla-g) as a marker ...,2011-01-07,human leukocyte antigen-g (hla-g) is a non-cla...,"[biomarkers, tumor immunology, hla antigens bi..."
1,21265258,head and neck follicular dendritic cell sarcom...,2011-01-27,"currently, of less than 50 cases of head and n...","[aged, castleman disease complications, dendri..."
2,21205401,a discrete choice experiment investigating pre...,2011-01-06,policy debate about funding criteria for drugs...,"[biomedical research, choice behavior, cost-be..."
3,21245633,effectiveness of repeated intragastric balloon...,2011-01-20,a 19-year-old japanese male with a bmi of 55.4...,"[gastric balloon trends, humans, male, obesity..."
4,21194024,golden retriever muscular dystrophy (grmd): de...,2011-01-05,studies of canine models of duchenne muscular ...,"[animals, breeding, disease models, animal, do..."
...,...,...,...,...,...
355799,26709456,reactive oxygen species production by human de...,2015-12-29,tuberculosis remains the single largest infect...,"[dendritic cells immunology, host-pathogen int..."
355800,26675461,evaluating the use of commercial west nile vir...,2015-12-18,we evaluated the utility of 2 types of commerc...,"[animals, antigens, viral immunology, culicida..."
355801,26709605,efficacy of protease inhibitor monotherapy vs....,2015-12-29,the aim of this analysis was to review the evi...,"[atazanavir sulfate therapeutic use, cerebrosp..."
355802,26662151,the occurrence of chronic lymphocytic leukemia...,2015-12-15,the occurrence of chronic myeloid leukemia (cm...,"[aged, aged, 80 and over, b-lymphocytes pathol..."


## 2.5 Removing stopwords and  Stemming for Abstract

NOTE: **Remove stop words first** then do stemming

In [11]:
df.reset_index(drop=True,inplace=True)

In [12]:
df['abstract'][50]
df['abstract'][5]

'dengue fever, including dengue hemorrhagic fever, has become a re-emerging public health threat in the caribbean in the absence of a comprehensive regional surveillance system. in this deficiency, a project entitled aricaba, strives to implement a pilot surveillance system across three islands: martinique, st. lucia, and dominica. the aim of this project is to establish a network for epidemiological surveillance of infectious diseases, utilizing information and communication technology. this paper describes the system design and development strategies of a "network of networks" surveillance system for infectious diseases in the caribbean. also described are benefits, challenges, and limitations of this approach across the three island nations identified through direct observation, open-ended interviews, and email communications with an on-site it consultant, key informants, and the project director. identified core systems design of the aricaba data warehouse include a disease monitor

'b and t lymphocyte attenuator (btla) is a co-inhibitory receptor that interacts with herpesvirus entry mediator (hvem), and this interaction regulates pathogenesis in various immunologic diseases. in graft-versus-host disease (gvhd), btla unexpectedly mediates positive effects on donor t-cell survival, whereas immunologic mechanisms of this function have yet to be explored. in this study, we elucidated a role of btla in gvhd by applying the newly established agonistic anti-btla monoclonal antibody that stimulates btla signal without antagonizing btla-hvem interaction. our results revealed that provision of btla signal inhibited donor antihost t-cell responses and ameliorated gvhd with a successful engraftment of donor hematopoietic cells. these effects were dependent on btla signal into donor t cells but neither donor non-t cells nor recipient cells. on the other hand, expression of btla mutant lacking an intracellular signaling domain restored impaired survival of btla-deficient t ce

In [13]:
#Function to tokenise, stem words, then join back
def stem_sentences(sentence):
    tokens = word_tokenize(sentence)
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

In [14]:
#remove stopwords
df['abstract'] = df['abstract'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [15]:
%%time
#tokenise, stem, then join back
df['abstract'] = df['abstract'].apply(stem_sentences)

CPU times: total: 18min 37s
Wall time: 18min 39s


In [16]:
df['abstract'][50]
df['abstract'][5]

"dengu fever , includ dengu hemorrhag fever , becom re-emerg public health threat caribbean absenc comprehens region surveil system . defici , project entitl aricaba , strive implement pilot surveil system across three island : martiniqu , st. lucia , dominica . aim project establish network epidemiolog surveil infecti diseas , util inform commun technolog . paper describ system design develop strategi `` network network '' surveil system infecti diseas caribbean . also describ benefit , challeng , limit approach across three island nation identifi direct observ , open-end interview , email commun on-sit consult , key inform , project director . identifi core system design aricaba data warehous includ diseas monitor system syndrom surveil system . three compon compris develop strategi : data warehous server , geograph inform system , forecast algorithm ; recogn technic prioriti surveil system . main benefit aricaba surveil system improv respons repres exist health system autom data col

'b lymphocyt attenu ( btla ) co-inhibitori receptor interact herpesviru entri mediat ( hvem ) , interact regul pathogenesi variou immunolog diseas . graft-versus-host diseas ( gvhd ) , btla unexpectedli mediat posit effect donor t-cell surviv , wherea immunolog mechan function yet explor . studi , elucid role btla gvhd appli newli establish agonist anti-btla monoclon antibodi stimul btla signal without antagon btla-hvem interact . result reveal provis btla signal inhibit donor antihost t-cell respons amelior gvhd success engraft donor hematopoiet cell . effect depend btla signal donor cell neither donor non-t cell recipi cell . hand , express btla mutant lack intracellular signal domain restor impair surviv btla-defici cell , suggest btla also serv ligand deliv hvem prosurviv signal donor cell . collect , current studi elucid dichotom function btla gvhd serv costimulatori ligand hvem transmit inhibitori signal receptor .'

## 2.6 Remove special characters

In [17]:
def remove_spec_char(text):
    return re.sub('[^A-Za-z0-9 ]+', '',text)    

In [18]:
df['abstract'][5]

'b lymphocyt attenu ( btla ) co-inhibitori receptor interact herpesviru entri mediat ( hvem ) , interact regul pathogenesi variou immunolog diseas . graft-versus-host diseas ( gvhd ) , btla unexpectedli mediat posit effect donor t-cell surviv , wherea immunolog mechan function yet explor . studi , elucid role btla gvhd appli newli establish agonist anti-btla monoclon antibodi stimul btla signal without antagon btla-hvem interact . result reveal provis btla signal inhibit donor antihost t-cell respons amelior gvhd success engraft donor hematopoiet cell . effect depend btla signal donor cell neither donor non-t cell recipi cell . hand , express btla mutant lack intracellular signal domain restor impair surviv btla-defici cell , suggest btla also serv ligand deliv hvem prosurviv signal donor cell . collect , current studi elucid dichotom function btla gvhd serv costimulatori ligand hvem transmit inhibitori signal receptor .'

In [19]:
df['abstract'] = df['abstract'].apply(remove_spec_char)

In [20]:
df['abstract'][5]

'b lymphocyt attenu  btla  coinhibitori receptor interact herpesviru entri mediat  hvem   interact regul pathogenesi variou immunolog diseas  graftversushost diseas  gvhd   btla unexpectedli mediat posit effect donor tcell surviv  wherea immunolog mechan function yet explor  studi  elucid role btla gvhd appli newli establish agonist antibtla monoclon antibodi stimul btla signal without antagon btlahvem interact  result reveal provis btla signal inhibit donor antihost tcell respons amelior gvhd success engraft donor hematopoiet cell  effect depend btla signal donor cell neither donor nont cell recipi cell  hand  express btla mutant lack intracellular signal domain restor impair surviv btladefici cell  suggest btla also serv ligand deliv hvem prosurviv signal donor cell  collect  current studi elucid dichotom function btla gvhd serv costimulatori ligand hvem transmit inhibitori signal receptor '

# 3. Summary and Export Cleaned Data <a class="anchor" id="ID_3"></a>

After conducting these processess:<br>
- Addressing Missing
- Checking for duplicates
- Converting to lowercase
- Removing stopwords and stemming for Articles' Abstract
- Removing Abstract special characters
We now have our cleaned Articles data.

In [None]:
df.head()

In [30]:
df.to_csv("DATA_cleaned.csv",index=False)