# Data Preprocessing and Understanding

## Set Workspace

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.options.display.max_colwidth=None
pd.options.display.max_rows=None

## Preprocessing Data

In [3]:
# The following codes are used in the dataset
codes = ['[PDF][PDF]', '[BOOK][B]', '[HTML][HTML]', '[DOC][DOC]', '[CITATION][C]']

# Create a replacement dictionary
replace_dict = {'[PDF][PDF]':'PDF', '[BOOK][B]':'BOOK', 
                '[HTML][HTML]':'HTML', '[DOC][DOC]':'DOC', 
                '[CITATION][C]': 'CITE', 'ART':'ART'}

In [4]:
# Data on horse colic
horse_df = pd.read_csv('data/horse_colic.csv')
horse_df.head(2)

Unnamed: 0,Title,TruncAbstr,AugmTitle,Info
0,Prospective study of equine colic risk factors,A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,"('Prospective study of equine colic risk factors', 'A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …')","MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library"
1,[PDF][PDF] Dietary and other management factors associated with equine colic,"… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","('[PDF][PDF] Dietary and other management factors associated with equine colic', '… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …')","N Cohen, P Gibbs, A Woods - J. Am. Vet. Med. Assoc, 1999 - researchgate.net"


In [5]:
# Get the data shape
horse_df.shape

(1050, 4)

In [6]:
# Find how many missing values
horse_df.isna().sum()

Title           0
TruncAbstr    190
AugmTitle       0
Info            0
dtype: int64

In [7]:
# Fill in the missing values with empty string
horse_df = horse_df.fillna('')

# Check for success
horse_df.isna().sum()

Title         0
TruncAbstr    0
AugmTitle     0
Info          0
dtype: int64

In [8]:
# Create a new column to hold the labels
horse_df['Label'] = 'ART'

# Populate the Label column with the codes extracted from Title column
# This does not seem to work, it assigns other labels than the expected ones at times!
for c in codes:
    horse_df['Label'] = np.where(horse_df['Title'].str.contains(c), c, horse_df['Label'] )

# Check the output
horse_df.head(10)

Unnamed: 0,Title,TruncAbstr,AugmTitle,Info,Label
0,Prospective study of equine colic risk factors,A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,"('Prospective study of equine colic risk factors', 'A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …')","MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library",ART
1,[PDF][PDF] Dietary and other management factors associated with equine colic,"… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","('[PDF][PDF] Dietary and other management factors associated with equine colic', '… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …')","N Cohen, P Gibbs, A Woods - J. Am. Vet. Med. Assoc, 1999 - researchgate.net",[PDF][PDF]
2,Prospective study of equine colic incidence and mortality,"A prospective study of one year was conducted on 31 horse farms to obtain population based \nestimates of incidence, morbidity and mortality rates of equine colic. Farms with greater …","('Prospective study of equine colic incidence and mortality', 'A prospective study of one year was conducted on 31 horse farms to obtain population based \nestimates of incidence, morbidity and mortality rates of equine colic. Farms with greater …')","MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library",ART
3,Case-control study of the association between various management factors and development of colic in horses. Texas Equine Colic Study Group.,The association between various management factors and development of colic was studied \nin 821 horses treated for colic and 821 control horses treated for noncolic emergencies by …,"('Case-control study of the association between various management factors and development of colic in horses. Texas Equine Colic Study Group.', 'The association between various management factors and development of colic was studied \nin 821 horses treated for colic and 821 control horses treated for noncolic emergencies by …')","ND Cohen, PL Matejka, CM Honnas… - Journal of the American …, 1995 - europepmc.org",ART
4,[BOOK][B] Practical guide to equine colic,"… should stimulate research ideas to improve equine colic patient management. I personally \nlearned … When asked by WileyBlackwell to write a small 200-page book about equine colic, it …","('[BOOK][B] Practical guide to equine colic', '… should stimulate research ideas to improve equine colic patient management. I personally \nlearned … When asked by WileyBlackwell to write a small 200-page book about equine colic, it …')","LL Southwood, J Fehr - 2012 - books.google.com",[DOC][DOC]
5,"A two year, prospective survey of equine colic in general practice",… VERY few epidemiological studies of equine colic have been performed on an unbiased \npopulation. Most published surveys have been conducted on horses referred to clinics for …,"('A two year, prospective survey of equine colic in general practice', '… VERY few epidemiological studies of equine colic have been performed on an unbiased \npopulation. Most published surveys have been conducted on horses referred to clinics for …')","CJ Proudman - Equine Veterinary Journal, 1992 - Wiley Online Library",ART
6,Development of a colic severity score for predicting the outcome of equine colic,"… of survival for equine colic patients. Numerous variables, used independently or in combination, \nare related to survival,’ but they have been most successful in predicting survival when …","('Development of a colic severity score for predicting the outcome of equine colic', '… of survival for equine colic patients. Numerous variables, used independently or in combination, \nare related to survival,’ but they have been most successful in predicting survival when …')","MO FURR, P LESSARD, NAW II - Veterinary Surgery, 1995 - Wiley Online Library",ART
7,Prognosticating equine colic,Prognosticating survival in horses with colic is challenging because of the number of diseases \nand pathophysiologic processes that can cause the behavior. Although the treatment of …,"('Prognosticating equine colic', 'Prognosticating survival in horses with colic is challenging because of the number of diseases \nand pathophysiologic processes that can cause the behavior. Although the treatment of …')","S Dukti, NA White - Veterinary clinics: Equine practice, 2009 - vetequine.theclinics.com",ART
8,Clinical evaluation of blood lactate levels in equine colic,"… Donawick and Hiza (1973) reported that in 9 cases of equine colic, blood lactate levels were \n… evaluate the rise of blood lactate levels in equine colic cases and to determine if such levels …","('Clinical evaluation of blood lactate levels in equine colic', '… Donawick and Hiza (1973) reported that in 9 cases of equine colic, blood lactate levels were \n… evaluate the rise of blood lactate levels in equine colic cases and to determine if such levels …')","JN Moore, RR Owen, JH Lumsden - Equine Veterinary Journal, 1976 - Wiley Online Library",ART
9,Prognosis in equine colic: a comparative study of variables used to assess individual cases,The present retrospective study compared objectively the prognostic value of many variables \nroutinely used in the assessment of equine colic cases. The best prognostic variables were …,"('Prognosis in equine colic: a comparative study of variables used to assess individual cases', 'The present retrospective study compared objectively the prognostic value of many variables \nroutinely used in the assessment of equine colic cases. The best prognostic variables were …')","BW Parry, GA Anderson, CC Gay - Equine Veterinary Journal, 1983 - Wiley Online Library",ART


In [9]:
# Replace the labels with shorter expressions
horse_df['Label'] = horse_df['Label'].replace(replace_dict)
horse_df.head(2)

Unnamed: 0,Title,TruncAbstr,AugmTitle,Info,Label
0,Prospective study of equine colic risk factors,A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,"('Prospective study of equine colic risk factors', 'A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …')","MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library",ART
1,[PDF][PDF] Dietary and other management factors associated with equine colic,"… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","('[PDF][PDF] Dietary and other management factors associated with equine colic', '… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …')","N Cohen, P Gibbs, A Woods - J. Am. Vet. Med. Assoc, 1999 - researchgate.net",PDF


In [10]:
# Delete the labels from Title column
for c in codes:
    horse_df.Title = horse_df.Title.str.replace(c,'', regex=False)
horse_df.head(2)

Unnamed: 0,Title,TruncAbstr,AugmTitle,Info,Label
0,Prospective study of equine colic risk factors,A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,"('Prospective study of equine colic risk factors', 'A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …')","MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library",ART
1,Dietary and other management factors associated with equine colic,"… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","('[PDF][PDF] Dietary and other management factors associated with equine colic', '… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …')","N Cohen, P Gibbs, A Woods - J. Am. Vet. Med. Assoc, 1999 - researchgate.net",PDF


In [11]:
# Group items by type - there are no items books??
horse_df.groupby('Label').size()

Label
ART     845
CITE     43
DOC      18
HTML     48
PDF      96
dtype: int64

In [15]:
# Combine Title and TruncAbstract and replace the AugmTitle column
horse_df['AugmTitle'] = horse_df['Title'] + ' ' + ' . '+ ' ' + horse_df['TruncAbstr']
horse_df.head(2)

Unnamed: 0,Title,TruncAbstr,AugmTitle,Info,Label
0,Prospective study of equine colic risk factors,A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,Prospective study of equine colic risk factors . A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,"MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library",ART
1,Dietary and other management factors associated with equine colic,"… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","Dietary and other management factors associated with equine colic . … Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","N Cohen, P Gibbs, A Woods - J. Am. Vet. Med. Assoc, 1999 - researchgate.net",PDF


In [16]:
# Remove the line delimitators in the AugmTitle column
horse_df.AugmTitle = horse_df.AugmTitle.str.replace('\n','', regex=False)
horse_df.head(2)

Unnamed: 0,Title,TruncAbstr,AugmTitle,Info,Label
0,Prospective study of equine colic risk factors,A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine \ncolic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,Prospective study of equine colic risk factors . A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine colic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …,"MK Tinker, NA White, P Lessard… - Equine veterinary …, 1997 - Wiley Online Library",ART
1,Dietary and other management factors associated with equine colic,"… Equine colic is an important cause of disease and death in horses. Relatively few … , and to \nexamine other management factors associated with equine colic. Because horses examined at …","Dietary and other management factors associated with equine colic . … Equine colic is an important cause of disease and death in horses. Relatively few … , and to examine other management factors associated with equine colic. Because horses examined at …","N Cohen, P Gibbs, A Woods - J. Am. Vet. Med. Assoc, 1999 - researchgate.net",PDF


## Prepare Titles Data

In [17]:
# Prepare the text column
df_titles = horse_df[['Title']]

# Rename the title column
df_titles = df_titles.rename(columns = {'Title': 'text'})

# Check the outcome
df_titles.head(2)

Unnamed: 0,text
0,Prospective study of equine colic risk factors
1,Dietary and other management factors associated with equine colic


In [18]:
# Check how many titles contain the word colic
contain_values = df_titles[df_titles['text'].str.contains('olic')]
len(contain_values)


682

In [19]:
# Save the text column to a file
df_titles.to_csv('data/horse_titles.csv')  

## Augumented Title Data

In [20]:
# Prepare the text column 
df = horse_df[['AugmTitle']]

In [21]:
# Rename the text column 
df = df.rename(columns={'AugmTitle': 'text'})
df.head(4)

Unnamed: 0,text
0,Prospective study of equine colic risk factors . A1 year prospective study was conducted on 31 horse farms to identify risk factors for equine colic. Farms were randomly selected from a list from 2 adjacent counties of Virginia and …
1,"Dietary and other management factors associated with equine colic . … Equine colic is an important cause of disease and death in horses. Relatively few … , and to examine other management factors associated with equine colic. Because horses examined at …"
2,"Prospective study of equine colic incidence and mortality . A prospective study of one year was conducted on 31 horse farms to obtain population based estimates of incidence, morbidity and mortality rates of equine colic. Farms with greater …"
3,Case-control study of the association between various management factors and development of colic in horses. Texas Equine Colic Study Group. . The association between various management factors and development of colic was studied in 821 horses treated for colic and 821 control horses treated for noncolic emergencies by …


In [22]:
# Check how many augmented titles contain the word colic
contains_list = df[df['text'].str.contains('olic')]
len(contains_list)

919

In [23]:
# Inspect the items that do not contain the word colic
list_nocolic = df[~df['text'].str.contains('olic')]
list_nocolic = list_nocolic[~list_nocolic['text'].str.contains('OLIC') ]
list_nocolic

Unnamed: 0,text
112,"The equine acute abdomen . Written and edited by leading experts on equine digestive diseases, The Equine Acute Abdomen, Third Editionis the preeminent text on diagnosing and treating acute abdominal …"
155,"Cisapride in the prophylaxis of equine post operative ileus . Cisapride and domperidone were both effective in restoring electrical and mechanical activity, coordination between gastric and small intestinal activity cycles and the stomach to anus …"
167,Effect of butorphanol on equine antroduodenal motility . Six healthy six to eight‐month‐old horses were surgically prepared with Ag bipolar electrodes sutured to the gastric antrum and duodenum. Leads from the electrodes were exteriorised …
187,Plasma endotoxin concentrations in experimental and clinical equine subjects . Endotoxin (LPS) was quantitated in experimental subjects and in horses with naturally occurring gastrointestinal strangulation obstruction and/or septicaemic diseases to establish the …
223,"Factors influencing equine gut microbiota: Current knowledge . Gastrointestinal microbiota play a crucial role in nutrient digestion, maintaining animal health and welfare. Various factors may affect microbial balance often leading to disturbances that …"
247,"Survey of equine nutrition: perceptions and practices of veterinarians in Georgia, USA . Equine nutrition plays a critical role in equine health. The veterinarian is an expected major source of equine nutrition information, yet little evidence exists to evaluate this assumed role, …"
264,"Monitoring acute equine visceral pain with the equine Utrecht University scale for composite pain assessment (EQUUS-COMPASS) and the equine Utrecht University … . This study presents the validation of two recently described pain scales, the Equine Utrecht University Scale for Composite Pain Assessment (EQUUS-COMPASS) and the Equine …"
266,"Adult Equine Diarrhea Workup . Basic elements involved in the clinical evaluation of cases of acute and chronic diarrhea are covered. Major etiological considerations are listed, and important aspects of …"
287,Application of an equine composite pain scale and its association with plasma adrenocorticotropic hormone concentrations and serum cortisol concentrations in … . This study assessed the application of a modified equine composite pain scale (CPS) and identified the inter‐observer reliability. Associations between CPS scores and the measured …
294,Internal Parasites .


In [24]:
len(list_nocolic)

107

In [25]:
# Save the text column to a file
df.to_csv('data/horse_augm_titles.csv')  

In [26]:
#for c in codes:
#    df = df.apply(lambda x: x.str.replace(c,''), regex=True, axis=1)
