### Data Pre-processing and Cleaning

In [36]:
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib notebook

In [37]:
import math

def isOpen(line):
    try:
        if math.isnan(line):
            line = 'closed'
    except TypeError:
        line = 'open'
    return line

# Create a new column that checks whether the paper is open or not
#data['isopen'] = data.pmd.apply(lambda l:isOpen(l))

#### Tried this bash approach 

This did not work. The join command could not accept tab separated output. 

In [4]:
%%bash
LANG=en_EN sort -d ../Data/CleanedPubidJournalYear.txt \
>../Data/CleanedPubidJournalYearSorted.txt

LANG=en_EN sort -d ../Data/pmid_pmc_check.txt \
>../Data/pmid_pmc_check_sorted.txt

LANG=en_EN join ../Data/pmid_pmc_check_sorted.txt \
../Data/CleanedPubidJournalYearSorted.txt \
>../Data/CleanedPubidJournalYearPmic.txt

#### So I Wrote a Python Code to parse the Abstracts for useful information

In [14]:
def parseAbstracts(infile,outfile):
    with open(outfile,'w') as clean:
        with open(infile) as abstract:
            tag = False
            for line in abstract:
                if line[0].isdigit() and (
                    line[1:3] == '. ' or line[2:4] == '. ' or line[3:5] == '. '):
                    if tag:
                        continue
                    else:
                        try:
                            date = line.replace(
                                ';','.').replace(':','.').split('.')[2]
                            journal = line.replace(
                                ';','.').replace(':','.').split('.')[1]
                            tag = True
                        except IndexError:
                            print(line)
                            tag = False
                if tag and line.startswith('PMID:'):
                    pubid = line.split()[1]
                    tag=False
                    clean.write('%s\t%s\t%s\n' % (pubid, journal, date.strip()))
            

Keep in mind that 4 papers had been retrated and therefore their details were not parsed correctly and were not included in the analysis.

Used the script below to confirm the recheck articles that were published in August 2018. Most had not been assigned PMIC ID yet they were open.

In [28]:
!grep '2018 Oct' ../Data/CleanedKenyanPaps.txt |cut -f1

30312417
30312309
30311515
30311422
30311117
30309787
30309738
30309402
30309343
30308064
30307994
30307567
30305924
30305236
30305159
30305127
30305123
30305069
30305067
30304206
30304043
30303978
30302585
30300838
30300797
30300389
30298229
30297412
30297124
30296964
30295231
30287676
30286763
30285848
30285768
30285745
30285684
30283498
30282787
30282493
30281594
30277311
30275421
30273342
30270054
30269584
30248195
30224199
30223982
30220395
30219666
30173559
30173395
30143829
30120168
30109391
30105967
30105965
30105964
30084344
30077450
30041058
30031730
30017827
30006028
30005020
29985263
29961102
29885751
29877169
29792760
29752729
29751313
29667077
29664764
29661259
29603110
29594962
29464876
29453584
29397542


In [29]:
%%bash
for line in $(grep '2018 Oct' ../Data/CleanedKenyanPaps.txt |cut -f1)
    do
        efetch -db pubmed -id $line \
        -format xml | xtract \
        -pattern ArticleIdList -element ArticleId |cut -f1,4
    done

30312417
30312309
30311515
30311422
30311117
30309787
30309738
30309402
30309343
30308064
30307994
30307567
30305924	PMC6173229
30305236
30305159
30305127
30305123
30305069
30305067
30304206
30304043
30303978
30302585
30300838
30300797
30300389
30298229
30297412
30297124
30296964
30295231
30287676
30286763
30285848	PMC6167850
30285768	PMC6171301
30285745	PMC6167779
30285684	PMC6167894
30283498	PMC6131765
30282787
30282493
30281594
30277311
30275421
30273342
30270054
30269584
30248195
30224199
30223982
30220395	PMC6152584
30219666
30173559
30173395
30143829
30120168
30109391
30105967
30105965
30105964
30084344
30077450
30041058	PMC6139638
30031730
30017827
30006028
30005020	NIHMS980335
29985263
29961102	PMC6154034
29885751
29877169
29792760
29752729
29751313
29667077
29664764
29661259
29603110	PMC6146064
29594962	PMC6156752
29464876
29453584
29397542	PMC6131128


#### Merge the data 

In [6]:
def mergeData(pmcPMID,outfile):
    '''
    Takes a PMC_PMID check file and merges 
    '''
    pmc_pmid = pd.read_table(
        pmcPMID,header=None, names=['pmid', 'pmcid'])

    pmc_pmid['isopen'] = pmc_pmid['pmcid'].apply(
        lambda l:isOpen(l))
    
    journal_year = pd.read_table(
        outfile, header=None, names=['pmid','journal','date'])
    
    data = pd.merge(pmc_pmid, journal_year, on="pmid")
    
    return data

In [7]:
def convertDate(data,outcsv):
    '''
    Given a dataframe, convert to date time and separate
    the date columns
    '''
    data.set_index('pmid', inplace=True)
    #### Conver the date column to date format
    data['date'] = pd.to_datetime(data['date'], errors='coerce')
    data['year'] = data.date.dt.year
    data['month'] = data.date.dt.month
    ### Save the data in a csv for future re-use
    data.to_csv(outcsv)

#### Initial data, with Kenya as KeyWord

In [23]:
infile = '../Data/abstracts.txt'
outfile = '../Data/CleanedPubidJournalYear.txt'
pmcPMID = '../Data/pmid_pmc_check.txt'
outcsv = '../Data/PMID_PMC_Journal_Year.csv'

In [24]:
parseAbstracts(infile,outfile)
data = mergeData(pmcPMID,outfile)
convertDate(data,outcsv)

11. Providing Sustainable Mental and Neurological Health Care in Ghana and

9. RETRACTED ARTICLE

40. RETRACTED ARTICLE

68. RETRACTED ARTICLE

84. RETRACTED ARTICLE



#### Updated data, with Kenya as KeyWord in affliation

In [19]:
infile = '../Data/kenpaps.txt'
outfile = '../Data/CleanedKenyanPaps.txt'
pmcPMID = '../Data/pmid_pmc_check.txt'
outcsv = '../Data/PMID_PMC_Journal_Year_Kenya.csv'

In [22]:
parseAbstracts(infile,outfile)
data = mergeData(pmcPMID,outfile)
convertDate(data,outcsv)

99. RETRACTED ARTICLE

72. RETRACTED ARTICLE

35. RETRACTED ARTICLE



### Processing Preprint data

The data we are working with are downloaded from [prepu](https://raw.githubusercontent.com/OmnesRes/prepub/master/biorxiv/biorxiv_licenses.tsv). 

1. First, we get the pre-prints with Kenyan authors

In [27]:
%%bash
    echo `cut -f 7 ../Data/biorxiv_licenses.tsv|grep -c -f \
    ../Data/countries` \
    >>../Data/countries_papers2.txt


2\. Downloaded a list of all the countries from [this gist](https://gist.github.com/kalinchernev/486393efcca01623b18d)

3\. Counted the number of papers available in the preprints affiliated with each of the countries

In [52]:
%%bash
while read country; 
do
  echo -e "$country\t"`cut -f 7 ../Data/biorxiv_licenses.tsv|grep -c "\b$country\b"`
done <../Data/countries >../Data/preprint_country_counts.txt

##### See Data_Analysis_and_Visualization notebook for further analysis