### Data Set Overview:

In an attempt to make the debate around the costs of open access publishing more evidence based, the Wellcome Trust is releasing into the public domain details of its open access spend in the year 2012-2013, as reported by UK institutions and the Trust’s Major Overseas Programmes in receipt of an OA block grant 

### Assignment:

* Determine the five most common journals and the total articles for each. 

* Calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

In [1]:
import pandas as pd
import numpy as np

In [4]:
wellcome = pd.read_csv('WELLCOME_APCspend2013_forThinkful.csv', header=0,encoding = 'unicode_escape')

In [5]:
wellcome.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [6]:
wellcome.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2127 entries, 0 to 2126
Data columns (total 5 columns):
PMID/PMCID                                             1928 non-null object
Publisher                                              2127 non-null object
Journal title                                          2126 non-null object
Article title                                          2127 non-null object
COST (£) charged to Wellcome (inc VAT when charged)    2127 non-null object
dtypes: object(5)
memory usage: 83.2+ KB


_________________________________________________________________________________________________________________

##### Part 1

Determine the five most common journals and the total articles for each.

In [7]:
#find the five most common journals...
wellcome['Journal title'].value_counts()

PLoS One                                           92
PLoS ONE                                           62
Journal of Biological Chemistry                    48
Nucleic Acids Research                             21
Proceedings of the National Academy of Sciences    19
                                                   ..
Development                                         1
Journal of Visulaized expermiments                  1
PLOS Computational Biology                          1
Journal of Hospital Infections                      1
Journal of Cultural Economy                         1
Name: Journal title, Length: 984, dtype: int64

observing the above... it is apparent that there are journals that are capitalized inconsistently throughout the dataset. Lets solve this by making every observation lowercase.

In [8]:
wellcome['Journal title'] = wellcome['Journal title'].str.lower()

In [9]:
wellcome.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,psychological medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,j med chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,j med chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,j org chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [13]:
#find the five most common journals...
wellcome['Journal title'].value_counts().nlargest(5)

plos one                           190
journal of biological chemistry     53
neuroimage                          29
plos pathogens                      24
plos genetics                       24
Name: Journal title, dtype: int64

The five most common journals and their number of publications is as follows:

* plos one                           190
* journal of biological chemistry     53
* neuroimage                          29
* plos pathogens                      24
* plos genetics                       24

_________________________________________________________________________________________________________________

##### Part 2

Calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

Lets first find if there are null/missing values for the columns 'Journal title' and 'COST...'. We'll first need to change the name of the 'COST...' column to something more friendly.

In [25]:
wellcome.rename(columns = {'COST (£) charged to Wellcome (inc VAT when charged)':'COST'}, inplace=True)

In [26]:
wellcome.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST
0,,CUP,psychological medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,j med chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,j med chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,j org chem,Regioselective opening of myo-inositol orthoes...,£685.88


In [32]:
not_null_cost_series = pd.notnull(wellcome["COST"])  

In [33]:
not_null_cost_series.value_counts()

True    2127
Name: COST, dtype: int64

In [35]:
not_null_JournalTitle_series = pd.notnull(wellcome['Journal title'])

In [36]:
not_null_JournalTitle_series.value_counts()

True     2126
False       1
Name: Journal title, dtype: int64

There are no missing values for the 'COST' column and there is one missing value for the 'Journal title' column. Lets just remove the missing value for the 'Journal title' column. We will then be able to move on to summary statistics of article cost per Journal Title.

In [37]:
#drop missing values from column 'Journal title'...
wellcome.dropna(subset=['Journal title'])

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST
0,,CUP,psychological medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,j med chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,j med chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,j org chem,Regioselective opening of myo-inositol orthoes...,£685.88
...,...,...,...,...,...
2122,2901593,Wolters Kluwer Health,circulation research,Mechanistic Links Between Na+ Channel (SCN5A) ...,£1334.15
2123,3748854,Wolters Kluwer Health,aids,Evaluation of an empiric risk screening score ...,£1834.77
2124,3785148,Wolters Kluwer Health,pediatr infect dis j,Topical umbilical cord care for prevention of ...,£1834.77
2125,PMCID:\n PMC3647051\n,Wolters Kluwer N.V./Lippinott,aids,Grassroots Community Organisations' Contributi...,£2374.52


Calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

Need to first remove the £ from the COST column, and then change the values to numeric.

In [45]:
#remove the £ sign...
wellcome.COST = wellcome.COST.str.replace('£', '')

In [58]:
#remove the $ sign...
wellcome.COST = wellcome.COST.str.replace('$', '')

In [59]:
wellcome.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST
0,,CUP,psychological medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,j med chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,j med chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,j org chem,Regioselective opening of myo-inositol orthoes...,685.88


In [63]:
#change column 'COST' to numeric value...
wellcome['COST'] = wellcome['COST'].astype(float)

Lets define the mean, median, and standard deviation functions first...

In [38]:
# A simple implementation simply iterating the array
def mean(numbers):
  total_sum = 0;  # initialize the sum to zero
  for n in numbers:
    total_sum += n # add up the numbers in the array
  count = len(numbers)   # find the length of the array
  avg = total_sum / count  # calculate the mean
  return avg  # return the result

##print('The mean something is {}'.format(mean(series)))

In [39]:
import math

def median(numbers):
  numbers.sort()  # sort the list of numbers
  count = len(numbers) # get the length of the array
  isEven = count % 2 == 0 # check if this list is of even length
  
  if (isEven):
    # find the two numbers in the middle of the array
    mid = math.floor( count / 2 )
    a = numbers[mid - 1]
    b = numbers[mid]
    # find the average of these two numbers
    ans = (a + b) / 2
  else:
    ans = numbers[math.floor( count / 2 )]
 
  return ans

##print('The median of the something is {}'.format(median(series)))

In [40]:
def standard_deviation(numbers):
  X_bar = mean(numbers)
  N = len(numbers)
  total_sum = 0
  for X in numbers:
    diff = X - X_bar
    squared = math.pow(diff, 2)
    total_sum += squared
  sigma = math.sqrt(total_sum / (N - 1))  
  return sigma

##print('The standard deviation of the something is {}'.format(standard_deviation(series)))

In [64]:
wellcome.groupby('Journal title')['COST'].sum()

Journal title
academy of nutrition and dietetics                                   2379.54
acs chemical biology                                                 7090.93
acs chemical neuroscience                                            1186.80
acs nano                                                             1336.28
acta crystallographica section d,  biological crystallography         771.42
                                                                     ...    
virology journal                                                     1242.00
virus research                                                       1947.09
vision research                                                   1001455.18
visual neuroscience                                                  2034.00
zoonoses and public health                                           2272.15
Name: COST, Length: 928, dtype: float64

In [66]:
wellcome.groupby('Journal title')['COST'].mean()

Journal title
academy of nutrition and dietetics                                  2379.540
acs chemical biology                                                1418.186
acs chemical neuroscience                                           1186.800
acs nano                                                             668.140
acta crystallographica section d,  biological crystallography        771.420
                                                                     ...    
virology journal                                                    1242.000
virus research                                                      1947.090
vision research                                                   500727.590
visual neuroscience                                                 2034.000
zoonoses and public health                                          2272.150
Name: COST, Length: 928, dtype: float64

In [72]:
wellcome.groupby('Journal title')['COST'].median()

Journal title
academy of nutrition and dietetics                                  2379.54
acs chemical biology                                                1294.59
acs chemical neuroscience                                           1186.80
acs nano                                                             668.14
acta crystallographica section d,  biological crystallography        771.42
                                                                    ...    
virology journal                                                    1242.00
virus research                                                      1947.09
vision research                                                   500727.59
visual neuroscience                                                 2034.00
zoonoses and public health                                          2272.15
Name: COST, Length: 928, dtype: float64

In [73]:
wellcome.groupby('Journal title')['COST'].std()

Journal title
academy of nutrition and dietetics                                          NaN
acs chemical biology                                                 507.309560
acs chemical neuroscience                                                   NaN
acs nano                                                              35.708892
acta crystallographica section d,  biological crystallography               NaN
                                                                      ...      
virology journal                                                            NaN
virus research                                                              NaN
vision research                                                   706076.399327
visual neuroscience                                                         NaN
zoonoses and public health                                                  NaN
Name: COST, Length: 928, dtype: float64