In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let's start by loading our data

In [22]:
data = pd.read_csv('wellcome.csv',encoding='Latin-1')

We'll use head to take a quick look at our data.

In [23]:
data.head(5)

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


The currency symbols could cause problems when analyzing. Let's clean it by removing it.

In [29]:
dataclean = data.copy()
dataclean.rename(columns={'COST (£) charged to Wellcome (inc VAT when charged)':'cost'}, inplace=True)
dataclean['cost'] = dataclean['cost'].apply(lambda x: x.replace('£', '') if '£' in str(x) else x)
dataclean['cost'] = dataclean['cost'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x)

dataclean['cost'] = dataclean['cost'].apply(lambda x: float(x))

dataclean.head(5)

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,cost
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,685.88


Do the values in the cost column make sense?

In [36]:
dataclean.describe() 

Unnamed: 0,cost
count,2127.0
mean,24067.339972
std,146860.665559
min,0.0
25%,1280.0
50%,1884.01
75%,2321.305
max,999999.0


Apparently we have articles that cost a million dollars! Let's clean this column so it makes sense.

In [38]:
dataclean.dropna(subset=['cost'], inplace=True)
dataclean.drop(dataclean[(dataclean.cost > 2321.305000)].index, inplace=True)
dataclean.describe() 

Unnamed: 0,cost
count,1595.0
mean,1528.023154
std,530.280293
min,0.0
25%,1060.65
50%,1604.82
75%,1999.97
max,2321.23


Let's change all the names in the Publisher, Journal Title and Article Title columns to capitalize the first letter and lowercase the rest.

In [32]:
dataclean['Publisher'] = dataclean['Publisher'].apply(lambda x: str(x).capitalize())
dataclean['Journal title'] = dataclean['Journal title'].apply(lambda x: str(x).capitalize())
dataclean['Article title'] = dataclean['Article title'].apply(lambda x: str(x).capitalize())
dataclean.head(5)

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,cost
0,,Cup,Psychological medicine,Reduced parahippocampal cortical thickness in ...,0.0
1,PMC3679557,Acs,Biomacromolecules,Structural characterization of a model gram-ne...,2381.04
2,23043264 PMC3506128,Acs,J med chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",642.56
3,23438330 PMC3646402,Acs,J med chem,Orvinols with mixed kappa/mu opioid receptor a...,669.64
4,23438216 PMC3601604,Acs,J org chem,Regioselective opening of myo-inositol orthoes...,685.88


Now we can answer the questions.

## Question 1: The five most common journals and the total articles for each

In [40]:
q1 = dataclean['Journal title'].value_counts().head(5)
q1

Plos one                           182
Journal of biological chemistry     50
Nucleic acids research              23
Plos genetics                       22
Plos pathogens                      22
Name: Journal title, dtype: int64

## Question 2: The mean, median, and standard deviation of the open-access cost per article

In [42]:
q2 = dataclean.describe()
q2

Unnamed: 0,cost
count,1595.0
mean,1528.023154
std,530.280293
min,0.0
25%,1060.65
50%,1604.82
75%,1999.97
max,2321.23


The mean, median, and standard deviation are 1528.023154, 1604.820000, and 530.280293 respectively 