### Deeper in to data visualization and exploration
Data cleaning is definitely a "practice makes perfect" skill. In this challenge, you'll use this dataset of article open-access prices paid by the WELLCOME Trust between 2012 and 2013.

To complete this challenge, determine the five most common journals and the total articles for each. Next, calculate the mean, median, and standard deviation of the open-access cost per article for each journal.

You will need to do considerable data cleaning in order to extract accurate estimates. You may may want to look into data encoding methods if you get stuck. For a real bonus round, identify the open access prices paid by subject area.

Remember not to modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress. Submit a link to your script and results below and discuss it with your mentor at your next session.

In [1]:
#Major libraries to be used
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#setting up the dataframe after endoding the first unicade to 'unicode_escape'
df = pd.DataFrame(pd.read_csv(("data/WELLCOME_APCspend2013_forThinkful.csv"), encoding = 'unicode_escape'))
df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00
1,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04
2,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56
3,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64
4,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88


The first part of data cleaning would be to replace nan, '' and na values to to probably 0

In [3]:
df.index = df['PMID/PMCID'];

In [4]:
df.fillna(0);
df['COST'] = df['COST (£) charged to Wellcome (inc VAT when charged)']

In [5]:
#Since we can't sum the Euro values as is, we have to strip the "£" and save the remaining in place
df['COST'] = df['COST'].str.strip('£')

In [6]:
#Let's filter the major areas of desciplene by common type and use it to filter throughout
topics = ['Biology', 'Chem','Vir','AIDS','Neuro','Psych','Biochem','Vet','Act','Gene','Nutrition',
          'Health','Arthr','Acs','Immu','PLoS','Preventive']
df['topics_fixed'] = df['Journal title']
for topic in topics:
    df['topics_fixed'] = df['topics_fixed'].str.replace('.*{}.*'.format(topic), topic, case=False, regex=True)
df

Unnamed: 0_level_0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged),COST,topics_fixed
PMID/PMCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
,,CUP,Psychological Medicine,Reduced parahippocampal cortical thickness in ...,£0.00,0.00,Psych
PMC3679557,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04,2381.04,Biomacromolecules
23043264 PMC3506128,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56,642.56,Chem
23438330 PMC3646402,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64,669.64,Chem
23438216 PMC3601604,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88,685.88,Chem
PMC3579457,PMC3579457,ACS,Journal of Medicinal Chemistry,Comparative Structural and Functional Studies ...,£2392.20,2392.20,Chem
PMC3709265,PMC3709265,ACS,Journal of Proteome Research,Mapping Proteolytic Processing in the Secretom...,£2367.95,2367.95,Journal of Proteome Research
23057412 PMC3495574,23057412 PMC3495574,ACS,Mol Pharm,Quantitative silencing of EGFP reporter gene b...,£649.33,649.33,Mol Pharm
PMCID: PMC3780468,PMCID: PMC3780468,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,A Novel Allosteric Inhibitor of the Uridine Di...,£1294.59,1294.59,Biology
PMCID: PMC3621575,PMCID: PMC3621575,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,Chemical proteomic analysis reveals the drugab...,£1294.78,1294.78,Biology


In [7]:
df.dropna(inplace=True)
df.sort_values('topics_fixed')

Unnamed: 0_level_0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged),COST,topics_fixed
PMID/PMCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
23340916,23340916,Springer-Verlag GMBH & Ci,AGE,Voluntary exercise can strengthen the circadia...,£2002.00,2002.00,AGE
3633780,3633780,Springer,AIDS Behav,Adherence to antiretroviral therapy and clinic...,£1834.77,1834.77,AIDS
PMCID:\n PMC3590645,PMCID:\n PMC3590645,Taylor & Francis,Aids Care,Asset ownership among households caring for or...,£2399.28,2399.28,AIDS
PMCID:\n PMC3687248\n,PMCID:\n PMC3687248\n,Taylor and Francis,AIDS Care,WORLDBANK Special Issue: Evidence for a contri...,£2232.74,2232.74,AIDS
Epub,Epub,Wolters Kluwer,Journal of Acquired Immune Deficiency Syndroms...,Reduction in early mortality on antiretroviral...,£1836.92,1836.92,AIDS
PMC3707567,PMC3707567,Wolters Kluwer,Journal of Aids,Risk factors for seropositivity to Kaposi sarc...,£2009.65,2009.65,AIDS
PMC3765690,PMC3765690,BioMed Central Limited,AIDS Research and Therapy,Collective Patient behaviours derailing ART ro...,£1240.00,1240.00,AIDS
PMC3815011,PMC3815011,Wolters Kluwer,AIDS UK,HIV incidence and survival from age-specific s...,£1836.92,1836.92,AIDS
PMC3819359,PMC3819359,Wolters Kluwer,AIDS Journal,Short title: TB and VL breakthrough and failur...,£2015.72,2015.72,AIDS
PMC3773237,PMC3773237,Wolters Kluwer,AIDS UK,Sexual behaviour in a rural high HIV prevalence,£1836.92,1836.92,AIDS


In [8]:

df['COST'] = (df['COST'].str.split('.').str[0])
df['COST'] = (df['COST'].str.split('$').str[0]).astype(float)
df

Unnamed: 0_level_0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged),COST,topics_fixed
PMID/PMCID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PMC3679557,PMC3679557,ACS,Biomacromolecules,Structural characterization of a Model Gram-ne...,£2381.04,2381.0,Biomacromolecules
23043264 PMC3506128,23043264 PMC3506128,ACS,J Med Chem,"Fumaroylamino-4,5-epoxymorphinans and related ...",£642.56,642.0,Chem
23438330 PMC3646402,23438330 PMC3646402,ACS,J Med Chem,Orvinols with mixed kappa/mu opioid receptor a...,£669.64,669.0,Chem
23438216 PMC3601604,23438216 PMC3601604,ACS,J Org Chem,Regioselective opening of myo-inositol orthoes...,£685.88,685.0,Chem
PMC3579457,PMC3579457,ACS,Journal of Medicinal Chemistry,Comparative Structural and Functional Studies ...,£2392.20,2392.0,Chem
PMC3709265,PMC3709265,ACS,Journal of Proteome Research,Mapping Proteolytic Processing in the Secretom...,£2367.95,2367.0,Journal of Proteome Research
23057412 PMC3495574,23057412 PMC3495574,ACS,Mol Pharm,Quantitative silencing of EGFP reporter gene b...,£649.33,649.0,Mol Pharm
PMCID: PMC3780468,PMCID: PMC3780468,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,A Novel Allosteric Inhibitor of the Uridine Di...,£1294.59,1294.0,Biology
PMCID: PMC3621575,PMCID: PMC3621575,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,Chemical proteomic analysis reveals the drugab...,£1294.78,1294.0,Biology
PMCID: PMC3739413,PMCID: PMC3739413,ACS (Amercian Chemical Society) Publications,Journal of Chemical Information and Modeling,Locating Sweet Spots for Screening Hits and Ev...,£1329.69,1329.0,Chem


In [13]:
#Lastly, since we have the data filtered by common catagory, its time we sum the values per cost and sort the values
#from highest to lowest
df.groupby('topics_fixed')['COST','topics_fixed'].sum().sort_values(by='COST',ascending=False)

Unnamed: 0_level_0,COST
topics_fixed,Unnamed: 1_level_1
PLoS,10446885.0
Chem,3257039.0
Gene,3184233.0
Molecular Cell,2011774.0
Neuro,1299184.0
Biology,1244741.0
Nature Communications,1052940.0
Cell,1012684.0
Journal of Cell Science,1012439.0
Journal of Physiology,1011970.0
