# Contents

Data cleaning is definitely a "practice makes perfect" skill. Using this dataset of article open-access prices paid by the WELLCOME Trust between 2012 and 2013, [determine the five most common journals and the total articles for each.](#Five Most Common Journals and their Totals) Next, [calculate the mean, median, and standard deviation of the open-access cost per article for each journal.](#Mean, Median, and Standard Deviation of the Cost per Article for each Journal) You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data encoding methods if you get stuck.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
wellcome = pd.read_csv("https://raw.githubusercontent.com/RRamirez21/ThinkfulDrills/master/Wellcome%20Trust%20APCspend2013.csv", encoding='ISO-8859-1')

In [56]:
wellcome.columns

Index(['PMID/PMCID', 'Publisher', 'Journal title', 'Article title',
       'COST (£) charged to Wellcome (inc VAT when charged)'],
      dtype='object')

In [0]:
wellcome.rename(columns = {'PMID/PMCID':'PMCID', 'Journal title':'Journttl', 
                           'Article title':'Article', 'COST (£) charged to Wellcome (inc VAT when charged)': 
                         'Cost'}, inplace=True)

In [58]:
wellcome.isnull().sum(axis = 0)

PMCID        199
Publisher      0
Journttl       1
Article        0
Cost           0
dtype: int64

In [0]:
wellcome.dropna(inplace=True)

In [0]:
wellcome['Journttl'] = [x.lower() for x in wellcome['Journttl']]

In [0]:
wellcome['Journttl'] = wellcome['Journttl'].str.strip()

In [0]:
wellcome['Journttl'] = wellcome['Journttl'].replace(' ', '', regex=True)

In [0]:
journ_count = wellcome.groupby('Journttl').size()

In [64]:
journ_count.describe()

count    819.000000
mean       2.354090
std        7.517487
min        1.000000
25%        1.000000
50%        1.000000
75%        2.000000
max      197.000000
dtype: float64

--------------------
#Five Most Common Journals and their Totals

In [65]:
journ_count.sort_values(ascending=False)[:5]

Journttl
plosone                         197
journalofbiologicalchemistry     52
neuroimage                       28
nucleicacidsresearch             25
plospathogens                    24
dtype: int64

In [66]:
wellcome['Cost'].unique()

array(['£2381.04', '£642.56', '£669.64', ..., '£2015.72', '£1334.15',
       '£2034.75'], dtype=object)

In [0]:
def numerify(string):
  return float(''.join(e for e in string if e.isalnum()))

wellcome['Cost'] = wellcome['Cost'].apply(numerify)


In [68]:
int(wellcome['Cost'].mean())

2272467

--------------------
#Mean, Median, and Standard Deviation of the Cost per Article for each Journal

In [0]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [70]:
wellcome.groupby('Journttl', as_index=False)['Cost'].mean()

Unnamed: 0,Journttl,Cost
0,academyofnutritionanddietetics,237954.000
1,acschemicalbiology,153596.500
2,acschemicalneuroscience,118680.000
3,acsnano,66814.000
4,"actacrystallographica,sectiond",75718.000
5,"actacrystallographicasectiond,biologicalcrysta...",77142.000
6,actacrystallographicasectiond:biologicalcrysta...,77374.000
7,actacrystallographicasectionf:structuralbiolog...,79663.500
8,actacrystallographyd,77419.000
9,actad,75016.000
