# Challenge: Data cleaning & validation

Estimated Time: 2-3 hours

Data cleaning is definitely a "practice makes perfect" skill. Using this dataset of article open-access prices paid by the WELLCOME Trust between 2012 and 2013
1. determine the five most common journals and the total articles for each. 
2. calculate the mean, median, and standard deviation of the open-access cost per article for each journal. 
You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data encoding methods if you get stuck. 
3. For a real bonus round, identify the open access prices paid by subject area.

As noted in the previous assignment, don't modify the data directly. Instead, write a cleaning script that will load the raw data and whip it into shape. Jupyter notebooks are a great format for this. Keep a record of your decisions: well-commented code is a must for recording your data cleaning decision-making progress. Submit a link to your script and results below and discuss it with your mentor at your next session.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import re
from collections import Counter

In [2]:
df = pd.read_csv('WELLCOME_APCspend2013_forThinkful.csv', encoding="ISO-8859-1")
df.head()

Unnamed: 0,PMID/PMCID,Publisher,Journal title,Article title,COST (£) charged to Wellcome (inc VAT when charged)
0,PMC3378987\n,Elsevier,Academy of Nutrition and Dietetics,Parent support and parent mediated behaviours ...,£2379.54
1,PMCID: PMC3780468,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,A Novel Allosteric Inhibitor of the Uridine Di...,£1294.59
2,PMCID: PMC3621575,ACS (Amercian Chemical Society) Publications,ACS Chemical Biology,Chemical proteomic analysis reveals the drugab...,£1294.78
3,PMID: 24015914 PMC3833349,American Chemical Society,ACS Chemical Biology,Discovery of an allosteric inhibitor binding s...,£1267.76
4,: PMC3805332,American Chemical Society,ACS Chemical Biology,Synthesis of alpha-glucan in mycobacteria invo...,£2286.73


## 1. Determine the five most common journals and the total articles for each.

In [3]:
# Clean the Journal title column
dfn = df[['Journal title', 'Article title']]

def clean_title(x):
    x=str(x).rstrip() #remove white space
    x=str(x).lstrip() #remove white space
    x=str(x).lower() #remove upper letter and capitalization
    x=str(x).replace('&','and')# replace & with 'and'
    x=str(x).replace('[^a-zA-Z]+', '') # remove special characters
    return x

dfn['Journal title'] = dfn['Journal title'].apply(clean_title)

# calculate the occurence of each journal
JournalCount = dfn.groupby('Journal title').count()
JournalCount.nlargest(5, 'Article title')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,Article title
Journal title,Unnamed: 1_level_1
plos one,190
journal of biological chemistry,53
neuroimage,29
nucleic acids research,26
plos genetics,24


## 2. Calculate the mean, median, and standard deviation of the open-access cost per article for each journal. You will need to do considerable data cleaning in order to extract accurate estimates, and may want to look into data encoding methods if you get stuck.

In [10]:
#Clean the Cost column
df= df.rename(columns={ df.columns[4]: "Cost_wC" })
df['Currency']= df.Cost_wC.str.replace('(\d+(?:\.\d+)?)','').astype('category')
df['Value']= df.Cost_wC.str.extract('(\d+(?:\.\d+)?)',expand=False).astype(float)
df['CostinD']= np.where(df['Currency']=='$',df['Value'] , df['Value']*1.34)

#Journal Cost data frame

#Remove outliers, 
#Assuming low publisjig cost are possible. Where as article cost above $9000 are ost likely typos
P75 = np.percentile(df['CostinD'], 75)
MedianCost = np.median(df['CostinD'])
print('P75:', P75)
print('Max before:', df['CostinD'].max())
df['CostinD']= np.where(df['CostinD']>(P75*1.5), MedianCost , df['CostinD']) # Remove high outliers
print('Max after:', df['CostinD'].max())

# Clean Journal title
JC = df[['Journal title','CostinD']]

JC['Journal title'] = JC['Journal title'].apply(clean_title)

Overview=JC.groupby('Journal title').agg(['count','mean', 'median','std'])
Overview.head()

P75: 3110.5487
Max before: 1339998.66
Max after: 4663.2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,CostinD,CostinD,CostinD,CostinD
Unnamed: 0_level_1,count,mean,median,std
Journal title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
academy of nutrition and dietetics,1,3188.5836,3188.5836,
acs chemical biology,5,1900.36924,1734.7506,679.79481
acs chemical neuroscience,1,1590.312,1590.312,
acs nano,2,895.3076,895.3076,47.849916
"acta crystallographica section d, biological crystallography",1,1033.7028,1033.7028,


## 3. For a real bonus round, identify the open access prices paid by subject area.

In [11]:
# Define categories for Journals
# extract all words from the Journal title
df2 = df[['Journal title', 'Article title', 'CostinD']]
# Have to clean the title differently to keep the spaces
df2['Journal title'] = df2['Journal title'].apply(clean_title)

Allword_list = df2['Journal title'].tolist()
Allwords= ','.join(str(x) for x in Allword_list)
Allwords = Allwords.replace(' ',',')
SingleWordlist = Allwords.split(',')

SingleWordCount=Counter(SingleWordlist)
SWC_df = pd.DataFrame.from_dict(SingleWordCount, orient='index')
SWC_df = SWC_df.rename(columns={SWC_df.columns[0]: "WordCount"})
print(SWC_df)
# choose the 10 most common words and make categories out of them
# Classify Journal calculate stats

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [16]:
# categories
Bio = ['biology', 'biological']
Chem = ['chemistry', 'chemical']
Genetics = ['genetics']
Med = ['medicine']
Neuro = ['neuroscience', 'neuro']
Health = ['health']
Human = ['human']
Clinical = ['clinical']
Disease = ['disease']
Brain = ['brain']
Tropical = ['tropical']
Immun = ['immunology']
Micro = ['microbiology']
Viro = ['virology']
Endocrinology =['endocrinology']
Epidemiology = ['epidemiology']

def check_categories(x):
    x = str(x) #other wise the exception argument of type 'float' is not iterable
    if (Bio[0] in x or Bio[1] in x):
        return "Biology"
    elif (Chem[0] in x or Chem[1] in x):
        return "Chemistry"
    elif(Genetics[0] in x):
        return "Genetics"
    elif (Med[0] in x):
        return "Medicine"
    elif (Neuro[0] in x or Neuro[1] in x):
        return "Neurology"
    elif (Health[0] in x):
        return "Health"
    elif (Human[0] in x):
        return "Human"
    elif(Clinical[0] in x):
        return "Clinical"
    elif(Disease[0] in x):
        return "Disease"
    elif(Brain[0] in x):
        return "Brain"
    elif(Tropical[0] in x):
        return "Tropical"
    elif (Immun[0] in x):
        return "Immunology"
    elif(Micro[0] in x):
        return "Microbiology"
    elif(Viro[0] in x):
        return "Virology"
    elif(Endocrinology[0] in x):
        return "Endocrinology"
    elif(Epidemiology[0] in x):
        return "Epidemiology"
    else:
        return "Other"
        
df2['Cat']=df2['Journal title'].apply(check_categories) 
dfn.head()        

CC = df2[['Cat','CostinD']]
Overview=df2.groupby('Cat').agg(['mean', 'median','std','count'])
Overview.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0_level_0,CostinD,CostinD,CostinD,CostinD
Unnamed: 0_level_1,mean,median,std,count
Cat,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Biology,2292.099707,2278.0,750.196288,223
Brain,2683.120286,2733.6,606.16288,35
Chemistry,2321.70443,2412.0,738.725635,61
Clinical,2792.597259,2980.8568,544.568729,41
Disease,2517.358244,2428.4954,625.165881,54
