# Observed Occupations-Task Dataset

By Paul Duckworth 11th Dec 2017.

- Read in the Observed Occupation and Task dataset (from Matt Willis). 

- Assign Probability of Automation scores based on a weighted average of DWA ID mappings. 

- Insert additional rows using the "also done by column.

- output a final dataset. 

## Observational Data:

In [159]:
# encoding=utf8
import os
import numpy as np
import pandas as pd
import getpass
import pickle
import string
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## # Input DATA set of observed Occupations and Tasks
datasets = '/home/'+ getpass.getuser() +'/Datasets/'
d = os.path.join(datasets, 'FoHealthcare/Final_dataset_27082018.xlsx')

excel_doc = pd.ExcelFile(d)
data = excel_doc.parse(sheet_name="Tasks")
print(data.shape)
data.rename(columns = {'Occupation title':'Observed Occupation'}, inplace = True)

# # Strip whitespaces and Capitalize Occupations
data["Observed Occupation"] = pd.Series([string.capwords(i.strip()) for i in data["Observed Occupation"]])
data["Task"] = data["Task"].str.strip()
unique_occupations = [string.capwords(i.strip()) for i in data["Observed Occupation"].unique()]+["Scanning Clerk"]

# change this to a flag:
try:
    equiv = {"Clinical":1, "Clerical":0}
    data["clinical"] = data["Clinical or Clerical Task"].map(equiv)
except:
    pass

# If the Task dataset already has all these variables, just remove them... and compute them again :S 
try: 
    #"Clinical or Clerical Task"  - replaced by "clerical" flag
    data = data.drop(columns=["Task Weight", "Automation Scores", "Weighted Average Automation Score"])
except:
    pass
data.head()

(137, 22)


Unnamed: 0,Observed Occupation,Task,Task keywords/context,GP Code,DWA Task,DWA ID,Notes,Task also done by,Technology Use,Technology use3,Technology use4,Information work,Information work5,Information work6,Information work7,partial automation,Future automation potential (FAP),FAP Notes,clinical
0,Administrator,Medical Coding of letters and other documents,Structured clinical vocabulary highlight docum...,"BFS, BSC, WMC",Code data or other information.; Convert data ...,4.A.3.b.1.I06.D08; 4.A.3.b.1.I06.D11; 4.A.3.b....,,Summariser; deputy practice manager; Administr...,Desktop,specific software,,reduction,,,,no,,,0
1,Administrator,Register new Patients,,"BFS, BSC",Process healthcare paperwork.; Verify accuracy...,4.A.4.c.1.I01.D03; 4.A.2.a.2.I05.D04; 4.A.3.b....,Summarise patient records that have just moved...,Receptionist; Summariser,Desktop,paper,,transfer,entry,structured,error checking,no,,,0
2,Administrator,Use software to convert printed letters into t...,,BFS,Convert data among multiple digital or analog ...,4.A.3.b.1.I06.D11; 4.A.2.a.3.I01.D02,Uses optical character recognition,,Desktop,software,,reduction,,,,yes,,,0
3,Administrator,Work in Open Exeter online web portal,,WMC,Monitor external affairs or events affecting b...,4.A.1.a.2.I08.D03; 4.A.4.c.1.I01.D03; 4.A.3.b....,,Practice Manager; Deputy Practice Manager; Adm...,Desktop,website,,transfer,,,,yes,,,0
4,Administrator,"Write letters for secondary care, other GPs, o...",,MWMC,Type documents.; Edit written materials.; Read...,4.A.4.c.1.I01.D07; 4.A.2.b.1.I07.D01; 4.A.1.a....,"Letters written as needed, changes day to day ...",Receptionist; Secretary; General Practitioner,Desktop,software,,transfer,bricolage,,,yes,,,0


## Add Structured Work Flag

In [160]:
data.columns

Index(['Observed Occupation', 'Task', 'Task keywords/context', 'GP Code',
       'DWA Task', 'DWA ID', 'Notes', 'Task also done by', 'Technology Use',
       'Technology use3', 'Technology use4', 'Information work',
       'Information work5', 'Information work6', 'Information work7',
       'partial automation', 'Future automation potential (FAP)', 'FAP Notes',
       'clinical'],
      dtype='object')

In [161]:
structured_work = []

for i, row in data.iterrows():
    structured_flat = 0

    for ind, col in enumerate(data.columns):
        if isinstance(row[ind], str): 
            if "structured" in row[ind].lower():
                structured_flat = 1
            if "unstructured" in row[ind].lower():
                structured_flat = -1

        else: 
            pass #print(col, row[ind])
    
    structured_work.append(structured_flat)

print(structured_work)

data["structured_work"] = pd.Series(structured_work)
data.head()

[1, 1, 0, 0, 0, 0, 0, 0, 1, -1, 1, 1, -1, 1, 0, 0, 0, 1, -1, 1, -1, 0, -1, 1, -1, 1, 1, -1, 0, 0, -1, -1, 1, -1, 1, -1, 1, 1, 0, 1, -1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, -1, -1, 1, 1, 1, 1, 1, -1, 1, -1, -1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, -1, 1, 1, 1, 0, 0, 0, 1, 1, 1, -1, 1, 0, 1, 1, 1, 1, 0, 0, 1, -1, 1, -1, 1, -1, 1, -1]


Unnamed: 0,Observed Occupation,Task,Task keywords/context,GP Code,DWA Task,DWA ID,Notes,Task also done by,Technology Use,Technology use3,Technology use4,Information work,Information work5,Information work6,Information work7,partial automation,Future automation potential (FAP),FAP Notes,clinical,structured_work
0,Administrator,Medical Coding of letters and other documents,Structured clinical vocabulary highlight docum...,"BFS, BSC, WMC",Code data or other information.; Convert data ...,4.A.3.b.1.I06.D08; 4.A.3.b.1.I06.D11; 4.A.3.b....,,Summariser; deputy practice manager; Administr...,Desktop,specific software,,reduction,,,,no,,,0,1
1,Administrator,Register new Patients,,"BFS, BSC",Process healthcare paperwork.; Verify accuracy...,4.A.4.c.1.I01.D03; 4.A.2.a.2.I05.D04; 4.A.3.b....,Summarise patient records that have just moved...,Receptionist; Summariser,Desktop,paper,,transfer,entry,structured,error checking,no,,,0,1
2,Administrator,Use software to convert printed letters into t...,,BFS,Convert data among multiple digital or analog ...,4.A.3.b.1.I06.D11; 4.A.2.a.3.I01.D02,Uses optical character recognition,,Desktop,software,,reduction,,,,yes,,,0,0
3,Administrator,Work in Open Exeter online web portal,,WMC,Monitor external affairs or events affecting b...,4.A.1.a.2.I08.D03; 4.A.4.c.1.I01.D03; 4.A.3.b....,,Practice Manager; Deputy Practice Manager; Adm...,Desktop,website,,transfer,,,,yes,,,0,0
4,Administrator,"Write letters for secondary care, other GPs, o...",,MWMC,Type documents.; Edit written materials.; Read...,4.A.4.c.1.I01.D07; 4.A.2.b.1.I07.D01; 4.A.1.a....,"Letters written as needed, changes day to day ...",Receptionist; Secretary; General Practitioner,Desktop,software,,transfer,bricolage,,,yes,,,0,0


## Get Automation Labels:

In [162]:
# onet_datasets = '/home/'+ getpass.getuser() +'/Datasets/ONET/db2016'

y_train = pd.read_csv("icml_data/y_train.csv")
y_test = pd.read_csv("icml_data/y_test.csv")

y_train.rename(columns={"GT Score 1-4":"y"}, inplace=True)
y_test.rename(columns={"y_pred": "y"}, inplace=True)

## We concatenate the Ground Truth scores, with the Predictions (for the other 1700). 
## We do not use the Predictions for all 2000 DWAs. 
automation_scores = pd.concat([y_train, y_test], ignore_index=True).iloc[:, [1,2]]
print(y_train.shape, y_test.shape, automation_scores.shape)
# automation_scores[automation_scores["DWA ID"] == "4.A.3.b.1.I06.D01"]["y"]

automation_scores.head()

# y_train[y_train["DWA ID"] == "4.A.1.a.1.I23.D06"]
# automation_scores[automation_scores["DWA ID"] == "4.A.1.a.1.I23.D06"]


(314, 3) (1753, 3) (2067, 2)


Unnamed: 0,DWA ID,y
0,4.A.1.a.1.I01.D01,2.3125
1,4.A.1.a.1.I01.D04,2.233974
2,4.A.1.a.1.I02.D08,3.333333
3,4.A.1.a.1.I02.D09,2.696154
4,4.A.1.a.1.I02.D10,3.059091


In [163]:
# # investigate why some DWAs don't have Automation Scores = it's because the composite Tasks dont have Importance scores:

# onet_tasks = pd.read_table(os.path.join(datasets, 'ONET/databases/db2016/Task Statements.txt'), sep='\t')
# taskDWA = pd.read_table(os.path.join(datasets, 'ONET/databases/db2016/Tasks to DWAs.txt'), sep='\t')
DWArefs = pd.read_table(os.path.join(datasets, 'ONET/databases/db2016/DWA Reference.txt'), sep='\t')
automation_scores = automation_scores.merge(DWArefs[['DWA ID', 'DWA Title']], on=['DWA ID'])
automation_scores.head()
# onet_tasks_dwa = onet_tasks[['Task ID', 'Task']].merge(DWA_sup, on=['Task ID'])\
#                                                          .sort_values(by='Task ID')\
#                                                          .reset_index().drop('index', axis=1)
# others = onet_tasks_dwa[onet_tasks_dwa["IWA ID"] == "4.A.3.b.1.I06"][["DWA ID", "DWA Title"]].drop_duplicates()
# others.merge(automation_scores, on=["DWA ID"], how='left').rename(columns={"y":"automation score"})

Unnamed: 0,DWA ID,y,DWA Title
0,4.A.1.a.1.I01.D01,2.3125,Review art or design materials.
1,4.A.1.a.1.I01.D04,2.233974,Study scripts to determine project requirements.
2,4.A.1.a.1.I02.D08,3.333333,Review technical documents to plan work.
3,4.A.1.a.1.I02.D09,2.696154,Review blueprints or specifications to determi...
4,4.A.1.a.1.I02.D10,3.059091,Review work orders or schedules to determine o...


## Add Task Weights and DWA Scores:

In [164]:
data["Task Weight"] =  data.apply(lambda _: '', axis=1)
data["Automation Scores"] =  data.apply(lambda _: '', axis=1)
data["Weighted Average Automation Score"] =  data.apply(lambda _: '', axis=1)

# data = data.reindex(columns = columns_required + ["Automation Scores", "Average Automation Score"])    

for i, (occ, task, str_dwas) in data[["Observed Occupation", "Task", "DWA ID"]].iterrows():
    
    # # Compute the weight (based on number of DWAs selected to Match)  
    dwas = str_dwas.split(";")
    weight = 1./len(dwas)

    # # Grab the Automation Scores per DWA ID
    str_auto_scores = ""
    flo_auto_scores = []
    str_weights = ""
    for d in dwas:
#         try:
    
        float_y = automation_scores[automation_scores["DWA ID"] == d.strip()]["y"].values[0]
        y = " %0.2f;" % float_y
#         except:
#             print("No Y value found: ", d)
#             float_y = 0
#             y = ""
            
        # # Average the Automation scores to compute the Observed Task's Automation Score
        flo_auto_scores.append(float_y)
        str_auto_scores += y
        str_weights += " %0.2f;" % weight
        
    data.at[i, "Task Weight"] = str_weights
    data.at[i,"Automation Scores"] = str_auto_scores
    data.at[i,"Weighted Average Automation Score"] = np.mean(flo_auto_scores)
print(data.shape)
data.head()

(137, 23)


Unnamed: 0,Observed Occupation,Task,Task keywords/context,GP Code,DWA Task,DWA ID,Notes,Task also done by,Technology Use,Technology use3,...,Information work6,Information work7,partial automation,Future automation potential (FAP),FAP Notes,clinical,structured_work,Task Weight,Automation Scores,Weighted Average Automation Score
0,Administrator,Medical Coding of letters and other documents,Structured clinical vocabulary highlight docum...,"BFS, BSC, WMC",Code data or other information.; Convert data ...,4.A.3.b.1.I06.D08; 4.A.3.b.1.I06.D11; 4.A.3.b....,,Summariser; deputy practice manager; Administr...,Desktop,specific software,...,,,no,,,0,1,0.25; 0.25; 0.25; 0.25;,3.08; 2.91; 3.50; 2.94;,3.10935
1,Administrator,Register new Patients,,"BFS, BSC",Process healthcare paperwork.; Verify accuracy...,4.A.4.c.1.I01.D03; 4.A.2.a.2.I05.D04; 4.A.3.b....,Summarise patient records that have just moved...,Receptionist; Summariser,Desktop,paper,...,structured,error checking,no,,,0,1,0.33; 0.33; 0.33;,2.91; 3.15; 3.50;,3.18928
2,Administrator,Use software to convert printed letters into t...,,BFS,Convert data among multiple digital or analog ...,4.A.3.b.1.I06.D11; 4.A.2.a.3.I01.D02,Uses optical character recognition,,Desktop,software,...,,,yes,,,0,0,0.50; 0.50;,2.91; 2.94;,2.92687
3,Administrator,Work in Open Exeter online web portal,,WMC,Monitor external affairs or events affecting b...,4.A.1.a.2.I08.D03; 4.A.4.c.1.I01.D03; 4.A.3.b....,,Practice Manager; Deputy Practice Manager; Adm...,Desktop,website,...,,,yes,,,0,0,0.33; 0.33; 0.33;,2.20; 2.91; 2.94;,2.68329
4,Administrator,"Write letters for secondary care, other GPs, o...",,MWMC,Type documents.; Edit written materials.; Read...,4.A.4.c.1.I01.D07; 4.A.2.b.1.I07.D01; 4.A.1.a....,"Letters written as needed, changes day to day ...",Receptionist; Secretary; General Practitioner,Desktop,software,...,,,yes,,,0,0,0.20; 0.20; 0.20; 0.20; 0.20;,3.50; 2.42; 3.32; 3.40; 2.57;,3.04252


In [165]:
m = data["Weighted Average Automation Score"].mean()
s = data["Weighted Average Automation Score"].std()
print("Average Automation of Tasks, %0.3f, (%0.3f)" % (m, s))

Average Automation of Tasks, 2.874, (0.331)


## Also Done By: add extra rows

In [166]:
unique_occupations

['Administrator',
 'Deputy Practice Manager',
 'General Practitioner',
 'Healthcare Assistant',
 'Nurse Practitioner',
 'Pharmacy Technician',
 'Phlebotomist',
 'Practice Manager',
 'Practice Nurse',
 'Practice Pharmacist',
 'Prescription Clerk',
 'Receptionist',
 'Secretary',
 'Summariser',
 'Scanning Clerk']

In [167]:
data[data["Observed Occupation"] == "Summariser"]

Unnamed: 0,Observed Occupation,Task,Task keywords/context,GP Code,DWA Task,DWA ID,Notes,Task also done by,Technology Use,Technology use3,...,Information work6,Information work7,partial automation,Future automation potential (FAP),FAP Notes,clinical,structured_work,Task Weight,Automation Scores,Weighted Average Automation Score
136,Summariser,Cleaning up information in the patients electr...,,,Process healthcare paperwork.,4.A.4.c.1.I01.D03,Summarisers do this while they are working wit...,Practice Nurse; Nurse Practitioner; Healthcare...,,,...,,,,,,0,-1,1.00;,2.91;,2.91399


In [168]:
print(data.shape)
duplicate_task_rows = data.copy()
duplicate_task_rows

for i, row in duplicate_task_rows.iterrows():
    occu = string.capwords(row['Observed Occupation'].strip())

#     if row["Task"].strip().lower() == "medical coding of letters and other documents":
#         print(occu, row['Task']) 
    
    if isinstance(row['Task also done by'], str):
        for _ in row['Task also done by'].split(";"):
            also_occ = _.strip()
            a = string.capwords(also_occ.strip())

            if a == occu: 
                continue

            elif a in unique_occupations:
                #create a copy of the row, but change the occupation:
                new_row = row.copy()
                new_row['Observed Occupation'] = a
                # append the new row
                duplicate_task_rows = duplicate_task_rows.append(new_row, ignore_index=True)
            else:
                print("***", i, "different occupation:", a)            

duplicate_task_rows = duplicate_task_rows.sort_values(by = ['Observed Occupation', 'Task'])
duplicate_task_rows.reset_index(inplace=True)
duplicate_task_rows.drop(columns = ["index"], inplace=True)

duplicate_task_rows[duplicate_task_rows["Observed Occupation"] == "Administrator"]


(137, 23)
*** 24 different occupation: 
*** 39 different occupation: 
*** 44 different occupation: 
*** 46 different occupation: 
*** 98 different occupation: 
*** 111 different occupation: 


Unnamed: 0,Observed Occupation,Task,Task keywords/context,GP Code,DWA Task,DWA ID,Notes,Task also done by,Technology Use,Technology use3,...,Information work6,Information work7,partial automation,Future automation potential (FAP),FAP Notes,clinical,structured_work,Task Weight,Automation Scores,Weighted Average Automation Score
0,Administrator,Address problems that arise with building,,BSC,"Notify others of emergencies, problems, or haz...",4.A.4.a.2.I08.D07; 4.A.4.a.2.I08.D04; 4.A.4.a....,,Deputy Practice Manager; Administrator,,,...,,,no,no,,0,-1,0.20; 0.20; 0.20; 0.20; 0.20;,3.27; 2.90; 3.30; 3.10; 2.46;,3.00464
1,Administrator,Answer phone,,,Answer telephones to direct calls or provide i...,4.A.4.a.3.I03.D11,,Practice Nurse; Nurse Practitioner; Healthcare...,phone,desktop,...,reference,,no,no,,0,-1,1.00;,3.15;,3.15315
2,Administrator,Checking for errors in paperwork,,BSC,Check data for recording errors.,4.A.2.a.2.I01.D08,,Administrator; Practice Manager; Secretary,desktop,paper forms,...,,,no,,,0,-1,1.00;,3.27;,3.27283
3,Administrator,Cleaning up information in the patients electr...,,,Process healthcare paperwork.,4.A.4.c.1.I01.D03,Summarisers do this while they are working wit...,Practice Nurse; Nurse Practitioner; Healthcare...,,,...,,,,,,0,-1,1.00;,2.91;,2.91399
4,Administrator,Connecting human resources/making introduction...,,BSC,Relay information between personnel.,4.A.4.a.2.I03.D11,,Deputy Practice Manager; Administrator,,,...,,,no,no,,0,-1,1.00;,3.38;,3.375
5,Administrator,Deal with complaints,,,Respond to customer problems or complaints.; R...,4.A.4.a.8.I03.D01; 4.A.4.a.8.I03.D05,,Practice Manager; Administrator; General Pract...,,,...,,,,,,0,0,0.50; 0.50;,3.32; 2.34;,2.83122
6,Administrator,Enter data for enhanced services,,WMC,Develop procedures for data entry or processin...,4.A.2.b.4.I03.D16; 4.A.3.b.6.I10.D02,,,Desktop,website,...,,,yes,,,0,0,0.50; 0.50;,2.70; 3.46;,3.08005
7,Administrator,"Fill out or process audit forms, conduct an au...",,WMC,Process healthcare paperwork.,4.A.4.c.1.I01.D03,,,Desktop,,...,unstructured,,no,,,0,-1,1.00;,2.91;,2.91399
8,Administrator,Invoicing (private insurance),,,Prepare financial documents.; Maintain records...,4.A.3.b.6.I01.D01; 4.A.3.b.6.I08.D15; 4.A.4.b....,"Quicked used for private insurance, or another...",,Desktop,,...,structured,,yes,yes,,0,1,0.25; 0.25; 0.25; 0.25;,2.85; 2.46; 1.77; 3.80;,2.72054
9,Administrator,Manage pension schemes,,BFS,Manage organizational or program finances.; Ma...,4.A.4.b.4.I09.D08; 4.A.3.b.6.I08.D14,Usually called PCSE. like shopping on amazon f...,Deputy Practice Manager; Administrator,desktop,website,...,,,no,yes,put RFID on everything to track when things ne...,0,1,0.50; 0.50;,1.77; 2.94;,2.35545


In [169]:
# duplicate_task_rows.sort_values(by = ['Task', 'Observed Occupation'], inplace=True)
output_doc = os.path.join(datasets, 'FoHealthcare/Final_also_done_by_tasks_040818.xlsx')
# duplicate_task_rows.to_csv(output_doc, sep='\t', encoding='utf-8')
duplicate_task_rows.to_excel(output_doc)

automation_scores[["DWA ID", "y"]].to_excel(os.path.join(datasets, 'FoHealthcare/automation_scores.xlsx'))

m = duplicate_task_rows["Weighted Average Automation Score"].mean()
s = duplicate_task_rows["Weighted Average Automation Score"].std()

print("Average Automation of Tasks (duplicated by occupations), %0.3f, (%0.3f)" % (m, s))

Average Automation of Tasks (duplicated by occupations), 2.877, (0.320)


## Employment Figures? 

- We have employment numbers from from NHS Digital, so we can produce a lasanga plot of risk affected employment


In [170]:
import xlrd
d = os.path.join(datasets, 'FoHealthcare/NHS_GP_services_exp_statistics_Dec17.csv')
nhs_employment_figures = pd.read_csv(d)
nhs_employment_figures1 = nhs_employment_figures.dropna(axis=0)

nhs_employment_figures1["December 2017"] = nhs_employment_figures1["December 2017"].str.replace(",","").astype(float)

nhs_employment_figures2 = nhs_employment_figures1[["Observed Occupation", "December 2017"]].groupby(by="Observed Occupation").sum().reset_index()

output_doc = os.path.join(datasets, 'FoHealthcare/employment_figures_Dec17_040818.xlsx')
nhs_employment_figures2.to_excel(output_doc)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [171]:
nhs_employment_figures3 = nhs_employment_figures2.merge(occupation_n_tasks, on="Observed Occupation").rename(columns = {"Task":"n_tasks"})
nhs_employment_figures3["Employment"] = nhs_employment_figures3["December 2017"] / nhs_employment_figures3["n_tasks"]
nhs_employment_figures3

Unnamed: 0,Observed Occupation,December 2017,n_tasks,Employment
0,Administrator,64565.0,40,1614.125
1,Deputy Practice Manager,9585.0,33,290.454545
2,General Practitioner,33947.0,37,917.486486
3,Healthcare Assistant,6580.0,27,243.703704
4,Nurse Practitioner,4670.0,27,172.962963
5,Pharmacy Technician,2418.0,6,403.0
6,Phlebotomist,740.0,5,148.0
7,Practice Manager,9585.0,37,259.054054
8,Practice Nurse,18307.0,37,494.783784
9,Practice Pharmacist,658.0,12,54.833333


# Output All Data to the Same Excel Sheet. 


In [172]:
import xlsxwriter
writer = pd.ExcelWriter(xls_path, engine='xlsxwriter')  # path defined above.

## Create a tab in the excel document for each part-dataset
tabs = [("Tasks", data) ,
        ("SortedTaskScores", sorted_task_scores), 
        ("ExpandedTasks", duplicate_task_rows), 
        ("DWA_AutomationScores", automation_scores[["DWA ID", "DWA Title", "y"]]), 
        ("NHS_stats_Dec17", nhs_employment_figures ),
        ("employment_figures", nhs_employment_figures3) ]

for tab_name, dataset in tabs:

    df_ = pd.DataFrame(data = dataset)
    df_.to_excel(writer, '%s' % tab_name)
    
    ## Format the Excel Sheet: 
    workbook  = writer.book
    format = workbook.add_format()
    format.set_text_wrap() # wraps text
writer.save()

## For analysis - see other workbooks...