In [1]:
import pandas as pd
import numpy as np
import janitor as jntr
import difflib

characteristic_data = pd.read_csv("hego_data/hego_17_18_salary_by_characteristic.csv", skiprows = 13).clean_names()
institution_data = pd.read_csv("hego_data/hego_17_18_salary_by_institution.csv", skiprows = 13).clean_names()
subject_data = pd.read_csv("hego_data/hego_17_18_salary_by_subject.csv", skiprows = 13).clean_names()
uni_codes = pd.read_excel("hego_data/institution-and-campus-codes-2018-entry.xls").clean_names()
rankings_data = pd.read_csv("hego_data/Complete_University_Guide_University_League_Table.csv", skiprows=4).clean_names()

To begin with and to make sure everything is working, only one year will be used. All the other data that has been read in refers only to this year (17/18 that is) so only the ranking dataset needs filtered.

In [2]:
rankings_2017 = rankings_data.loc[rankings_data["year"] == 2017].copy()

I've been having problems with these datasets whilst trying to join them on the instituion/provider names. I've tried to do this by using what's known as fuzzy merging. This has worked to a degree, but has thrown up some significant issues. I'm now going to try and create a more concise merging process so that I can see if the merging and then correcting I've used has worked.

In [3]:
institution_data_for_merge = institution_data.copy()

In [4]:
just_2017_rankings = rankings_2017.loc[:,["rank", "institution"]].copy()

Now to define our fuzzy merge function using the following source: 

https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas/60908516#60908516

In [5]:
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None

Through trial and error, I've worked out that the function was getting a little confused when it came to certain titles, especially "The University of....". With this in mind, I altered the provider names before the merge.

In [6]:
institution_data_for_merge["provider_name"] = institution_data_for_merge["provider_name"].str.replace("University", '')

Now time for the merge. This will require further modification and for this reason is being labelled provisional:

In [7]:
provisional_merge = fuzzy_merge(institution_data_for_merge, just_2017_rankings, left_on='provider_name', right_on='institution').copy()