# Merging the Rating- to the Main Data
In this notebook we will merge the rating to the main Data.
The problem here are the differences in spelling between these two.

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import date

In [2]:
# !pip install Levenshtein
# !pip install fuzzywuzzy

In [3]:
# We will try to match the company names using different libraries and see wich method serves us best
import difflib
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz

## Loading files

In [4]:
ratings_df = pd.read_csv('Prepared Frames/rating_data.csv')
df_companies = pd.read_csv('Prepared Frames/company_data.csv')

## Looking for matches
We will create a column for each method in our dataframe, where we will insert the closest match found with the ratings dataframe.

In [5]:
# Making a list of the companies in the ratings dataframe
r_companies = list(ratings_df['Company'].unique())
len(r_companies)

17674

#### Making everything lowercase for better results


In [6]:
def makelow(x):
    return x.lower()

In [7]:
ratings_df['Company_lower'] = ratings_df['Company'].apply(makelow)

In [8]:
df_companies['Company_lower'] = df_companies['Company Name'].apply(makelow)

In [9]:
r_companies = list(ratings_df['Company_lower'].unique())

#### Difflib

In [10]:
# We want the matchs wo be very close, it is crucial to not get false matches,+
# while losing data is not ideal, but has to be expected in this case.
def getmatch(company):
    # We check if we have very close matches for the companies and return maximum of one value.
    try:
        return difflib.get_close_matches(company,r_companies,1,cutoff)[0]
    # If nothing is found, we return nothing, to get a NaN which will be easy for us to track
    except:
        pass

In [11]:
# Test
# The cutoff is just for testing purposes, we will use a way tighter one on our dataset.
cutoff = 0.5
getmatch('Barrick Gold')

'archrock inc. (old)'

In [12]:
# Applying it to our dataset

In [13]:
df_companies.columns

Index(['Ticker', 'SimFinId', 'Company Name', 'IndustryId', 'tokens',
       'current_search', 'Company_lower'],
      dtype='object')

In [14]:
cutoff = 0.9
df_companies['difflib'] = df_companies['Company_lower'].apply(getmatch)

In [15]:
# Having a look how it worked
len(df_companies[df_companies['difflib'].isna()==False]['difflib'])
# 570 Matches, that looks better than the around 200 we had before.

570

In [16]:
# Check some examples on how good the matches are
# df_companies[df_companies['difflib'].isna()==False][['Company Name','difflib']].head(50)

The matches now look all correct, we had to increase the cutoff from initially 0.8 to 0.9 now.

#### Fuzzywuzzy
We try the same with the fuzzywuzzy library

In [17]:
from fuzzywuzzy import process

In [18]:
# We again define a function to get the closest matches, this time with the fuzzywuzzy library
def getfuzz(company):
    res,score = process.extractOne(company, r_companies, score_cutoff=1)
    if score < 1:
        pass
    else:
        return res   

In [19]:
print(getfuzz('Barrick Gold'))

gold fields limited


The fuzzywuzzy results were way to vague and had a lot of mismatches, even with
the cutoff set to one. We cannot use this.

In [20]:
# This took a long time, so we save the results:
df_companies.to_csv('Prepared Frames/companies_approx_match.csv', index=False)