# Merging the Rating- to the Main Data
In this notebook we will merge the rating to the main Data.
The problem here are the differences in spelling between these two.

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from datetime import date

In [2]:
# !pip install Levenshtein
# !pip install fuzzywuzzy

In [3]:
# We will try to match the company names using different libraries and see wich method serves us best
import difflib
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz

## Loading files

In [4]:
ratings_df = pd.read_csv('Prepared Frames/rating_data.csv')
df_companies = pd.read_csv('Prepared Frames/company_data.csv')

## Looking for matches
We will create a column for each method in our dataframe, where we will insert the closest match found with the ratings dataframe.

In [5]:
# Making a list of the companies in the ratings dataframe
r_companies = list(ratings_df['Company'].unique())
len(r_companies)

17674

#### Making everything lowercase for better results


In [6]:
def makelow(x):
    return x.lower()

In [7]:
ratings_df['Company_lower'] = ratings_df['Company'].apply(makelow)

In [8]:
df_companies['Company_lower'] = df_companies['Company Name'].apply(makelow)

In [9]:
r_companies = list(ratings_df['Company_lower'].unique())

#### Difflib

In [10]:
# We want the matchs wo be very close, it is crucial to not get false matches,+
# while losing data is not ideal, but has to be expected in this case.
def getmatch(company):
    # We check if we have very close matches for the companies and return maximum of one value.
    try:
        return difflib.get_close_matches(company,r_companies,1,cutoff)[0]
    # If nothing is found, we return nothing, to get a NaN which will be easy for us to track
    except:
        pass

In [11]:
# Test
# The cutoff is just for testing purposes, we will use a way tighter one on our dataset.
cutoff = 0.5
getmatch('Barrick Gold')

'archrock inc. (old)'

In [12]:
# Applying it to our dataset

In [13]:
df_companies.columns

Index(['Ticker', 'Company Name', 'IndustryId', 'Company_lower'], dtype='object')

In [14]:
cutoff = 0.9
df_companies['difflib'] = df_companies['Company_lower'].apply(getmatch)

In [15]:
# Having a look how it worked
len(df_companies[df_companies['difflib'].isna()==False]['difflib'])
# 570 Matches, that looks better than the around 200 we had before.

570

In [16]:
# Check some examples on how good the matches are
# df_companies[df_companies['difflib'].isna()==False][['Company Name','difflib']].head(50)

The matches now look all correct, we had to increase the cutoff from initially 0.8 to 0.9 now.

#### Fuzzywuzzy
We try the same with the fuzzywuzzy library

In [17]:
from fuzzywuzzy import process

In [18]:
# We again define a function to get the closest matches, this time with the fuzzywuzzy library
def getfuzz(company):
    res,score = process.extractOne(company, r_companies, score_cutoff=1)
    if score < 1:
        pass
    else:
        return res   

In [19]:
print(getfuzz('Barrick Gold'))

gold fields limited


The fuzzywuzzy results were way to vague and had a lot of mismatches, even with
the cutoff set to one. We cannot use this.

In [20]:
# This took a long time, so we save the results:
df_companies.to_csv('Prepared Frames/companies_approx_match.csv', index=False)

# Another attempt

When building the model I realized, that having 500 companies is not enough to train a model for all the rating classes.


We create yet another column, where we clean up the company names of a lot of unneccessary ballast and try to get some more matches like this.

In [21]:
def cleaner(s):
    # We will define alist of words that are often used within companynames but don't help to distinguish:
    bloat_list = ['incorporated','company','corporation','pharmaceuticals','limited','resources', 'incorporated','plc','inc','corp','co',
                  'group','holdings','.',',','-','ltd',"'",'&']
    
    s =  s.lower()
    # We remove the bloat from our targeted string
    for word in bloat_list:
        s = s.replace(word,'')
    # We return the cleaned phrase
    return s

In [22]:
# We create the new column
df_companies['reduced_name'] = df_companies['Company Name'].apply(cleaner)
df_companies.head(10)

Unnamed: 0,Ticker,Company Name,IndustryId,Company_lower,difflib,reduced_name
0,A,AGILENT TECHNOLOGIES INC,106001.0,agilent technologies inc,,agilent technologies
1,A18,Trip.com Group Ltd,108004.0,trip.com group ltd,,tripm
2,A21,Li Auto Inc.,108004.0,li auto inc.,,li auto
3,AA,Alcoa Corp,110004.0,alcoa corp,,ala
4,AAC_delist,"AAC Holdings, Inc.",106011.0,"aac holdings, inc.","aac holdings, inc.",aac
5,AAL,American Airlines Group Inc.,100006.0,american airlines group inc.,american airlines group inc.,american airlines
6,AAMC,Altisource Asset Management Corp,104001.0,altisource asset management corp,,altisource asset management
7,AAME,ATLANTIC AMERICAN CORP,104004.0,atlantic american corp,,atlantic american
8,AAN,"Aaron's Company, Inc.",100007.0,"aaron's company, inc.",,aarons
9,AAOI,"APPLIED OPTOELECTRONICS, INC.",101004.0,"applied optoelectronics, inc.",,applied optoelectronics


In [23]:
# We do the same for the ratings
ratings_df['reduced_name'] = ratings_df['Company'].apply(cleaner)
ratings_df.head(10)

Unnamed: 0,Company,Date,Rating,Company_lower,reduced_name
0,"Becton, Dickinson and Company",2017-12-28,Ba1,"becton, dickinson and company",becton dickinson and
1,"Becton, Dickinson and Company",2012-06-15,A2,"becton, dickinson and company",becton dickinson and
2,"Becton, Dickinson and Company",2012-10-03,A3,"becton, dickinson and company",becton dickinson and
3,"Becton, Dickinson and Company",2015-03-17,Baa2,"becton, dickinson and company",becton dickinson and
4,"Becton, Dickinson and Company",2017-12-28,Ba1,"becton, dickinson and company",becton dickinson and
5,Tereos,2012-06-15,Ba3,tereos,tereos
6,Tereos,2012-08-21,Ba2,tereos,tereos
7,Tereos,2015-09-22,B1,tereos,tereos
8,Tereos,2012-06-15,Ba3,tereos,tereos
9,Tereos,2012-08-21,Ba2,tereos,tereos


In [24]:
# We try to match again
r_companies = list(ratings_df['reduced_name'].unique())

In [25]:
df_companies['reduced_matches'] = df_companies['reduced_name'].apply(getmatch)

In [26]:
# Having a look how it worked
len(df_companies[df_companies['reduced_matches'].isna()==False]['reduced_matches'])

851

In [27]:
df_companies[df_companies['reduced_matches'].isna()==False][['Company Name','reduced_name','reduced_matches']].head(60)

Unnamed: 0,Company Name,reduced_name,reduced_matches
4,"AAC Holdings, Inc.",aac,aac
5,American Airlines Group Inc.,american airlines,american airlines
15,AbbVie Inc.,abbvie,abbvie
18,"CAMBIUM LEARNING GROUP, INC.",cambium learning,cambium learning
21,ASBURY AUTOMOTIVE GROUP INC,asbury automotive,asbury automotive
34,"Arcosa, Inc.",arsa,arsa
38,American Campus Communities Inc,american campus mmunities,american campus mmunities
39,ACCO BRANDS Corp,ac brands,ac brands
44,Acadia Healthcare Company Inc,acadia healthcare,acadia healthcare
45,"Albertsons Companies, Inc.",albertsons mpanies,albertsons mpanies


In [28]:
df_companies[df_companies['reduced_matches'].isna()==False][['Company Name','reduced_name','reduced_matches']].tail(60)

Unnamed: 0,Company Name,reduced_name,reduced_matches
2898,Validus Holdings LTD,validus,validus
2902,Varex Imaging Corp,varex imaging,varex imaging
2905,VERINT SYSTEMS INC,verint systems,verint systems
2907,Verisk Analytics,verisk analytics,verisk analytics
2908,Verisign Inc.,verisign,verisign
2911,VIASAT INC,viasat,viasat
2912,Victoria's Secret & Co.,victorias secret,victorias secret
2913,VISHAY INTERTECHNOLOGY INC,vishay intertechnology,vishay intertechnology
2915,"Versum Materials, Inc.",versum materials,versum materials
2917,Vista Outdoor Inc.,vista outdoor,vista outdoor


# Adding the names of the matched companies, to our companies dataframe

In [29]:
# After looking through some examples, I am confident, that the matches are reliable.
# With each added word we were able to increase the number of matches for our dataframe.
# From initially 250 matched companies we went up to now 850
df_companies.to_csv('Prepared Frames/companies_approx_match.csv', index=False)

In [30]:
# We also have to save the ratings dataframe with the matched versions of the company names
ratings_df.to_csv('Prepared Frames/rating_data.csv', index=False)