## 1.5 Combining Data from Multiple Tables

In [22]:
import pandas as pd

## Introduction

Welcome to our introduction to cleaning data! Real-world data is messy. For example, business data is often entered manually by humans. Humans can be somewhat inconsistent in the way that they enter the data. For example, when writing company names, at times they might write Apple, Inc., and at other times they might just write Apple. If we are trying to find the number of times Apple is included in a dataset and only consider the first possibility—Apple, Inc.—we will not obtain a proper count of the number of occurrences of Apple.

Luckily, we can take some steps to clean up datasets. In this lesson, we will go through a synthetic dataset of companies that have been selected for a recent regulatory investigation. For each industry represented in the dataset, we will seek to obtain the average tax revenue produced for the companies within that industry.

## Regulatory Investigation: Tax Revenue Analysis

You will get your first taste for data cleaning by seeing it in action. There are some core concepts of data cleaning that we will lay out later on, but data cleaning ultimately relies on an intimate and careful understanding of the data we are working with. By looking at our data and thinking about what we want to do with it, we can determine how we have to clean it.

First, we load the dataset from the file and take a look at it.

In [23]:
# investigation_companies['tax_revenue'] = investigation_companies['tax_revenue'].apply(lambda row: row.replace("$","").replace(",",""))

In [24]:
# investigation_companies.to_csv('investigation_iran_companies.csv',index = [0])

In [25]:
# investigation_companies.drop(columns = ["Unnamed: 0"], inplace = True)

In [26]:
investigation_companies = pd.read_csv('investigation_iran_companies.csv',index_col = [0])
display(investigation_companies)

Unnamed: 0,company_name,sector,tax_revenue,number_of_employees,location,ceo_name
0,S&F Co.,Electronics,50000000,300,Tehran,John Smith
1,Ahvazِ Oil Refining,Oil & Gas,1500000000,5000,Ahvaz,Ali Khan
2,Arzan Shargh Trading Ltd.,Trading,3000000,20,Mashhad,Leila Hamidi
3,Amin Pharmaceutical Company - شرکت امین دارویی,Pharma,80000000,200,Tehran,Hamid Rezaei
4,South-Pars Gas Field - میدان گازی جنوب پارس,Energy & Gas,10000000000,2500,Asaluyeh,Ahmad Rahimi
5,Raja Rail Transportation,Logistics,30000000,150,Tehran,Mohammad Gharib
6,بافتِ بزرگ,Textile,5000000,100,Esfahan,Sara Jafari
7,Re&Co,Services,10000000,75,Tehran,Reza Mousavi
8,Sadr Holding,Investment Holding,2000000000,100,Tehran,Seyed Ahmad Hashemi
9,Sepehr Sazeh,Construction,50000000,500,Tehran,Ali Rezaei


We notice that some companies appear more than once in the dataset, and they appear under slightly different names. For example, we see that row 18 and row 44 both correspond to Arta Energy, and row 3 and row 29 both correspond to Amin Pharmaceutical. We need to eliminate duplicate entries such as these so that we can get the proper value for the average tax revenue corresponding to each industry.

Since this dataset is quite small, we could manually go through it to remove duplicates. However, we will seek to remove duplicates in an automated manner so that you will have the skills to work with larger datasets. 

As a first step, we can simply take care of all instances where the exact same company name occurs more than once in the dataset. We can do this using the drop_duplicates function in pandas as follows:

In [27]:
investigation_companies_dropped = investigation_companies.drop_duplicates(subset = 'company_name')
display(investigation_companies_dropped)
len(investigation_companies_dropped)

Unnamed: 0,company_name,sector,tax_revenue,number_of_employees,location,ceo_name
0,S&F Co.,Electronics,50000000,300,Tehran,John Smith
1,Ahvazِ Oil Refining,Oil & Gas,1500000000,5000,Ahvaz,Ali Khan
2,Arzan Shargh Trading Ltd.,Trading,3000000,20,Mashhad,Leila Hamidi
3,Amin Pharmaceutical Company - شرکت امین دارویی,Pharma,80000000,200,Tehran,Hamid Rezaei
4,South-Pars Gas Field - میدان گازی جنوب پارس,Energy & Gas,10000000000,2500,Asaluyeh,Ahmad Rahimi
5,Raja Rail Transportation,Logistics,30000000,150,Tehran,Mohammad Gharib
6,بافتِ بزرگ,Textile,5000000,100,Esfahan,Sara Jafari
7,Re&Co,Services,10000000,75,Tehran,Reza Mousavi
8,Sadr Holding,Investment Holding,2000000000,100,Tehran,Seyed Ahmad Hashemi
9,Sepehr Sazeh,Construction,50000000,500,Tehran,Ali Rezaei


45

We see that there are now 45 rows in our dataset. However, we see that there still some duplicates in the dataset, such as Tehran Pharmaceuticals. We now need to find a way to standardize company names. In other words, we need to make sure that whenever two rows correspond to the same company, there is a company_id that is the same between those two rows, even when the company_name is different. We could make the company_id simply the lower case version of the company_name. This would take care of the Arta Energy case, where we have a row for Arta Energy and a row for arta energy. However, in other cases, this would not be enough. For example, rows 15 and 41 refer to Persian Gulf Steel, but row 15 would become "persian gulf steel & co." and row 41 would become "persian gulf steel."

What if we can use the ceo_name to determine when two companies are in fact the same? This relies on the assumption that no two companies have a CEO with the same name.

In [28]:
investigation_companies_dropped = investigation_companies.drop_duplicates(subset = 'ceo_name')
display(investigation_companies_dropped)
print(len(investigation_companies_dropped))

Unnamed: 0,company_name,sector,tax_revenue,number_of_employees,location,ceo_name
0,S&F Co.,Electronics,50000000,300,Tehran,John Smith
1,Ahvazِ Oil Refining,Oil & Gas,1500000000,5000,Ahvaz,Ali Khan
2,Arzan Shargh Trading Ltd.,Trading,3000000,20,Mashhad,Leila Hamidi
3,Amin Pharmaceutical Company - شرکت امین دارویی,Pharma,80000000,200,Tehran,Hamid Rezaei
4,South-Pars Gas Field - میدان گازی جنوب پارس,Energy & Gas,10000000000,2500,Asaluyeh,Ahmad Rahimi
5,Raja Rail Transportation,Logistics,30000000,150,Tehran,Mohammad Gharib
6,بافتِ بزرگ,Textile,5000000,100,Esfahan,Sara Jafari
7,Re&Co,Services,10000000,75,Tehran,Reza Mousavi
8,Sadr Holding,Investment Holding,2000000000,100,Tehran,Seyed Ahmad Hashemi
9,Sepehr Sazeh,Construction,50000000,500,Tehran,Ali Rezaei


51


This led to the removal of just two rows. This could be due to the fact that some of the CEO names are lower-cased. We create a new column with the lower-cased version of the CEO name and see if we can use this as a basis for removing duplicates; this way we might capture more duplicates.

In [29]:
investigation_companies_dropped = investigation_companies
investigation_companies_dropped['ceo_name_lower'] = investigation_companies_dropped['ceo_name'].str.lower()

In [30]:
investigation_companies_dropped = investigation_companies_dropped.drop_duplicates(subset = 'ceo_name_lower')
display(investigation_companies_dropped)
print(len(investigation_companies_dropped))

Unnamed: 0,company_name,sector,tax_revenue,number_of_employees,location,ceo_name,ceo_name_lower
0,S&F Co.,Electronics,50000000,300,Tehran,John Smith,john smith
1,Ahvazِ Oil Refining,Oil & Gas,1500000000,5000,Ahvaz,Ali Khan,ali khan
2,Arzan Shargh Trading Ltd.,Trading,3000000,20,Mashhad,Leila Hamidi,leila hamidi
3,Amin Pharmaceutical Company - شرکت امین دارویی,Pharma,80000000,200,Tehran,Hamid Rezaei,hamid rezaei
4,South-Pars Gas Field - میدان گازی جنوب پارس,Energy & Gas,10000000000,2500,Asaluyeh,Ahmad Rahimi,ahmad rahimi
5,Raja Rail Transportation,Logistics,30000000,150,Tehran,Mohammad Gharib,mohammad gharib
6,بافتِ بزرگ,Textile,5000000,100,Esfahan,Sara Jafari,sara jafari
7,Re&Co,Services,10000000,75,Tehran,Reza Mousavi,reza mousavi
8,Sadr Holding,Investment Holding,2000000000,100,Tehran,Seyed Ahmad Hashemi,seyed ahmad hashemi
9,Sepehr Sazeh,Construction,50000000,500,Tehran,Ali Rezaei,ali rezaei


51


We see that no additional duplicates are being dropped. What is going here? It appears that the first and last names of the CEOs are getting switched at certain points. For example, for one entry of Persian Gulf Steel & Co., the CEO is abdollah azimi, and for the other it is azimi abdollah. We need to make sure the two names of the CEO appear in the same order. To acccomplish this, we can writea function that puts the two names in alphabetical order. This way, for each instance where the two names appear, the two names will appear in the same order.

In [31]:
def put_names_in_order(ceo_name):
  name_list = ceo_name.split() #make the two names of the CEO into a list
  name_list.sort() #Put the names in order
  return ' '.join(name_list) #Put the names back together as a single string

In [32]:
investigation_companies_dropped = investigation_companies
investigation_companies_dropped['ceo_name_lower_ordered'] = investigation_companies_dropped.apply(lambda row: put_names_in_order(row['ceo_name_lower']), axis = 1)

Hopefully, we've now standardized the CEO names. Let's see if we can now use ceo_name_lower_ordered as a basis for dropping duplicates.

In [33]:
investigation_companies_dropped = investigation_companies_dropped.drop_duplicates(subset = 'ceo_name_lower_ordered')
display(investigation_companies_dropped)
print(len(investigation_companies_dropped))

Unnamed: 0,company_name,sector,tax_revenue,number_of_employees,location,ceo_name,ceo_name_lower,ceo_name_lower_ordered
0,S&F Co.,Electronics,50000000,300,Tehran,John Smith,john smith,john smith
1,Ahvazِ Oil Refining,Oil & Gas,1500000000,5000,Ahvaz,Ali Khan,ali khan,ali khan
2,Arzan Shargh Trading Ltd.,Trading,3000000,20,Mashhad,Leila Hamidi,leila hamidi,hamidi leila
3,Amin Pharmaceutical Company - شرکت امین دارویی,Pharma,80000000,200,Tehran,Hamid Rezaei,hamid rezaei,hamid rezaei
4,South-Pars Gas Field - میدان گازی جنوب پارس,Energy & Gas,10000000000,2500,Asaluyeh,Ahmad Rahimi,ahmad rahimi,ahmad rahimi
5,Raja Rail Transportation,Logistics,30000000,150,Tehran,Mohammad Gharib,mohammad gharib,gharib mohammad
6,بافتِ بزرگ,Textile,5000000,100,Esfahan,Sara Jafari,sara jafari,jafari sara
7,Re&Co,Services,10000000,75,Tehran,Reza Mousavi,reza mousavi,mousavi reza
8,Sadr Holding,Investment Holding,2000000000,100,Tehran,Seyed Ahmad Hashemi,seyed ahmad hashemi,ahmad hashemi seyed
9,Sepehr Sazeh,Construction,50000000,500,Tehran,Ali Rezaei,ali rezaei,ali rezaei


27


We see that we are down to 27 rows! We have made major headway in removing duplicate companies from this dataset.

We now move on to calculating the average tax revenue for the Insurance industry within the dataset.

In [34]:
investigation_companies_dropped[investigation_companies_dropped['sector'] == 'Insurance']['tax_revenue'].mean()

30000000.0

We run into an error here: 

TypeError: Could not convert $30,000,000 to numeric

How can we address this error? The issue is that the tax_revenue column is of type str (String). To compute the mean of that column, we need to convert the values in it to numeric values. We can do so as follows:

In [35]:
investigation_companies_dropped['tax_revenue'] = investigation_companies_dropped['tax_revenue'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  investigation_companies_dropped['tax_revenue'] = investigation_companies_dropped['tax_revenue'].astype(int)


In [36]:
We ob

SyntaxError: ignored

In [None]:
investigation_companies = investigation_companies.drop_duplicates(subset = 'ceo_name')
investigation_companies

Provide key ways in which data must be clean: non-duplicates

## 

## Core Concepts In Data Cleaning

Now that we've gotten a taste for data cleaning, we can go through some additional relevant core concepts.


Missing values: Missing values can lead to inaccurate or biased results. Data cleaning involves identifying and addressing missing values, either by imputing them with appropriate estimates or by deleting the affected records.

Duplicate records: Identify and remove any duplicate records in the dataset. Duplicate records can skew results and lead to incorrect conclusions.

Outliers: Outliers are data points that significantly deviate from the rest of the data. Detecting and handling outliers is important to ensure accurate analysis and modeling.

Data transformation: Transforming data involves converting it into a suitable format for analysis. This may involve scaling, normalizing, or encoding categorical variables.

Data consistency: Ensure that data is consistent across all records, particularly when merging or integrating data from multiple sources. This may involve standardizing units, currency, or date formats, as well as reconciling conflicting data points

Data accuracy: Verify that the data is accurate and free from errors, such as incorrect entries or typos. This may involve cross-referencing with other reliable sources or employing domain expertise to identify potential inaccuracies.

#### Using ChatGPT to assist with data cleaning