# Introduction to data cleaning

In [None]:
import pandas as pd

## Introduction

Welcome to our introduction to cleaning data! Real-world data is messy. For example, business data is often entered manually by humans. Humans can be somewhat inconsistent in the way that they enter the data. For example, when writing company names, at times they might write `Apple, Inc.`, and at other times they might just write `Apple`. If we are trying to find the number of times `Apple` is included in a dataset and only consider the first possibility—`Apple, Inc.`—we will not obtain a proper count of the number of occurrences of `Apple`.

Luckily, we can take some steps to clean up datasets. In this lesson, we will go through a synthetic dataset of companies that have been selected for a recent regulatory investigation. For each industry represented in the dataset, we will seek to obtain the average tax revenue produced for the companies within that industry.

## Regulatory Investigation: Tax Revenue Analysis

You will get your first taste for data cleaning by seeing it in action. There are some core concepts of data cleaning that we will lay out later on, but data cleaning ultimately relies on an intimate and careful understanding of the data we are working with. By looking at our data and thinking about what we want to do with it, we can determine how we have to clean it.

First, we load the dataset from the file and take a look at it.

In [None]:
investigation_companies = pd.read_csv('../../input/investigation_iran_companies.csv', index_col = [0])
display(investigation_companies)

We notice that some companies appear more than once in the dataset, and they appear under slightly different names. For example, we see that row 18 and row 44 both correspond to Arta Energy, and row 3 and row 29 both correspond to Amin Pharmaceutical. We need to eliminate duplicate entries such as these so that we can get the proper value for the average tax revenue corresponding to each industry.

#### Dropping Duplicates

Since this dataset is quite small, we could manually go through it to remove duplicates. However, we will seek to remove duplicates in an automated manner so that you will have the skills to work with larger datasets. 

As a first step, we can simply take care of all instances where the exact same company name occurs more than once in the dataset. We can do this using the `drop_duplicates` function in `pandas` as follows:

In [None]:
investigation_companies_dropped = investigation_companies.drop_duplicates(subset = 'company_name')
display(investigation_companies_dropped)
len(investigation_companies_dropped)

We see that there are now 45 rows in our dataset. However, we see that there still some duplicates in the dataset, such as Tehran Pharmaceuticals. We now need to find a way to standardize company names. In other words, we need to make sure that whenever two rows correspond to the same company, there is a `company_id` that is the same between those two rows, even when the `company_name` is different. We could make the `company_id` simply the lower case version of the `company_name`. This would take care of the Arta Energy case, where we have a row for Arta Energy and a row for arta energy. However, in other cases, this would not be enough. For example, rows 15 and 41 refer to Persian Gulf Steel, but row 15 would become "persian gulf steel & co." and row 41 would become "persian gulf steel."

What if we can use the `ceo_name` to determine when two companies are in fact the same? This relies on the assumption that no two companies have a CEO with the same name.

In [None]:
investigation_companies_dropped = investigation_companies.drop_duplicates(subset = 'ceo_name')
display(investigation_companies_dropped)
print(len(investigation_companies_dropped))

This led to the removal of just two rows. This could be due to the fact that some of the CEO names are lower-cased. We create a new column with the lower-cased version of the CEO name and see if we can use this as a basis for removing duplicates; this way we might capture more duplicates.

In [None]:
investigation_companies_dropped = investigation_companies
investigation_companies_dropped['ceo_name_lower'] = investigation_companies_dropped['ceo_name'].str.lower()

In [None]:
investigation_companies_dropped = investigation_companies_dropped.drop_duplicates(subset = 'ceo_name_lower')
display(investigation_companies_dropped)
print(len(investigation_companies_dropped))

We see that no additional duplicates are being dropped. What is going here? It appears that the first and last names of the CEOs are getting switched at certain points. For example, for one entry of Persian Gulf Steel & Co., the CEO is abdollah azimi, and for the other it is azimi abdollah. We need to make sure the two names (first and last) of the CEO appear in the same order each time the name appears. To acccomplish this, we can write a function that puts the two names in alphabetical order. This way, for each instance where the two names appear, the two names will appear in the same order.

In [None]:
def put_names_in_order(ceo_name):
  name_list = ceo_name.split() #make the two names of the CEO into a list
  name_list.sort() #Put the names in order
  return ' '.join(name_list) #Put the names back together as a single string

In [None]:
investigation_companies_dropped = investigation_companies
investigation_companies_dropped['ceo_name_lower_ordered'] = investigation_companies_dropped.apply(lambda row: put_names_in_order(row['ceo_name_lower']), axis = 1)

Hopefully, we've now standardized the CEO names. Let's see if we can now use `ceo_name_lower_ordered` as a basis for dropping duplicates.

In [None]:
investigation_companies_dropped = investigation_companies_dropped.drop_duplicates(subset = 'ceo_name_lower_ordered')
display(investigation_companies_dropped)
print(len(investigation_companies_dropped))

We have now removed many more rows, and we see that we are down to 27 rows! We have made major headway in removing duplicate companies from this dataset.

#### Data Type Conversion

We now move on to calculating the average tax revenue for the Insurance industry within the dataset.

In [None]:
investigation_companies_dropped[investigation_companies_dropped['sector'] == 'Insurance']['tax_revenue'].mean()

We run into an error here: 

> TypeError: Could not convert $30000000 to numeric

This error relates to a key element of data cleaning: making sure our data types are correct. The issue is that the tax_revenue column is of type str (String). To compute the mean of that column, we need to convert the values in it to numeric values. To accomplish this, we need to follow two steps:
1. Remove the $ sign from each string.

2. Once we've removed the $ sign, Python is able to convert the string into an integer.

In [None]:
investigation_companies_dropped['tax_revenue'] = investigation_companies_dropped['tax_revenue'].apply(lambda row: row.replace("$",""))
investigation_companies_dropped['tax_revenue'] = investigation_companies_dropped['tax_revenue'].astype(int)

Now we see that the tax revenue column no longer has $ signs and has data type integer:

In [None]:
investigation_companies_dropped['tax_revenue']

And now we can calculate the desired mean:

In [None]:
investigation_companies_dropped[investigation_companies_dropped['sector'] == 'Insurance']['tax_revenue'].mean()

#### Advanced Topic: Fuzzy Matching

Fuzzy matching is a valuable technique for tackling advanced data cleaning problems, especially when dealing with messy or inconsistent data. Fuzzy matching allows you to find approximate matches between strings, even when there are variations in spelling, formatting, or other discrepancies.

One popular Python package for fuzzy matching is called `fuzzywuzzy`. It provides various functions and algorithms for fuzzy string matching. You can use it to compare strings and determine their similarity scores.

To get started with `fuzzywuzzy`, you can install it using pip:

In [None]:
!pip install fuzzywuzzy

Once installed, you can import the package and start using its functions in your code. Here's an example of how you can apply fuzzy matching using `fuzzywuzzy`:

In [None]:
from fuzzywuzzy import fuzz

# Compare two strings and get their similarity score
string1 = "apple"
string2 = "apples"
similarity_score = fuzz.ratio(string1, string2)
print(similarity_score)

The above will output a similarity score between 0 and 100, indicating the degree of similarity between the two strings.

## Conclusion

I hope you enjoyed this lesson on data cleaning. If you clean your data, you will be able to use it to discover fascinating stories.