<center>
    <h1 id='inconsistent-data-entry' style='color:#7159c1'>🔨 Inconsistent Data Entry 🔨</h1>
    <i>Dealing with Inconsistent Categorical Variables Values</i>
</center>

<br />

In this notebook, let's fix some invalid data entries for countries.

In [7]:
# ---- Reading Dataset ----
import pandas as pd # pip install pandas

professors_df = pd.read_csv('./datasets/pakistan_intellectual_capital.csv', index_col='Unnamed: 0')
countries = professors_df.Country.unique()
countries.sort()
countries

array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',
       'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',
       'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',
       'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',
       'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',
       'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',
       'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],
      dtype=object)

---

Notice that we have 'New Zealand' and ' New Zealand'; 'Germany' and 'germany'; 'USA', ' USA' and 'USofA'; 'South Korea' and 'SouthKorea'.

In [9]:
# ---- Converting All Values ----
#
# - lower case
# - strip
#
professors_df['Country'] = professors_df['Country'].str.lower()
professors_df['Country'] = professors_df['Country'].str.strip()

countries = professors_df.Country.unique()
countries.sort()
countries

array(['australia', 'austria', 'canada', 'china', 'finland', 'france',
       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',
       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',
       'norway', 'pakistan', 'portugal', 'russian federation',
       'saudi arabia', 'scotland', 'singapore', 'south korea',
       'southkorea', 'spain', 'sweden', 'thailand', 'turkey', 'uk',
       'urbana', 'usa', 'usofa'], dtype=object)

---

Now, we gotta a few problemms like: 'south korea' and 'southkorea', 'usa' and 'usofa'. 

In [11]:
# ---- Finding Countries Names that are similar to 'south korea' ---- #
#
# Fuzzy matching: The process of automatically finding text strings that are very similar to the target string. 
#
# In general, a string is considered "closer" to another  one the fewer characters you'd need to change if you 
# were transforming one string into another. 
#
# So "apple" and "snapple" are two changes away from each  other (add "s" and "n") while "in" and "on" and one 
# change away (replace "i" with "o"). 
#
# You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little 
# time.
#
import fuzzywuzzy # pip install fuzzywuzzy
from fuzzywuzzy import process

matches = fuzzywuzzy.process.extract(
    'south korea'
    , countries
    , limit=10
    , scorer=fuzzywuzzy.fuzz.token_sort_ratio
)

matches

[('south korea', 100),
 ('southkorea', 48),
 ('saudi arabia', 43),
 ('norway', 35),
 ('austria', 33),
 ('ireland', 33),
 ('pakistan', 32),
 ('portugal', 32),
 ('scotland', 32),
 ('australia', 30)]

In [15]:
# ---- Creating function to replace the similar countries' names ---- #
def replace_matches_in_column(df, column, string_to_match, min_ratio=47):
    """
    \ Description:
        - applies fuzzywuzzy in order to find the most similar matches to 'string_to_match' parameter;
        - the most similar matches are filtered to those ones that has a score greater than 'min_ratio' parameter;
        - then the chosen ones are corrected having their values replaced by the 'string_to_match' parameter.
    
    \ Parameters:
        - df: pandas dataframe;
        - column: string;
        - string_to_match: string;
        - min_ratio: integer.
    """
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closes matches to the input string
    matches = fuzzywuzzy.process.extract(
        string_to_match
        , strings
        , limit=10
        , scorer=fuzzywuzzy.fuzz.token_sort_ratio
    )
    
    # only get matches with a ratio > min_ratio
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]
    
    # get the rows of all the close matches in the dataframe
    rows_with_matches = df[column].isin(close_matches)
    
    # replace all rows with close matches with the input match
    df.loc[rows_with_matches, column] = string_to_match

In [16]:
# ---- Fixing Inconsistent Data Entries for 'South Korea'
replace_matches_in_column(
    df=professors_df
    , column='Country'
    , string_to_match='south korea'
    , min_ratio=47
)

In [17]:
# ---- Fixing Inconsistend Data Entries for 'USA' ----
matches = fuzzywuzzy.process.extract(
    'usa'
    , countries
    , limit=2
    , scorer=fuzzywuzzy.fuzz.token_sort_ratio
)

matches

[('usa', 100), ('usofa', 75)]

In [19]:
# ---- Fixing Inconsistend Data Entries for 'USA' ----
replace_matches_in_column(
    df=professors_df
    , column='Country'
    , string_to_match='usa'
    , min_ratio=50
)

countries = professors_df.Country.unique()
countries.sort()
countries

array(['canada', 'china', 'finland', 'france', 'germany', 'greece',
       'hongkong', 'ireland', 'italy', 'japan', 'macau', 'malaysia',
       'mauritius', 'netherland', 'new zealand', 'norway', 'pakistan',
       'portugal', 'russian federation', 'saudi arabia', 'scotland',
       'singapore', 'south korea', 'sweden', 'thailand', 'turkey', 'uk',
       'urbana', 'usa'], dtype=object)

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).