In this exercise, you'll apply what you learned in the **Inconsistent data entry** tutorial.

# Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

In [None]:
from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex5 import *
print("Setup Complete")

# Get our environment set up

The first thing we'll need to do is load in the libraries and dataset we'll be using.  We use the same dataset from the tutorial.

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

# read in all our data
suicide_attacks = pd.read_csv("../input/pakistansuicideattacks/PakistanSuicideAttacks Ver 11 (30-November-2017).csv", encoding='Windows-1252')

# set seed for reproducibility
np.random.seed(0)

Next, we'll redo all of the work that we did in the tutorial.

In [None]:
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

# get the top 10 closest matches to "d.i khan"
cities = suicide_attacks['City'].unique()
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")
    
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")

# 1) Examine another column

Write code below to take a look at all the unique values in the "Province" column.

In [None]:
# TODO: Your code here

In [None]:
#%%RM_IF(PROD)%%
provinces = suicide_attacks['Province'].unique()

# sort them alphabetically and then take a closer look
provinces.sort()
provinces

Do you notice any inconsistencies in the data?  (You might need to Google some of the entries.)  Can any of the inconsistencies in the data be fixed by making everything lowercase?

Once you have answered these questions, run the code cell below to get credit for your work.

In [None]:
# Check your answer (Run this code cell to receive credit!)
q1.check()

In [None]:
# Line below will give you a hint
#_COMMENT_IF(PROD)_
q1.hint()

# 2) Do some text pre-processing

Convert every entry in the "Province" column in the `suicide_attacks` DataFrame to lowercase.

In [None]:
# TODO: Your code here
____

# Check your answer
q2.check()

In [None]:
#%%RM_IF(PROD)%%
q2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
suicide_attacks['Province'] = suicide_attacks['Province'].str.lower()
q2.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q2.hint()
#_COMMENT_IF(PROD)_
q2.solution()

# 3) Continue working with cities

In the tutorial, we focused on cleaning up inconsistencies in the "City" column.  Run the code cell below to view the list of unique values that we ended with.

In [None]:
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

Take another look at the "City" column and see if there's any more data cleaning we need to do.

It looks like 'kuram agency' and 'kurram agency' should be the same city.  Correct the "City" column in the dataframe so that every match to 'kuram agency' appears instead as 'kurram agency'.

In [None]:
# TODO: Your code here!
____

# Check your answer
q3.check()

In [None]:
#%%RM_IF(PROD)%%
q3.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
rows_with_matches = (suicide_attacks['City'] == 'kuram agency')
suicide_attacks.loc[rows_with_matches, 'City'] = 'kurram agency'
q3.assert_check_passed()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q3.hint()
#_COMMENT_IF(PROD)_
q3.solution()

# (Optional) More practice

Do any other columns in this dataframe have inconsistent data entry? If you can find any, try to tidy them up.

You can also try reading in the `PakistanSuicideAttacks Ver 6 (10-October-2017).csv` file from this dataset and tidying up any inconsistent columns in that data file.

# Congratulations!

Congratulations for completing the **Data Cleaning** course on Kaggle Learn!

To practice your new skills, you're encouraged to download and investigate some of [Kaggle's Datasets](https://www.kaggle.com/datasets).