# In this report,
### I will present a possible solution to reduce the number of unique city names in dataset.

#### In particular, we will see:
- How to translate from urdu to english using python modules
- How to create pandas dataframes from wikipedia tabels
- What is fuzzy match and how to find a best match for noisy categorical data


In [None]:
import pandas as pd
%config IPCompleter.use_jedi = False

In [None]:
df = pd.read_csv('/kaggle/input/gufhtugu-publications-dataset-challenge/GP Orders - 5.csv')
df.head()

In [None]:
df.dropna(subset=["City"], inplace=True)
df.City.isna().sum()

In [None]:
df["City"] = df.City.astype(str)
raw_cities = df.iloc[:,[0,4]].copy()
raw_cities.columns = ["oid","city"]
raw_cities.city.value_counts()

In [None]:
N = 150

before_ev = raw_cities.city.nunique()
before_ev_p = raw_cities.city.value_counts()[:N].sum()/raw_cities.shape[0]

print(f"Initially we have {before_ev} unique values for city names.")
print(f"Top {N} cities with most ordes are {before_ev_p:.2f}% of the whole dataset")

# Translation from Urdu to English

In [None]:
mask = raw_cities.city.str.contains("[a-zA-Z]")
urdu_names= raw_cities[~mask].city.unique()
urdu_names.shape
print("we have {} unique city names in urdu, a sample is shown below:\n{}".format(len(urdu_names), urdu_names[:10]))

#### Manually translating these names is time consuming, We can use a very famous module TextBlob which provides basic translation.


In [None]:
from textblob import TextBlob
from time import sleep

In [None]:
to_en = {}
for x in urdu_names:
    # sleep to not exceed the limit of requests
    sleep(0.5)
    try:
        tr = TextBlob(x).translate().string
        to_en[x] = tr
#         print(x," - ", tr)
    except:
        pass
#         print(x)


In [None]:
len(to_en), to_en

#### TextBlob was able to translate 122 out of 156 names. As expected, most of the translations are perfect.
We can replace these names in raw dataset.

In [None]:
tr = raw_cities.city.replace(to_en)
raw_cities["city"] = tr

after_tr = raw_cities.city.nunique()
after_tr_p = raw_cities.city.value_counts()[:N].sum()/raw_cities.shape[0]

print(f"After translation, we have {after_tr} unique values for city names.")
print(f"Top {N} cities with most ordes are {after_tr_p:.2f}% of the whole dataset.")

# Case Normalization
#### We will convert all city names in title format to remove any difference w.r.t case sensitivity.

In [None]:
raw_cities["city"] = raw_cities.city.str.strip().str.title()

after_cn = raw_cities.city.nunique()
after_cn_p = raw_cities.city.value_counts()[:N].sum()/raw_cities.shape[0]

print(f"After case normalization, we have {after_cn} unique values for city names.")
print(f"Top {N} cities with most ordes are {after_cn_p:.2f}% of the whole dataset.")

# Fuzzy matching
#### In this section I will try to correct the spelling mistakes and extract city names from detailed address.  
In particular, I will  
- Create a List of cities using Wikipedia articles
- Match city entries from raw data to newly created list

I could not find a way to replace abbreviations of city names with correct names.  
I will manually replace city names like {"lhr", "khi"} to their full names.

In [None]:
# list names with less than 5 characters
mask = raw_cities.city.str.len() < 5
raw_cities[mask].city.unique()

In [None]:
short_full = {
    "Rwp": "Rawalpindi",
    "Isb": "Islamabad",
    "Fsd": "Faisalabad",
    "Khi": "Karachi",
    "Lhr": "Lahore",
    "D I Khan": "Dera ismail khan",
    "G G Khan": "Dera ghazi khan"
}

raw_cities["city"] = raw_cities.city.replace(short_full)

In [None]:
from fuzzywuzzy import process, fuzz

In [None]:
# Create on dataframe for each province

urls = {
    "kpk": "https://en.wikipedia.org/wiki/List_of_cities_in_Khyber_Pakhtunkhwa_by_population",
    "balochistan":"https://en.wikipedia.org/wiki/List_of_cities_in_Balochistan,_Pakistan_by_population",
    "punjab":"https://en.wikipedia.org/wiki/List_of_cities_in_Punjab,_Pakistan_by_population",
    "sindh": "https://en.wikipedia.org/wiki/List_of_cities_in_Sindh_by_population",
    "gb": "https://en.wikipedia.org/wiki/List_of_cities_in_Gilgit-Baltistan_by_population"
}

tabel_idx = {
    "kpk": 0,
    "balochistan":0,
    "punjab":0,
    "sindh": 0,
    "gb": 1
}

df_list = {}
for pname in urls:
    df_list[pname] = pd.read_html(urls[pname])[tabel_idx[pname]]

In [None]:
# Create list of cities and city-to-province mapping

cities = pd.Series("Islamabad",dtype = object)
city_prov = {"Islamabad":"Federal"}
for pname in df_list:
    p = df_list[pname]
    p.rename(columns = {"City Name": "City"}, inplace = True)
    cities = cities.append(p.City, ignore_index=True)
    city_prov.update({city:pname.title() for city in p.City})

#### To match two strings, we will use the weighted ratio score.   (details about Weighted ratio are  [here](https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings))

In [None]:
to_city = {}
for city in raw_cities.city.unique():
    res, score, _ = process.extractOne(city, cities, scorer = fuzz.WRatio)
    to_city[city] = f"{res};{score}"

In [None]:
new_names = raw_cities.city.replace(to_city)
raw_cities[["proposed_city_name", "similarity_score"]] = new_names.str.split(";", n = 1, expand = True)
raw_cities["similarity_score"] = raw_cities.similarity_score.astype(int)

raw_cities.head()

The real magic happens between the scores of 85 and 90, where spelling mistakes are corrected.  
Results include some errors, which will cause noise in the data.  
Threshold can be set according to a particular problem, we will rename all the cities where score is above 85.

In [None]:
cols = ["city", "proposed_city_name", "similarity_score"]
mask_85 = raw_cities.similarity_score > 85
mask = mask_85 & (raw_cities.similarity_score < 90)
raw_cities[mask][cols]

In [None]:
raw_cities.loc[mask_85, "city"] = raw_cities[mask].proposed_city_name
raw_cities.loc[mask_85, "fuzzy_name"] = True
raw_cities.loc[~mask_85, "fuzzy_name"] = False

print("{} names are mapped to their fuzzy match.".format(raw_cities.fuzzy_name.sum()))

raw_cities.head()

In [None]:
mask = raw_cities.similarity_score > 85

raw_cities.loc[mask, "city"] = raw_cities[mask].proposed_city_name
raw_cities.loc[mask, "fuzzy_name"] = True
raw_cities.loc[~mask, "fuzzy_name"] = False

print("{} names are mapped to their fuzzy match.".format(raw_cities.fuzzy_name.sum()))
raw_cities

In [None]:
final = raw_cities.city.nunique()
final_p = raw_cities.city.value_counts()[:N].sum()/raw_cities.shape[0]

print(f"Finally, we have {final} unique values for city names.")
print(f"Top {N} cities are {final_p:.2f}% of the whole dataset.")

## Results

More than the fact that we have lesser unique values, it is satisfying to see that almost 88% of the dataset has standard city names.  
We have made some errors but those city names were rarely present in our dataset.  
The results can be further improved by expanding our search for True values of city names.


### If you find this report useful 🧐, please upvote ☝. Adios. 