# Step.1 Processing Geolocation Dataset:  
Due to the time-consuming nature of processing the geolocation dataset, an additional process is required. The final file will be saved as **'olist_geolocation_dataset_2.0'**. Subsequent uses of geolocation data will load from version 2.0.

**1.1 Import neccesary libraries**

In [39]:
import pandas as pd
#for checking misspells
from fuzzywuzzy import fuzz
import unidecode

**1.2 Read the geolocation dataset**

In [42]:
df_geolocation = pd.read_csv("olist_geolocation_dataset.csv")

**1.3 Combining rows with same zipcode:**  
Each zip code has more than one latitude and longitude, so average the latitude and longitude of each zip code to leave one value.

In [45]:
df_geolocation = df_geolocation.groupby('geolocation_zip_code_prefix').agg({
    'geolocation_lat': 'mean',
    'geolocation_lng': 'mean',
    # remain the first state
    'geolocation_state': 'first',
    # remain the first city
    'geolocation_city': 'first'
}).reset_index()

**1.4 Processing spelling inconsistencies in 'City' Column:**  
In the city column, some cities representing the same area are displayed in different formats within the dataset. To maintain data consistency, the city field will be normalized first.For example, São Paulo, Sao Paulo, and San Paulo refer to the same city but may be displayed differently in the dataset. Therefore, we will process these entries to ensure consistency.

In [48]:
# Normalize function to remove accents and diacritical marks, useful for standardising text data
def normalize_string(s):
    return unidecode.unidecode(s)

In [50]:
# Get the unique city names
unique_cities = df_geolocation['geolocation_city'].unique()

# Apply the normalize_string function to the 'geolocation_city' column
df_geolocation['normalized_city'] = df_geolocation['geolocation_city'].apply(normalize_string)

# Function to find similar cities (use normalized version for both)
def find_similar_cities(city_name, unique_normalized_cities, threshold=95):
    similar_cities = []
    for other_city in unique_normalized_cities:
        # Compare the normalized city names directly (no need to normalize again)
        similarity = fuzz.ratio(city_name, other_city)
        if similarity >= threshold:
            similar_cities.append(other_city)
    return similar_cities

# Create a dictionary to store the mapping of misspellings to corrected city names
city_mapping = {}

# Get the unique normalized city names from the DataFrame (no need to normalize again)
unique_normalized_cities = df_geolocation['normalized_city'].unique()

# Iterate through the unique normalized cities to find similar ones
for city in unique_normalized_cities:
    similar_cities = find_similar_cities(city, unique_normalized_cities)
    if len(similar_cities) > 1:  # If more than one similar city exists
        # Choose the city with the accented version as the "correct" one
        # Here we use the original, non-normalized city names for the mapping
        corrected_city = max(similar_cities, key=lambda x: fuzz.ratio(x, city))  # pick the version with better match
        for similar_city in similar_cities:
            if similar_city != corrected_city:
                city_mapping[similar_city] = corrected_city  # map misspelled version to correct version

# Create a new column with the corrected city names
df_geolocation['corrected_city'] = df_geolocation['normalized_city'].map(city_mapping).fillna(df_geolocation['normalized_city'])

**1.5 Save as a new file named: olist_geolocation_dataset_2.0.**

In [15]:
df_geolocation.to_csv('olist_geolocation_dataset_2.0.csv', index=False)