# BabiGuide Data Cleaning Exploration

This notebook is used for exploring the data cleaning steps performed on the Abidjan local business dataset.  

All the **final cleaning steps** and the reproducible code can be found in the Python script `scripts/Cleaned_data.py`:


The **cleaned datasets** are saved in the `data/processed/` folder in both GeoJSON and CSV formats for further analysis.  

This notebook mainly serves as an **exploratory environment** to test, inspect, and validate the cleaning process before finalizing it in the script.  

Key steps included:
- Filtering and selecting relevant columns.
- Handling missing values (`name`, `amenity`, `shop`, `tourism`).
- Removing duplicates.
- Standardizing business types (amenity mapping and merging categories).
- Adding dummy reviews and ratings for analysis purposes.

In [34]:
import os
import geopandas as gpd

# Data path
file_path = "../data/raw/abidjan_pois.geojson"

# Check if file exists
if os.path.isfile(file_path):
    print("File found!")
    # Load raw data
    gdf = gpd.read_file(file_path)
    # Inspect first few rows
    print(gdf.head())
else:
    print("File not found. Check the path:", file_path)

File found!
            name name:en           amenity man_made  shop tourism  \
0          Powex    None              fuel     None  None    None   
1           None    None            police     None  None    None   
2  ASA Formation    None           college     None  None    None   
3    Lavage Auto    None          car_wash     None  None    None   
4   Orange Money    None  bureau_de_change     None  None    None   

  opening_hours  beds rooms addr:full addr:housenumber addr:street addr:city  \
0          None  None  None      None             None        None      None   
1          None  None  None      None             None        None      None   
2          None  None  None      None             None        None      None   
3          None  None  None      None             None        None      None   
4          None  None  None      None             None        None      None   

  source name:fr       osm_id osm_type                  geometry  
0   None    None  1193538

In [35]:
print("Original rows:", len(gdf))

Original rows: 55875


In [36]:
print("Columns:", gdf.columns)

Columns: Index(['name', 'name:en', 'amenity', 'man_made', 'shop', 'tourism',
       'opening_hours', 'beds', 'rooms', 'addr:full', 'addr:housenumber',
       'addr:street', 'addr:city', 'source', 'name:fr', 'osm_id', 'osm_type',
       'geometry'],
      dtype='object')


In [37]:
print("Missing values per column:\n", gdf.isna().sum())

Missing values per column:
 name                15605
name:en             55630
amenity             29305
man_made            53244
shop                29613
tourism             54151
opening_hours       54428
beds                55874
rooms               55863
addr:full           55844
addr:housenumber    55720
addr:street         54943
addr:city           55049
source              45858
name:fr             54995
osm_id                  0
osm_type                0
geometry                0
dtype: int64


In [38]:
columns_to_keep = ['name', 'amenity', 'shop', 'tourism', 'osm_id', 'osm_type', 'geometry']
gdf_clean = gdf[columns_to_keep]
gdf_clean = gdf_clean.dropna(subset=['amenity'])
gdf_clean = gdf_clean.drop_duplicates()
gdf_clean = gdf_clean.reset_index(drop=True)

In [39]:
print("Number of rows before cleaning:", len(gdf))
print("Number of rows after cleaning:", len(gdf_clean))
print("Missing values per column:\n", gdf_clean.isnull().sum())
print("Columns in cleaned data:\n", gdf_clean.columns)

Number of rows before cleaning: 55875
Number of rows after cleaning: 26570
Missing values per column:
 name         5836
amenity         0
shop        26435
tourism     26562
osm_id          0
osm_type        0
geometry        0
dtype: int64
Columns in cleaned data:
 Index(['name', 'amenity', 'shop', 'tourism', 'osm_id', 'osm_type', 'geometry'], dtype='object')


In [40]:
gdf_clean['name'] = gdf_clean['name'].fillna('Unknown')
gdf_clean['shop'] = gdf_clean['shop'].fillna('None')
gdf_clean['tourism'] = gdf_clean['tourism'].fillna('None')

In [41]:
print("Missing values per column:\n", gdf_clean.isnull().sum())

Missing values per column:
 name        0
amenity     0
shop        0
tourism     0
osm_id      0
osm_type    0
geometry    0
dtype: int64


In [53]:
import random
import numpy as np

# Set random seed for reproducibility
random.seed(42)
np.random.seed(42)

# Add dummy reviews (number of reviews)
gdf_clean['reviews'] = [random.randint(0, 500) for _ in range(len(gdf_clean))]

# Add dummy ratings (1-5 stars, 1 decimal)
gdf_clean['rating'] = [round(random.uniform(1, 5), 1) for _ in range(len(gdf_clean))]

# Check
print(gdf_clean[['name', 'amenity', 'reviews', 'rating']].head())

            name           amenity  reviews  rating
0          Powex              fuel      327     2.1
1        Unknown            police       57     1.0
2  ASA Formation           college       12     3.7
3    Lavage Auto          car_wash      379     2.7
4   Orange Money  bureau_de_change      140     4.4


In [43]:
print(gdf_clean['amenity'].unique())

['fuel' 'police' 'college' 'car_wash' 'bureau_de_change' 'restaurant'
 'community_centre' 'events_venue' 'bank' 'place_of_worship' 'food_court'
 'pub' 'post_office' 'school' 'courthouse' 'bus_station' 'ferry_terminal'
 'clinic' 'marketplace' 'pharmacy' 'university' 'weighbridge' 'doctors'
 'money_transfer' 'bicycle_parking' 'taxi' 'nightclub' 'parking'
 'water_point' 'bar' 'toilets' 'cafe' 'prison' 'driving_school' 'mortuary'
 'ice_cream' 'waste_basket' 'hospital' 'shelter' 'townhall'
 'public_building' 'government' 'nursing_home' 'products' 'atm'
 'waste_disposal' 'internet_cafe' 'drinking_water' 'monastery' 'fast_food'
 'motorcycle_repair' 'social_facility' 'garage auto' 'social_centre'
 'veterinary' 'bench' 'car_rental' 'kindergarten' 'bicycle_repair_station'
 'recycling' 'health_post' 'Salon de coiffure DAME' 'vending_machine'
 'Magasin de meche' 'telephone' 'parking_space' 'post_box'
 'Maria coiffure' 'Garage auto' 'library' 'motocycle_repair' 'stripclub'
 'arts_centre' 'baby_hatc

In [44]:
print(gdf_clean['shop'].unique())

['None' 'yes' 'no' 'beverages' 'hardware' 'car_repair' 'coffee'
 'motorcycle_repair' 'bakery' 'beauty' 'fuel' 'Kiosque café'
 'Buvette traditionnelle' 'money_transfer' 'alcohol' 'kiosk' 'tattoo'
 'computer' 'dry_cleaning' 'Maquis kaplin' 'pastry' 'funeral_directors'
 'religion' 'supermarket' 'ice_cream' 'copyshop' 'chemist' 'car' 'music'
 'optician' 'orange money' 'seafood' 'jewelry']


In [45]:
print(gdf_clean['tourism'].unique())

['None' 'attraction' 'hotel']


In [46]:
import pandas as pd

df = gdf_clean.copy()

# Replace 'None' strings with real NA
df["amenity"] = df["amenity"].replace("None", pd.NA)
df["shop"] = df["shop"].replace("None", pd.NA)
df["tourism"] = df["tourism"].replace("None", pd.NA)

In [47]:
# Define which amenities are considered local businesses vs non-business/public
business_amenities = [
    'fuel', 'car_wash', 'bureau_de_change', 'restaurant', 'bank', 'food_court',
    'pub', 'clinic', 'pharmacy', 'doctors', 'money_transfer', 'nightclub',
    'bar', 'marketplace', 'driving_school', 'ice_cream', 'atm', 'internet_cafe',
    'cafe', 'fast_food', 'motorcycle_repair', 'garage auto', 'veterinary',
    'car_rental', 'bicycle_repair_station', 'stripclub', 'studio', 'boat_rental',
    'coworking_space', 'cinema', 'dentist', 'brothel', 'casino',
    'mobile_money_agent', 'shipping', 'car_sharing', 'microfinance_bank',
    'theatre', 'music_school', 'conference_centre', "O'TOPAZ, Pâtisserie",
    'charging_station', 'cars', 'tattoos', 'Pressing', 'animal_breeding', 'taxi', 'parking',
    'parking_space', 'motorcycle_parking'
]

non_business_amenities = [
    'college', 'school', 'university', 'kindergarten', 'prep_school', 'language_school',
    'community_centre', 'social_facility', 'social_centre', 'public_building', 'government',
    'townhall', 'reception_desk', 'research_institute', 'Etablissements sanitaires public',
    'place_of_worship', 'monastery', 'church', 'police', 'fire_station', 'ranger_station',
    'prison', 'mortuary', 'nursing_home', 'health_post', 'health_facility', 'medical_imaging',
    'water_point', 'drinking_water', 'waste_basket', 'toilets', 'recycling',
    'waste_transfer_station', 'sanitary_dump_station', 'watering_place', 'arts_centre', 'library', 'baby_hatch',
    'fountain', 'hookah_lounge', 'shelter', 'grave_yard', 'hunting_stand',
    'first_aid', 'fédération', 'transportation', 'clock', 'Administration prive', 'Garba'
]



In [48]:
# Clean shops (drop meaningless "yes"/"no")
df["shop"] = df["shop"].replace({"yes": pd.NA, "no": pd.NA})

# Tourism options
include_attraction = False  # change to True if you want to keep attractions
if include_attraction:
    business_tourism = ["hotel", "attraction"]
else:
    business_tourism = ["hotel"]

# Create unified "business_type" column
def get_business_type(row):
    if pd.notna(row["shop"]):
        return row["shop"]
    elif pd.notna(row["amenity"]) and row["amenity"] in business_amenities:
        return row["amenity"]
    elif pd.notna(row["tourism"]) and row["tourism"] in business_tourism:
        return row["tourism"]
    else:
        return None

df["business_type"] = df.apply(get_business_type, axis=1)

# Drop rows without a business type
df = df.dropna(subset=["business_type"])

In [49]:
print("Remaining rows after filtering:", len(df))
print(df["business_type"].value_counts().head(20))

Remaining rows after filtering: 17568
business_type
restaurant        4219
pub               3441
money_transfer    1930
cafe              1646
pharmacy           967
bank               756
doctors            694
fuel               694
bar                581
car_wash           471
internet_cafe      407
marketplace        331
clinic             297
fast_food          210
driving_school     113
nightclub          108
food_court          98
ice_cream           86
parking             72
atm                 49
Name: count, dtype: int64


In [50]:
# Count each business_type
counts = df['business_type'].value_counts()

# Show only those with less than 10 occurrences
rare_types = counts[counts < 10]
print(rare_types)


business_type
bakery                    9
pastry                    9
motorcycle_parking        7
casino                    7
brothel                   5
beverages                 5
animal_breeding           5
parking_space             4
car_repair                4
beauty                    4
chemist                   4
charging_station          3
copyshop                  3
optician                  3
stripclub                 2
alcohol                   2
boat_rental               2
microfinance_bank         2
religion                  2
computer                  2
music_school              2
Kiosque café              2
Buvette traditionnelle    1
coffee                    1
hardware                  1
coworking_space           1
shipping                  1
dry_cleaning              1
Maquis kaplin             1
tattoo                    1
funeral_directors         1
theatre                   1
car_sharing               1
supermarket               1
car                       1
music 

In [51]:
# Mapping for fixing / merging business types
mapping = {
    # Merge into restaurant
    'bakery': 'restaurant',
    'pastry': 'restaurant',
    'Maquis kaplin': 'restaurant',
    "O'TOPAZ, Pâtisserie": 'restaurant',
    'seafood': 'restaurant',

    # Merge into cafe
    'coffee': 'cafe',
    'Kiosque café': 'cafe',
    'Buvette traditionnelle': 'cafe',

    # Merge into bar/pub
    'alcohol': 'bar',
    'beverages': 'bar',

    # Merge into pharmacy
    'chemist': 'pharmacy',
    'optician': 'pharmacy',

    # Merge into internet cafe
    'copyshop': 'internet_cafe',
    'computer': 'internet_cafe',

    # Merge into bank
    'microfinance_bank': 'bank',

    # Merge into mobile money agent
    'orange money': 'mobile_money_agent',

    # Merge into marketplace
    'hardware': 'marketplace',
    'jewelry': 'marketplace',
    'supermarket': 'marketplace',

    # Merge into studio
    'tattoo': 'studio',
    'music_school': 'studio',

    # Merge into cinema
    'theatre': 'cinema',

    # Merge into car_rental
    'car_sharing': 'car_rental',

    # Merge into parking
    'motorcycle_parking': 'parking',
    'parking_space': 'parking',

    # Merge into clinic
    'beauty': 'clinic',

    # Merge into veterinary
    'animal_breeding': 'veterinary'

}

to_drop = [
    'casino',
    'brothel',
    'stripclub',
    'religion',
    'car',
    'music',
    'shipping',        
    'dry_cleaning',    
    'funeral_directors',
    'coworking_space',
    'conference_centre',
    'charging_station',
    'boat_rental',
]

# Apply mapping + dropping
df['business_type'] = df['business_type'].replace(mapping)
df = df[~df['business_type'].isin(to_drop)]

# Check result counts again
print(df['business_type'].value_counts())


business_type
restaurant                4240
pub                       3441
money_transfer            1930
cafe                      1650
pharmacy                   974
bank                       758
fuel                       694
doctors                    694
bar                        588
car_wash                   471
internet_cafe              412
marketplace                334
clinic                     301
fast_food                  210
driving_school             113
nightclub                  108
food_court                  98
ice_cream                   86
parking                     83
atm                         49
bicycle_repair_station      46
dentist                     41
car_rental                  39
bureau_de_change            34
kiosk                       30
veterinary                  25
mobile_money_agent          23
studio                      20
taxi                        17
motorcycle_repair           14
cinema                      13
car_repair               

In [52]:
print("Remaining rows after filtering:", len(df))

Remaining rows after filtering: 17540
