# Accommodations Task

***TASK:*** Using two sources (BOOKING & GOOGLE), identify duplicate accommodations from each source and combine them to generate a consolidated list of accommodations.

<!-- RAW DATA -->
<!-- QGIS USED -->
<img src='Assets\photos\visualize_data_points.png' style="max-width: 50%">

## Setup

In [1]:
# !pip install geopandas
# !pip install rapidfuzz
# !pip install geopy

In [2]:
# Import libraries 
import geopandas as gpd
from thefuzz import fuzz
from geopy.distance import distance

In [3]:
# Loading the data
with open('Assets/data/accomodations-riyadh-exercise.geojson', 'r') as file:
    df = gpd.read_file(file)

### Exploring & Cleaning

#### QGIS visualiztion

<!-- RAW DATA -->
<!-- QGIS USED -->
<img src='Assets\photos\visualize_data_points.png' style="max-width: 40%">

In [4]:
df.sample(5)

Unnamed: 0,U_ID,DATA_SOURCE,SOURCE_IDENTIFIER,FACILITY_NAME,geometry
1014,0,GOOGLE,ChIJ7fvoKmEDLz4R5-89l69s1g8,Fraser Suites Riyadh,POINT (46.69184 24.67921)
1188,0,GOOGLE,ChIJife6x50DLz4RDG90X7sNAPM,Suite Inn,POINT (46.71119 24.69406)
1498,0,GOOGLE,ChIJlUJ-nEkXLz4RFHdtWb1kD_E,The palm chalets,POINT (46.58615 24.55353)
210,3596,BOOKING,1251517,Mocador Aparthotel - Al Nuzha Branch,POINT (46.70107 24.75314)
1167,0,GOOGLE,ChIJwxQ4WRwHLz4RkH18UEBMZ_w,Liwan Gulf Hotel Suites,POINT (46.78982 24.68302)


In [5]:
df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 1578 entries, 0 to 1577
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   U_ID               1578 non-null   int32   
 1   DATA_SOURCE        1578 non-null   object  
 2   SOURCE_IDENTIFIER  1578 non-null   object  
 3   FACILITY_NAME      1578 non-null   object  
 4   geometry           1578 non-null   geometry
dtypes: geometry(1), int32(1), object(3)
memory usage: 55.6+ KB


In [6]:
df['DATA_SOURCE'].value_counts()

DATA_SOURCE
BOOKING    879
GOOGLE     699
Name: count, dtype: int64

In [7]:
# gdf = df.drop(columns=['U_ID'])
df['U_ID'].value_counts()


U_ID
0       699
1240      1
699       1
2547      1
2693      1
       ... 
2755      1
2888      1
2674      1
2727      1
4108      1
Name: count, Length: 880, dtype: int64

In [8]:
print(f'"0" as an id: {(df["U_ID"] == 0).sum()}')
print(f'"0" as an id FROM GOOGLE: {((df["U_ID"] == 0) & (df["DATA_SOURCE"] == "GOOGLE")).sum()}')

"0" as an id: 699
"0" as an id FROM GOOGLE: 699


In [9]:
# # Drop U_ID
# df = df.drop(columns=['U_ID'])

In [10]:
# Making sure SOURCE_IDENTIFIER is not needed
# df['SOURCE_IDENTIFIER'].value_counts().max()

In [11]:
# Drop SOURCE_IDENTIFIER
# df = df.drop(columns=['SOURCE_IDENTIFIER'])

In [12]:
df.sample(3)

Unnamed: 0,U_ID,DATA_SOURCE,SOURCE_IDENTIFIER,FACILITY_NAME,geometry
883,0,GOOGLE,ChIJ-6h4WHYDLz4RGg30Xeah6no,Garden City 1 Furnished Apartments,POINT (46.7066 24.69208)
1344,0,GOOGLE,ChIJhRszH1IdLz4RUCbIsXBcPmI,Farnas Furnished Apartment,POINT (46.66609 24.73268)
632,2633,BOOKING,6011521,فيفيان بارك للأجنحة الفندقية,POINT (46.78698 24.68336)


#### Handeling duplicates

In [13]:
# NNormilazing the text
df['FACILITY_NAME'] = df['FACILITY_NAME'].str.lower()

##### Removing exact duplicates

In [14]:
# Check exact duplicates inside Google rows
df[df["DATA_SOURCE"] == 'GOOGLE'].duplicated(subset=['FACILITY_NAME', 'geometry']).sum()

0

In [15]:
# Check exact duplicates inside Booking rows
df[df["DATA_SOURCE"] == 'BOOKING'].duplicated(subset=['FACILITY_NAME', 'geometry']).sum()

0

There are no exact duplicates within their respective classes

In [16]:
# Check exact duplicates on both data sources
df.duplicated(subset=['FACILITY_NAME', 'geometry']).sum()

0

There are no exact duplicates across both GOOGLE and BOOKING data sources based on `FACILITY_NAME` and `geometry`

##### Removing near duplicates

In [21]:
threshold_similarity = 80  # threshold for similarity score
threshold_distance = 30   # threshold for spatial distance in meters

In [18]:
def calculate_similarity(name1, name2):
    return fuzz.partial_ratio(name1, name2)

def calculate_distance(coords1, coords2):
    return distance(coords1, coords2).meters

In [19]:
df.sindex
df = df.to_crs(epsg=32632)

In [22]:
non_near_duplicates = []

for idx, row in df.iterrows():
    
    nearby_idx = list(df.sindex.intersection(row.geometry.buffer(threshold_distance).bounds))
    nearby = df.iloc[nearby_idx].copy()
    
    nearby['name_similarity'] = nearby['FACILITY_NAME'].apply(calculate_similarity, args=(row['FACILITY_NAME'],))
    
    filtered_nearby = nearby[
        (nearby['name_similarity'] >= threshold_similarity) & 
        (nearby.geometry.distance(row.geometry) <= threshold_distance) & 
        (nearby['DATA_SOURCE'] != row['DATA_SOURCE'])
    ]
    
    if filtered_nearby.empty:
        non_near_duplicates.append(row.to_dict())

non_near_duplicates_df = gpd.GeoDataFrame(non_near_duplicates, crs=df.crs)
non_near_duplicates_df

Unnamed: 0,U_ID,DATA_SOURCE,SOURCE_IDENTIFIER,FACILITY_NAME,geometry
0,3802,BOOKING,8606014,arid h -6 luxury gold balcony az79,POINT (4481536.22 3362349.288)
1,4236,BOOKING,6071063,badeel,POINT (4504859.21 3325241.355)
2,2922,BOOKING,5714353,180 executive suites alnarjes,POINT (4488998.136 3359441.188)
3,2561,BOOKING,2331549,al eairy apartments - al riyad 4,POINT (4509287.631 3329295.503)
4,2564,BOOKING,6026585,almakan suites 111,POINT (4505865.665 3329531.321)
...,...,...,...,...,...
1409,0,GOOGLE,ChIJ_ziB6_QDLz4R0ByM3wjdKP0,صيدلية بيت الصحة ( انوفا ) - health house phar...,POINT (4500163.949 3350120.522)
1410,0,GOOGLE,ChIJhWLNgj8FLz4RSwYLBsS4RCU,vivienda hotel villas accommodation,POINT (4496868.538 3333735.233)
1411,0,GOOGLE,ChIJT-Ue9eH_Lj4RJtUkKfY5aTs,عائلة البكري,POINT (4500823.34 3354025.228)
1412,0,GOOGLE,ChIJ3Sil7MoBLz4RNDJTI4kUVr4,white house,POINT (4503029.782 3350416.581)


In [24]:
non_near_duplicates_df.to_file('Assets/data/non_near_duplicates.geojson', driver='GeoJSON')

In [25]:
print(f'The Tottal number of cleaned data is {len(non_near_duplicates_df)}')

The Tottal number of cleaned data is 1414
