# Matching Feature
* The purpose of this notebook is to design a matching feature for finding the most similar kindergartens to the inputted data from the user
* This feature was developed and deployed as one of many features for [Kiddy](https://github.com/MaysaM-M-Mousa/GraduationProject-Backend) graduation project
* In this feature, we are going to use the [Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance) as a base for measuring the similarity between kindergartens

## Table Of Content
* EDA
* Preprocessing
* Finding Similarities
* Evaluation & Testing
* Trying New Data Input

In [1]:
# importing necessary libraris

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from datetime import datetime
from datetime import timedelta

import json
import joblib

In [2]:
original_df = pd.read_csv('Date\kindergarten-all.csv')

In [3]:
original_df['start_date'] = original_df['start_date'].astype('datetime64')
original_df['registration_expiration'] = original_df['registration_expiration'].astype('datetime64')

In [4]:
dates_df = original_df[['id', 'start_date', 'registration_expiration']].set_index('id').copy()
dates_df

Unnamed: 0_level_0,start_date,registration_expiration
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,2023-02-25,2023-03-25
9,2022-12-31,2023-01-19
8,2023-01-20,2023-02-04
14,2023-01-20,2023-02-04
13,2023-02-20,2023-03-07
...,...,...
272,2022-12-26,2023-01-07
273,2023-02-02,2023-02-14
274,2022-12-28,2023-01-09
275,2023-01-25,2023-02-06


In [5]:
df = original_df.copy()
df

Unnamed: 0,id,name,location_formatted,latitude,longitude,email,phone,country,city,website,about,createdAt,id.1,start_date,end_date,registration_expiration,name.1,tuition,createdAt.1,kindergartenId
0,2,Al- Aqsa Kindergarten,"803 Nablus, Palestinian Territory",32.237197,35.252466,Aqsa@gmail.com,594177742,Palestine,Nablus,aqsa.edu,Nice Kindergarten,2022-09-26 14:02:16,5,2023-02-25,2023-07-25,2023-03-25,2023 First,100,2022-11-30 18:13:04,2
1,9,Al-Makhfeya,"Palestine, Nablus",32.217492,35.236420,makhfeya@edu.com,45342189,Palestine,Nablus,www.jaberi.com,summary,2022-12-01 04:32:32,6,2022-12-31,2023-04-09,2023-01-19,2023 First,350,2022-12-01 04:33:32,9
2,8,Al-Jaberi Kindergarten,"Palestine, Nablus",32.221399,35.238845,jaberi@edu.com,123456789,Palestine,Nablus,www.jaberi.com,summary,2022-12-01 04:31:32,7,2023-01-20,2023-05-12,2023-02-04,2022-2023 First,320,2022-12-01 04:34:10,8
3,14,Ammany,"Jordann, Amman",31.934158,35.930048,ammany@edu.com,56497542,Jordan,Amman,www.ammany.com,summary,2022-12-01 04:43:46,8,2023-01-20,2023-05-12,2023-02-04,2022-2023 First,150,2022-12-01 04:46:34,14
4,13,Amman,"Jordan, Amman",31.899435,35.212263,amman@edu.com,56497542,Jordan,Amman,www.amman.com,summary,2022-12-01 04:40:09,9,2023-02-20,2023-06-23,2023-03-07,2023 First,150,2022-12-01 04:47:31,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,272,Nuwaybi‘a kindergarten,4937 Old Gate Junction,34.653433,28.973444,rscurryk@gmail.com,8139837451,Egypt,Nuwaybi‘a,,summary,2022-06-28 18:05:00,273,2022-12-26,2023-04-25,2023-01-07,2023 Semester,331,2023-01-25 14:56:00,272
264,273,Ibshawāy kindergarten,257 Laurel Point,30.681616,29.359413,mdobbynl@gmail.com,9313134099,Egypt,Ibshawāy,,summary,2022-05-23 00:23:00,274,2023-02-02,2023-06-04,2023-02-14,2023 Semester,262,2023-03-04 08:10:00,273
265,274,Bilbays kindergarten,39 Starling Pass,31.562118,30.415838,mhaythornem@gmail.com,2907423757,Egypt,Bilbays,,summary,2019-03-29 17:31:00,275,2022-12-28,2023-04-27,2023-01-09,2023 Semester,377,2023-01-27 18:29:00,274
266,275,Bi’r al ‘Abd kindergarten,95 Dakota Circle,28.247778,31.014722,lkinkaidn@gmail.com,8655943585,Egypt,Bi’r al ‘Abd,,summary,2019-01-02 17:25:00,276,2023-01-25,2023-05-25,2023-02-06,2023 Semester,285,2023-02-24 04:31:00,275


# EDA

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268 entries, 0 to 267
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   id                       268 non-null    int64         
 1   name                     268 non-null    object        
 2   location_formatted       268 non-null    object        
 3   latitude                 268 non-null    float64       
 4   longitude                268 non-null    float64       
 5   email                    268 non-null    object        
 6   phone                    268 non-null    int64         
 7   country                  268 non-null    object        
 8   city                     268 non-null    object        
 9   website                  21 non-null     object        
 10  about                    268 non-null    object        
 11  createdAt                268 non-null    object        
 12  id.1                     268 non-nul

In [7]:
df.describe()

Unnamed: 0,id,latitude,longitude,phone,id.1,tuition,kindergartenId
count,268.0,268.0,268.0,268.0,268.0,268.0,268.0
mean,158.570896,34.57349,33.181202,4934170000.0,141.41791,302.283582,158.570896
std,82.782637,4.685456,3.551576,2781518000.0,79.137135,106.3849,82.782637
min,2.0,28.247778,16.701895,12515420.0,5.0,100.0,2.0
25%,89.75,31.899435,32.031686,2826187000.0,72.75,205.75,89.75
50%,161.5,32.223331,35.04529,4777875000.0,140.5,312.0,161.5
75%,229.25,35.895902,35.21569,7205504000.0,210.25,387.5,229.25
max,298.0,49.912345,36.890343,9947318000.0,277.0,499.0,298.0


# Preprocessing

### 1. Dropping unnecessary columns
What are we going to depend on to find the most similar kindergartens to the input data from user are:
* `latitude` 
* `longitude`
* `country`
* `city`
* `tuition`

In [8]:
df.drop(columns=['name', 'email', 'phone', 'website', 'about', 'createdAt', 'location_formatted',
                 'id.1', 'name.1', 'createdAt.1','kindergartenId', 'end_date', 'start_date', 'registration_expiration'],
       inplace=True)
df

Unnamed: 0,id,latitude,longitude,country,city,tuition
0,2,32.237197,35.252466,Palestine,Nablus,100
1,9,32.217492,35.236420,Palestine,Nablus,350
2,8,32.221399,35.238845,Palestine,Nablus,320
3,14,31.934158,35.930048,Jordan,Amman,150
4,13,31.899435,35.212263,Jordan,Amman,150
...,...,...,...,...,...,...
263,272,34.653433,28.973444,Egypt,Nuwaybi‘a,331
264,273,30.681616,29.359413,Egypt,Ibshawāy,262
265,274,31.562118,30.415838,Egypt,Bilbays,377
266,275,28.247778,31.014722,Egypt,Bi’r al ‘Abd,285


In [9]:
# df['start_date'] = df['start_date'].apply(lambda start_date: 1 if (start_date > datetime.now()) else 0)
# df['registration_expiration'] = df['registration_expiration'].apply(lambda registration_expiration: 1 if (registration_expiration > datetime.now()) else 0)

### 2. Defining One-Hot-Encoders for `City` and `Country` features

In [10]:
city_encoder = OneHotEncoder()
country_encoder = OneHotEncoder()

##### 2.1 Fitting city encoder to cities in our dataset

In [11]:
city_encoder.fit(df[['city']])
city_encoder.categories_

[array(['Abha', 'Ad Dīwānīyah', 'Akhtarīn', 'Al Balyanā', 'Al Buq‘ah',
        'Al Bīrah', 'Al Fayyūm', 'Al Hufūf', 'Al Judayrah', 'Al Jumūm',
        'Al Jīb', 'Al Karmil', 'Al Karāmah', 'Al Khafjī',
        'Al Lubban al Gharbī', 'Al Majd', 'Al Mazra‘ah ash Sharqīyah',
        'Al Midyah', 'Al Mughayyir', 'Al Muwayh', 'Al Muţayrifī',
        'Al Qanāţir al Khayrīyah', 'Al Qarārah', 'Al ‘Awjah', 'Al ‘Awjā',
        'Al ‘Ayzarīyah', 'Al ‘Ulá', 'Amman', 'An Naşr', 'An Naşşārīyah',
        'An Nāşirīyah', 'Aqaba', 'Ar Ramthā', 'Ar Rass', 'Ar Ruţbah',
        'Ar Ruḩaybah', 'As Salţ', 'As Sulayyil', 'Ash Shaddādah',
        'Ash Shuhadā’', 'Ash Shuyūkh', 'Ashmūn', 'Az Zarqā', 'Azun Atme',
        'Aţ Ţafīlah', 'Aţ Ţaybah', 'Baghdad', 'Balīlā', 'Banhā', 'Banān',
        'Banī Mazār', 'Bardalah', 'Batroûn', 'Bayt Maqdūm', 'Bayt Ta‘mar',
        'Bayt Ūmmar', 'Bayt ‘Īnūn', 'Bayt ‘Ūr at Taḩtā', 'Baytā al Fawqā',
        'Bazzāryah', 'Beirut', 'Bent Jbaïl', 'Bethlehem', 'Bilbays',
        'Bil

##### 2.2 Transforming cities to OHE vectors

In [12]:
encoded_cities = city_encoder.transform(df[['city']]).toarray()
encoded_cities.shape

(268, 187)

In [13]:
encoded_cities

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

##### 2.3 Fitting city encoder to cities in our dataset

In [14]:
country_encoder.fit(df[['country']])
country_encoder.categories_

[array(['Egypt', 'Iraq', 'Jordan', 'Lebanon', 'Palestine', 'Saudi Arabia',
        'Syria'], dtype=object)]

##### 2.4 Transforming countries to OHE vectors

In [15]:
encoded_countries = country_encoder.transform(df[['country']]).toarray()
encoded_countries.shape

(268, 7)

In [16]:
encoded_countries

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]])

### 3. Concatenating cities and countries OHE vectors to the dataframe

In [17]:
df = pd.concat([df, pd.DataFrame(encoded_cities, columns=city_encoder.categories_)], axis=1)
df = pd.concat([df, pd.DataFrame(encoded_countries, columns=country_encoder.categories_)], axis=1)

df.drop(columns=['country', 'city'], inplace=True)
df

Unnamed: 0,id,latitude,longitude,tuition,"(Abha,)","(Ad Dīwānīyah,)","(Akhtarīn,)","(Al Balyanā,)","(Al Buq‘ah,)","(Al Bīrah,)",...,"(‘Uqayribāt,)","(‘Ābūd,)","(‘Ūrīf,)","(Egypt,)","(Iraq,)","(Jordan,)","(Lebanon,)","(Palestine,)","(Saudi Arabia,)","(Syria,)"
0,2,32.237197,35.252466,100,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,9,32.217492,35.236420,350,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,8,32.221399,35.238845,320,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,14,31.934158,35.930048,150,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,13,31.899435,35.212263,150,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263,272,34.653433,28.973444,331,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
264,273,30.681616,29.359413,262,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
265,274,31.562118,30.415838,377,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
266,275,28.247778,31.014722,285,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


### 4. Normalizing our features

In [18]:
# normalizer = Normalizer().fit(df.drop(columns=['id']))
# normalizer

In [19]:
normalizer = MinMaxScaler().fit(df.drop(columns=['id']))
normalizer



In [20]:
normalizer.transform(df.drop(columns=['id']))



array([[0.18414488, 0.91887058, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.18323533, 0.91807577, 0.62656642, ..., 1.        , 0.        ,
        0.        ],
       [0.18341567, 0.91819589, 0.55137845, ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.15298436, 0.67929655, 0.69423559, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.70896123, 0.46365915, ..., 0.        , 0.        ,
        0.        ],
       [0.15634543, 0.71855464, 0.24561404, ..., 0.        , 0.        ,
        0.        ]])

In [21]:
normalized_data = normalizer.transform(df.drop(columns=['id']))
normalized_data



array([[0.18414488, 0.91887058, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.18323533, 0.91807577, 0.62656642, ..., 1.        , 0.        ,
        0.        ],
       [0.18341567, 0.91819589, 0.55137845, ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.15298436, 0.67929655, 0.69423559, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.70896123, 0.46365915, ..., 0.        , 0.        ,
        0.        ],
       [0.15634543, 0.71855464, 0.24561404, ..., 0.        , 0.        ,
        0.        ]])

In [22]:
normalized_data.shape

(268, 197)

In [23]:
normalized_df = pd.DataFrame(normalized_data)
normalized_df = normalized_df.set_index(df['id'])
normalized_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,187,188,189,190,191,192,193,194,195,196
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.184145,0.918871,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9,0.183235,0.918076,0.626566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8,0.183416,0.918196,0.551378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
14,0.170157,0.952433,0.125313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
13,0.168554,0.916879,0.125313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,0.295674,0.607850,0.578947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
273,0.112342,0.626968,0.406015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
274,0.152984,0.679297,0.694236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
275,0.000000,0.708961,0.463659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
normalized_df_copy = normalized_df.copy()

# Finding Similarity

##### Columns: (longitude), (lattiude), (tuition), (cities), (countires), (start_date), (registration_expiration)

In [25]:
# all cols length + 2 (start_date and registration_expiration)
dims_numb = len(normalized_df.columns) + 2

In [26]:
# defining the start and the end of each column
distance_dims = 2
tuition_dims = distance_dims + 1
city_dims = tuition_dims + len(city_encoder.categories_[0])
country_dims = city_dims + len(country_encoder.categories_[0])
start_date_dims = country_dims + 1
registration_expiration_dims = start_date_dims + 1

In [27]:
weights = np.zeros(dims_numb)
weights

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [28]:
# reading weights_dict which represents the importance of features
weights_dict = {
    'tuition': 5,
    'location': 1,
    'city': 1,
    'country': 1,
    'start_date': 1,
    'registration_expiration': 1
}

In [29]:
# giving importance to features
weights[0:distance_dims] = weights_dict['location']
weights

array([1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [30]:
weights[distance_dims:tuition_dims] = weights_dict['tuition']
weights

array([1., 1., 5., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [31]:
weights[tuition_dims:city_dims] = weights_dict['city']
weights

array([1., 1., 5., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [32]:
weights[city_dims:country_dims] = weights_dict['country']
weights

array([1., 1., 5., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.])

In [33]:
weights[country_dims:start_date_dims] = weights_dict['start_date']
weights

array([1., 1., 5., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.])

In [34]:
weights[start_date_dims:registration_expiration_dims] = weights_dict['registration_expiration']
weights

array([1., 1., 5., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [35]:
# initializing a matrix and diagonalizing it with weights list 
a = np.zeros((dims_numb, dims_numb))
np.fill_diagonal(a, weights)
a

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 5., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [36]:
user_input = {
    'latitude': 32.2458139,
    'longitude': 35.227928,
    'country': 'Palestine',
    'city': 'Nablus',
    'tuition': 350,
}

In [37]:
user_input_df = pd.DataFrame(user_input, index=[0])
user_input_df

Unnamed: 0,latitude,longitude,country,city,tuition
0,32.245814,35.227928,Palestine,Nablus,350


In [38]:
# encoding cities and countries
user_input_city_encoded  = city_encoder.transform(user_input_df[['city']]).toarray()
user_input_country_encoded = country_encoder.transform(user_input_df[['country']]).toarray()

In [39]:
user_input_df = pd.concat([user_input_df, pd.DataFrame(user_input_city_encoded, columns=city_encoder.categories_)], axis=1)
user_input_df = pd.concat([user_input_df, pd.DataFrame(user_input_country_encoded, columns=country_encoder.categories_)], axis=1)

In [40]:
user_input_df.drop(columns=['city', 'country'], inplace=True)
user_input_df

Unnamed: 0,latitude,longitude,tuition,"(Abha,)","(Ad Dīwānīyah,)","(Akhtarīn,)","(Al Balyanā,)","(Al Buq‘ah,)","(Al Bīrah,)","(Al Fayyūm,)",...,"(‘Uqayribāt,)","(‘Ābūd,)","(‘Ūrīf,)","(Egypt,)","(Iraq,)","(Jordan,)","(Lebanon,)","(Palestine,)","(Saudi Arabia,)","(Syria,)"
0,32.245814,35.227928,350,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [41]:
# scaling
user_input_scaled = normalizer.transform(user_input_df)



In [42]:
user_input_scaled

array([[0.18454262, 0.91765514, 0.62656642, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [43]:
user_input_scaled = np.append(user_input_scaled[0], [1,1], axis=0).reshape(1,user_input_scaled.shape[1] + 2)
user_input_scaled

array([[0.18454262, 0.91765514, 0.62656642, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [44]:
# multiplying user input by weights
user_input_weighted = np.matmul(a, user_input_scaled.T).T
user_input_weighted

array([[0.18454262, 0.91765514, 3.13283208, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [45]:
# appending `start_date` and `registraton_expiration` as cols in normalized_df
start_date_series = dates_df['start_date'].apply(lambda start_date: 1 if (start_date > datetime.now()) else 0)
registration_expiration_series = dates_df['registration_expiration'].apply(lambda date: 1 if (date > datetime.now()) else 0)

In [46]:
normalized_df_copy = pd.concat([normalized_df_copy, start_date_series], axis=1)
normalized_df_copy = pd.concat([normalized_df_copy, registration_expiration_series], axis=1)

In [47]:
normalized_df_copy.isna().sum().sum()

0

In [48]:
normalized_df_copy

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,189,190,191,192,193,194,195,196,start_date,registration_expiration
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.184145,0.918871,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,1
9,0.183235,0.918076,0.626566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0
8,0.183416,0.918196,0.551378,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,1
14,0.170157,0.952433,0.125313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0,1
13,0.168554,0.916879,0.125313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,0.295674,0.607850,0.578947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
273,0.112342,0.626968,0.406015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1
274,0.152984,0.679297,0.694236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
275,0.000000,0.708961,0.463659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1


In [49]:
weighted_df = np.matmul(a, normalized_df_copy.T).T
weighted_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,189,190,191,192,193,194,195,196,197,198
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.184145,0.918871,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
9,0.183235,0.918076,3.132832,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,0.183416,0.918196,2.756892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
14,0.170157,0.952433,0.626566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
13,0.168554,0.916879,0.626566,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272,0.295674,0.607850,2.894737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
273,0.112342,0.626968,2.030075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
274,0.152984,0.679297,3.471178,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275,0.000000,0.708961,2.318296,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [50]:
# finding similarity/euclidean_distances between user input and each kindergarten
similarity = pd.DataFrame(euclidean_distances(user_input_weighted, weighted_df).reshape(-1, 1), columns=['similarity'])
similarity['id'] = df['id']

In [51]:
# sorting in ascending order
similarity = similarity.sort_values('similarity', ascending=True)

In [52]:
ids_to_return = similarity['id'][:10].values

  ids_to_return = similarity['id'][:10].values


In [53]:
ids_to_return

array([  5,   8,   9, 186, 130,  89, 174, 161,  80, 166], dtype=int64)

In [54]:
# getting results most similar to user_input dict and weighted by weights_dict
original_df.set_index('id').loc[ids_to_return][['latitude', 'longitude', 'country', 'city', 'tuition', 'start_date', 'registration_expiration']]

Unnamed: 0_level_0,latitude,longitude,country,city,tuition,start_date,registration_expiration
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5,32.04,35.98,Palestine,Nablus,340,2023-02-02,2023-02-13
8,32.221399,35.238845,Palestine,Nablus,320,2023-01-20,2023-02-04
9,32.217492,35.23642,Palestine,Nablus,350,2022-12-31,2023-01-19
186,32.328224,35.292369,Palestine,Sīrīs,353,2023-02-23,2023-03-07
130,32.17842,35.21569,Palestine,‘Aşīrah al Qiblīyah,357,2023-02-01,2023-02-12
89,31.53537,34.97192,Palestine,Bayt Maqdūm,357,2023-01-27,2023-02-07
174,32.283121,35.385789,Palestine,Ţammūn,342,2023-03-11,2023-03-23
161,32.38613,35.2878,Palestine,Mislīyah,335,2023-01-27,2023-02-07
80,31.927317,35.299468,Palestine,Rammūn,369,2023-02-01,2023-02-12
166,31.440961,34.99744,Palestine,Khursā,328,2023-02-21,2023-03-05


# Evaluation & Testing Results - Trying New Data Input (Pipeline)

In [55]:
def find_top_N_similar_With_Weights(user_input, weights_dict, search_date, topN, normalized_df):
    # constants for slicing features
    distance_dims = 2
    tuition_dims = distance_dims + 1
    city_dims = tuition_dims + len(city_encoder.categories_[0])
    country_dims = city_dims + len(country_encoder.categories_[0])
    start_date_dims = country_dims + 1
    registration_expiration_dims = start_date_dims + 1
    """ 
    All preprocessed kinderarten data Part:
        1. Reading importance of features from user input (weights_dict)
        2. Initializing an array representing the importance of features (weights)
        3. Initializing matrix and diagonalizing it with the previously defined weights array (weighted_matrix)
        4. Appending start_date and registration_expiration cols to preprocessed normalized_df 
        5. Multiplying the normalized_df with the weighted_matrix to get a weighted_normalized_df (weighted_normalized_df)
    """
    # 1. initializing weights array representing the importance of each feature
    # columns in order: [latitude, longitude, tuition, cities, countries, start_date, registration_expiration]
    dims_numb = len(normalized_df.columns) + 2
    
    # 2. Initializing an array representing the importance of features (weights)
    weights = np.zeros(dims_numb)
    weights[0:distance_dims] = weights_dict['location']
    weights[distance_dims:tuition_dims] = weights_dict['tuition']
    weights[tuition_dims:city_dims] = weights_dict['city']
    weights[city_dims:country_dims] = weights_dict['country']
    weights[country_dims:start_date_dims] = weights_dict['start_date']
    weights[start_date_dims:registration_expiration_dims] = weights_dict['registration_expiration']
    
    # 3. Initializing matrix and diagonalizing it with the previously defined weights array (weighted_matrix)
    weighted_matrix = np.zeros((dims_numb, dims_numb))
    np.fill_diagonal(weighted_matrix, weights)
    
    # 4. Appending start_date and registration_expiration cols to preprocessed normalized_df 
    # Giving the kindergarten a higher rank if the start and registration did not start yet
    start_date_series = dates_df['start_date'].apply(lambda start_date: 1 if (start_date > search_date) else 0)
    registration_expiration_series = dates_df['registration_expiration'].apply(lambda date: 1 if (date > search_date) else 0)
    
    normalized_df = pd.concat([normalized_df, start_date_series], axis=1)
    normalized_df = pd.concat([normalized_df, registration_expiration_series], axis=1)
    
    weighted_normalized_df = np.matmul(weighted_matrix, normalized_df.T).T
    
    """
    User input Part:
        1. Storing user input data in a dataframe (user_input_df)
        2. Encoding city and country columns in user_input
        3. Concatenating encoded city and country to user_input dataframe and dropping old cols
        4. Normalizing user input data (user_input_normalized)
        5. Appending 1's as `start_date` and `registration_expiration` cols to user_input_normalized 
        6. Multiplying weighted matrix by user input after normalizing it
    """
    
    # 1. storing user input data in a dataframe (user_input_df)
    user_input_df = pd.DataFrame(user_input, index=[0])
    
    # 2. encoding city and country columns in user_input
    user_input_city_encoded  = city_encoder.transform(user_input_df[['city']]).toarray()
    user_input_country_encoded = country_encoder.transform(user_input_df[['country']]).toarray()
    
    # 3. concatenating encoded city and country to user_input dataframe
    user_input_df = pd.concat([user_input_df, pd.DataFrame(user_input_city_encoded, columns=city_encoder.categories_)], axis=1)
    user_input_df = pd.concat([user_input_df, pd.DataFrame(user_input_country_encoded, columns=country_encoder.categories_)], axis=1)

    user_input_df.drop(columns=['city', 'country'], inplace=True)
    
    # 4. scaling user_input 
    user_input_normalized = normalizer.transform(user_input_df)
    
    # 5. appending `start_date` and `registration_expiration` features to user input and initializing values to 1's
    user_input_normalized = np.append(user_input_normalized[0], [1,1], axis=0).reshape(1,user_input_normalized.shape[1] + 2)
    
    # 6. multiplying weighted matrix by user input after scaling it
    user_input_normalized_weighted = np.matmul(weighted_matrix, user_input_normalized.T).T
    
    """
    Similarity Part:
        1. Finding similarity between all pre-processed kindergartens and the processes user_input_df
        2. Sorting the result in descending order according to the similarity
        3. Getting the Ids of the top N similar kindergarten to the user_input df
        4. Returning the needed cols
    """
    # 1. finding similarity between all pre-processed kindergartens and the processes user_input df
    # Euclidean Distances was the used metric to find distance between vectors 
    similarity = pd.DataFrame(euclidean_distances(user_input_normalized_weighted, weighted_normalized_df).reshape(-1, 1), columns=['similarity'])
    similarity['id'] = df['id']
    
    # 2. sorting the result in descending order according to the similarity
    # sorting in ascending order because the more the euclidean distance is less the more the vector are similar
    # in case of using cosine distance metric, we will be sorting in descending order
    similarity = similarity.sort_values('similarity', ascending=True)
    
    # 3. getting the Ids of the top N similar kindergarten to the user_input df
    ids_to_return = similarity['id'][:topN].values 
    
    # 4. returning the needed cols 
    return pd.concat([
        original_df.set_index('id').loc[ids_to_return][['latitude', 'longitude', 'country', 'city', 'tuition', 'start_date', 'registration_expiration']], 
        pd.DataFrame(similarity.iloc[:topN].set_index('id').to_dict())], 
        axis=1)

In [56]:
user_input = {
    'latitude': 32.2458139,
    'longitude': 35.227928,
    'country': 'Palestine',
    'city': 'Nablus',
    'tuition': 120,
}

weights_dict = {
    'tuition': 5,
    'location': 1,
    'city': 1,
    'country': 1,
    'start_date': 1,
    'registration_expiration': 1
}

search_date = datetime.fromisoformat("2023-01-20")

In [57]:
find_top_N_similar_With_Weights(user_input, weights_dict, search_date, 10, normalized_df)

  ids_to_return = similarity['id'][:topN].values


Unnamed: 0,latitude,longitude,country,city,tuition,start_date,registration_expiration,similarity
2,32.237197,35.252466,Palestine,Nablus,100,2023-02-25,2023-03-25,0.25063
172,32.035776,35.038365,Palestine,Al Lubban al Gharbī,127,2023-02-07,2023-02-19,1.416996
62,32.158702,35.224177,Palestine,‘Ūrīf,152,2023-02-05,2023-02-16,1.469973
154,32.446834,35.167274,Palestine,Ya‘bad,152,2023-01-25,2023-02-05,1.469999
181,31.851182,35.200835,Palestine,Bīr Nabālā,158,2023-01-25,2023-02-06,1.492344
52,32.23647,35.38711,Palestine,An Naşşārīyah,160,2023-01-24,2023-02-04,1.500439
116,32.38051,35.50838,Palestine,‘Ayn al Bayḑā,164,2023-01-26,2023-02-06,1.517976
159,32.152799,35.04529,Palestine,Kafr Thulth,166,2023-01-26,2023-02-06,1.527215
180,32.0671,35.08287,Palestine,Kafr ad Dīk,167,2023-02-04,2023-02-16,1.531995
179,32.00018,35.28152,Palestine,Al Mazra‘ah ash Sharqīyah,167,2023-02-02,2023-02-14,1.532


In [58]:
user_input = {
    'latitude': 32.2458139,
    'longitude': 35.227928,
    'country': 'Palestine',
    'city': 'Ramallah',
    'tuition': 400,
}

weights_dict = {
    'tuition': 5,
    'location': 1,
    'city': 5,
    'country': 1,
    'start_date': 1,
    'registration_expiration': 5
}

search_date = datetime.fromisoformat("2023-01-22")

find_top_N_similar_With_Weights(user_input, weights_dict, search_date, 10, normalized_df)

  ids_to_return = similarity['id'][:topN].values


Unnamed: 0,latitude,longitude,country,city,tuition,start_date,registration_expiration,similarity
10,31.89448,35.197146,Palestine,Ramallah,177,2023-02-02,2023-02-13,2.794534
11,31.899435,35.212263,Palestine,Ramallah,404,2023-01-11,2023-01-22,5.099291
109,31.89609,35.08178,Palestine,Bayt ‘Ūr at Taḩtā,400,2023-02-04,2023-02-15,7.07109
60,31.705382,35.202443,Palestine,Bethlehem,400,2023-01-23,2023-02-03,7.071112
168,32.16523,34.97748,Palestine,Ḩablah,395,2023-03-04,2023-03-16,7.071357
122,31.396843,34.365268,Palestine,Wādī as Salqā,405,2023-01-28,2023-02-08,7.071583
171,32.244032,35.064616,Palestine,Kafr Şūr,392,2023-02-10,2023-02-22,7.071783
141,31.91001,35.21645,Palestine,Al Bīrah,410,2023-02-02,2023-02-13,7.072195
40,31.963463,35.215092,Palestine,Jifnā,412,2023-02-05,2023-02-16,7.072679
95,31.57112,35.23227,Palestine,‘Arab ar Rashāydah,387,2023-01-26,2023-02-06,7.073013


In [59]:
user_input = {
    'latitude': 31.899435,
    'longitude': 35.930048,
    'country': 'Jordan',
    'city': 'Amman',
    'tuition': 300,
}

weights_dict = {
    'tuition': 5,
    'location': 5,
    'city': 5,
    'country': 5,
    'start_date': 5,
    'registration_expiration': 5
}

search_date = datetime.fromisoformat("2023-01-21")

find_top_N_similar_With_Weights(user_input, weights_dict, search_date, 10, normalized_df)

  ids_to_return = similarity['id'][:topN].values


Unnamed: 0,latitude,longitude,country,city,tuition,start_date,registration_expiration,similarity
13,31.899435,35.212263,Jordan,Amman,150,2023-02-20,2023-03-07,1.888087
4,31.934158,35.890048,Jordan,Amman,497,2023-01-23,2023-02-03,2.468705
14,31.934158,35.930048,Jordan,Amman,150,2023-01-20,2023-02-04,5.34166
193,36.225112,32.844941,Jordan,‘Izrā,316,2023-02-11,2023-02-23,7.184753
205,35.936286,32.389131,Jordan,Balīlā,295,2023-02-09,2023-02-21,7.186168
198,35.849203,31.978024,Jordan,Umm as Summāq,281,2023-01-31,2023-02-12,7.200394
196,35.936286,32.389131,Jordan,Balīlā,241,2023-01-26,2023-02-07,7.22383
192,35.616056,30.833706,Jordan,Aţ Ţafīlah,321,2023-01-30,2023-02-11,7.238656
187,35.006321,29.532052,Jordan,Aqaba,311,2023-01-23,2023-02-04,7.283132
204,35.77722,32.64519,Jordan,Ḩātim,184,2023-01-29,2023-02-10,7.319554


In [60]:
user_input = {
    'latitude': 31.899435,
    'longitude': 35.212263,
    'country': 'Egypt',
    'city': 'Cairo',
    'tuition': 300,
}

weights_dict = {
    'tuition': 5,
    'location': 1,
    'city': 5,
    'country': 5,
    'start_date': 1,
    'registration_expiration': 1
}

search_date = datetime.fromisoformat("2023-01-21")

find_top_N_similar_With_Weights(user_input, weights_dict, search_date, 10, normalized_df)

  ids_to_return = similarity['id'][:topN].values


Unnamed: 0,latitude,longitude,country,city,tuition,start_date,registration_expiration,similarity
259,31.235712,30.04442,Egypt,Cairo,370,2023-01-28,2023-02-09,0.914293
261,31.235712,30.04442,Egypt,Cairo,207,2022-12-30,2023-01-11,1.850582
275,28.247778,31.014722,Egypt,Bi’r al ‘Abd,285,2023-01-25,2023-02-06,7.078628
269,31.79778,30.729272,Egypt,Fāqūs,326,2023-02-10,2023-02-22,7.082054
273,30.681616,29.359413,Egypt,Ibshawāy,262,2023-02-02,2023-02-14,7.093234
267,31.37934,30.36088,Egypt,Mashtūl as Sūq,365,2023-01-29,2023-02-10,7.121923
257,30.734995,28.42336,Egypt,Maţāy,374,2023-02-12,2023-02-24,7.13974
255,32.94626,24.47669,Egypt,Kawm Umbū,346,2023-01-12,2023-01-24,7.184525
271,33.651144,27.402484,Egypt,El Gouna,366,2023-01-10,2023-01-22,7.200016
264,30.299236,31.305222,Egypt,Idkū,313,2023-01-06,2023-01-18,7.215916


In [61]:
user_input = {
    'latitude': 44.799952,
    'longitude': 26.684419,
    'country': 'Saudi Arabia',
    'city': 'reyad',
    'tuition': 350,
}

weights_dict = {
    'tuition': 1,
    'location': 5,
    'city': 1,
    'country': 5,
    'start_date': 1,
    'registration_expiration': 1
}

search_date = datetime.fromisoformat("2023-01-21")

find_top_N_similar_With_Weights(user_input, weights_dict, search_date, 10, normalized_df)

  ids_to_return = similarity['id'][:topN].values


Unnamed: 0,latitude,longitude,country,city,tuition,start_date,registration_expiration,similarity
30,46.429664,24.720824,Saudi Arabia,reyad,374,2023-02-02,2023-02-13,0.61773
34,46.664641,24.827211,Saudi Arabia,reyad,341,2023-01-18,2023-01-29,1.182068
31,46.708166,24.490877,Saudi Arabia,reyad,449,2023-01-17,2023-01-28,1.245253
33,46.783108,24.794877,Saudi Arabia,reyad,491,2023-01-13,2023-01-24,1.246341
36,46.860567,24.772688,Saudi Arabia,reyad,170,2023-01-20,2023-01-31,1.286025
32,46.76861,24.688413,Saudi Arabia,reyad,167,2023-01-11,2023-01-22,1.288862
291,46.675296,24.713552,Saudi Arabia,Riyadh,474,2023-01-29,2023-02-09,1.588134
298,49.629908,25.314156,Saudi Arabia,Al Hufūf,420,2023-01-28,2023-02-08,1.840797
282,49.912345,26.651956,Saudi Arabia,Umm as Sāhik,272,2023-01-24,2023-02-04,1.852144
292,48.488722,28.425662,Saudi Arabia,Al Khafjī,364,2023-01-12,2023-01-23,1.977873


# Saving Models

##### Saving city encoder

In [62]:
model_name = 'Output/city_encoder.sav'
joblib.dump(city_encoder, model_name)

['Output/city_encoder.sav']

##### Saving country encoder

In [63]:
model_name = 'Output/country_encoder.sav'
joblib.dump(country_encoder, model_name)

['Output/country_encoder.sav']

##### Saving scaler 

In [64]:
model_name = 'Output/normalizer.sav'
joblib.dump(normalizer, model_name)

['Output/normalizer.sav']

##### Saving scaled dataframe

In [65]:
normalized_df.to_csv('Output/normalized_df.csv')

##### Saving dates dataframe

In [66]:
dates_df.to_csv('Output/dates_df.csv')

#### Coded by [Maysam M. Mousa](https://github.com/MaysaM-M-Mousa)