### Assumptions made
1. This is an MVP so the scraping module is not robust enough for a full prod deployment, there is very little validation of scraped data. It works currently but it will likely break when changes are made to the wikipedia page
2. The datasets used here are small, small enough to cache them locally and load them when needed, if this was deployed into a prod env, the datasets will likely have to be saved or cached elsewhere
3. The result of the model is outputted to the console, based on the needs of the customer this data would be outputted somewhere more useful. This would also be done in a python module, not a notebook, so it is omitted for now 

In [28]:
from scraper import get_museum_data

museum_df = get_museum_data()

print("Museum Data")
print(museum_df.to_string(max_rows=5))

Museum Data
    Unnamed: 0               name                                                                                     type  collection_size  visitors          city  country
0            0             Louvre                                                             Art museum and historic site         615797.0   8700000         Paris   France
1            1    Vatican Museums                                                                               Art museum              NaN   6800000  Vatican City  Vatican
..         ...                ...                                                                                      ...              ...       ...           ...      ...
79          79  Museo Reina Sofía                                                                              Non-movable              NaN   1253183        Madrid    Spain
80          80       Galata Tower  Touristic buildingmuseumexhibition placeFormerly: watchtowerobservation towerfire tower 

In [29]:
import pandas as pd
import os
from pathlib import Path

# Dataset downloaded from https://www.kaggle.com/datasets/dataanalyst001/world-population-growth-rate-by-cities-2024
# Since this is a MVP, we use the locally cached version of the dataset

city_df = pd.read_csv(os.path.abspath('../data/population_data.csv'))
print("City Data")
print(city_df.to_string(max_rows=5))

City Data
               City Country      Continent  Population_2024  Population_2023  Growth Rate
0             Tokyo   Japan           Asia         37115035         37194105      -0.0021
1             Delhi   India           Asia         33807403         32941309       0.0263
..              ...     ...            ...              ...              ...          ...
799  Ribeirao Preto  Brazil  South America           750174           742115       0.0109
800       Panzhihua   China           Asia           750036           738495       0.0156


### Here we assume all museum data is from 2023 and will try to predict the visitors for 2024
The wikipedia table has visitor data from both 2024 and 2023, however I was only able to find population growth values for 2023-2024. 
Therefore in order to simplify the model training, I am treating all the museum data as being from 2023, and using population data from
2023 and 2024. 

In [30]:
joined_data = museum_df.merge(city_df, left_on='city', right_on='City')
joined_data = joined_data[['name', 'type', 'collection_size', 'visitors', 'city', 'Population_2024', 'Population_2023', 'Growth Rate']]


# Some cleaning of the data to prepare for model training
mean_items = joined_data['collection_size'].mean()
joined_data['collection_size'] = joined_data['collection_size'].fillna(mean_items)
joined_data['collection_size'] = joined_data['collection_size'].astype('int64')

mode_type = joined_data['type'].mode()
joined_data['type'] = joined_data['type'].fillna(mode_type.iloc[0])

print(joined_data.to_string(max_rows=5))

                        name                                                                                     type  collection_size  visitors      city  Population_2024  Population_2023  Growth Rate
0                     Louvre                                                             Art museum and historic site           615797   8700000     Paris         11276701         11208440       0.0061
1   National Museum of China                                                               Art museum, history museum          1300000   6300000   Beijing         22189082         21766214       0.0194
..                       ...                                                                                      ...              ...       ...       ...              ...              ...          ...
62         Museo Reina Sofía                                                                              Non-movable         10816549   1253183    Madrid          6783241          6751374    

### Here we begin preparing the data for model training
A key assumption/simplification made is that the real 2024 visitor values are just the 2023 values multiplied by the city growth rate.
This estimated 2024 value is used to validate the model predicted output.

In [31]:
from sklearn.preprocessing import LabelEncoder

# Creating data for 2024 by multiplying visitors by the growth rate of the city
joined_data["visitors_2024"] = joined_data["visitors"] * (1 + joined_data["Growth Rate"])
joined_data["visitors_2024"] = joined_data["visitors_2024"].round().astype('int64')

# Encoding features, but keeping a copy of the df for validation later
encoder = LabelEncoder()
data_copy = joined_data.copy()
joined_data['type'] = encoder.fit_transform(joined_data['type'])
joined_data['city'] = encoder.fit_transform(joined_data['city'])

print(joined_data.to_string(max_rows=5))

                        name  type  collection_size  visitors  city  Population_2024  Population_2023  Growth Rate  visitors_2024
0                     Louvre     4           615797   8700000    20         11276701         11208440       0.0061        8753070
1   National Museum of China     6          1300000   6300000     2         22189082         21766214       0.0194        6422220
..                       ...   ...              ...       ...   ...              ...              ...          ...            ...
62         Museo Reina Sofía    23         10816549   1253183    14          6783241          6751374       0.0047        1259073
63              Galata Tower    26         10816549   1250000    10         16047350         15847768       0.0126        1265750


In [32]:
from sklearn.model_selection import train_test_split

# Splitting features (X) and target variable (Y)
# The museum name is used solely for identification so we exclude it from encoding.
X = joined_data.drop(columns=['name', 'visitors_2024'], axis=1)
Y = joined_data['visitors_2024']

# Splitting the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=2)

In [33]:
from xgboost import XGBRegressor

# Training model
regressor = XGBRegressor()
regressor.fit(X_train, Y_train)

In [34]:
from sklearn import metrics
# Testing model on training data
training_data_prediction = regressor.predict(X_train)
r2_train = metrics.r2_score(Y_train, training_data_prediction)
print('R Squared (Training Data) = ', r2_train)

R Squared (Training Data) =  1.0


In [35]:
# Evaluate model on test data
test_data_prediction = regressor.predict(X_test)
r2_test = metrics.r2_score(Y_test, test_data_prediction)
print('R Squared (Test Data) = ', r2_test)

R Squared (Test Data) =  0.883658230304718


In [36]:
# Predicting values using full dataset
prediction = regressor.predict(X)
data_copy['predicted_2024'] = prediction
data_copy['delta'] = data_copy['predicted_2024'] - data_copy['visitors_2024']
data_copy = data_copy[['name', 'city', 'Growth Rate', 'visitors', 'visitors_2024', 'predicted_2024', 'delta']]
print(data_copy.to_string())

                                                  name              city  Growth Rate  visitors  visitors_2024  predicted_2024        delta
0                                               Louvre             Paris       0.0061   8700000        8753070     6502534.500 -2250535.500
1                             National Museum of China           Beijing       0.0194   6300000        6422220     6422220.000        0.000
2                                       British Museum            London       0.0104   6479952        6547344     6547343.500       -0.500
3             Natural History Museum, South Kensington            London       0.0104   6301972        6367513     6367514.000        1.000
4                  China Science and Technology Museum           Beijing       0.0194   5315000        5418111     5418110.500       -0.500
5                                       Nanjing Museum           Nanjing       0.0257   5007000        5135680     5135681.000        1.000
6                   