In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn as skl

### First steps

First, we have to fetch the data and clean it a bit

In [3]:
def fetchData():
    # Here we fetch the data from my github
    url = "https://raw.githubusercontent.com/D0nG10vanni/DataScientistBlog/main/Sources.csv"
    df = pd.read_csv(url)
    return df

df = fetchData()
print(df.head())


               Province  Monuments_Count  Cities_Count  Country
0                   NaN              1.0           NaN      NaN
1                Achaea             38.0           6.0  Albania
2                Achaea           1056.0         103.0   Greece
3              Aegyptus            144.0          47.0    Egypt
4  Africa Proconsularis             74.0           5.0    Libya


In [9]:
def cleaning(df):
    # Here we clean it a bit. It is only necessary to convert datatypes and (effectively only) drop the first 
    # row as it is invalid (see printed df above) as well as other null rows
    df = df.dropna(subset=["Province"]).copy()  
    df["Monuments_Count"] = pd.to_numeric(df["Monuments_Count"], errors="coerce")
    df["Cities_Count"] = pd.to_numeric(df["Cities_Count"], errors="coerce")
    df = df.dropna()
    return df

df = cleaning(df)
print(df.describe())

       Monuments_Count  Cities_Count
count       102.000000    102.000000
mean         92.852941     13.598039
std         151.608302     18.605471
min           1.000000      1.000000
25%          12.000000      2.000000
50%          45.000000      6.000000
75%          96.750000     16.750000
max        1056.000000    103.000000


### Quick Findings

As can be seen in this overview, we have 102 data rows (= provinces) each averaging about 93 monuments and around 14 cities. 

However, we can also see, that the amount of cities and monuments per province varies heavily, wtih the highest being 103 and 1056 respectively. We should take this into account later on.

### Proceeding

Now that we have to data neatly organized, we can start the analysis. First, we want to define the Feature and the target variables. 

After that, we may create the spilts and start training the model

In [10]:
def def_vars(df, *features, target="Monuments_Count"):
    # here we create a function to be able to efficiently define the features 
    X = df[list(features)] 
    y = df[target]
    return X, y

In [11]:
from sklearn.model_selection import train_test_split  
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [12]:
def split(X, y):
    # now we create the training and testing split (80:20)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

In [13]:
# Now we train a Linear regression model based on data we curated before. First, we have to recall all the steps we took.
df = fetchData();
df = cleaning(df);
X, y = def_vars(df, "Cities_Count");
X_train, X_test, y_train, y_test = split(X, y);

print("Preparation completed!")

regression = LinearRegression()
regression.fit(X_train, y_train)

Preparation completed!


In [14]:
def eval(model):
    # Here we want to test the prediction accuracy of the model
    y_pred = model.predict(X_test)
    print("MSE:", mean_squared_error(y_test, y_pred))
    print("R²:", r2_score(y_test, y_pred))
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"Average uncertainty of ± {rmse:.1f} monuments")

eval(regression)

MSE: 2724.38625855613
R²: 0.6725331281447817
Average uncertainty of ± 52.2 monuments


### First one down

Now we have trained our first model and evaluated it. The scores are not bad. However, I think we can do better.

### Improvements

1. I want to introduce a new Model, which can much better handle the large varieties we have in our data, as mentioned in the description of the DataFrame: the gradient boosting regression model
2. I will introduce two new variables, which will help the model to better understand the relationship between monuments and cities


In [15]:
from sklearn.ensemble import GradientBoostingRegressor

In [16]:
def ratios(df):
    # here we add the new features for gradient boosting regression
    df["Cities_per_Monument"] = df["Cities_Count"] / (df["Monuments_Count"] + 1)
    df["Monuments_per_City"] = df["Monuments_Count"] / (df["Cities_Count"] + 1)
    return df

In [38]:
# firstly, we want to recreate the status quo from earlier
df = fetchData();
df = cleaning(df);
df = ratios(df) # adding the new variables
# and finally we can create the variables and the train-test-split
X, y = def_vars(df, "Cities_Count", "Cities_per_Monument", "Monuments_per_City"); # adding the two new variables
X_train, X_test, y_train, y_test = split(X, y);

In [39]:
# now, we will train the model again, this time the new one
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

In [40]:
eval(gbr) # testing the model

MSE: 607.9130422851662
R²: 0.9269298243991966
Average uncertainty of ± 24.7 monuments


### Big Steps forward

As can be seen, the model improved signicantly. However, I think there is still room for improvement, since there is one more thing, that is fairly easy to do.

Before we do that, let's look at the weights of our new model, since we have more than one feature now.

In [31]:
def weights(model, decimals):
    importances = model.feature_importances_
    for name, imp in zip(X.columns, importances):
        print(f"{name}: {imp:.{decimals}f}")

weights(gbr, 9)

Cities_Count: 0.861645298
Cities_per_Monument: 0.051980893
Monuments_per_City: 0.086373810


### Weights weighing

Despite the new variables not having much weight, their impact is not to be neglected.

### Striving for Perfection

As teasered earlier, I have an idea on how to improve accuracy even further. It involves a variable which we have not taken into account at all yet: __*the countries*__!

In [32]:
# as before, we need to reset everything again
df = fetchData();
df = cleaning(df);
df = ratios(df) # adding the two ratio-variables
# now countries are taken into consideration as well
df_encoded = pd.get_dummies(df, columns=["Country"])
feature_cols = ["Cities_Count", "Cities_per_Monument", "Monuments_per_City"] + \
               [col for col in df_encoded.columns if col.startswith("Country_")]
# Create feature matrix and target vector
X, y = def_vars(df_encoded, *feature_cols)
X_train, X_test, y_train, y_test = split(X, y)

In [33]:
# training the model
gbr2 = GradientBoostingRegressor()
gbr2.fit(X_train, y_train)

In [34]:
# now we can evaluate again
eval(gbr2)
weights(gbr2, 9)

MSE: 608.102768736564
R²: 0.9269070195831202
Average uncertainty of ± 24.7 monuments
Cities_Count: 0.859244255
Cities_per_Monument: 0.059838399
Monuments_per_City: 0.055526582
Country_Albania: 0.000000295
Country_Algeria: 0.000361178
Country_Austria: 0.000006694
Country_Belgium: 0.000000000
Country_Bosnia and Herzegovina: 0.000000000
Country_Bulgaria: 0.000000000
Country_Croatia: 0.000001194
Country_Cyprus: 0.000000693
Country_Egypt: 0.000011019
Country_France: 0.000043917
Country_Germany: 0.000000000
Country_Greece: 0.024067186
Country_Hungary: 0.000002305
Country_Israel: 0.000000000
Country_Italy: 0.000161770
Country_Jordan: 0.000000000
Country_Kosovo: 0.000000000
Country_Lebanon: 0.000000000
Country_Libya: 0.000001040
Country_Macedonia: 0.000000000
Country_Malta: 0.000000000
Country_Montenegro: 0.000032648
Country_Morocco: 0.000000000
Country_Netherlands: 0.000000009
Country_Northern Cyprus: 0.000000000
Country_Portugal: 0.000000000
Country_Romania: 0.000026300
Country_Serbia: 0.000

### Not better

Despite more details to analyze, the model did not produce better predictions than our second one. However, an R² of almost 0.93 is exceptional hence I am very content with my first ML Model.

### Final remarks

I was able to create a model, which is able to predict the count of monuments of a region _pretty accurately_ based on the cities it has. The only left to do, is to use the model and create some actual predictions. For this, i will use the second model, since the third is more complex without adding additional benefit.