### Preparation

In [1]:
include("../../src/modeling/embedded_correlation_lib.jl")
include("../../src/modeling/correlation_lib.jl")
include("../../src/modeling/util.jl")
using SQLite
using CSV
using DataFrames: DataFrame, Missing
using Statistics, LinearAlgebra, Logging, Random
using Gen

# Another rent model
- Columns are ```state, city, zip, construct_year, living_space, rent```
- Joint probability distribution function:

$Pr(state, city, zip, c\_y, l\_s, rent) = $

$Pr(state)\,Pr(city|state)\,Pr(zip|state,city)\,Pr(l\_s|state, city)\,Pr(c\_y|city, l\_s)\,Pr(rent|l\_s, state, city, c\_y)$
- Can be interpreted as a Bayes Net
- Current assumptions:
    - High cardinality
    - Occurrence information has less meaning
    - No ground truth about $dom(X)$ and $P(X)$

### Conditional probability distributions that benefit from vectorized categorical variables:
- $Pr(l\_s|state, city)$: If the db only has ```(Bayern, Muenchen, 50m^2)``` for Bayern Muenchen, we can take its neighbors into consideration, e.g. ```(Bayern, Ingolstadt)```. Without vector representations, no neighborhood can be easily evaluated.
- $Pr(rent|l\_s, state, city, c\_y)$

### Read and parse data
- Each city in $dom(city)$ appears 30 times
- ```Berlin``` occurs once with very high rent
- ```Frankfurt``` occurs once with very low rent
- ```Starnberg``` occurs twice normally
- ```FaM``` one time with wrong zip
- One ```Muenchen``` in ```Niedersachsen``` 

In [28]:
CAT_ATTRS = [:state, 
            :city, 
            :construct_year,
            ]
NUM_ATTRS = [:living_space, :zip, :rent]
TSV_PATH_PREFIX = "../../data/tsv/20191217_nb_neo_30_oversampling/neo_30_os_"
CAT_EMBEDDING_DICT = merge([read_tsv("$(TSV_PATH_PREFIX)$(cat_attr)_meta.tsv",
                            "$(TSV_PATH_PREFIX)$(cat_attr)_vec.tsv")
                            for cat_attr in CAT_ATTRS]...)

db = SQLite.DB()
rent_table = CSV.File("../../data/rent_data/neo_enriched_rent_30_per_city.csv") |> SQLite.load!(db, "rent_table")
realization_df = SQLite.Query(db, "SELECT * FROM rent_table") |> DataFrame
embedding_df = replace_with_emb(realization_df, CAT_ATTRS, NUM_ATTRS, CAT_EMBEDDING_DICT)
realization_df

Unnamed: 0_level_0,state,city,zip,living_space,construct_year,rent
Unnamed: 0_level_1,String⍰,String⍰,Int64⍰,Float64⍰,Int64⍰,Float64⍰
1,Bayern,Starnberg,82319,115.37,1970,1650.24
2,Bayern,Starnberg,82319,135.37,1970,1900.24
3,Hessen,Frankfurt,60596,130.0,2000,810.3
4,Berlin,Berlin,10115,30.4,2001,1500.2
5,Niedersachsen,Muenchen,84337,115.37,1970,1650.24
6,Hessen,Frankfurt am Main,83217,88.75,1943,1609.9
7,Niedersachsen,Braunschweig,38113,144.53,1947,1446.11
8,Bayern,Schweinfurt,97421,128.82,1979,894.89
9,Nordrhein-Westfalen,Aachen,52064,118.54,1963,943.12
10,Bayern,Kempten,87437,50.51,1982,623.55


### Model with categorical variables
- ```categorical_co_occurrence(("Bayern", "Starnberg"), rent)``` models $Pr(rent|state, city)$:
    - It seeks entries with ```Bayern, Starnberg``` in the given database;
    - It samples rent from probability distribution with mean and variance calculated from entries found above;
    - If e.g. only one entry is found, there is no variance. Thus the function requires a hyperparam "minimum variance", or otherwise it computes the whole variance of rent within the database.

In [29]:
@gen function neo_rent_plain_model(df)
    @info "-----------------NEOPLAIN"
    # with frequency info
    #states = unique(df.state)
    #occurrence = [sum(df.state .== s) for s in states]
    #probs = LinearAlgebra.normalize(occurrence, 1)
    #state = @trace(categorical_named(states, probs), :state => :realization)

    # wo frequency info
    states = unique(df.state)
    state = @trace(uniform_categorical(states), :state => :realization)
    @info "$state"

    city = @trace(categorical_co_occurrence(df,
                                            [:state,],
                                            ["categorical"],
                                            :city,
                                            [state],
                                            true), :city)

    @info "$city"

    zip = @trace(categorical_co_occurrence(df,
                                            [:state, :city],
                                            ["categorical", "categorical"],
                                            :zip,
                                            [state, city],
                                            true), :zip)
    @info "$zip"

    living_space = @trace(numerical_co_occurrence(df,
                                                    [:state, :city],
                                                    ["categorical", "categorical"],
                                                    :living_space,
                                                    [state, city],
                                                    false,
                                                    true), :living_space)
    @info "$living_space"

    construct_year = @trace(categorical_co_occurrence(df,
                                                    [:city, :living_space],
                                                    ["categorical", "numerical"],
                                                    :construct_year,
                                                    [city, (living_space, 10.)],
                                                    true), :construct_year)
    @info "$construct_year"

    total_rent = @trace(numerical_co_occurrence(df,
                                                [:living_space, :state, :city],
                                                ["numerical", "categorical", "categorical"],
                                                :rent,
                                                [(living_space, 10.), state, city],
                                                false,
                                                true), :rent)

    @info "Totally $total_rent"
end
println("SUCCESS")

SUCCESS


### Model with embedded categorical variables
- ```embedding_co_occurrence(("Bayern", "Starnberg"), rent)```:
    - It seeks entries with ```Bayern, Starnberg``` in the given database;
    - If number of entries found is less than $k$, which is the mandatory size of the neighborhood, the function seeks neighbors of ```Bayern, Starnberg```:
    $x_{state}, x_{city} = arg\,min_{x_{state}, x_{city}} cos\_dist(Bayern, x_{state}) + cos\_dist(Starnberg, x_{city})$, $x_{state} \in dom(state), x_{city} \in dom(city)$
    - It samples rent from probability distribution with mean and variance calculated from entries found above.

In [37]:
@gen function neo_rent_emb_model(df, emb_df, emb_dict)
    @info "-----------------NEOEMB"
    # with frequency info
    #states = unique(df.state)
    #occurrence = [sum(df.state .== s) for s in states]
    #probs = LinearAlgebra.normalize(occurrence, 1)
    #state = @trace(categorical_named(states, probs), :state => :realization)

    # wo frequency info
    states = unique(df.state)
    state = @trace(uniform_categorical(states), :state => :realization)
    @info "$state"

    # sample city from neighborhood, this is a hyperparameter...
    city_neighborhood_size = 1
    city = @trace(embedding_co_occurrence(df,
                                            emb_df,
                                            emb_dict,
                                            [:state,],
                                            ["embedding"],
                                            :city,
                                            "categorical",
                                            [state],
                                            city_neighborhood_size), :city)
    @info "$city"

    # neighborhood size == 1 means we want no emb based neighbors
    zip_neighborhood_size = 1
    zip = @trace(embedding_co_occurrence(df,
                                            emb_df,
                                            emb_dict,
                                            [:state, :city],
                                            ["embedding", "embedding"],
                                            :zip,
                                            "categorical",
                                            [state, city],
                                            zip_neighborhood_size), :zip)
    @info "$zip"

    ls_neighborhood_size = 30
    living_space = @trace(embedding_co_occurrence(df,
                                                    emb_df,
                                                    emb_dict,
                                                    [:state, :city],
                                                    ["embedding", "embedding"],
                                                    :living_space,
                                                    "numerical",
                                                    [state, city],
                                                    ls_neighborhood_size), :living_space)
    @info "$living_space"

    cy_neighborhood_size = 30
    construct_year = @trace(embedding_co_occurrence(df,
                                                    emb_df,
                                                    emb_dict,
                                                    [:city, :living_space],
                                                    ["embedding", "numerical"],
                                                    :construct_year,
                                                    "categorical",
                                                    [city, (living_space, 30)],
                                                    cy_neighborhood_size), :construct_year)
    @info "$construct_year"

    rent_neighborhood_size = 30
    total_rent = @trace(embedding_co_occurrence(df,
                                                emb_df,
                                                emb_dict,
                                                [:living_space,
                                                :state, 
                                                :city,
                                                ],
                                                ["numerical", 
                                                "embedding", 
                                                "embedding",
                                                ],
                                                :rent,
                                                "numerical",
                                                [(living_space, 10.), 
                                                state, 
                                                city,
                                                ],
                                                rent_neighborhood_size), :rent)

    @info "Totally $total_rent" #lsr $living_space_rent, chlsr $city_heating_ls_rent"
end
println("SUCCESS")

SUCCESS


### Error detection result
- Each city in $dom(city)$ appears 30 times
- ```Berlin``` occurs once with very high rent
- ```Frankfurt``` occurs once with very low rent
- ```Starnberg``` occurs twice normally
- ```FaM``` one time with wrong zip
- One ```Muenchen``` in ```Niedersachsen``` 

In [38]:
disable_logging(LogLevel(10))
(emb_traces, emb_scores) = k_most_improbable_neo(10,
                                    realization_df,
                                    realization_df,
                                    vcat(CAT_ATTRS, NUM_ATTRS),
                                    neo_rent_emb_model,
                                    (realization_df, embedding_df, CAT_EMBEDDING_DICT))

(plain_traces, plain_scores) = k_most_improbable_neo(10,
                                    realization_df,
                                    realization_df,
                                    vcat(CAT_ATTRS, NUM_ATTRS),
                                    neo_rent_plain_model,
                                    (realization_df, ))
println("SUCCESS")

SUCCESS


In [25]:
for (i, trace) in enumerate(plain_traces)
    println("---------------")
    for attr in vcat(CAT_ATTRS, NUM_ATTRS)
        println("$(attr): $(trace[attr => :realization])")
    end
    println(plain_scores[i])
end

---------------
state: Bayern
city: Bayreuth
living_space: 141.66
zip: 95444
rent: 2749.65
-28.180814601724393
---------------
state: Bayern
city: Wuerzburg
living_space: 112.28
zip: 97082
rent: 2620.63
-20.452810843434733
---------------
state: Bayern
city: Passau
living_space: 142.29
zip: 94033
rent: 2228.49
-24.521683836166023
---------------
state: Nordrhein-Westfalen
city: Essen
living_space: 149.51
zip: 45260
rent: 663.27
-20.58754725884107
---------------
state: Bayern
city: Fuerth
living_space: 136.39
zip: 90768
rent: 1283.33
-20.304718818793486
---------------
state: Nordrhein-Westfalen
city: Essen
living_space: 55.15
zip: 45275
rent: 440.74
-20.45696681135314
---------------
state: Bayern
city: Muenchen
living_space: 124.77
zip: 82723
rent: 1686.48
-21.027550659986353
---------------
state: Bayern
city: Hof
living_space: 145.69
zip: 95031
rent: 2497.0
-20.414149767919522
---------------
state: Bayern
city: Ingolstadt
living_space: 146.88
zip: 85052
rent: 2590.91
-20.303519486

In [39]:
for (i, trace) in enumerate(emb_traces)
    println("---------------")
    for attr in vcat(CAT_ATTRS, NUM_ATTRS)
        println("$(attr): $(trace[attr => :realization])")
    end
    println(emb_scores[i])
end

---------------
state: Bayern
city: Bayreuth
construct_year: 2034
living_space: 141.66
zip: 95444
rent: 2749.65
-24.475249931405074
---------------
state: Bayern
city: Landshut
construct_year: 2032
living_space: 146.45
zip: 84032
rent: 2815.18
-24.918539764499496
---------------
state: Bayern
city: Wuerzburg
construct_year: 2040
living_space: 112.28
zip: 97082
rent: 2620.63
-26.40630044854532
---------------
state: Bayern
city: Passau
construct_year: 2021
living_space: 142.29
zip: 94033
rent: 2228.49
-24.525031669634807
---------------
state: Bayern
city: Fuerth
construct_year: 2045
living_space: 90.66
zip: 90763
rent: 2270.89
-25.08628497378998
---------------
state: Bayern
city: Aschaffenburg
construct_year: 2031
living_space: 122.85
zip: 63741
rent: 2363.67
-24.740265086546998
---------------
state: Baden-Wuerttemberg
city: Ulm
construct_year: 2014
living_space: 148.46
zip: 89153
rent: 2557.51
-25.518152523049615
---------------
state: Bayern
city: Nuernberg
construct_year: 2023
liv

![title](../../resource/nb_neo_30_oversampling.png)

### Next steps
- Dataset with higher cardinality / less data points
    - Check both embedding training results and sampling
- For embedding training: Currently using naive oversampling to solve imbalanced classes (too few ```Starnberg```, the NN model does not give effort to fit it)
    - Negative sampling or other methods
- Repair suggestion: think about custom update/proposal function
- Consider real dataset