# 20190923 - Rent Model
Implements a generative model that incorporates 4 random variables:
- __State__: Drawn from categorical distribution; weight of each state is proportional to population and edit score given observation
- __City__: Drawn from categorical distribution; weight of each city is based on previously chosen state, proportional to population and edit distance from given observation
- __Zip__: Paired with city, conditioned on edit score
- __Rent__: Drawn from normal distribution, with mean equals mean rent and sigma set fix 

Idea: For any observed tuple (does not matter if dirty or not), the rent model takes it as input and samples a new tuple The attribute values of it are drawn from corresponding distributions, with parameters calculated based on prior knowledge (we give it a list of states, cities, population and mean rent) and observation (i.e. compare edit distance). If no observation is given, the model samples only based on external knowledge.

With this model, 3 things can be done:
- Sample fake data with ```Gen.simulate(model, args)```
- Sample tuples for given constraints, e.g. ```rent!=7 and state != "Bayern"```, and evaluate the log probability of that tuple
- Do importance resampling/MCMC and other inference algorithms to find the most probable tuple given observation and constraints iteratively

#### 1. Preparations
Import packages, implement some helper functions for reading external data etc.

In [None]:
using CSV
using StringDistances
using Gen
using DataFrames

function read_csv(file_path::String)
    return CSV.File(file_path) |> DataFrame!
end

function word_to_distance_dict(word, word_list)
    return [w => StringDistances.evaluate(Levenshtein(), word, w) for w in word_list]
end

function get_word_with_typo(word::String, action, pos, chr)
    # Action: 1 for insertion, 2 for removal, 3 for replacement
    if action==1
        converted_chr = convert(Char, chr+96)
        return word[1:pos-1]*converted_chr*word[pos:end]
    elseif action==2
        return word[1:pos-1]*word[pos+1:end]
    else
        converted_chr = convert(Char, chr+96)
        return replace(word, word[pos]=>converted_chr)
    end
end

struct ExtData
    state_population_df::DataFrames.DataFrame
    state_city_population_mean_rent_df::DataFrames.DataFrame
end

function init_ext_data(state_population_df, state_city_population_mean_rent_df)
    global ext_data
    ext_data = ExtData(state_population_df, state_city_population_mean_rent_df)
end

@gen function add_typo(intention, prob)
    if @trace(Gen.bernoulli(prob), :add_more_typo)
        # Which action
        action = @trace(Gen.uniform_discrete(1, 3), :action)
        chr = @trace(Gen.uniform_discrete(1, 26), :chr)
        if action == 1
            # insertion
            pos = @trace(Gen.uniform_discrete(1, length(intention) + 1), :pos)
        else
            # deletion or replacement
            pos = @trace(Gen.uniform_discrete(1, length(intention)), :pos)
        end
        dirty_intention = get_word_with_typo(intention, action, pos, chr)
        return @trace(add_typo(dirty_intention, prob / 2.), :next_typo)
    else
        return intention
    end
end

function get_edit_score(string1::String, string2::String)
    return compare(string1, string2, TokenMax(Levenshtein()))
end

function get_edit_score(string_list::Array{String}, string::String)
    return [compare(i, string, TokenMax(Levenshtein()))  for i in string_list]
end
;

For the categorical distribution of state/city, we multiply population with edit score (1 is total match)

In [None]:
function get_weighted_prob_bef_norm(states_or_cities::Array{String}, populations, realization; mask=nothing)
    edit_dist_score = [1. for _=1:length(states_or_cities)]
    if realization != nothing
        edit_dist_score = get_edit_score(states_or_cities, realization)
    end
    # 5 times gain
    # edit_dist_score = [if i > 0.5 i * 5 else i end for i in edit_dist_score]
    edit_dist_score = [exp(i*5.) for i in edit_dist_score]
    #weighted_pop = populations .* edit_dist_score
    weighted_pop = edit_dist_score
    if mask != nothing
        weighted_pop[mask] .= 0.
    end
    #println(weighted_pop)
    return weighted_pop
end
;

#### 2. The model itself
- Input: Observed data tuple (could contain missing value), constraints (optional)
- Output: A suggestive clean tuple sampled from probabilistic intention model
- Procedure:
    1. Fetch external data (list of states and cities)
    2. Choose state proportionally to population and edit score
    3. For chosen state, choose city accordingly
        - The current model doesn't allow choosing cities outside the state, as its output should be free of constraint violation.
    4. Sample rent

- The only prior considered now is the state population, i.e. once a state is chosen, city, zip and rent are conditioned on it. Maybe we should
    1. Incorporate the prior of city $P(city)$, zip and rent, and use e.g. $P(city|state="Bayern")P(city)$
    2. For real data sets, find reasonable prior knowledge, as e.g. population alone isn't sufficient 

In [None]:
@gen function rent_model(realized_state, realized_city, realized_zip)
    # fetch external data
    global ext_data
    states_df = ext_data.state_population_df
    cities_df = ext_data.state_city_population_mean_rent_df

    # CHOOSE STATE PROPORTIONALLY TO POPULATION AND NUMBER OF TYPOS
    states_weights = get_weighted_prob_bef_norm(convert(Array{String}, states_df[:, :state]), states_df[:, :population], realized_state)
    states_prob = states_weights ./ sum(states_weights)
    intended_state_ind = @trace(Gen.categorical(states_prob), :intended_state_ind)
    intended_state = states_df[intended_state_ind, :state]

    # CHOOSE CITY AND ZIP PROPORTIONALLY TO POPULATION AND NUMBER OF TYPOS
    mask = cities_df[:state] .!= states_df[intended_state_ind, :state]
    cities_weights = get_weighted_prob_bef_norm(convert(Array{String}, cities_df[:, :city]), cities_df[:, :population], realized_city, mask=mask)
    if realized_zip != nothing
        zip_edit_score = get_edit_score([string(z) for z in cities_df[:, :zip]], string(realized_zip))
        zip_edit_score = [exp(i*5.) for i in zip_edit_score]
        cities_weights .*= zip_edit_score
    end
    cities_prob = cities_weights ./ sum(cities_weights)
    intended_city_ind = @trace(Gen.categorical(cities_prob), :intended_city_ind)
    intended_city = cities_df[intended_city_ind, :city]
    intended_zip = cities_df[intended_city_ind, :zip]

    # SAMPLE RENT
    rent = @trace(Gen.normal(cities_df[intended_city_ind, :mean_rent], 2.), :rent)

    # ADD TYPOS
    if realized_city == nothing
        realized_city = @trace(add_typo(cities_df[intended_city_ind, :city], 0.2), :realized_city)
    end
    if realized_state == nothing
        realized_state = @trace(add_typo(cities_df[intended_city_ind, :state], 0.2), :realized_state)
    end
    # println("$intended_state; $intended_city; $intended_zip; $rent")
end
;

Read data

In [None]:
init_ext_data(read_csv(joinpath(@__DIR__, "../data/states.csv")), read_csv(joinpath(@__DIR__, "../data/cities.csv")))

#### 3. Simulation - Sample ```ys = intended_tuple``` for ```xs = observed_tuple```
- Note that we have no constraints here, i.e. for every single attribute value, we don't know if it's clean or not.
- Results are quite bad. Reasons:
    - We have no constraints, too much freedom for the model
    - Parameters of the distributions don't fit. Population should not have such an impact over edit score.
    - Measure for edit score is too weak (```TokenMax(Levenshtein())```), need new measure
    - Need new strategies, e.g. if ```edit_score(realization, city_a) > threshold```, we eliminate other cities' weights. __This would bring us back to heuristical decision rules!__

In [None]:
# Sample 10 traces
traces = [Gen.simulate(rent_model, ("Hessen", "Frankfurt am Main", "60596")) for _=1:10];

#### 4. Simulation with constraints - Sample ```ys = intended_tuple``` for ```xs = dirty_or_NA_attr``` and ```constraints = clean_attr```
- For each sample that comes out here, we also get a weight

$w = log\frac{p(t,r;x)}{q(t;u,x)q(r;x,t)}$,

with $u$ (constraints), $x$ (dirty attr. values), $t, r$ (sample). 

- If constraints conflict with each other, ```weight = -Inf```

Here we test ```["Hesssen", NaN, "60596", "4."]``` as an observation. We think the rent is trustworthy and set it as constraint.

In [None]:
constraints = Gen.choicemap()

# conflicting constraints will result in -inf weight
# constraints[:intended_state_ind] = 5 # Hessen
constraints[:rent] = 15.0    # rent = 4EUR/m2
#constraints[:intended_city_ind] = 56 # FaM

for _=1:10
    (trace, weight) = Gen.generate(rent_model, ("Bayern", "Frankfurt am Main", "60306"), constraints)
    println(weight)
end

The model thinks that the rent is too low for Hessen, thus gives samples with ```"Niedersachsen"``` greater weights.

A noticable task is to implement __outlier detection__ or use training data to tune the parameters of the model.

#### 5. Inference
- We setup constraints and number of iterations, and use importance resampling to infer the most probable tuple for given observation.
- __Problem__: If the model is large and/or its attributes have large admissible value set, the inference program may fail to traverse through all possible value combinations and stuck at local maxima.
- Inference problems of such models are in general non-convex.

In [None]:
function do_inference(model, args, constraints, num_iter)
    (trace, lml_est) = Gen.importance_resampling(rent_model, args, constraints, num_iter)
    return (trace, lml_est)
end

constraints = Gen.choicemap()
constraints[:rent] = 12.25
#constraints[:intended_state_ind] = 5

trace, lml_est = do_inference(rent_model, (nothing, "munich", "80331"), constraints, 100000)
println([ext_data.state_city_population_mean_rent_df[trace[:intended_city_ind], :state], ext_data.state_city_population_mean_rent_df[trace[:intended_city_ind], :city],
ext_data.state_city_population_mean_rent_df[trace[:intended_city_ind], :zip], trace[:rent]])
println(lml_est)

#### 6. Immediate tasks
- Incorporate different types of integrity constraint (quantitative statistics, denial rules etc.)
- Try out parameter learning using training data
- Try out realistic data sets!
- Test out inference performance and scalability by enlarging the model (incorporate more attributes)
- Find other packages/libraries that are useful for this project
- Think about whether stay with Julia or switch to Python for flexibility and extendability

#### 7. Future directions
- How to generalize the process of model building? How to allow user to specify a model without hard coding? --> Constraint definition by using specific expressions
- How to combine mixed types of integrity constraints? How to specify them?
- Improve inference program:
    - Pruning sample space
    - Performance comparison between algorithms (MCMC, SVI, MH)
    - Neural network?
- Supervised, weakly-supervised and unsupervised learning --> How? Possible?

__More buzzwords__: Featurization, error detection, outlier detection, grouding, graphical models