Skip to content
OTP and ETL exercise to deserialize several large CSVs, perform transformations, and encode into JSON.
Elixir
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config
lib
test
.gitignore
README.md
hou_tax
mix.exs
observer.png

README.md

HouTax

OTP and ETL demo with a command line interface using Elixir. Original open dataset from City of Houston's Tax Rolls by Year.

But What Does It Do?

This application reads property/building tax data from the City of Houston's Tax Rolls by Year as csvs. For each building and each year that record of that property exists, the application will calculate how much the building's value increased or decreased. This data is then serialized as JSON and written to disk in a single export file.

To Build

  1. Install Erlang and Elixir

  2. Clone this repository into a working directory git clone git@github.com:GeoffreyPS/HouTax.git.

  3. CD into the cloned repository and fetch dependencies with mix deps.get if you don't already have the Hex package manager, you will be prompted to install it as well.

  4. Build the application with mix escript.build

To Run

If you already have any instance of Erlang installed on your machine, you can run the CLI without futzing around with installing Elixir or the project's dependencies.

  1. Enter ./hou_tax for help or ./hou_tax --path path/to/csvs to get the application started.

  2. Sample data located in the test/data directory

  3. Export file titled houtax_export.json should appear.

To Look Under the Hood

You must have done steps 1-3 of To Build to do this part.

  1. CD into the cloned repository and run iex -S mix. This starts the application in iex, Elixir's REPL.

  2. Enter :observer.start to see the process tree started by the application in a GUI. application tree

$ iex -S mix
Erlang/OTP 18 [erts-7.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Interactive Elixir (1.3.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :observer.start
:ok
iex(2)> File.cd "./test/data/100"
:ok
iex(3)> {:ok, files} = File.ls
{:ok, ["2012.csv", "2013.csv", "2014.csv", "2015.csv"]}
iex(4)> csvs = Enum.map(files, &(Path.expand(&1, __DIR__)))
["/Users/geoff/Projects/sandbox/elixir-learning/HouTax/test/data/100/2012.csv",
 "/Users/geoff/Projects/sandbox/elixir-learning/HouTax/test/data/100/2013.csv",
 "/Users/geoff/Projects/sandbox/elixir-learning/HouTax/test/data/100/2014.csv",
 "/Users/geoff/Projects/sandbox/elixir-learning/HouTax/test/data/100/2015.csv"]
iex(5)> HouTax.process csvs
[:ok, :ok, :ok, :ok]
iex(6)> HouTax.write_all
Finished write!
:ok
iex(7)>

Useful to note:

  • Each building's data exists its own distinct process. In the event of a crash or a misread, the other processes are unaffected.

  • For the full datasets, tweaks might need to be made to the BEAM/VM to allow 750,000+ processes for it all to run, plus timeout defaults might need to be changed. However, for the sample dataset of 20k+ properties, this technique works fine. Alternatively one could solve this problem with a pool of workers and caching the property information in an ETS table.

  • Saša Jurić's book Elixir in Action provided the model for much of this exercise, with a few departures. Benjamin Tan Wei Hao's forthcoming book The Little Elixir and OTP Guidebook was also very helpful.

You can’t perform that action at this time.