Explore the Stack Overflow data set with the Elastic Stack using this gentle introduction. Stack Overflow data is indexed using .NET Core, a cross platform, open source platform for building applications, using NEST, the official Elasticsearch client for .NET.
- Download at least Elasticsearch 7.4.2
- Download at least Kibana 7.4.2 (version must match same version as Elasticsearch)
- Install .NET Core 3.0
- Download latest Stack Overflow data set
- Under 7Z files, choose
stackoverflow.com-Posts.7z
,stackoverflow.com-Users.7z
andstackoverflow.com-Badges.7z
- Under 7Z files, choose
- Unzip Stack Overflow data set to a directory. You'll need around 90GB of available space!
-
Restore project Nuget package dependencies. In the solution root directory
dotnet restore
-
Build the solution in Release configuration. In the solution root directory
dotnet build -c Release
-
Set the JVM heap size to at least 8GB, by adding the following to the
jvm.options
file inconfig
directory within Elasticsearch home directory, and saving the file-Xms8g -Xmx8g
-
Start Elasticsearch using the
elasticsearch.[sh|bat]
file inbin
directory within Elasticsearch home directory./elasticsearch.bat
-
Navigate to
StackOverflow.Indexer/bin/Release/netcoreapp3.0
directory from the root of the solution. There should be a compiledStackOverflow.Indexer.dll
file in the directory from compiling the solution in previous steps. -
Check available options for indexing posts or users using
--help
argumentdotnet .\StackOverflow.Indexer.dll --help dotnet .\StackOverflow.Indexer.dll posts --help dotnet .\StackOverflow.Indexer.dll users --help dotnet .\StackOverflow.Indexer.dll tags --help
-
Index posts data
dotnet .\StackOverflow.Indexer.dll posts -e "http://localhost:9200" -f "/path/to/Posts.xml"
Wait ~90 minutes to index all questions and answers on a local single node Elasticsearch cluster
-
Index users data
dotnet .\StackOverflow.Indexer.dll users -e "http://localhost:9200" -f "/path/to/Users.xml" -b "/path/to/Badges.xml"
Wait ~15 minutes to index all users and their badges on a local single node Elasticsearch cluster
-
(Optional) Update answers with tags
If you'd like to be able to filter both questions and answers using tags, it can be useful to denormalize question tags onto answers. The source data can be transformed before ingesting to do this, but can also be achieved using the update by query API, which is what this command does.
dotnet .\StackOverflow.Indexer.dll tags -e "http://localhost:9200" -f "/path/to/Posts.xml"
This can take a few hours. The
-s
argument can be used to change the number of concurrent updates, so depending on the performance of the cluster into which you're indexing, you may be able to increase this to speed up the process.
The kibana_saved_objects_742.ndjson
file can be
imported into Kibana to apply some preconfigured saved queries, visualizations and a dashboard:
- Navigate to
Management
menu item within Kibana - Under Kibana, select
Saved Objects
- Select
Import
and choose thekibana_saved_objects_742.ndjson
file.
There should now be
- a Dashboard under the
Dashboard
menu item - a collection of Vizualizations under
Vizualize
menu item - a collection of Saved Queries under
Discover
menu item
- Content of this repository made available under Apache 2.0 license.
- Stack Overflow data is made available under Creative Commons Attribution-ShareAlike 4.0 International license.