![Datadog Logo](images/dd_logo.png)

# Distributed Tracing with Datadog APM

Welcome to the Distributed Tracing with APM workshop! 

This repo contains a "dummy" Water System microservices project, a single page web application with microservices, ready to be instrumented and analyzed using Datadog's APM.

Using features like Service Map and Trace Search, we'll see how a APM can help alleviate some of the complexity of writing and deploying code for a distributed system.

![Dashboard Image](images/service-map.png)

The repo itself is run using Docker, along with `docker-compose`, which should now be bundled with [Windows](https://store.docker.com/editions/community/docker-ce-desktop-windows) and [MacOS](https://store.docker.com/editions/community/docker-ce-desktop-mac) versions of Docker. 

On Linux, Docker can be installed with your preferred method, and `docker-compose` should be available via a `pip install docker-compose`.

Using Docker allows us to set up multiple microservices locally, giving us an environment to rapidly build, test, and instrument our demo distributed system.

For the workshop, we'll be using some fresh, demo accounts. These don't have any other services submitting their Datadog account.

If you'd like to play around on your own, or on an account your company provides, and don't want to collide with other people's work, you'll want to pay attention to the `env` we set in the workshop. You can set this to what you prefer, and have it be isolated in the Datadog drop down.

[I've](http://www.twitter.com/burningion) worked hard to make this workshop as helpful possible, but if you see something that could be improved, please feel free to create a [Github issue](https://github.com/burningion/distributed-tracing-with-apm-workshop) on the repo, or reach out via the [Datadog public slack](https://chat.datadoghq.com/). 

# Getting Started 

The first thing you'll want to do is create a [Datadog account](https://www.datadoghq.com), and download this repo. 

You'll then want to do a `docker-compose up`, using the new API key from your trial account.

Your command should look like the following on MacOS/Linux:

```bash
$ POSTGRES_USER=postgres POSTGRES_PASSWORD=<pg password> DD_API_KEY=<api key> docker-compose up
```

For Windows, the process of setting environment variables is a bit different:

```
PS C:\dev> $env:POSTGRES_USER=postgres
PS C:\dev> $env:POSTGRES_PASSWORD=<pg password>
PS C:\dev> $env:DD_API_KEY=<YOUR_API_KEY>
PS C:\dev> docker-compose up
```

With the command run, you should see Docker start pulling down container images for the code. Afterwards, you'll be able to go to [http://localhost:5000/](http://localhost:5000/) and see the single page web app.

![Our Single Page App](images/dashboard.png)

Refresh the page, click around, add a pump, try adding a city. This will begin to generate APM traffic to send back to Datadog, to be processed. 

Tab back over to your console, and look over the container logs. Notice there's a polling web request generating APM traffic that will begin showing up in the Datadog APM backend.

Before we walk through instrumenting our code in the workshop, let's quickly take a look at what Datadog's APM gives us out of the box for our code, and see how APM instrumentation gives us rapid insight into our software's architecture and performance.

For reference, here's a rough image of our architecture. Compare it to what we see out of the box with APM and service maps:

![Demo Architecture](images/workshop-architecture.png)

# Traces List, Services List, and Service Maps

The APM product itself integrates across multiple places in the Datadog user interface. It uses Traces to discover and track services across your entire infrastructure, updating and adapting to rapidly changing services and infrastructure.

The first thing we'll want to do is go directly to the Trace List, to see if our traces are showing up properly, and we've successfully established a connection to Datadog.

![Trace List](images/trace-list.png)

Here, we can see the traces as they come in, and get our first feel for how APM works.

If we hover over one of the traces, we'll see the option to view the trace. Clicking into this, we stay on the same page, but a flame graph of our trace, along with some accompanying information comes up.

Having established our traces are coming in properly, let's jump over to the Service List, and see which services are running on our new repo.

If we look at the list of services, we can get an overview for our application, and see all the pieces running within it:

![Service List](images/service-list.png)

From the service list, we can get a quick glance at the health of our services we run internally, along with their request load.

This allows us to quickly see the places within our code where latency has gone up, or services which have a raised error rate.

Diving into one of services, we can see a list of the specific endpoints within that service.

![Service Drill Down](images/service-drilldown.png)

From here, we can then view the specific traces for this service, drilling down to just that service and its traces. 

![Trace Service Drill Down](images/service-trace-drilldown.png)

Notice we can see see a list of specific traces here. Clicking into one of these traces gives us the view of the entire request as it processed through our our distributed system:

![Individual Trace](images/individual-trace.png)

Besides the flame graph, there's also an option to view the list of spans that comprise the entire trace. Clicking into this view, we can see specific database calls, and sort by latency to see the bottlenecks within our systems.

Datadog APM integrates across your hosts and logs, so there are options to view logs and host level metrics at a specific trace too. Clicking on hosts shows this state, and allows you to drill down into the underlying hosts' status at the time of your trace.

![Individual Trace with List](images/individual-trace-list.png)

Finally, because traces are structured views of your system, they allow us to build a Service Map, describing your service level infrastructure as it changes over time. Clicking into the Service Map shows us a snapshot of the infrastructure we have running, and the relationships across our distributed system.

![Service Map Drilldown](images/service-map-drill.png)

Clicking on any of our services shows a pop up, allowing us to further inspect, seeing the relationship between it and any sub-services it communicates on / relies upon.

Service maps take in any services detected for the past two weeks. Any new service created should show soon.

![Service map Options](images/service-map-option.png)

# Trace Search and Analytics With Tagging

Looking a the list of our services, we can now get a feel for how they all connect. We now have a feel for the architecture of our application, along with any bottlenecks that may exist.

But beyond this, when we're debugging our applications in the real world, we need to know more details beyond a service level, or an abstract idea for where bottlenecks occur in our application.

For example, we may have a specific customer who keeps complaining about errors nobody has been able to reproduce.

By adding a `tag` to our trace, we can then add a facet to Trace Search, allowing us to drill down on specific requests for a single customer.

If we're in our Python API, it's as simple as grabbing the current span, and then setting tags:

```python
span = tracer.current_span()
span.set_tags({'user_id': user['id']})
```

Note that the `set_tags` allows us to set multiple tags if we need. Allowing us to further drill down if needed.

From here, we can now submit requests that generate traffic for Trace Search and Analytics. Once we have some tagged traces, we can then add facets, allowing us to drill down even further.

![Trace Search Facet Menu](images/trace-search-facet-menu.png)

Notice how we can add a facet to any tag we've set, allowing us to analyze and track how our services perform for any specific user or use case.

From here, we can now watch as every request we generate with our frontend's `Generate Requests for Random User` populates throughout the system.

With a few requests, we can then switch back over to the service, and filter by each individual `User Id`.

![Trace Search User Id](images/trace-search-userid.png)

# How Observability and APM Support Each Other

Although the focus of today's workshop is specifically APM, we've already seen APM's deep integration with monitors, logs, and metrics. 

This integration allows us to build a more complete vision of our software systems, as we build and deploy them across time. 

As more of our software products become distributed systems, and more of our infrastructure becomes code, having the ability to see across layers of your deployment stack is what will help when diagnosing issues, or making plans for change.

At this is at the heart of what Datadog is built to do. Provide insight to the process of building, deploying, and supporting systems in production.

![Three Pillars of Observability](images/observability.png)

A distributed system is fundamentally a complex system to build, maintain, and deploy. Datadog gives you the tools to observe the entire system, allowing you to make better decisions about code, infrastructure, and errors.

APM itself fits well into a developer's workflow, adding a way to visualize requests across all systems internally, even those they may not work on.

By adding tags and using trace search and analytics, they can drill down and see customer level problems, as they pass through all systems, and more quickly diagnose what's happening, and where.

# Let's Start Breaking Things!

Let's introduce latency into our downstream services.

The best candidate service for introducing latency is our `users-api` service. I've already added an async sleep library, called `sleep-promise`.

Since our `/users` api endpoint is getting called every few seconds, let's try adding some latency there to simulate the end user experience. 

This allows us to get a feel for which pieces of our infrastructure are true bottlenecks for the end user experience.

## Creating Traffic and Viewing Our Request

![Trace Analytics Duration](images/trace-analytics-duration.png)

Under Trace Search, we can swith over to analytics, and drill down by duration on our endpoint.

Here, we can see how adding our sleep increased our response time, along with the trend over time.

Also note, we can export any of our traces we're looking at as Timeboards into Datadog.

![Trace Analytics Timeboard](images/trace-analytics-timeboard.png)

With the ability to monitor your most critical endpoints for true latency, you'll have a chance to catch degraded user experience in production sooner rather than later.

# Alright, show me how to instrument my code

Here we'll walk through how our code has been instrumented with Datadog APM.

In general, for production code, just wrapping our code in a `ddtrace-run` should be sufficient for supported apps.

Let's step through our Python Flask and NodeJS code, and explain how things are instrumented the way they are.

# More Resources and Help with APM

There is a Datadog community [APM Slack channel]. There's also an [earlier version of this workshop](https://github.com/burningion/dash-apm-workshop), that goes into a bit more detail on the structure of traces. 

For more information on the structure of individual traces, read the [Dapper paper](https://ai.google/research/pubs/pub36356) from Google.

For more specific instrumentation of a single application, check out Andrew McBurney's article on [instrumenting Homebrew](https://www.datadoghq.com/blog/engineering/using-datadog-apm-to-find-bottlenecks-and-performance-benchmarking/).

Finally, feel free to reach out to me on [Twitter](https://twitter.com/burningion).