# Understanding the Problem

## 1. "It Doesn't Work"

As we called out, the first step to solving a problem is getting enough information so that we can understand the current state of things. To do this we'll need to know what the actual issue we're solving is. This starts when we first come across the issue, which can be through report by a ticketing system or by encountering the problem ourselves. 

When working with users, it's pretty common to receive reports of failures that just boil down to, "It doesn't work." These reports usually don't include a lot of useful information but it's still important that the problem gets reported and solved. Which information is useful or not might depend on the problem. But there are some common questions that we can ask a user that simply report something doesn't work. 

*What were you trying to do? What steps did you follow? What was the expected result? What was the actual result?* If the ticketing system your company uses allows this, it's a good idea to include these questions in the form that users have to fill out when reporting an issue. This way we save time and can start asking more specific questions right away. Otherwise, these are almost always going to be the first questions you ask. Another thing to keep in mind is that when debugging a problem, we want to consider the simplest explanations first and avoid jumping into complex or time-consuming solutions unless we really have to. That's why when a device doesn't turn on, we first check if it's correctly plugged in and that there's electricity coming from the plug before taking it apart or replacing it with a new device. 

Say you got a call from a user that tells you the internal website used by the sales team to track customer interactions doesn't work. The user is super stressed because they need to access the information on the website for a meeting happening in a few minutes. So you tell them that you'll look into the problem right away, but then you need more information. *What were they trying to do?* The user tells you that they're trying to access the website. *What steps did they follow?* They tell you that they opened the website URL and entered their credentials. *What was the expected result?* They expected to see the sales system's landing page. *What did they get instead?* The web page just keeps loading. It stays blank forever. 

Okay. So now you've gone from, "it doesn't work," to, "when I tried to log in, the page keeps loading and never shows the landing page." That's great. Now that you have a basic idea of what the problem is, it's time to start figuring out the root cause. For that, you'll apply a process of elimination, starting with the simplest explanations first and testing those until you can isolate the root cause. 

For example, you check if you can reproduce the issue on your own computer. So you navigate to the website, enter your credentials, and sure enough, the page just keeps loading, never showing the landing page. This is enough information that you can tell the user that you'll work on it and investigate on your own. There's no need to keep them on the line. 

By reproducing the problem on your computer, **you've taken a simple and quick action that rules out the user or the user's computer as the cause of the problem.** This cuts the troubleshooting process in half since you now know there's a problem with the service and you can focus on solving that. Before jumping into the server that's hosting the application, you run a few quick checks to verify if the problem is isolated to that specific website or not. You check if your Internet access is working successfully by accessing an external website which loads just fine. Then you check if other internal websites, like the inventory website or ticketing system are working okay. Doing this, you discover that while the ticketing system loads with no issues, the inventory website never finishes loading. It turns out both websites are hosted on the same server. 

Again, it's important to highlight that doing these quick checks to verify that the Internet works correctly and which sites are affected by the problem, helps you isolate the root cause. By looking at possible simple explanations first, you avoid losing time chasing the wrong problem. At this point, you know that website's running on a specific server or failing to load while the rest of the systems and the Internet are working correctly. 

Next up, you need to check what's going on on that server. The server running the websites is a Linux machine, so you'll connect to it using SSH. You run the `top` command which shows the state of the computer and processes using the most CPU and see that the computer is super overloaded. The load average in the first line says 40. The **load average** on Linux *shows how much time a processor is busy in a given minute, with one meaning it was busy for the whole minute.* So normally this number shouldn't be above the amount of processors in the computer. A number higher than the amount of processors means the computer is overloaded. 

You know this computer has four cores, so 40 is a really high number. You also see that most of the CPU time is spent in waiting. This means that processes are stuck waiting for the operating system to return from system calls. This usually happens when processes get stuck gathering data from the hard drive or the network. 

By looking at the list of processes, you realize that the backup system is currently running on the server, and it seems to be using a lot of processing time. Backing up the data on the system is super important. But currently, the whole system is unusable. So you decide to stop the backup system by calling `kill -STOP.` This will suspend the execution of the program until you let it continue or decide to terminate it. After doing this, you're on top once again and you see that the load is going down, and so processes are no longer stuck waiting for I/O. Then you try logging into the website, and this time the landing page loads. 

Success. You let the user know that they can use the website once again. At this point, you've applied the immediate remediation. We'll talk about long-term remediation in a later video. 

Before moving on to the next topic, imagine that the following week another user calls you and tells you the sales website doesn't work. Remembering the previous incident, you tell them you'll fix it right away. You SSH onto the server and try to find the backup process to stop it, but it's not running. Oops. You forgot to ask the user what they meant when they said it didn't work. When you call back to ask them they tell you that they're trying to generate a monthly sales report and they get an error saying the product category column doesn't exist. Totally different problem, totally different actions to take. So remember to always have a clear picture of what the problem is before you start solving it. Up next, we'll talk about what are reproduction cases and how to come up with it.

## 2. Creating a Reproduction Case

When we're dealing with an issue that's tricky to debug, we want to have a clear reproduction case for the problem. **A reproduction case** *is a way to verify if the problem is present or not.* We want to make the reproduction case as simple as possible. That way, we can clearly understand when it happens, and it makes it really easy to check if the problem is fixed or not, when we try to solve it. 

Sometimes, the reproduction case is pretty obvious. In our example where the program fail to start because of a missing directory, the reproduction case was to open the program without that directory on the computer. On our overloaded server example, the reproduction case for the failure was to try to login to the website and see the loading page. But sometimes the reproduction case might be much more complex to discover. 

Imagine you're trying to help a user with an application that won't start. This time when you run the same version of the application on your computer, the application starts just fine. So you suspect that the problem has to do with something in the user's environment or configuration. 

There could be a bunch of reasons why this could happen. It could be problems with the network routing, old config files interfering with a new version of the program, a permissions problem blocking the user from accessing some required resource, or even some faulty piece of hardware acting out. So how can you figure out what's causing the problem. 

**The first step is to read the logs available to you.** Which logs to read, will depend on the operating system and the application that you're trying to debug. On Linux, you'd read system logs like `/var/log/syslog` and user-specific logs like the `.xsession-errors` file located in the user's home directory. On MacOs, on top of the system logs, you'd go through the logs stored in the library logs directory. On Windows, you'd use the **Event Viewer tool** to go through the event logs. 

No matter the operating system, remember to look at the logs when something isn't behaving as it should. Lots of times, you'll find an error message that will help you understand what's going on like, **unable to reach server, invalid file format, or permission denied.** 

But what if you're unlucky, and there's no error message, or the error message is super unhelpful like **internal system error.** The next step is to try to isolate the conditions that trigger the issue. Do other users in the same office also experienced the problem? Does the same thing happen if the same user logs into a different computer? Does the problem happen if the applications config directory is moved away? 

Let's say that it's the config directories file. You ask the user to move it away without deleting it, and now the application starts correctly. So you ask the user to send you the contents of that directory. You copy them onto your computer, and the program fails to start. Bingo, you got your reproduction case. It's starting the program with that config in place. 

Having a clear reproduction case, let's do investigate the issue, and quickly see what changes it. For example, does the problem go away if you revert the application to the previous version? Are there any differences in the strace log, or the ltrace logs when running the application with the bad config and without it? On top of that, having a clear reproduction case, lets you share with others when asking for help. As long as you aren't sharing any confidential information of course. 

You could use it to report a bug to the applications developers, to ask for help from a colleague, or even to ask for help from an Internet forum about the application if it's publicly available. So when trying to create a reproduction case, we want to find the actions that reproduce the issue, and we want these to be as simple as possible. The smaller the change in the environment and the shorter the list of steps to follow, the better. To get there, we might need to dig deeper into the problem until we have a small enough set of instructions. Once you have a reproduction case, you're ready to move on to the next step, finding the root cause. We'll talk about that in our next video.