Skip to content
Navraj Chohan edited this page May 5, 2015 · 29 revisions

Introduction

We all love AppScale, but like all software, it once in a while has problems. This post outlines what to do when you run into a problem with AppScale, how to debug it, and how to fix it. Of course, you can always ask us for help on IRC (#appscale on freenode.net). Let's start off with some common problems we've seen people run into, how to get past those, and then look at what to do when the going gets tough.

Most common problem: Ran Out Of Memory?

AppScale runs many processes with each of these processes taking up memory. If there is not enough, the OOM Killer will come along and start killing processes and AppScale will start acting very weird. If AppScale is not working correctly make sure that you didn't run out of memory. Check '/var/log/kern.log' and '/var/log/syslog' on your AppScale nodes to make sure this is not the case.

$ tail /var/log/kern.log
Feb 14 00:10:54 appscale-image0 kernel: [203916.804124] Out of memory: Kill process 28026 (python) score 182 or sacrifice child
Feb 14 00:10:54 appscale-image0 kernel: [203916.804320] Killed process 28036 (python) total-vm:810672kB, anon-rss:550012kB, file-rss:0kB

Another common problem: Ran out of disk space?

root@appscale-image0:/# df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/xvda1       8256952 7784864     52660 100% /
udev             3806468      12   3806456   1% /dev
tmpfs            1525896     212   1525684   1% /run
none                5120       0      5120   0% /run/lock
none             3814732     160   3814572   1% /run/shm
/dev/xvdb       30956028  176196  29207352   1% /mnt

You have some options here to clear up disk space:

  1. Free up ZooKeeper usage (answered in FAQ)

  2. Run the AppScale groomer to do disk garbage collection

  3. Remove logs or do log rolling in /var/logs/appscale/

  4. Run the cassandra nodetool repair/cleanup

Are all the processes running?

AppScale uses monit to monitor all the processes on the node. If it gets killed off, it will no longer restart downed processes. Is it running? Did it get killed off for some reason?

$ ps aux | grep monit
root     25043  0.1  0.0 103532  2796 ?        Sl   Feb12   3:32 /usr/local/bin/monit
root     28906  0.0  0.0   9396   900 pts/0    S+   14:57   0:00 grep --color=auto monit

If you do "appscale status" do you get "[Errno 111] Connection refused"? If so, that generally means that the AppController is no longer running. This could be a bug in the AppController, check the logs. Most commonly, its because it was killed off by the OOM killer.

To bring the processes back up, just restart monit.

$ service monit start
Starting monit daemon with http interface at [*:2812]

If monit is already running try running:

$ monit summary

To see running processes.

My app keeps getting killed off

Monit makes sure that apps don't go over a certain memory limit. You can set this in your AppScale file as such:

max_memory: 600

Where the number is in megabytes (MB). Check the monit logs to see if it indeed monit restarting your app over and over again in /var/log/appscale/monit.log

[EST Feb 12 22:40:07] error    : 'app___memhungry-20004' total mem amount of 882676kB matches resource limit [tot
al mem amount>512001kB]
[EST Feb 12 22:40:07] info     : 'app___memhungry-20004' trying to restart
[EST Feb 12 22:40:07] info     : 'app___memhungry-20004' stop: /usr/bin/python
[EST Feb 12 22:40:08] info     : 'app___memhungry-20004' start: /bin/bash

Don't set your max_memory too high, otherwise you may run out of memory on the nodes hosting your applications.

Is Cassandra and ZooKeeper running?

These two are critical services for data storage. You can see their logs in /var/log/zookeeper and /var/log/cassandra. Monit will monitor these services and restart them if they are down. Sometimes you run into bugs in cassandra or zookeeper, that need manual intervention (monit will keep trying to start them but they fail on restart repeatedly).

Is Cassandra or Zookeeper acting slow?

If you suspect/observe slow response times from the Datastore, one or more of the database nodes might be running compactions. Run: /root/appscale/AppDB/cassandra/cassandra/bin/nodetool compactionstats to see more details.

You can also run a stress test on a particular database node to determine latency:

cd /root/appscale/AppDB/cassandra/; python stress.py

Also make sure to test the Zookeeper node for disk IO latency by running:

echo stat | nc 127.0.0.1 2181

Users API

If you have problems logging in or using the Users API, try going to your datastore nodes and killing the user/apps soap server (monit will restart it). It has been seen to get stuck and do so silently.

ps aux | grep soap_server.py | grep -v grep | awk {'print $2'} | xargs kill -9

AppScale didn't come up successfully

If you ran "appscale up" to start AppScale and it didn't start, it could have failed for any of the following reasons:

  • (VirtualBox) AppScale hung at "Please wait for AppScale to start your machines."
  • (EC2) You're using Spot Instances but AppScale is hung at "Waiting for machines to become available."
  • (Eucalyptus) AppScale hung at "Waiting for machines to become available."

Let's look at each of these individually.

AppScale on VirtualBox

When running AppScale on VirtualBox, we've seen problems when VirtualBox 4.1.X is used. Specifically, the AppScale Tools will start up the AppController on port 17443 and then hang at "Please wait for AppScale to start your machines." In this case, the AppScale Tools are waiting for port 17443 to open on the VM, but can't actually reach the VM, which has that port open. Upgrade to VirtualBox 4.2 or newer and that should fix the problem.

Another common problem is that the wrong IP is given in your AppScalefile. AppScale will complain that the node could not be found in the nodelist. Please check your IPs and try again from a clean state (run 'appscale clean').

AppScale on EC2

If you're using Spot Instances (you've set "use_spot_instances : True" in your AppScalefile), there is a possibility that Amazon won't have any spare machines available at the price and instance type you requested. Typically it takes us about 5 minutes to get a Spot Instance, so if it takes you substantially longer than that (say, 10 minutes), then you can log into the AWS Dashboard, click on EC2, and then click on Spot Instances. There, you can see why your machines aren't available. You can cancel your Spot Instance Request and try again with a higher price or a different instance type, depending on the message the dashboard reports.

AppScale on Eucalyptus

When running on Eucalyptus, if there are no virtual machines available, AppScale won't be able to start up. For example, if you tell AppScale to run over 8 machines, and you only have 6 available, then that won't work! In this case, you'll see a message from the tools saying "Spawning 7 virtual machines" (since we spawn one machine and delegate the responsibility of starting up the other 7 to it), and the tools will eventually crash, since the AppController won't be able to get the remaining 7 machines. In this case, the solution is simple - make sure you have enough virtual machines available before you start AppScale! In Eucalyptus, an administrator can find out how many virtual machines are free by running "euca-describe-availability-zones verbose".

Forcefully cleaning up AppScale state

If, for some reason, running "appscale down" isn't able to terminate your AppScale deployment, you can bring your VMs back to a pristine state by running:

appscale clean

This script forcefully kills all of the AppScale-related processes.

Debugging AppScale Deployments

So you've ran into a problem we don't normally run into - how do you find out what's going on? For this case, we have a special command you can run. On the machine that you've got the AppScale tools installed on, run "appscale logs ~/Desktop/baz" and this will copy over all of the logs from each machine in your AppScale deployment to ~/Desktop/baz (of course, change that path if you want your logs copied somewhere else). If this doesn't work for some reason, you can always use "scp" to copy over the contents of the "/var/log/appscale" directory on each machine.

Logs you will find interesting include:

  • controller-17443.log: The most interesting log! This log belongs to the AppController, our provisioning daemon. Since it sets up every other service in AppScale, this log can throw exceptions if Cassandra couldn't be started, if the autoscaling algorithm ran into problems, and so on. This is the first place you want to look in if you're having problems with AppScale. You'll find one of these on each machine in an AppScale deployment, since this service runs on all machines.
  • app___app_id-*.log: These logs correspond to Google App Engine apps that AppScale is hosting. You'll want to check these out if you're running into problems with your App Engine apps, like if you want to include special libraries that App Engine doesn't normally support or are debugging your application at high load. You'll find one of these for each App Server process that runs on each machine running the "App Engine" role (see which machines are running this service by running "appscale status").
  • datastore_server-400*.log: The logs from the implementation of the AppScale datastore.
Clone this wiki locally