Skip to content

Commit

Permalink
FSP/CONSOLE: Workaround for unresponsive ipmi daemon
Browse files Browse the repository at this point in the history
We use TCE mapped area to write data to console. Console header
(fsp_serbuf_hdr) is modified by both FSP and OPAL (OPAL updates
next_in pointer in fsp_serbuf_hdr and FSP updates next_out pointer).

Kernel makes opal_console_write() OPAL call to write data to console.
OPAL write data to TCE mapped area and sends MBOX command to FSP.
If our console becomes full and we have data to write to console,
we keep on waiting until FSP reads data.

In some corner cases, where FSP is active but not responding to
console MBOX message (due to buggy IPMI) and we have heavy console
write happening from kernel, then eventually our console buffer
becomes full. At this point OPAL starts sending OPAL_BUSY_EVENT to
kernel. Kernel will keep on retrying. This is creating kernel soft
lockups. In some extreme case when every CPU is trying to write to
console, user will not be able to ssh and thinks system is hang.

If we reset FSP or restart IPMI daemon on FSP, system recovers and
everything becomes normal.

This patch adds workaround to above issue by returning OPAL_HARDWARE
when cosole is full. Side effect of this patch is, we may endup dropping
latest console data. But better to drop console data than system hang.

Alternative approach is to drop old data from console buffer, make space
for new data. But in normal condition only FSP can update 'next_out'
pointer and if we touch that pointer, it may introduce some other
race conditions. Hence we decided to just new console write request.

Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Acked-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
  • Loading branch information
Vasant Hegde authored and stewartsmith committed Jun 14, 2017
1 parent 4cef4d8 commit c8a7535
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 1 deletion.
18 changes: 17 additions & 1 deletion hw/fsp/fsp-console.c
Expand Up @@ -26,6 +26,11 @@
#include <timebase.h>
#include <device.h>
#include <fsp-sysparam.h>
#include <errorlog.h>

DEFINE_LOG_ENTRY(OPAL_RC_CONSOLE_HANG, OPAL_PLATFORM_ERR_EVT, OPAL_CONSOLE,
OPAL_PLATFORM_FIRMWARE,
OPAL_PREDICTIVE_ERR_GENERAL, OPAL_NA);

struct fsp_serbuf_hdr {
u16 partition_id;
Expand Down Expand Up @@ -611,7 +616,18 @@ static int64_t fsp_console_write(int64_t term_number, int64_t *length,
*length = written;
unlock(&fsp_con_lock);

return written ? OPAL_SUCCESS : OPAL_BUSY_EVENT;
if (written)
return OPAL_SUCCESS;

/*
* FSP is still active but not reading console data. Hence
* our console buffer became full. Most likely IPMI daemon
* on FSP is buggy. Lets log error and return OPAL_HARDWARE
* to payload (Linux).
*/
log_simple_error(&e_info(OPAL_RC_CONSOLE_HANG), "FSPCON: Console "
"buffer is full, dropping console data\n");
return OPAL_HARDWARE;
}

static int64_t fsp_console_write_buffer_space(int64_t term_number,
Expand Down
3 changes: 3 additions & 0 deletions include/errorlog.h
Expand Up @@ -332,6 +332,9 @@ enum opal_reasoncode {

/* Platform error */
OPAL_RC_ABNORMAL_REBOOT = OPAL_SRC_COMPONENT_CEC | 0x10,

/* FSP console */
OPAL_RC_CONSOLE_HANG = OPAL_SRC_COMPONENT_CONSOLE | 0x10,
};

#define DEFINE_LOG_ENTRY(reason, type, id, subsys, \
Expand Down

0 comments on commit c8a7535

Please sign in to comment.