## Overall descriptive feature statistics

These values are reported before transformations.

In [None]:
# feature descriptives table
df_desc = pd.read_csv(join(output_dir, '{}_feature_descriptives.csv'.format(experiment_id)), index_col=0)
HTML(df_desc.to_html(classes=['sortable'], float_format=float_format_func))

The following table shows additional statistics for the data. Quantiles are computed using type=3 method used in SAS. The mild outliers are defined as data points between [1.5, 3) \* IQR away from the nearest quartile. Extreme outliers are the data points >= 3 * IQR away from the nearest quartile.

### Prevalence of recoded cases

This sections shows the number and percentage of cases truncated to mean +/- 4 SD for each feature.

In [None]:
df_outliers = pd.read_csv(join(output_dir, '{}_feature_outliers.csv'.format(experiment_id)), index_col=0)
df_outliers.index.name = 'feature'
df_outliers = df_outliers.reset_index()
df_outliers = pd.melt(df_outliers, id_vars=['feature'])
df_outliers = df_outliers[df_outliers.variable.str.contains(r'[ulb].*?perc')]

# we need a higher aspect if we have more than 40 features
aspect = 3 if len(features_used) > 40 else 2

# colors for the plot
colors = sns.color_palette("Greys", 3)

# what's the largest value in the data frame
maxperc = df_outliers['value'].max()

# compute the limits for the graph
limits = (0, max(2.5, maxperc))

with sns.axes_style('whitegrid'):
    # create a barplot without a legend since we will manually
    # add one later
    p = sns.factorplot("feature", "value", "variable", kind="bar", 
                       palette=colors, data=df_outliers, size=3, 
                       aspect=aspect, legend=False)
    p.set_axis_labels('', '% cases truncated to mean +/- 4*sd')
    p.set_xticklabels(rotation=90)
    p.set(ylim=limits)
    
    # add a line at 2%
    axis = p.axes[0][0]
    axis.axhline(y=2.0, linestyle='--', linewidth=1.5, color='black')
    
    # add a legend with the right colors
    legend=axis.legend(('both', 'lower', 'upper'), title='', frameon=True, fancybox=True, ncol=3)
    legend.legendHandles[0].set_color(colors[0])
    legend.legendHandles[1].set_color(colors[1])
    plt.savefig(join(figure_dir, '{}_outliers.svg'.format(experiment_id)))

### Feature value distribution

In [None]:
# feature descriptives extra table
df_desce = pd.read_csv(join(output_dir, '{}_feature_descriptivesExtra.csv'.format(experiment_id)), index_col=0)
HTML(df_desce.to_html(classes=['sortable'], float_format=float_format_func))