# Experimentation with data analysis

In this notebook, we will be analyzing the data for the CM1 repository of the NASA MDP data set. The data set can be found [here](https://github.com/klainfo/NASADefectDataset/tree/master).

The dataset contains information for 327 modules (Represented as rows) with 38 metrics (represented as columns). The last column is the label for the module, which is either defective or not defective.


In [1]:
import pandas as pd
import pyarrow

In [2]:
cm_1_dataset = pd.read_csv("../Datasets/MDP_CSV/CM1.csv", engine='pyarrow')
cm_1_dataset

Unnamed: 0,LOC_BLANK,BRANCH_COUNT,CALL_PAIRS,LOC_CODE_AND_COMMENT,LOC_COMMENTS,CONDITION_COUNT,CYCLOMATIC_COMPLEXITY,CYCLOMATIC_DENSITY,DECISION_COUNT,DECISION_DENSITY,...,NODE_COUNT,NORMALIZED_CYLOMATIC_COMPLEXITY,NUM_OPERANDS,NUM_OPERATORS,NUM_UNIQUE_OPERANDS,NUM_UNIQUE_OPERATORS,NUMBER_OF_LINES,PERCENT_COMMENTS,LOC_TOTAL,Defective
0,2.0,3.0,0.0,0.0,8.0,4.0,2.0,0.22,2.0,2.00,...,6.0,0.22,5.0,10.0,4.0,7.0,9.0,47.06,9.0,b'N'
1,3.0,3.0,0.0,2.0,2.0,4.0,2.0,0.15,2.0,2.00,...,5.0,0.11,10.0,22.0,5.0,12.0,19.0,26.67,13.0,b'N'
2,38.0,35.0,4.0,5.0,70.0,58.0,18.0,0.17,24.0,2.42,...,51.0,0.08,150.0,222.0,58.0,32.0,218.0,41.90,109.0,b'N'
3,1.0,7.0,5.0,0.0,12.0,12.0,4.0,0.10,6.0,2.00,...,18.0,0.06,50.0,79.0,36.0,19.0,68.0,22.64,41.0,b'Y'
4,9.0,15.0,4.0,14.0,22.0,28.0,8.0,0.20,14.0,2.00,...,24.0,0.11,29.0,64.0,19.0,18.0,73.0,57.14,41.0,b'N'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
322,67.0,29.0,10.0,27.0,41.0,56.0,15.0,0.13,28.0,2.00,...,63.0,0.07,192.0,274.0,32.0,19.0,228.0,42.50,119.0,b'N'
323,9.0,3.0,5.0,0.0,12.0,4.0,2.0,0.11,2.0,2.00,...,10.0,0.05,27.0,43.0,21.0,16.0,40.0,40.00,18.0,b'N'
324,3.0,3.0,1.0,1.0,2.0,4.0,2.0,0.17,2.0,2.00,...,6.0,0.11,11.0,21.0,8.0,14.0,18.0,21.43,12.0,b'N'
325,6.0,9.0,3.0,10.0,22.0,16.0,5.0,0.16,8.0,2.00,...,21.0,0.08,42.0,72.0,19.0,20.0,61.0,59.26,32.0,b'N'


### Attribute of the dataset


In [3]:
cm_1_dataset.describe()

Unnamed: 0,LOC_BLANK,BRANCH_COUNT,CALL_PAIRS,LOC_CODE_AND_COMMENT,LOC_COMMENTS,CONDITION_COUNT,CYCLOMATIC_COMPLEXITY,CYCLOMATIC_DENSITY,DECISION_COUNT,DECISION_DENSITY,...,MULTIPLE_CONDITION_COUNT,NODE_COUNT,NORMALIZED_CYLOMATIC_COMPLEXITY,NUM_OPERANDS,NUM_OPERATORS,NUM_UNIQUE_OPERANDS,NUM_UNIQUE_OPERATORS,NUMBER_OF_LINES,PERCENT_COMMENTS,LOC_TOTAL
count,327.0,327.0,327.0,327.0,327.0,327.0,327.0,327.0,327.0,327.0,...,327.0,327.0,327.0,327.0,327.0,327.0,327.0,327.0,327.0,327.0
mean,16.767584,13.015291,3.795107,5.657492,17.470948,20.917431,7.324159,0.166422,9.87156,2.119113,...,10.492355,25.480122,0.104771,76.963303,120.189602,35.443425,19.088685,82.428135,29.584343,46.844037
std,23.162161,16.843098,3.941782,10.095501,30.508948,27.275282,9.455683,0.061612,12.704113,0.320483,...,13.67768,29.187338,0.055726,98.531729,150.976639,38.176965,9.3044,95.848744,18.805039,56.157181
min,0.0,3.0,0.0,0.0,0.0,4.0,2.0,0.04,2.0,2.0,...,2.0,5.0,0.02,4.0,7.0,3.0,4.0,9.0,0.0,7.0
25%,4.0,5.0,1.0,0.0,1.0,8.0,3.0,0.13,4.0,2.0,...,4.0,9.0,0.07,22.0,38.0,13.0,13.0,28.0,13.89,18.0
50%,9.0,7.0,3.0,2.0,8.0,12.0,4.0,0.16,6.0,2.0,...,6.0,15.0,0.09,40.0,68.0,22.0,17.0,52.0,31.25,28.0
75%,20.0,15.0,5.0,6.0,20.5,24.0,8.0,0.19,12.0,2.085,...,12.0,31.0,0.13,92.5,136.5,42.0,23.0,90.5,45.115,53.0
max,164.0,162.0,26.0,80.0,339.0,248.0,96.0,0.56,118.0,5.0,...,124.0,251.0,0.5,798.0,1229.0,314.0,72.0,764.0,71.93,503.0


In [4]:
column_names = cm_1_dataset.columns
column_names

Index(['LOC_BLANK', 'BRANCH_COUNT', 'CALL_PAIRS', 'LOC_CODE_AND_COMMENT',
       'LOC_COMMENTS', 'CONDITION_COUNT', 'CYCLOMATIC_COMPLEXITY',
       'CYCLOMATIC_DENSITY', 'DECISION_COUNT', 'DECISION_DENSITY',
       'DESIGN_COMPLEXITY', 'DESIGN_DENSITY', 'EDGE_COUNT',
       'ESSENTIAL_COMPLEXITY', 'ESSENTIAL_DENSITY', 'LOC_EXECUTABLE',
       'PARAMETER_COUNT', 'HALSTEAD_CONTENT', 'HALSTEAD_DIFFICULTY',
       'HALSTEAD_EFFORT', 'HALSTEAD_ERROR_EST', 'HALSTEAD_LENGTH',
       'HALSTEAD_LEVEL', 'HALSTEAD_PROG_TIME', 'HALSTEAD_VOLUME',
       'MAINTENANCE_SEVERITY', 'MODIFIED_CONDITION_COUNT',
       'MULTIPLE_CONDITION_COUNT', 'NODE_COUNT',
       'NORMALIZED_CYLOMATIC_COMPLEXITY', 'NUM_OPERANDS', 'NUM_OPERATORS',
       'NUM_UNIQUE_OPERANDS', 'NUM_UNIQUE_OPERATORS', 'NUMBER_OF_LINES',
       'PERCENT_COMMENTS', 'LOC_TOTAL', 'Defective'],
      dtype='object')

### Selecting Attrbute for Analysis

We selected the following attributes for analysis:

- **Halstead Metrics:** Halstead Difficulty, Halstead Effort, Halstead Error Estimate
- **Code Attributes:** Lines of Code (Executable), Cyclomatic Complexity, Decision Count
- **Design Attributes:** Essential Complexity, Design Complexity, Design Density


In [5]:
analysis_dataset = cm_1_dataset[[
    "HALSTEAD_DIFFICULTY", 
    "HALSTEAD_EFFORT", 
    "HALSTEAD_ERROR_EST",
    "LOC_EXECUTABLE",
    "CYCLOMATIC_COMPLEXITY",
    "DECISION_COUNT",
    "ESSENTIAL_COMPLEXITY",
    "DESIGN_DENSITY",
    "DESIGN_COMPLEXITY",
    "Defective"
    ]]
# Replace all the 'Y' with 1 and 'N' with 0
analysis_dataset = analysis_dataset.replace(to_replace='b\'Y\'', value=1)
analysis_dataset = analysis_dataset.replace(to_replace='b\'N\'', value=0)
analysis_dataset

Unnamed: 0,HALSTEAD_DIFFICULTY,HALSTEAD_EFFORT,HALSTEAD_ERROR_EST,LOC_EXECUTABLE,CYCLOMATIC_COMPLEXITY,DECISION_COUNT,ESSENTIAL_COMPLEXITY,DESIGN_DENSITY,DESIGN_COMPLEXITY,Defective
0,4.38,227.03,0.02,9.0,2.0,2.0,1.0,0.50,1.0,0
1,12.00,1569.59,0.04,11.0,2.0,2.0,1.0,0.50,1.0,0
2,41.38,99929.77,0.80,104.0,18.0,24.0,13.0,0.39,7.0,0
3,13.19,9840.36,0.25,41.0,4.0,6.0,3.0,0.75,3.0,1
4,13.74,6655.21,0.16,27.0,8.0,14.0,4.0,0.75,6.0,0
...,...,...,...,...,...,...,...,...,...,...
322,57.00,150670.96,0.88,92.0,15.0,28.0,15.0,1.00,15.0,0
323,10.29,3750.81,0.12,18.0,2.0,2.0,1.0,1.00,2.0,0
324,9.63,1373.50,0.05,11.0,2.0,2.0,1.0,0.50,1.0,0
325,22.11,13319.21,0.20,22.0,5.0,8.0,1.0,0.40,2.0,0


# Plotting the Data

In this section we will be plotting various attributes of the dataset in various ways to get a better understanding of the data.
We will use Cufflinks to plot the data. Cufflinks is a library that connects the Pandas data frame with Plotly enabling users to create visualizations directly from Pandas.


### Box Plot

Box plot is a way of showing the distribution of the data over 5 attributes : minimum, maximum, first quartile, second quartile (median) and third quartile. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. The position of the whiskers is set by default to 1.5 \* IQR (IQR = Q3 - Q1) from the edges of the box. Outlier points are those past the end of the whiskers.


In [6]:
import cufflinks as cf
import nbformat
cf.go_offline()

For example, we will take a look at the distribution of Halstead Error Estimate for the defective and non-defective modules.


In [7]:
# Box plotting the cyclomatic complexity of the defective and non-defective modules
non_defective_group = analysis_dataset[analysis_dataset['Defective'] == 0]
defective_group = analysis_dataset[analysis_dataset['Defective'] == 1]

In [8]:

non_defective_group[["HALSTEAD_ERROR_EST"]].iplot(
    kind='box', 
    title='Box plot of Halstead Error Estimate of Non-Defective Modules',
    dimensions=(900, 500),
    orientation='h',
    theme="solar"
    )

In [9]:
defective_group[["HALSTEAD_ERROR_EST"]].iplot(
    kind='box', 
    title='Box plot of Halstead Error Estimate of Defective Modules',
    dimensions=(900, 500),
    orientation='h',
    theme="solar"
    )

### Bar Plot

Bar plot is a way of showing the distribution of the data over a categorical variable. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value. For example, we will take a look at the distribution of the average design complexity score between defective and non-defective modules.

In [10]:
avg_design_complexity_of_defective = defective_group["DESIGN_COMPLEXITY"].mean()
avg_design_complexity_of_non_defective = non_defective_group["DESIGN_COMPLEXITY"].mean()

pd.DataFrame({
    "Defective": [avg_design_complexity_of_defective],
    "Non-Defective": [avg_design_complexity_of_non_defective]
}).iplot(
    kind='bar',
    title='Bar plot of Average Design Complexity of Defective and Non-Defective Modules',
    dimensions=(900, 500),
    theme="solar"
)


For another example, we will see the Average Halstead Error Estimate for the defective and non-defective modules.

In [11]:
average_error_for_defective = defective_group["HALSTEAD_ERROR_EST"].mean()
average_error_for_non_defective = non_defective_group["HALSTEAD_ERROR_EST"].mean()

pd.DataFrame({
    "Defective": [average_error_for_defective],
    "Non-Defective": [average_error_for_non_defective]
}).iplot(
    kind='bar',
    title='Bar plot of Average Halstead Error Estimate of Defective and Non-Defective Modules',
    dimensions=(900, 500),
    theme="solar"
)

## Scatter Plot

Scatter plot is a way of showing the distribution of the data over two variables. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value (Note the specific category may be a measured value themself).

Previously from the bar plot, we could infer that there might be a relation between Design complexity and Halstead Error Estimate. We will plot a scatter plot to see if there is a relation between the two attributes.

In [18]:
analysis_dataset.iplot(
    kind='scatter',
    x='HALSTEAD_ERROR_EST',
    y='DESIGN_COMPLEXITY',
    dimensions=(900, 500),
    theme="solar",
    xTitle='Halstead Error Estimate',
    yTitle='Design Complexity',
    title='Scatter plot of Halstead Error Estimate vs Design Complexity',
    dash='dot',
)

Now for the distribution between defective and non-defective groups:

In [19]:
defective_group.iplot(
    kind='scatter',
    x='HALSTEAD_ERROR_EST',
    y='DESIGN_COMPLEXITY',
    dimensions=(900, 500),
    theme="solar",
    xTitle='Halstead Error Estimate',
    yTitle='Design Complexity',
    title='Scatter plot of Halstead Error Estimate vs Design Complexity of Defective Modules',
    dash='dot',
)

In [20]:
non_defective_group.iplot(
    kind='scatter',
    x='HALSTEAD_ERROR_EST',
    y='DESIGN_COMPLEXITY',
    dimensions=(900, 500),
    theme="solar",
    xTitle='Halstead Error Estimate',
    yTitle='Design Complexity',
    title='Scatter plot of Halstead Error Estimate vs Design Complexity of Non-Defective Modules',
    dash='dot',
)

In [14]:
cf.help("scatter")

SCATTER
Scatter plot
2D chart based in an x and y axis.
Can be a line chart (default) or a set of scatter points (markers)


Parameters:
    bestfit : bool or list
        Displays a best fit line
        If list, then a best fit line will be generated
        for each trace key in the list
    bestfit_colors : dict or list
        Sets the color for each best fit line
        	{key:color} to set the color for each trace
        	[color1, color2...] to set the colors in the specified order
    categories : string
        Name of the column that contains the categories
    connectgaps : bool
        If True, then empty values are connected
    dash : string, list or dict
        Line style
        	string : applies to all traces
        	list : applies to each trace in the order specified
        	dict : {column:value} for each column in the dataframe
        values    :
        	solid
        	dash
        	dashdot
        	dot
    fill : bool
        Fills the trace (area)
    interpo