<h2>Exercise 5 - Building a Defect Prediction System</h2>

**Introduction.** During this exercise you will be building a defect prediction system. Since creating such functionality from scratch would take more time than we have available, you will have the opportunity to select the features of the data that the model will use during both training and prediction.

Please read the background information for each step and then execute the code block. The steps that require you to modify the code will be clearly marked.

Thank you to prabhdeep123, who created the defect prediction system that this exercise was heavily influenced by. If you wish you can [view the original code](https://www.kaggle.com/code/prabhdeep123/software-defect-prediction).

**Step 1.** The first step is to import the libraries needed to implement our defect prediction approach. We then ignore warnings that we don't need to concern ourselves with.

In [13]:
# Import any needed libraries.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

**Step 2.** We then load the datasets that are stored on the local filesystem. *jm1.csv* contains source code metrics and defect presence information for a NASA real-time predictive ground system written in C and will be used to train the model. *cm1.csv* contains the same information for a NASA spacecraft instrument written in C and will be used to evaluate the model.

We then drop those rows that have missing values.

In [14]:
# Load the train and test datasets.
train_df  = pd.read_csv('jm1.csv')
test_df = pd.read_csv('cm1.csv')

# Prepare the data to be processed.
indexes = test_df.index
train_df.replace('?', pd.NA, inplace=True)
test_df.replace('?', pd.NA, inplace=True)
train_df.dropna(subset=train_df.columns[4:6], inplace=True)

**Step 3.** We then update a few columns so that they are represented as numeric values. After that we store the last column separately since information on the presence of defects will be used to evaluate the model, not to train it.

We then select the subset of columns (features) that should be used during training and testing. **You can leave this line as-is for now, but you will later be modifying it to determine what combination of features causes the model optimizes model performance.**

In [15]:
# Make the values in the five columns preceding the final column numeric.
# Drop any rows for which such conversion fails.
train_df[train_df.columns[16:21]] = train_df[train_df.columns[16:21]].apply(pd.to_numeric, errors='coerce')
test_df[test_df.columns[16:21]] = test_df[test_df.columns[16:21]].apply(pd.to_numeric, errors='coerce')
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)
train_df[train_df.columns[16:21]] = train_df[train_df.columns[16:21]].astype(int)
test_df[test_df.columns[16:21]] = test_df[test_df.columns[16:21]].astype(int)

# Store the defects column from the training data separately.
X = train_df.drop('defects', axis = 1)
y = train_df['defects'].astype('int')

# Select some subset of the below columns to use during training and testing.
# 1.  loc               : numeric % McCabe's line count of code
# 2.  v(g)              : numeric % McCabe "cyclomatic complexity"
# 3.  ev(g)             : numeric % McCabe "essential complexity"
# 4.  iv(g)             : numeric % McCabe "design complexity"
# 5.  n                 : numeric % Halstead total operators + operands
# 6.  v                 : numeric % Halstead "volume"
# 7.  l                 : numeric % Halstead "program length"
# 8.  d                 : numeric % Halstead "difficulty"
# 9.  i                 : numeric % Halstead "intelligence"
# 10. e                 : numeric % Halstead "effort"
# 11. b                 : numeric % Halstead
# 12. t                 : numeric % Halstead's time estimator
# 13. lOCode            : numeric % Halstead's line count
# 14. lOComment         : numeric % Halstead's count of lines of comments
# 15. lOBlank           : numeric % Halstead's count of blank lines
# 16. locCodeAndComment : numeric
# 17. uniq_Op           : numeric % unique operators
# 18. uniq_Opnd         : numeric % unique operands
# 19. total_Op          : numeric % total operators
# 20. total_Opnd        : numeric % total operands
# 21: branchCount       : numeric % of the flow graph
X = X[['loc', 'd', 'locCodeAndComment', 'v(g)', 'uniq_Opnd', 'i']]

# Prepare data to be used during training and testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Step 4.** The final step is to create a logistic regression model using the training data and to then evaluate it against the test data. The metric that we use to evaluate its performance is "AUC-ROC", which is the area under the Receiver Operating Characteristic curve. This value will be between 0 and 1, where 0 indicates a model that has not predictive power while a value of 1 means that the model perfectly predicts defects.

**Now that you have a sense for how this defect prediction system works, you should perform some experimentation to determine what combination of features (columns) in the training data yields the highest quality predictions. You can go to Step 3, modify the list of features, and then re-run both Steps 3 and 4 to see how the AUC-ROC value changes. Alternatively, you can press 'Ctrl-F9' to re-run all of the blocks.**

**You may wish to start by using individual features so you can see which of them have greater predictive power and then combining those that seem promising. Please be prepared to discuss what you've learned as well as the maximum AUC-ROC value you achieved with the class.**

In [16]:
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train, y_train)
logistic_predictions = logistic_model.predict_proba(X_test)[:, 1]
logistic_auc_roc = roc_auc_score(y_test, logistic_predictions)
print(f"Logistic Regression AUC-ROC: {logistic_auc_roc}")

Logistic Regression AUC-ROC: 0.7271697390458622
