## Software Fault Prediction using Classification algorithms

### we used pc1 file to do tasks, because we have the results in the paper that has been taken on this dataset

**if you want to get results with other files you can just simply load the dataset name instead of pc1**

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

### First of all, we will inport and define the important libraries, we should run this bloch of code before running the rest of code

In [None]:
# 1. Load Dataset

data = pd.read_csv('../data/pc1.csv')

In [4]:
# 2. Explore the dataset (optional)
print(data.head())
print(data.info())
print(data.describe())

    loc  v(g)  ev(g)  iv(g)     n       v     l      d      i        e  ...  \
0   1.1   1.4    1.4    1.4   1.3    1.30  1.30   1.30   1.30     1.30  ...   
1   1.0   1.0    1.0    1.0   1.0    1.00  1.00   1.00   1.00     1.00  ...   
2  24.0   5.0    1.0    3.0  63.0  309.13  0.11   9.50  32.54  2936.77  ...   
3  20.0   4.0    4.0    2.0  47.0  215.49  0.06  16.00  13.47  3447.89  ...   
4  24.0   6.0    6.0    2.0  72.0  346.13  0.06  17.33  19.97  5999.58  ...   

   lOCode  lOComment  lOBlank  locCodeAndComment  uniq_Op  uniq_Opnd  \
0       2          2        2                  2      1.2        1.2   
1       1          1        1                  1      1.0        1.0   
2       1          0        6                  0     15.0       15.0   
3       0          0        3                  0     16.0        8.0   
4       0          0        3                  0     16.0       12.0   

   total_Op  total_Opnd  branchCount  defects  
0       1.2         1.2          1.4    Fals

---

### we have Explored the dataset just to see and understand 

In [None]:
# 3. Preprocessing


imputer = SimpleImputer(strategy='mean')  # or 'median' depending on your data
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

### Here, we will handle missing values if exists in the dataset.

In [6]:
# Separate features and target
X = data_imputed.drop('defects', axis=1)  
y = data_imputed['defects']

In [7]:
# Encode categorical target if necessary
if y.dtype == 'object':
    le = LabelEncoder()
    y = le.fit_transform(y)

In [8]:
# Feature scaling (important for KNN and SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [9]:
# 4. Define classifiers
classifiers = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

In [11]:
# 5. Evaluation using 10-fold cross-validation
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

accuracies = []
precisions = []
recalls = []
f1s = []

for train_idx, test_idx in cv.split(X_scaled, y):
    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    accuracies.append(accuracy_score(y_test, y_pred))
    precisions.append(precision_score(y_test, y_pred, zero_division=0))
    recalls.append(recall_score(y_test, y_pred, zero_division=0))
    f1s.append(f1_score(y_test, y_pred, zero_division=0))

print(f"Accuracy: {np.mean(accuracies):.4f}")
print(f"Precision: {np.mean(precisions):.4f}")
print(f"Recall: {np.mean(recalls):.4f}")
print(f"F1-score: {np.mean(f1s):.4f}")


Accuracy: 0.8896
Precision: 0.0500
Recall: 0.0200
F1-score: 0.0286


In [12]:
# 6. Display results
results_df = pd.DataFrame(results).T
print(results_df)

                        Accuracy  Precision  Recall  F1-score
Decision Tree           0.851551   0.265952   0.205  0.207172
K-Nearest Neighbors     0.887592   0.200000   0.040  0.066667
Support Vector Machine  0.899633   0.000000   0.000  0.000000
Random Forest           0.889633   0.050000   0.020  0.028571


# 7. Compare your results with the paper's reported results
# (Manually add the paper's results here for comparison in my report)

In [16]:
import pandas as pd

# My results 
my_results = {
    'Classifier': ['Decision Tree', 'K-Nearest Neighbor', 'Support Vector Machine', 'Random Forest'],
    'Accuracy_Your': [0.851551, 0.887592, 0.899633, 0.889633],
    'Precision_Your': [0.265952, 0.200000, 0.000000, 0.050000],
    'Recall_Your': [0.205, 0.040, 0.000, 0.020],
    'F1-score_Your': [0.207172, 0.066667, 0.000000, 0.028571]
}

# Paper's reported results for PC1 dataset (from Table 2 in the paper)
paper_results = {
    'Classifier': ['Decision Tree', 'K-Nearest Neighbor', 'Support Vector Machine', 'Random Forest'],
    'Accuracy_Paper': [0.912, 0.923, 0.924, 0.930],
    'Precision_Paper': [0.583, 0.615, 0.635, 0.653],
    'Recall_Paper': [0.512, 0.564, 0.573, 0.598],
    'F1-score_Paper': [0.545, 0.588, 0.602, 0.624]
}

# Create DataFrames
df_your = pd.DataFrame(my_results)
df_paper = pd.DataFrame(paper_results)

# Merge on Classifier
comparison_df = pd.merge(df_your, df_paper, on='Classifier')

# Display the comparison table
print("Comparison of Your Results vs Paper's Reported Results (PC1 Dataset):")
display(comparison_df)


Comparison of Your Results vs Paper's Reported Results (PC1 Dataset):


Unnamed: 0,Classifier,Accuracy_Your,Precision_Your,Recall_Your,F1-score_Your,Accuracy_Paper,Precision_Paper,Recall_Paper,F1-score_Paper
0,Decision Tree,0.851551,0.265952,0.205,0.207172,0.912,0.583,0.512,0.545
1,K-Nearest Neighbor,0.887592,0.2,0.04,0.066667,0.923,0.615,0.564,0.588
2,Support Vector Machine,0.899633,0.0,0.0,0.0,0.924,0.635,0.573,0.602
3,Random Forest,0.889633,0.05,0.02,0.028571,0.93,0.653,0.598,0.624


### Comparison of Our Results with the Paper’s Reported Results

The table below presents a side-by-side comparison of the performance metrics obtained from our implementation of classical classifiers on the PC1 dataset with the results reported in the paper *“A hybrid approach based on k-nearest neighbors and decision tree for software fault prediction.”*

| Classifier           | Accuracy (Our) | Accuracy (Paper) | Precision (Our) | Precision (Paper) | Recall (Our) | Recall (Paper) | F1-score (Our) | F1-score (Paper) |
|----------------------|----------------|------------------|-----------------|-------------------|--------------|----------------|----------------|------------------|
| Decision Tree        | 0.852          | 0.912            | 0.266           | 0.583             | 0.205        | 0.512          | 0.207          | 0.545            |
| K-Nearest Neighbor   | 0.888          | 0.923            | 0.200           | 0.615             | 0.040        | 0.564          | 0.067          | 0.588            |
| Support Vector Machine | 0.900        | 0.924            | 0.000           | 0.635             | 0.000        | 0.573          | 0.000          | 0.602            |
| Random Forest        | 0.890          | 0.930            | 0.050           | 0.653             | 0.020        | 0.598          | 0.029          | 0.624            |

#### Analysis:

- Our accuracy scores are reasonably close to those reported in the paper, indicating that the classifiers are generally effective at correctly classifying the majority class.
- However, our precision, recall, and F1-scores are significantly lower, especially for SVM and Random Forest, which suggests that our models struggle to correctly identify the minority (faulty) class.
- This discrepancy may be due to class imbalance in the dataset, differences in preprocessing steps, parameter tuning, or the use of additional techniques such as class weighting or resampling in the paper.
- To improve minority class detection, future work could involve applying class balancing methods (e.g., SMOTE), hyperparameter tuning, or implementing the hybrid approach proposed in the paper.
- Overall, our results validate the baseline performance of classical classifiers and highlight the potential benefits of the hybrid model for software fault prediction.



In [None]:
# 8. Save results to CSV 
results_df.to_csv('classification_results.csv')

*this is my results i saved them in the csv file* 
---