# Software Defects Analysis & Prediction

## Step 1: Load Data

## Step 2: Data exploration

## Step 3: Data visualization

## Step 4: Defects prediction

### Model 1: Logistic Regression

#### The Core Idea

Logistic Regression models the **probability** that an instance belongs to class 1 using the **logistic (sigmoid) function**.

#### Mathematical Formulation

##### 1. Linear Combination

First, compute a linear combination of features:

```
z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
z = β₀ + Σ(βᵢ * xᵢ)
```

Where:

- `x₁, x₂, ..., xₙ` are the features
- `β₀, β₁, ..., βₙ` are the coefficients (weights) to be learned
- `β₀` is the intercept (bias)

##### 2. Sigmoid Function

Transform the linear output to a probability using the **sigmoid function**:

```
P(y=1|x) = σ(z) = 1 / (1 + e^(-z))
```

Where:

- `P(y=1|x)` is the probability that y=1 given features x
- `e` is Euler's number (≈2.718)
- The output is always between 0 and 1

**Why sigmoid?** It maps any real number to a range [0, 1]

```
z → -∞    ⟹  P → 0
z = 0     ⟹  P = 0.5
z → +∞    ⟹  P → 1
```

##### 3. Decision Rule

```
Predict class 1 if P(y=1|x) ≥ 0.5
Predict class 0 if P(y=1|x) < 0.5
```

#### Loss Function: Log Loss (Cross-Entropy)

The model learns by minimizing the **log loss**:

```
L(β) = -1/m * Σ[yᵢ * log(P(yᵢ=1|xᵢ)) + (1-yᵢ) * log(1-P(yᵢ=1|xᵢ))]
```

Where:

- `m` is the number of training examples
- `yᵢ` is the actual label (0 or 1)
- `P(yᵢ=1|xᵢ)` is the predicted probability

**Intuition**:

- If actual y=1 and we predict P=0.9, loss is small (-log(0.9) ≈ 0.1)
- If actual y=1 and we predict P=0.1, loss is large (-log(0.1) ≈ 2.3)

#### Regularization

To prevent overfitting, add a penalty term:

**L2 Regularization (Ridge):**

```
L(β) = Log Loss + λ * Σ(βᵢ²)
```

**L1 Regularization (Lasso):**

```
L(β) = Log Loss + λ * Σ|βᵢ|
```

Where `λ` controls regularization strength (larger λ = more regularization).

### Model 2: Random Forest

#### The Core Idea

Build many **decision trees** on random subsets of data and features, then **average** their predictions.

#### Mathematical Formulation

##### 1. Single Decision Tree

A decision tree splits data recursively to maximize **information gain** or minimize **impurity**.

**Gini Impurity** (measure of randomness):

```
Gini(node) = 1 - Σ(pᵢ²)
```

Where:

- `pᵢ` is the proportion of class i samples in the node
- Perfect purity: Gini = 0 (all samples same class)
- Maximum impurity: Gini = 0.5 (equal mix of classes)

**Example:**

```
Node with 60 class-0 and 40 class-1 samples:
p₀ = 60/100 = 0.6
p₁ = 40/100 = 0.4
Gini = 1 - (0.6² + 0.4²) = 1 - 0.52 = 0.48
```

**Information Gain** when splitting:

```
IG = Gini(parent) - Weighted_Average[Gini(left_child), Gini(right_child)]
```

The algorithm chooses splits that **maximize Information Gain**.

##### 2. Bootstrap Aggregating (Bagging)

Random Forest creates diversity through **bagging**:

1. **Randomly sample** m instances from training data (with replacement)
2. **Randomly select** k features at each split (typically k = √n_features)
3. Build a decision tree on this subset
4. Repeat for n_trees

##### 3. Prediction by Voting

For binary classification:

```
P(y=1|x) = 1/N * Σ[tree_i predicts 1]
```

Where N is the number of trees.

**Final prediction:**

```
ŷ = 1 if P(y=1|x) ≥ 0.5, else 0
```

#### Why Does It Work?

**Law of Large Numbers:** Average of many predictions is more stable than individual predictions.

**Bias-Variance Tradeoff:**

- Single tree: Low bias, high variance (overfits)
- Random Forest: Low bias, low variance (averaging reduces variance)

### Model 3: Gradient Boosting

#### The Core Idea

Build trees **sequentially**, where each tree corrects the errors of the previous trees.

#### Mathematical Formulation

##### 1. Additive Model

The prediction is a **sum** of multiple weak learners:

```
F(x) = f₀(x) + η*f₁(x) + η*f₂(x) + ... + η*fₘ(x)
```

Where:

- `F(x)` is the final prediction
- `f₀(x)` is the initial prediction (usually mean)
- `fᵢ(x)` are decision trees (weak learners)
- `η` is the learning rate (0 < η ≤ 1)
- `m` is the number of trees

##### 2. Sequential Learning

At each iteration t:

**Step 1: Calculate residuals (errors)**

```
rᵢ⁽ᵗ⁾ = yᵢ - F⁽ᵗ⁻¹⁾(xᵢ)
```

**Step 2: Fit a new tree to residuals**

```
fₜ(x) ← fit_tree(X, r⁽ᵗ⁾)
```

**Step 3: Update the model**

```
F⁽ᵗ⁾(x) = F⁽ᵗ⁻¹⁾(x) + η * fₜ(x)
```

#### 3. For Binary Classification

Use **log-odds** instead of class labels:

```
F(x) = log(P(y=1|x) / P(y=0|x))
```

Convert to probability:

```
P(y=1|x) = 1 / (1 + e^(-F(x)))
```

##### 4. Gradient Descent Intuition

Gradient Boosting minimizes loss by following the **negative gradient**:

```
fₜ(x) ≈ -∂L/∂F(x)
```

Where L is the loss function (e.g., log loss).

**Why "Gradient" Boosting?** Each new tree fits the **negative gradient** of the loss function.

### Model 4: Support Vector Machine (SVM)

### Model 5: K-Nearest Neighbors (KNN)

### Model 6: Naive Bayes

### Model 7: Neural Network (MLPClassifier)