## **Machine Learning Algorithm**

## **Type**: Supervised Learning 

## **Regression + Classification**

## **Day 1**: Linear Regression + Logistic Regression

## **Student**: Muhammad Shafiq

-------------------------------------------

# **Topic 1: Linear Regression**

**Linear regression** is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, meaning the output changes at a constant rate as the input changes. This relationship is represented by a straight line.

**For example**:

we want to predict a student's exam score based on how many hours they studied. We observe that as students study more hours, their scores go up. In the example of predicting exam scores based on hours studied. Here

- **Independent variable (input)**: Hours studied because it's the factor we control or observe.

- **Dependent variable (output)**: Exam score because it depends on how many hours were studied.

#### **Equation**:

                y^=β0 + β1x1 + β2x2 + ⋯ +βnxn + ϵ

where 

 - y^ = predicted Value
 - β = weights(learned from data) 
 - x1,x2,xn = inputs features 
 - ϵ : error/noise    

### **Why Linear Regression is Important?**

Here’s why linear regression is important:

- **Simplicity and Interpretability**: It’s easy to understand and interpret, making it a starting point for learning about machine learning.

- **Predictive Ability**: Helps predict future outcomes based on past data, making it useful in various fields like finance, healthcare and marketing.

- **Basis for Other Models**: Many advanced algorithms, like logistic regression or neural networks, build on the concepts of linear regression.

- **Efficiency**: It’s computationally efficient and works well for problems with a linear relationship.

- **Widely Used**: It’s one of the most widely used techniques in both statistics and machine learning for regression tasks.

- **Analysis**: It provides insights into relationships between variables (e.g., how much one variable influences another).            

### **How Linear Regression Works**:

                   min 1/n Σ(y^-y)^2

Where 

  - y^ = predicted output
  - y = actual output
  - n = total number of observation
  - min = indicate the goal to minimize the error


## **Best Fit Line in Linear Regression**

In linear regression, the **best-fit line** is the straight line that most accurately represents the relationship between the independent variable (input) and the dependent variable (output). It is the line that minimizes the difference between the actual data points and the predicted values from the model.


### 1. **Goal of the Best-Fit Line**

The goal of linear regression is to find a straight line that minimizes the error (the difference) between the observed data points and the predicted values. This line helps us predict the dependent variable for new, unseen data.

### 2. **Equation of the Best-Fit Line**

For simple linear regression (with one independent variable), the best-fit line is represented by the equation

                              y = mx + b

Where:

- y is the predicted value (dependent variable)
- x is the input (independent variable)
- m is the slope of the line (how much y changes when x changes)
- b is the intercept (the value of y when x = 0)

The best-fit line will be the one that optimizes the values of m (slope) and b (intercept) so that the predicted y values are as close as possible to the actual data points.

## **Implementation + Detail Explanation**

#### **Import libraries**

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

## **Code Breakdown**

### 1. **sklearn.linear_model**

This module contains algorithms that assume a linear relationship between input features and output.

**Popular Classes:**

- **LinearRegression** → For regression tasks

- **LogisticRegression** → For binary/multiclass classification

- **Ridge, Lasso, ElasticNet** → Regularized regression variants

- **SGDClassifier, SGDRegressor** → Linear models trained with stochastic gradient descent


### 2. **sklearn.model_selection**

This module provides functions for:

- Splitting datasets

- Cross-validation

- Hyperparameter tuning (e.g., GridSearchCV, RandomizedSearchCV)


**Popular Functions/Classes:**

- **train_test_split()** → Splits dataset into train/test

- **KFold, StratifiedKFold** → For cross-validation

- **GridSearchCV, RandomizedSearchCV** → For tuning model parameters


### **3. sklearn.datasets**

This module provides preloaded toy datasets for learning and testing models.

**Popular Functions:**

- **load_diabetes()** → Regression dataset for predicting disease progression

- **load_iris()** → Classification dataset with 3 flower types

- **load_boston()** → House price regression (deprecated)

- **load_digits(), load_wine(), load_breast_cancer()** → Others

## **Othe Useful Modules**:


| Module                  | Purpose                                                |
| ----------------------- | ------------------------------------------------------ |
| `sklearn.preprocessing` | Scaling, encoding, imputing missing values             |
| `sklearn.metrics`       | Accuracy, precision, recall, F1, ROC, confusion matrix |
| `sklearn.pipeline`      | Create ML pipelines to chain preprocessing and models  |
| `sklearn.ensemble`      | Ensemble models like Random Forest, Gradient Boosting  |
| `sklearn.svm`           | Support Vector Machines                                |
| `sklearn.tree`          | Decision Trees                                         |
| `sklearn.neighbors`     | KNN for classification and regression                  |
| `sklearn.naive_bayes`   | Naive Bayes algorithms                                 |


### **Load DataSet**

In [6]:
diabetes_data = load_diabetes()

# separate feature X and y
X = diabetes_data.data   # All input features 
y = diabetes_data.target  # output (targetd values)

print(f"Input feature {X}")
print(f"Output {y}")



Input feature [[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990749
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06833155
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286131
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04688253
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452873
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00422151
   0.00306441]]
Output [151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
  61.  92. 259.  53. 190. 142.  75. 142. 155. 225.  59. 104. 182. 128.
  52.  37. 170. 170.  61. 144.  52. 128.  71. 163. 150.  97. 160. 178.
  48. 270. 202. 111.  85.  42. 170. 200. 252. 113. 143.  51.  52. 210.
  65. 141.  55. 134.  42. 111.  98. 164.  48.  96.  90. 162. 

## **Code Breakdown**

- `load_diabetes()` returns a **Bunch object**, similar to a dictionary.

- `data` → contains feature matrix (shape: [442, 10])

- `target` → contains target values (disease progression)

### What is banch object

A bunch is like a dictionary with dot-notation access


In [7]:
print(diabetes_data.keys())

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])


### **All Ways to load and extract X, y**

| Method                                                     | Description                 |
| ---------------------------------------------------------- | --------------------------- |
| `X, y = load_diabetes(return_X_y=True)`                    | Quick one-liner             |
| `bunch = load_diabetes(); X, y = bunch.data, bunch.target` | More flexible               |
| `bunch["data"]`, `bunch["target"]`                         | Dictionary-style access     |
| `pd.DataFrame(bunch.data)`                                 | Convert to pandas DataFrame |


### **Load as Pandas DataFrame (for EDA)**



In [12]:
import pandas as pd


# Convert to DataFrame
df = pd.DataFrame(diabetes_data.data, columns=diabetes_data.feature_names)
df["target"] = diabetes_data.target  # Add target column

print(df.head(5))


        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  target  
0 -0.002592  0.019907 -0.017646   151.0  
1 -0.039493 -0.068332 -0.092204    75.0  
2 -0.002592  0.002861 -0.025930   141.0  
3  0.034309  0.022688 -0.009362   206.0  
4 -0.002592 -0.031988 -0.046641   135.0  


### **Splitting the data:**

In [13]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### **Code Breakdown**

| Variable        | Description                                    |
| --------------- | ---------------------------------------------- |
| `X`             | Feature matrix (input) — full dataset          |
| `y`             | Target vector (output) — full dataset          |
| `test_size=0.2` | 20% of data goes to test set, 80% to train set |
| `X_train`       | Features for training (80% by default)         |
| `X_test`        | Features for testing (20% by default)          |
| `y_train`       | Labels for training                            |
| `y_test`        | Labels for testing                             |


### **Parameters of `train_test_split()`**

In [None]:
"""train_test_split(*arrays, 
                 test_size=None, 
                 train_size=None, 
                 random_state=None, 
                 shuffle=True, 
                 stratify=None)"""


### 1. `test_size` (float or int)
- Proportion (float like 0.2) → 20% test data

- Count (int like 100) → 100 samples in test set

❗ Can’t use with train_size in a way that both conflict.

### 2. `train_size` (optional)
- If test_size is 0.2 and you want exact train size → set train_size=0.8

- Not mandatory if test_size is provided.

### 3. `random_state` (int, optional)
- Sets the random seed so split is reproducible

- Use any int, like random_state=42 (common practice)

### 4. `shuffle` (bool, default=True)
- Shuffles the data before splitting

- Set shuffle=False for time series or ordered data

### 5. `stratify` (array-like, default=None)
- Ensures same proportion of classes in train/test

- Useful for imbalanced classification

## **Train Model:**

In [15]:
# Train model
model = LinearRegression()
model.fit(X_train, y_train)


# Predict
y_pred = model.predict(X_test)

# Score
print("R2 Score:", model.score(X_test, y_test))

R2 Score: 0.43550586496261223


## **Code Breakdown**

### 1. model = `LinearRegression()`

Creates an instance of the LinearRegression model.

This step:

- **Initializes the model**

- Sets up internal parameters (e.g. fit_intercept=True by default)

- Model is not trained yet

### 2. `model.fit(X_train, y_train)`

This is where the training happens.

It:

- Computes the best coefficients (β values) for minimizing error

- Learns the mapping between X_train (input) and y_train (output)

- Stores those learned weights inside the model for prediction



## **Important Functions & Attributes of ML Models (e.g., `LinearRegression`)**

| Type         | Name                   | Description                               |
| ------------ | ---------------------- | ----------------------------------------- |
|  Function  | `fit(X, y)`            | Train the model                           |
|  Function  | `predict(X)`           | Predict output for new data               |
|  Function  | `score(X, y)`          | Return R² score (for regression)          |
|  Function  | `get_params()`         | Get model hyperparameters                 |
|  Function  | `set_params(**kwargs)` | Set model hyperparameters manually        |
|  Attribute | `coef_`                | Array of learned weights (slopes)         |
|  Attribute | `intercept_`           | Learned intercept (β₀)                    |
|  Attribute | `n_iter_` *(for SGD)*  | Number of iterations used (if applicable) |
|  Attribute | `rank_`, `singular_`   | Info about the X matrix (for analysis)    |



In [17]:

print("MAE:", mean_absolute_error(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))


MAE: 48.677449593771186
MSE: 3425.115662804317
R2 Score: 0.43550586496261223


--------------------------

# **Topic 2: LOGISTIC REGRESSION** (Classification)

`Logistic Regression` is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two possible categories such as Yes/No, True/False or 0/1. It uses sigmoid function to convert inputs into a probability value between 0 and 1.

#### **Types of Logistic Regression**

Logistic regression can be classified into three main types based on the nature of the dependent variable:

- **Binomial Logistic Regression**: This type is used when the dependent variable has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic regression and is used for binary classification problems.

- **Multinomial Logistic Regression**: This is used when the dependent variable has three or more possible categories that are not ordered. For example, classifying animals into categories like "cat," "dog" or "sheep." It extends the binary logistic regression to handle multiple classes.

- **Ordinal Logistic Regression**: This type applies when the dependent variable has three or more categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high." It takes the order of the categories into account when modeling.

##  **Assumptions of Logistic Regression**

Understanding the assumptions behind logistic regression is important to ensure the model is applied correctly, main assumptions are:

- **Independent observations**: Each data point is assumed to be independent of the others means there should be no correlation or dependence between the input samples.

- **Binary dependent variables**: It takes the assumption that the dependent variable must be binary, means it can take only two values. For more than two categories SoftMax functions are used.

- **Linearity relationship between independent variables and log odds**: The model assumes a linear relationship between the independent variables and the log odds of the dependent variable which means the predictors affect the log odds in a linear way.

- **No outliers**: The dataset should not contain extreme outliers as they can distort the estimation of the logistic regression coefficients.

- **Large sample size**: It requires a sufficiently large sample size to produce reliable and stable results.

## **Understanding Sigmoid Function**

 1. The sigmoid function is a important part of logistic regression which is used to convert the raw output of the model into a probability value between 0 and 1.

 2. This function takes any real number and maps it into the range 0 to 1 forming an "S" shaped curve called the sigmoid curve or logistic curve. Because probabilities must lie between 0 and 1, the sigmoid function is perfect for this purpose.

 3. In logistic regression, we use a threshold value usually 0.5 to decide the class label.

 - If the sigmoid output is same or above the threshold, the input is classified as Class 1.
 - If it is below the threshold, the input is classified as Class 0.
 - This approach helps to transform continuous input values into meaningful class predictions.


**Read From Here For Full Details of Logistic Regression**

[Machine Learning Logistic Regression](https://www.geeksforgeeks.org/machine-learning/ml-linear-regression/)


# **Use Cases**:
- Spam detection

- Customer churn prediction

- Disease diagnosis (yes/no)

- Fraud detection

## **Strengths**:
- Simple and fast

- Outputs probabilities

- Good baseline classifier

###  **Limitations**:
- Assumes linear boundary between classes

- Struggles with non-linear relationships

- Not great with high-dimensional or sparse data



## **Implementation**


### **Import labraried**

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### **Load Dataset**

In [20]:
# Load dataset
X, y = load_breast_cancer(return_X_y=True)

### **Splitting the data**

In [None]:

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


### **Train and predict from model**

In [23]:

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.96      0.93      0.95        46
           1       0.96      0.97      0.96        68

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
