<a href="https://colab.research.google.com/github/AvtnshM/ML_CodeQs/blob/main/Practice_19_12_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LINEAR REGRESSION

**What:**  
A supervised regression algorithm that models the relationship between features and a continuous target using a straight line (or hyperplane).

**Why:**  
Simple, interpretable baseline for numeric prediction.

**Where (Applications):**  
- Housing price prediction  
- Sales forecasting  
- Demand estimation  
- Stock trend approximation (baseline)

**When (Ideal Conditions):**  
- Relationship between input and output is linear  
- Few outliers  
- Low multicollinearity  

**How (Mechanism):**  
Minimizes sum of squared errors (OLS) to learn coefficients.  
Uses gradient descent or closed-form solution.

**Validation Metrics:**  
- MSE  
- RMSE  
- MAE  
- R² Score  

---


In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=1)


model=LinearRegression()
model.fit(X_train, y_train)


preds = model.predict(X_test)


mse = mean_squared_error(y_test, preds)
print("MSE: ", mse)

import matplotlib.pyplot as plt
plt.scatter(y_test, preds, alpha = 0.5)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title("Actual vs Predicted")
plt.show()

residuals = y_test - preds
plt.scatter(preds, residuals, alpha = 0.5)
plt.axhline(0, color = 'red')
plt.xlabel("Predicted Prices")
plt.ylabel("Residuals")
plt.title("Residuals Plot")
plt.show()

 LOGISTIC REGRESSION

**What:**  
A supervised classification algorithm that predicts probabilities using the sigmoid function.

**Why:**  
Interpretable, fast, works well on linearly separable data.

**Where:**  
- Spam vs not spam  
- Disease detection  
- Fraud detection  
- Customer churn prediction

**When:**  
- Binary or multi-class problems  
- Need probability outputs  
- Dataset is medium-sized and linear-ish

**How:**  
Models log-odds using sigmoid → optimized with Maximum Likelihood and gradient descent.

**Validation Metrics:**  
- Accuracy  
- Precision  
- Recall  
- F1 Score  
- ROC–AUC  

---

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, random_state=2)


model = LogisticRegression(max_iter=100)
model.fit(X_train, y_train)


preds = model.predict(X_test)

acc = accuracy_score(y_test, preds)
print("Accuracy: ",  acc)



from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, preds)

sns.heatmap(cm, annot=True, cmap="Blues", fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matric(Logistic Regression())")
plt.show()

 K-MEANS CLUSTERING

**What:**  
An unsupervised clustering algorithm that groups data into K clusters by minimizing intra-cluster distance.

**Why:**  
Simple, scalable, widely used for segmentation.

**Where:**  
- Customer segmentation  
- Image compression  
- Market basket analysis  
- Anomaly grouping

**When:**  
- Spherical clusters  
- Medium-to-large datasets  
- Unlabeled data

**How:**  
Initialize centroids → assign points → update centroids → repeat until convergence.

**Validation Metrics:**  
- Inertia (Within-Cluster Sum of Squares)  
- Silhouette Score  

---

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = load_iris()

X = data.data

kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(X)

labels = kmeans.labels_

plt.scatter(X[:, 2], X[:, 3], c =labels)
plt.show()


 SUPPORT VECTOR MACHINE (SVM)

**What:**  
A supervised algorithm that finds the best separating hyperplane with maximum margin between classes.

**Why:**  
Effective for small/medium datasets & high-dimensional spaces.

**Where:**  
- Text classification  
- Image classification  
- Bioinformatics  
- Handwritten digit recognition

**When:**  
- Datasets with clear class margins  
- Small or medium-sized datasets  
- High dimensionality (e.g., text)  

**How:**  
Maximizes margin using hinge loss.  
Uses kernel trick for non-linear separation.

**Validation Metrics:**  
- Accuracy  
- Precision  
- Recall  
- F1 Score  
- ROC–AUC  

---


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size= 0.8, random_state = 1)

model = SVC()
model.fit(X_train, y_train)

preds = model.predict(X_test)
print("Accuracy :", accuracy_score(y_test, preds))


 DECISION TREE

**What:**  
A supervised model that splits data based on feature values to make decisions in a tree-like structure.

**Why:**  
Interpretable, handles non-linearity, no scaling needed.

**Where:**  
- Credit scoring  
- Risk analysis  
- Medical diagnosis  
- Loan approval

**When:**  
- Mixed data types (categorical + numeric)  
- Interpretability required  
- Non-linear decision boundaries

**How:**  
Splits using Gini impurity or entropy → maximize information gain.

**Validation Metrics:**  
- Accuracy  
- Precision/Recall/F1  
- MAE/MSE (regression trees)  

---

In [None]:
from sklearn.datasets  import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 1)


model = DecisionTreeClassifier()
model.fit(X_train, y_train)

preds = model.predict(X_test)

print("Accuracy :", accuracy_score(y_test, preds))


 RANDOM FOREST

**What:**  
Ensemble of decision trees trained on bootstrapped datasets with feature randomness.

**Why:**  
Reduces overfitting, improves accuracy vs a single tree.

**Where:**  
- Credit scoring  
- HR attrition  
- Sales prediction  
- Medical diagnosis  
- Tabular Kaggle competitions

**When:**  
- Need robust general-purpose model  
- Avoid overfitting of single trees  
- Large feature space  

**How:**  
Bagging (data sampling) + random features → aggregate predictions (majority vote/mean).

**Validation Metrics:**  
- Accuracy (classification)  
- Precision/Recall/F1  
- ROC–AUC  
- MSE/RMSE (regression)

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, random_state = 23)

model = RandomForestClassifier()
model.fit(X_train, y_train)

preds = model.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, preds))

## K-Nearest Neighbors (KNN)

**What:**  
An instance-based ML algorithm that predicts by using the closest K points.

**Why:**  
No training time, simple, handles non-linear decision boundaries.

**Where:**  
Recommendation, pattern recognition, medical baseline, image classification.

**When:**  
Small datasets, local patterns matter, non-parametric spaces.

**How:**  
Calculate distance → pick K neighbors → voting (classification) or averaging (regression).

**Validation Metrics:**  
Accuracy, Precision/Recall/F1, MSE (regression)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
print("Accuracy: ", accuracy_score(y_test, knn.predict(X_test)))



Accuracy:  0.9333333333333333


##  Naive Bayes (NB)

**What:**  
A probabilistic classifier based on Bayes theorem assuming independent features.

**Why:**  
Extremely fast, great on text & sparse data.

**Where:**  
Spam detection, sentiment analysis, document tagging.

**When:**  
High dimensional text, low compute, limited data.

**How:**  
Computes posterior probability: P(class|features) using independence.

**Validation Metrics:**  
Accuracy, Precision, Recall, F1, ROC–AUC

---

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

nb = GaussianNB()

nb.fit(X_train, y_train)

print('Accuracy: ', accuracy_score(y_test, nb.predict(X_test)))

Accuracy:  0.9333333333333333


##  Principal Component Analysis (PCA)

**What:**  
Dimensionality reduction by projecting data onto max variance directions.

**Why:**  
Reduce redundancy, compression, visualization, speed-up ML.

**Where:**  
Preprocessing, compression, noise reduction, 2D/3D visualization.

**When:**  
High dimensional data, multicollinearity, visualization required.

**How:**  
Covariance matrix → eigen decomposition → keep top k components.

**Validation Metrics:**  
Explained variance ratio, reconstruction error (optional)

---

In [None]:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

data = load_iris()
X= data.data

pca = PCA(n_components = 2)

X_pca = pca.fit_transform(X)

print("PCA Variance Ratio: ", pca.explained_variance_ratio_)


PCA Variance Ratio:  [0.92461872 0.05306648]


## Gradient Boosting Family (XGB / LightGBM / CatBoost)

**What:**  
Ensemble method where new trees correct previous errors sequentially.

**Why:**  
State-of-the-art performance on tabular data.

**Where:**  
Kaggle, credit scoring, forecasting, churn modeling.

**When:**  
Large data, complex relationships, need max accuracy.

**How:**  
Gradient descent on loss + boosted weak learners (trees).

**Validation Metrics:**  
Accuracy/F1 (classification), ROC–AUC, RMSE/MAE (regression)

---

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 1)

xgb = XGBClassifier(eval_metric = 'logloss')
xgb.fit(X_train, y_train)

print('Accuracy: ', accuracy_score(y_test, xgb.predict(X_test)))


Accuracy:  0.9333333333333333


##  Artificial Neural Network (ANN)

**What:**  
Network of neurons that learns hierarchical patterns through layers.

**Why:**  
Captures complex non-linear relationships.

**Where:**  
Images, NLP, tabular prediction, time series.

**When:**  
Large datasets, interactions unknown, non-linear spaces.

**How:**  
Forward pass → loss → backpropagation updates weights.

**Validation Metrics:**  
Accuracy, Cross-Entropy Loss, MSE (regression)

---

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

X, y = load_iris(return_X_y=True)
y_cat = to_categorical(y)
X_train, X_test, y_train, y_test = train_test_split(X, y_cat, train_size = 0.8, random_state = 1)

ann = Sequential([
    Dense(8, activation = 'relu', input_shape = (4, )),
    Dense(3, activation = 'softmax')
])

ann.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics =['accuracy'])
ann.fit(X_train, y_train, epochs =10, verbose = 0)

loss, acc = ann.evaluate(X_test, y_test, verbose = 0)

print("Accuracy: ", acc)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Accuracy:  0.36666667461395264


##  Gradient Descent (GD)

**What:**  
Optimization algorithm used to minimize cost functions.

**Why:**  
Works for large models where closed-form solution is impossible.

**Where:**  
Logistic/Linear Regression, SVM, Neural Networks.

**When:**  
Large parameters, iterative optimization needed.

**How:**  
Compute gradient of loss → move in negative gradient direction.

**Validation Metrics:**  
Not a model → observe decreasing loss & convergence.

In [None]:
import numpy as np
X = np.array([1,2,3,4], dtype =float)
y = np.array([2,4,6,8], dtype =float)

w = 0
lr = 0.01

for _ in range(100):
  pred = w * X
  grad = np.mean((pred - y)   * X)
  w -= lr * grad

print("GD Learned Weight: ", w)

GD Learned Weight:  1.9991773724130237
