<span style="font-size:30px">Classification</span> 

# LOGISTIC REGRESSION:
- Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier. <br>

<span style="color:blue">
from sklearn.linear_model import LogisticRegression<br>
log_reg = LogisticRegression() <br>
log_reg.fit(X, y)<br>
 </span> <BR>
 - WHAT IF WE WANT TO CLASSIFY MORE THAN 2 CATEGORIES WITH LOGISTIC REGRESSION ? --> WE USE "one-vs-rest or multinomial methods" <br>
 <span style="color:blue">
 -  Load data <br>
- iris = datasets.load_iris() <br>
- features = iris.data target = iris.target<br>
- Standardize features<br>
- scaler = StandardScaler() features_standardized = scaler.fit_transform(features)<br>
-Create one-vs-rest logistic regression object  <br>
- logistic_regression = LogisticRegression(random_state=0, multi_class="ovr")<br>
- Train model <br>
- model = logistic_regression.fit(features_standardized, target) <br>
</span>

- WHAT IF WE NEED TO REDUCE VARIANCE OF OUR LOGISTIC REGRESSION MODEL ? --> Tune the regularization strength hyperparameter, C:
- logistic_regression = LogisticRegressionCV( penalty='l2', Cs=10, random_state=0, n_jobs=-1) #WE USED LogisticRegressionCV() cs is params for C VALUES.

- WHAT IF WE NEED TO CLASSIFY VERY LARGE DATA WITH LOGISTIC REGRESSION ? --> CHANGE SOLVER TO "sag" 
<br>
- <span style="color:blue">logistic_regression = LogisticRegression(random_state=0, solver="sag") </span>

# STOCHASTIC GRADIENT DESCENT:
- Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time

- This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.<br>

<span style="color:blue">
-from sklearn.linear_model import SGDClassifier()<br>
- recommended parameter  values for start :
- Learning rate: 0.01 (SGD) or 0.001 (with momentum/Adam).

    Batch size: 32–128.

    Momentum: 0.9.

    Weight decay: 1e-4.
 </span>

# SUPPORT VECTOR MACHINES:
- Support vector machines classify data by finding the hyperplane that maximizes the margin between the classes in the training data. In a two-dimensional example with two classes, we can think of a hyperplane as the widest straight “band” that separates the two classes.
- SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on a dataset.
- SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations  On the other hand, LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel

- WHAT IF WE HAVE LINEARLY NOT SEPERABLE DATA HOW WE CAN CLASSIFY IT ?
- we can change kernel parameters:
- Polynomial kernel – good for data with polynomial relations.
- Radial Basis Function (RBF) kernel – handles complex, nonlinear boundaries.
- Sigmoid kernel – sometimes used in neural-net–like scenarios.
  
-The C parameter controls the penalty for misclassification:
- High C → less tolerance for errors (harder margin).
- Low C → more tolerance, allows some misclassified points, better generalization.
- Feature Engineering / Transformation,and Data preprocessing. <br>

<span style="color:blue">
from sklearn.svm import SVC <br>
svc = SVC(kernel="rbf", random_state=0, gamma=1, C=1) </span>

# Decision Tree Classifier:
- Decision tree learners attempt to find a decision rule that produces the greatest decrease in impurity at a node. While there are a number of measurements of impurity, by default DecisionTreeClassifier uses Gini impurity
- Gini(t) = 1 - Σ (p_i^2),  i = 1 to C 
-where G(t) is the Gini impurity at node t and pi is the proportion of observations of class c at node t. This process of finding the decision rules that create splits to increase impurity is repeated recursively until all leaf nodes are pure ... <br>
<span style="color:blue">
 from sklearn.tree import DecisionTreeClassifier <br >
 clf = DecisionTreeClassifier(random_state=0)
   </span>



# RANDOM FOREST CLASSIFIER:
- This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the loss function, e.g. binary or multiclass log loss. Binary classification is a special case where only a single regression tree is induced
- HOW IT WORKS:
- Generate B bootstrap samples
- For each sample, build decision tree with random feature selection
- Train all trees independently
- For prediction, pass input through all trees
- Aggregate predictions using majority vote

<span style="color:blue">
sklearn.ensemble.RandomForestClassifier()  </span> 

# HISTGRADIENTBOOSTINGCLASSIFIER:
- This estimator is much faster than GradientBoostingClassifier for big datasets (n_samples >= 10 000).
    This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain. When predicting, samples with missing values are assigned to the left or right child consequently.

<span style="color:blue">  
sklearn.ensemble.HistGradientBoostingClassifier() 

</span>

# K-NEIGHBORS CLASSIFIER :
- Step-1: Select the number K of the neighbors
-  Step-2: Calculate the Euclidean,Manhattan or minkowski distance of K number of neighbors
-  Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
-  Step-4: Among these k neighbors, count the number of the data points in each category.
-  Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
-  Step-6: Our model is ready.
 
 <span style="color:blue">
 from sklearn.neighbors import NearestNeighbors </span>



# NAIVE BAYES CLASSIFIER:

- How it works:
- Bayes' Theorem: Apply P(A|B) = P(B|A) × P(A) / P(B)
- Independence Assumption: Assume features are conditionally independent
- Prior Probability: Calculate prior probability for each class
- Likelihood: Calculate likelihood of features given each class
- Posterior: Calculate posterior probability for each class
- Classification: Choose class with highest posterior probability

The most common type of naive Bayes classifier is the Gaussian naive Bayes. In Gaussian naive Bayes, we assume that the likelihood of the feature values, x, given an observation is of class y, follows a normal distribution

- WHEN WE HAVE discrete or count data, you need to train a naive Bayes classifier, WE USE MULTINOMIAL NAIVE BAYES CLASSIFIER
- You have binary feature data and need to train a naive Bayes classifier. WE USE BERNOULLI NAIVE BAYES CLASSIFIER
- You want to calibrate the predicted probabilities from naive Bayes classifiers so they are interpretable. USE  CalibratedClassifierCV <br>
<span style="color:blue">
from sklearn.naive_bayes import GaussianNB <br>
from sklearn.naive_bayes import MultinomialNB <br> 
from sklearn.calibration import CalibratedClassifierCV </span>

# LinearDiscriminantAnalysis - QuadraticDiscriminantAnalysis
- LinearDiscriminantAnalysis  and QuadraticDiscriminantAnalysis are two classic classifiers, with, as their names suggest, a linear and a quadratic decision surface, 
 respectively.
- These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and 
 have no hyperparameters to tune
- How It Works: LDA performs classification by finding linear boundaries between classes. It uses Bayes' theorem and assumes Gaussian distribution.
- Step-by-Step Working Process:
- Calculate the mean for each class
- Compute the common covariance matrix for all data
- Calculate prior probabilities for each class
- For new data points, compute discriminant function value for each class
- Assign to the class with the highest discriminant value 
- Quadratic Discriminant Analysis (QDA)
- How It Works:QDA is a generalized version of LDA. It uses separate covariance matrices for each class and creates quadratic decision boundaries.
- Calculate the mean for each class
- Compute separate covariance matrix for each class
- Calculate prior probabilities for each class
- For new data points, compute quadratic discriminant function value for each class
- Assign to the class with the highest discriminant value<br>
<span style="color:blue">
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis<br>
LDA <br>
lda = LinearDiscriminantAnalysis()<br>
QDA   <br>
qda = QuadraticDiscriminantAnalysis() <br>
</span>

# BAGGING CLASSIFIER:
- In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box estimator on random subsets of the original training set and
 then aggregate their individual predictions to form a final prediction. These methods are used as a way to reduce the variance of a base estimator.
-  Step-by-step process:
- Generate B bootstrap samples from training data (typically B=10-100)
- Train a base classifier on each bootstrap sample
- For feature bagging, randomly select subset of features for each model
- Store all trained models
- For new predictions, get prediction from each model
- Combine predictions using majority voting (classification) or averaging (regression)
- Return final ensemble prediction <br>

<span style="color:blue">
from sklearn.ensemble import BaggingClassifier </span>

# AdaBoostClassifier:
- AdaBoost (Adaptive Boosting) sequentially trains weak learners, where each subsequent model focuses on correcting the mistakes of previous models by giving higher weights to misclassified examples.
- Step-by-step process:
- Initialize equal weights for all training examples
- Train first weak learner on weighted dataset
- Calculate error rate of current model
- Calculate model weight based on error rate (lower error = higher weight)
- Update example weights (increase for misclassified, decrease for correct)
- Normalize weights so they sum to 1
- Repeat steps 2-6 for specified number of iterations
- Final prediction = weighted vote of all weak learners <br>

<span style="color:blue">
-from sklearn.ensemble import AdaBoostClassifier <br>
-from sklearn.tree import DecisionTreeClassifier <br>

-adaboost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)   </span>

# VotingClassifier
- How it works:VotingClassifier combines predictions from multiple different algorithms using either majority voting (hard voting) or averaging predicted probabilities (soft voting).
- Step-by-step process:
- Train each base model independently on the same training data
- Store all trained models
- For new predictions, get prediction from each model
- Hard Voting: Count votes for each class, assign majority class
- Soft Voting: Average predicted probabilities, assign class with highest average probability
- Return ensemble prediction
```
voting = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('svc', SVC(probability=True)),  # probability=True for soft voting
        ('rf', RandomForestClassifier())
    ],
    voting='soft'  # or 'hard'
)
``` 

# StackingClassifier
-How it works:StackingClassifier uses a meta-model (meta-learner) to learn how to best combine predictions from multiple base models. Base models make predictions, which become features for the meta-model.
- Step-by-step process:
- Level-0 Training: Train base models using cross-validation

- Split training data into K folds
    For each fold, train base models on other K-1 folds
    Predict on held-out fold
    Combine predictions to create meta-features

- Meta-Feature Creation: 
    Base model predictions become new feature matrix
    Level-1 Training: Train meta-model on meta-features and original targets
    Final Model: Store both base models and meta-model
- Prediction:
    Get predictions from all base models
    Use these predictions as input to meta-model
    Return meta-model prediction
```
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

stacking = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression()),
        ('svc', SVC()),
        ('rf', RandomForestClassifier())
    ],
    final_estimator=LogisticRegression(),
    cv=5  # Number of cross-validation folds
) ``` 


   | Method     | Training        | Combination     | Complexity | Best For                 |
|------------|-----------------|-----------------|------------|--------------------------|
| Bagging    | Parallel        | Majority Vote   | Low        | High-variance models     |
| AdaBoost   | Sequential      | Weighted Vote   | Medium     | Weak learners            |
| Voting     | Parallel        | Vote/Average    | Low        | Diverse good models      |
| Stacking   | CV-based        | Meta-learning   | High       | Maximum performance      |
