Python Packages used in ML are:

- Numpy
- Scipy
- Matplotlib
- Pandas
- Sckit-learn


Machine Learning (ML) is broadly categorized into two main types: 
- <b>Supervised Learning</b> :
    - The model is trained on a labeled dataset, which means that each training example is paired with an       output label. 
    - The goal is to learn a mapping from inputs to outputs.
    - Types of Supervised Learning:
        - <b>Regression</b>: 
            - The output variable is continuous and the goal is to predict a numerical value. 
            - Examples include predicting house prices, stock prices, etc.
            - Example Algorithms: Linear Regression, Ridge Regression, Lasso Regression, Polynomial Regression.
        - <b>Classification</b>: 
            - The output variable is categorical and the goal is to predict a category label. 
            - Examples include spam detection, image classification, etc.
            - Example Algorithms: Logistic Regression, Support Vector Machines (SVM), Decision Trees, Random Forest, k-Nearest Neighbors (k-NN), Neural Networks.
    - Regression algorithms :
        - Ordinal regression
        - Poisson regression
        - Fast forest Quantile regression
        - Linear, Polynomial, Lasso, Stepwise, Ridge regression
        - Bayesian Linear regression
        - Neural network regression
        - Decision forest regression
        - Boosted decision tree regression
        - KNN (K-nearest neighbors)
- <b>Unsupervised Learning </b>:
    - The model is trained on data that does not have labeled responses. 
    - The goal is to infer the natural structure present within a set of data points.
    - Types of Unsupervised Learning:
        - <b>Clustering </b>: 
            - The goal is to group a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.
            - Grouping of data points or objects that are somehow similar by:
                - Discovering structure
                - Summerization
                - Anomaly detection
            - Example Algorithms: k-Means, Hierarchical Clustering, DBSCAN.
        - <b>Dimensionality Reduction </b>: 
            - The goal is to reduce the number of random variables under consideration, by obtaining a set of principal variables.
            - Example Algorithms: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Singular Value Decomposition (SVD).
    - Unsupervised Learning teachniques are:
        - Dimension reduction
        - Density estimation
        - Market basket analysis
        - Clustering
           





Diffrence between Supervised and unsupervised 
|  | Supervised | Unsupervised |
| ---- | --- | --- |
| Labeled Data: | Requires labeled data. | Does not require labeled data. |
|  | Classification : classifies labeled data | clustering : finds patterns and groupings from unlabeled data |
| Goal: | Predict an output based on input data. | Discover the underlying structure or distribution in the data.|
|  |Regression: predicts trends using previous labeled data | Has a fewer evaluation methods than supervised learning |
|  | Has more evaluation methods than unsupervised learning. | Less Controlled environment.|
|  | Controlled environment |
|  Applications: | Suitable for tasks where historical data with labels is available.| Suitable for tasks where labels are not available, or we want to understand the structure of data. |

<b>Regression</b>
- <b>Simple Linear Regression</b>
    - Only one independent variable is used to estimate one dependent variable.
    - we will have two variables:
        - Y-dependent Variable -> this shoud be continous variable not a category type.
        - X- Independent Variable
    - y=β0+β1x+ϵ
        - y : The dependent variable (the outcome or response variable you are trying to predict).
        - 𝑥 : The independent variable (the predictor or explanatory variable).
        - 𝛽0 : The intercept of the regression line (the value of 𝑦 when 𝑥 is 0).
        - 𝛽1 : The slope of the regression line (the change in y for a one-unit change in 𝑥).
        - 𝜖 : The error term (the difference between the observed and predicted values of 𝑦).
- <b>Multiple Linear Regression</b>
    - When more than one independent variable is present.

<b>Training Accuracy</b>

- Training accuracy is the accuracy of the model on the same dataset it was trained on. 
- It is calculated as the proportion of correctly predicted instances out of the total instances in the   training set. 
- High training accuracy indicates that the model is able to capture the patterns in the training data well. 
- However, very high training accuracy could also indicate overfitting, where the model has learned the noise in the training data rather than the underlying patterns.

<b>Out-of-Sample Accuracy</b>

- Out-of-sample accuracy (also known as test accuracy) is the accuracy of the model on a dataset that was not used during training.
- This dataset is called the test set or validation set. Out-of-sample accuracy is a better indicator of the model's performance on new, unseen data and helps to assess how well the model generalizes to other data.

Importance of Out-of-Sample Accuracy
- <b>Generalization</b>: High out-of-sample accuracy suggests that the model generalizes well to new data, which is the ultimate goal of most predictive models.
- <b>Avoiding Overfitting</b>: Comparing training and out-of-sample accuracy helps detect overfitting. A large gap between training accuracy and out-of-sample accuracy typically indicates overfitting.
- <b>Model Selection</b>: Out-of-sample accuracy is used to compare different models and select the best one.

Example on Out of accuracy 

In [4]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on the training set
y_train_pred = model.predict(X_train)
# Calculate training accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)

# Predict on the test set
y_test_pred = model.predict(X_test)
# Calculate out-of-sample (test) accuracy
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Out-of-Sample Accuracy: {test_accuracy:.2f}")


Training Accuracy: 0.96
Out-of-Sample Accuracy: 1.00


<b>K-fold cross-validation</b>

- K-fold cross-validation is a technique used in machine learning to evaluate the performance of a model in a more robust way than a simple train-test split. 
- It helps in assessing how the model will generalize to an independent dataset.

How K-Fold Cross-Validation Works

1. <b>Split the Data</b>: The entire dataset is randomly divided into k equal-sized subsets, or "folds".
2. <b>Training and Validation</b>:
    - For each fold, the model is trained using the remaining k-1 folds as the training data.
    - The model is validated on the remaining single fold.
3. <b>Repeat</b>: This process is repeated k times, with each fold used exactly once as the validation data.
4. <b>Aggregate Results</b>: The results (e.g., accuracy) from each of the k iterations are averaged to produce a single performance metric.

Benefits of K-Fold Cross-Validation

- <b>Better Utilization of Data</b>: Every observation is used for both training and validation.
- <b>Reduced Overfitting</b>: Since multiple train-test splits are used, the evaluation metric is more reliable.
- <b>Generalization</b>: Provides a better indication of how the model will perform on unseen data.

Example

In [5]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a logistic regression model
model = LogisticRegression(max_iter=200)

# Set up K-Fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate the model using cross-validation
cv_scores = cross_val_score(model, X, y, cv=kf)

# Print the cross-validation scores
print(f"Cross-Validation Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.2f}")
print(f"Standard Deviation of CV Score: {cv_scores.std():.2f}")


Cross-Validation Scores: [1.         1.         0.93333333 0.96666667 0.96666667]
Mean CV Score: 0.97
Standard Deviation of CV Score: 0.02


<b>Evalutaion Matrix </b>

1. <b> R - Squared (R²) </b>: 
    - This metric indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. 
    - Higher values indicate a better fit.
2. <b> Mean Absolute Error (MAE) </b>:
    - The average of the absolute differences between predicted and actual values.
3. <b> Mean Squared Error (MSE) </b>:
    - The average of the squared differences between predicted and actual values.
4. <b> Root Mean Squared Error (RMSE) </b> :
    - The square root of the mean squared error.