# Machine Learning

__Machine learning__ is a method of _data analysis_ that automates
_analytical model building_.
It is a branch of _artificial intelligence_ based on the idea that systems
can _learn from data_, _identify patterns_ and _make decisions_ with
minimal human intervention.

## Sci-kit Learn (SKlearn, Scipy, Numpy)

__Scikit-learn__ is a _Python package_ that provides a wide range of _machine learning algorithms_ and tools. 
It is built on top of _NumPy_, _SciPy_, and _Matplotlib_, and is designed to be simple and efficient for data analysis and modeling.

__Scikit-learn__ offers various modules for tasks such as _classification_, _regression_, _clustering_, _dimensionality reduction_, and _model selection_.
It also provides utilities for _preprocessing data_, _evaluating models_, and _handling datasets_.

With its extensive documentation and user-friendly interface, __Scikit-learn__ is widely used in the field of machine learning and data science.

In [None]:
#!pip install scikit-learn


In [None]:
# spplitting the data into training and testing data

# importing the model

# Assuming football_df is your DataFrame and it has a 'target' column
# columns_list is a list of column names to be used as features

# Splitting the data into features and target

# Splitting the dataset into training and testing sets


### K-Nearest Neighbors

__K-Nearest Neighbors__ is a simple algorithm that _stores all available
cases_ and _classifies_ new cases based on a similarity measure.

It is a type of _instance-based learning_, or _lazy learning_, where the
function is only approximated locally and all computation is deferred
until function evaluation.

In [None]:
# Classification of the data

# Feature scaling for better performance of KNN

# Creating the KNN model

# Fitting the model with the training data

# Testing model performance

# Predicting the class for the new input


In [None]:
# Regression of the data

# Feature scaling for better performance of KNN

# Creating the KNN model for regression

# Fitting the model with the training data

# Assuming new_input is a new data point you want to predict
# new_input should be a list of values corresponding to columns_list


### Linear Regression with Least Squares

__Linear regression__ is a type of _regression analysis_ used for predicting the value of a _continuous dependent variable_. It works by finding the _line that best fits the data_.

_Least squares_ is a method for finding the _best-fitting_ line by __minimizing__ the _sum of the squared differences_ between the predicted and actual values.

In [None]:
# Creating the Linear Regression model

# Fitting the model with the training data

# Assuming new_input is a new data point you want to predict
# new_input should be a list of values corresponding to columns_list

# Predicting the target for the new input


### Regularization with Ridge and Lasso

__Ridge regression__ (_L2_) and __Lasso regression__ (_L1_) are a type of _linear regression_ that includes a _penalty_ term to __prevent overfitting__. They work by adding a _regularization term_ to the least squares objective function

In [None]:
# implementing Rigde Regression (L2 regularization)

# Creating the Ridge Regression model
# alpha is the regularization strength; larger values specify stronger regularization.

# Fitting the model with the training data

# Making predictions on the test set

# Calculating the mean squared error of the predictions


In [1]:
# implementing Lasso Regression (L1 regularization)
# Creating the Lasso Regression model
# alpha is the regularization strength; larger values specify stronger regularization.

# Fitting the model with the training data

# Making predictions on the test set

# Calculating the mean squared error of the predictions

# To understand feature sensitivity, you can look at the coefficients


### Polynomial Regression

__Polynomial regression__ is a type of r_egression analysis_ that models
the _relationship_ between the independent and dependent variables as
an $nth-degree$ _polynomial_. It can capture _non-linear relationships_ between the variables.

In [None]:

# Assuming football_df is your DataFrame and it has a 'target' column
# columns_list is a list of column names to be used as features

# Splitting the data into features and target

# Splitting the dataset into training and testing sets

# Transforming the features into polynomial features

# Creating the Linear Regression model

# Fitting the model with the polynomial features and the training data

# Making predictions on the test set

# Calculating the mean squared error of the predictions


### Logistic Regression

__Logistic regression__ is a type of _regression analysis_ used for predicting the outcome of a _categorical dependent variable_.
It is used for __binary classification__ tasks, where the output is a
probability between $0$ and $1$.

In [None]:

# Creating the Logistic Regression model

# Fitting the model with the training data

# Making predictions on the test set

# Calculating the accuracy of the predictions


### Cross-Validation

__Cross-validation__ is a technique for _assessing the performance_ of a
model. It involves _splitting_ the data into multiple subsets, training the model on some subsets, and evaluating it on others.

__Cross-validation__ helps to _reduce overfitting_ and provides a more
accurate estimate of the model’s performance.

In [None]:

# Create a logistic regression model

# Perform cross-validation

# Print the accuracy for each fold

# Print the mean accuracy of all 5 folds


### Encoding

__One-hot encoding__ is a technique for _converting_ _categorical_ variables into _numerical_ variables.

It creates a _binary vector_ for each _category_, with a $1$ for the
category and $0$s for all other categories

In [3]:

# Creating a fake DataFrame

# Display the original DataFrame

# Applying OneHotEncoder

# Creating a DataFrame with the encoded data

# Concatenating the encoded columns with the original DataFrame (excluding the original 'Color' and 'Size' columns)

# Display the final DataFrame after one-hot encoding


# Supervised Machine Learning Algorithms

### Random Forest

__Random forest__ is an _ensemble learning_ method that combines
_multiple decision trees_ to create a strong predictive model.

It works by building _multiple trees_ and averaging their predictions to
_reduce overfitting_.

### Gradient Boosted Decision Trees

__Gradient boosted decision trees__ are an _ensemble learning_ method
that combines _multiple decision trees_ and _gradient descedent
optimization_ to create a strong predictive model.

They work by building _trees sequentially_, with each tree _correcting the
errors_ of the previous trees.

### Neural Networks

__Neural networks__ are a type of _machine learning_ model inspired by
the _human brain_.

They consist of _layers of interconnected nodes_ that process input data
and produce output data.

# Model Evaluation

### Confusion Matrices

A __confusion matrix__ is a table that _summarizes the performance_ of a
classification model.

It shows the number of _true positives_, _true negatives_, _false positives_,
and _false negatives_.

### Basic Metrics

- __Accuracy:__ The proportion of correct predictions.
- __Precision:__ The proportion of true positives among all positive
predictions.
- __Recall:__ The proportion of true positives among all actual positives.
- __F1 Score:__ The harmonic mean of precision and recall.

### Classifier Decision Metrics

- __ROC Curve:__ A plot of the true positive rate against the false positive rate.
- __Precision-Recall Curve:__ A plot of precision against recall.
- __AUC-ROC:__ The area under the ROC curve.
- __AUC-PR:__ The area under the precision-recall curve.

### Regression Evaluation Metrics

- __Mean Squared Error:__ The average of the squared differences between the predicted and actual values.
- __Mean Absolute Error:__ The average of the absolute differences between the predicted and actual values.
- __R-Squared:__ The proportion of the variance in the dependent variable that is predictable from the independent variables.
- __Adjusted R-Squared:__ A modified version of R-squared that adjusts for the number of predictors in the model.
- __Root Mean Squared Error:__ The square root of the mean squared error.