In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Machine Learning

Machine learning is a branch of artificial intelligence (AI) that involves developing algorithms and statistical models that enable computers to automatically learn from data without being explicitly programmed. In machine learning, algorithms use data to identify patterns and make decisions with minimal human intervention. Machine learning is used in a wide range of applications, from self-driving cars and facial recognition technology to medical diagnosis and natural language processing.

## Major Approaches of ML

1. **Supervised learning:** This approach involves using a labeled dataset to train a model that can predict output values for new input data. It is called "supervised" because the training data includes the correct answers, and the goal is for the model to learn how to produce the correct output for new, unseen data.

2. **Unsupervised learning:** This approach involves using an unlabeled dataset to train a model that can identify patterns or structure in the data. There is no "correct answer" to learn from, so the model is typically used for tasks like clustering, where similar data points are grouped together.

3. **Reinforcement learning:** This approach involves training a model to make decisions based on rewards or penalties received in response to its actions. The goal is for the model to learn to take actions that maximize its cumulative reward over time. This approach is often used in areas like game-playing and robotics, where an agent must learn to navigate an environment and make decisions based on feedback.

## Terminology

![image.png](attachment:f9d3cbd8-e467-4423-97e3-f20891aaa385.png)

## Supervised Learning

### Classification: 

In supervised learning, classification is a type of task where the goal is to predict the categorical label of new observations based on past observations with known labels. The algorithm learns from a labeled dataset, where each data point is associated with a class label, and then predicts the class labels for unseen data.

Classification can be binary, where there are only two classes, or multiclass, where there are more than two classes. Some common algorithms for classification include logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. These algorithms use different approaches to learn the patterns in the data and make predictions about the classes of new observations.

1. Logistic Regression
2. Naive Bayes
3. Decision Trees
4. Support Vector Machines
5. Random Forest
6. Neural Networks
7. K-Nearest Neighbors (KNN)

**Assumptions:**
1. Target data should be discret/categorical.
2. Features should be independent

### Regression:

In supervised learning, regression is a type of task where the goal is to predict a continuous value for new observations based on past observations with known values. The algorithm learns from a labeled dataset, where each data point is associated with a real-valued label, and then predicts the continuous value for unseen data.

Regression tasks can involve predicting a single value (univariate regression) or multiple values (multivariate regression). Some common algorithms for regression include linear regression, polynomial regression, decision tree regression, random forest regression, support vector regression (SVR), and neural network regression.

These algorithms use different mathematical techniques to learn the relationships between the input features and the target variable and make predictions about the continuous values of new observations.

#### Types of Continuous Data Model.
1. Linear Regression
2. Support Vector Regression
3. Ridge Regression and Lasso Regression
4. Decision Trees
5. Random Forest Regression
6. Gradient Boosting Regression

#### Assumptions
1. Target data should be continuous
2. Features should be independent

<br>

## Linear Regression

Linear regression is a widely used statistical technique for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fit line that minimizes the difference between the predicted and actual values. The goal of linear regression is to make predictions or understand the impact of independent variables on the dependent variable.

In simple linear regression, we consider a single independent variable and a single dependent variable. The relationship between the variables can be represented by the equation:

$y = mx + c$

Where:

- `y` is the dependent variable
- `x` is the independent variable
- `c` is the y-intercept (the value of `y` when `x` is 0)
- `m` is the slope (the change in `y` for a unit change in `x`)

In higher dimension this equation becomes:

$y = wx + b$

The goal is to estimate the values of $w$ and $b$ that best fit the data.

## Ordinary Least Squares (OLS) Estimation

The most common method to estimate the coefficients (`b` and `w`) in linear regression is the Ordinary Least Squares (OLS) estimation. It aims to minimize the sum of squared residuals by finding the values of `b` and `w` that minimize the following equation:

$\frac{∂RSS}{∂b0} = -2Σ(y - b0 - b1 * x) = 0$

$\frac{∂RSS}{∂b1} = -2Σx(y - b0 - b1 * x) = 0$

Solving these equations simultaneously will yield the estimated coefficients b0 and b1:

$w = \frac{Σ(x - x̄)(y - ȳ)}{Σ(x - x̄)^2}$

$b = ȳ - w * x̄$

Where:

- `x̄` is the mean of the independent variable `x`
- `ȳ` is the mean of the dependent variable `y`

These formulas can be computed efficiently, providing the best-fit line.

In [6]:
data = pd.read_csv("clean_stat_data.csv")

In [7]:
data.head()

Unnamed: 0,matric_gpa_%,accommodation_status,monthly_allowance,scholarship_bursary_2023,study_hours_week,socialising_week,drinks_night,classes_missed_alcohol,modules_failed,in_relationship,...,faculty_Law,faculty_Medicine and Health Services,faculty_Science,year_in_2023_0th Year,year_in_2023_1st Year,year_in_2023_2nd Year,year_in_2023_3rd Year,year_in_2023_4th Year,year_in_2023_Postgraduate,target
0,-0.365008,-0.395407,-1.095457,-0.360505,-1.155375,-1.205451,-1.407833,-2.507187,-0.753244,-1.148268,...,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412,72.0
1,1.794677,-0.395407,-0.464006,2.773886,-1.155375,-1.205451,-0.571682,-1.676764,-0.753244,0.870877,...,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412,75.0
2,-0.365008,-0.395407,-1.095457,-0.360505,-0.362322,-0.498885,-1.407833,-2.507187,-0.753244,0.870877,...,-0.172062,-0.172062,-0.399073,-0.056614,1.312151,-0.929029,-0.38278,-0.151248,-0.127412,55.0
3,1.794677,-0.395407,0.167446,-0.360505,-0.362322,0.207681,-1.407833,-0.846341,-0.753244,-1.148268,...,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412,84.0
4,-0.697268,-0.395407,-1.095457,-0.360505,-0.362322,-1.205451,0.26447,-0.015919,-0.077699,0.870877,...,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412,52.0


In [8]:
X = data.drop('target', axis=1).to_numpy()
y = data['target'].to_numpy()

In [9]:
y

array([72.  , 75.  , 55.  , 84.  , 52.  , 54.  , 75.  , 75.  , 64.  ,
       76.  , 75.  , 65.  , 55.  , 54.  , 62.  , 55.  , 76.  , 65.  ,
       69.  , 60.  , 55.  , 74.  , 70.  , 60.  , 54.  , 75.  , 63.  ,
       73.  , 57.  , 75.  , 78.  , 55.  , 70.  , 60.  , 61.  , 63.  ,
       60.  , 57.  , 74.  , 76.  , 80.  , 66.  , 80.  , 70.  , 65.  ,
       75.  , 63.  , 80.  , 65.  , 75.  , 65.  , 61.  , 76.  , 65.  ,
       58.  , 60.  , 65.  , 60.  , 58.  , 55.  , 71.  , 60.  , 70.  ,
       60.  , 64.  , 84.  , 60.  , 61.  , 66.  , 65.  , 53.  , 62.  ,
       65.  , 62.  , 61.  , 73.  , 60.  , 60.  , 64.  , 62.  , 69.  ,
       57.  , 60.  , 60.  , 66.  , 50.  , 70.  , 65.  , 72.  , 62.  ,
       76.  , 75.  , 88.  , 61.  , 52.  , 79.  , 56.  , 70.  , 60.  ,
       57.  , 64.  , 60.  , 74.  , 73.  , 51.  , 60.  , 60.  , 65.  ,
       54.  , 65.  , 64.  , 61.  , 68.  , 53.  , 73.  , 50.  , 64.  ,
       73.  , 73.  , 50.  , 68.  , 65.  , 71.  , 65.  , 60.  , 72.  ,
       70.  , 77.  ,

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

In [11]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [12]:
linear_model.coef_

array([ 1.96534281e+00, -5.22989387e-01,  2.39514883e-01, -3.51243860e-01,
        6.52832364e-02, -1.38268481e-01, -2.48539243e-01,  8.03119691e-01,
       -2.90799578e+00,  5.77401861e-02, -2.38650570e-01,  2.80275015e-01,
        3.21393407e+11,  3.21393407e+11, -4.41203015e+11, -5.74949322e+11,
       -9.20408213e+11, -2.31658567e+11, -4.91634723e+11, -3.08777789e+11,
       -3.08777789e+11, -6.36068833e+11,  1.69824101e+12,  1.45078662e+13,
        1.50058523e+13,  1.00469293e+13,  4.44971053e+12,  3.77296162e+12])

In [13]:
linear_model.intercept_

66.40056456542969

In [14]:
X.shape

(313, 28)

### Performance Metrics

* Mean Absolute Error (MAE): $\frac{\sum_{i=1}^{N}|y_{actual} - y_{predicted}|}{N}$
* Mean Squared Error (MSE): $\frac{\sum_{i=1}^{N}(y_{actual} - y_{predicted})^2}{N}$
* R2: $1 - \frac{\sum(y_{actual} - y_{predicted})^2}{\sum(y_{actual} - mean)^2}$^2}$

#### Mean Absolute Error (MAE)

Mean Absolute Error is a commonly used metric for regression problems. It measures the average absolute difference between the predicted and actual values. MAE provides an easily interpretable measure of how close the predictions are to the actual values.

$MAE = \frac{1}{n} * Σ|y - ŷ|$

where:

- `n` is the number of instances
- `y` represents the actual values
- `ŷ` represents the predicted values

MAE is relatively simple to understand and compute but does not consider the squared errors, potentially making it less sensitive to large errors.

`Example:` For the problem above, we calculate the MAE as follows:

$MAE = \frac{|250,000 - 260,000| + |300,000 - 295,000| + |350,000 - 360,000| + ...}{100}$

The MAE provides an average of the absolute differences between the actual and predicted values. In this example, it measures the average difference between the predicted and actual housing prices.

- **When to use:** MAE is a common metric for evaluating regression models and is suitable when you want to understand the average magnitude of the errors without considering their direction.

#### Mean Squared Error (MSE)

Mean Squared Error is another popular metric for regression evaluation. It measures the average of the squared differences between the predicted and actual values. MSE gives more weight to larger errors compared to MAE.

$MSE = \frac{1}{n} * Σ(y - ŷ)^2$

MSE provides a more comprehensive measure of the model's performance by considering both small and large errors. However, it is not directly interpretable in the original scale of the target variable.

`Example:` Using the same housing price regression example, we calculate the MSE:

$MSE = \frac{(250,000 - 260,000)^2 + (300,000 - 295,000)^2 + (350,000 - 360,000)^2 + ...}{100}$

MSE calculates the average of the squared differences between the predicted and actual values. It penalizes larger errors more heavily than MAE.

- **When to use:** MSE is widely used in regression tasks and is beneficial when you want to emphasize larger errors and penalize them more compared to smaller errors.

#### Root Mean Squared Error (RMSE)

RMSE is derived from MSE and is widely used due to its interpretability in the original scale of the target variable. It represents the square root of the average of the squared differences between the predicted and actual values.

$RMSE = \sqrt{\frac{1}{n} * Σ(y - ŷ)^2}$

RMSE shares the same properties as MSE but provides a more easily understandable and interpretable metric.

`Example:` Based on the previous example, we calculate the RMSE as follows:

$RMSE = sqrt(MSE)$

RMSE provides the square root of the MSE, giving a measure of the average magnitude of the errors in the original unit of the target variable (e.g., dollars in the housing price example).

- **When to use:** RMSE is a commonly used metric for regression problems and is especially useful when you want to interpret the errors in the original unit of the target variable.

#### R-squared Error (Coefficient of Determination)

R-squared (R²) is a statistical metric that represents the proportion of variance in the target variable explained by the regression model. It indicates how well the model fits the data and ranges from 0 to 1. A higher R² value indicates a better fit.

$R² = 1 - \frac{SSE}{SST}$

where:

- `SSE` is the sum of squared residuals $(predicted values - actual values)^2$
- `SST` is the total sum of squares $(actual values - mean of actual values)^2$

R-squared measures the goodness of fit but does not consider the complexity of the model or the number of predictors.

`Example:` In the housing price regression example, we can calculate the R-squared error as follows:

$SST = \sum{((Actual Prices - mean(Actual prices))^2}$

Calculate the residual sum of squares (SSE):

$SSE = \sum{((Actual Prices - Predicted Prices))^2}$

Calculate the R-squared error:

$R² = 1 - \frac{SSE}{SST}$

The R-squared error measures the proportion of the variance in the target variable that can be explained by the regression model. It ranges from 0 to 1, with 1 indicating that the model explains all the variability in the target variable.

- **When to use:** R-squared error is commonly used to assess the goodness of fit of a regression model. It provides an indication of how well the model fits the data.

In [15]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [16]:
y_actual = y_test

In [17]:
y_predicted = linear_model.predict(X_test)

In [18]:
mean_absolute_error(y_actual, y_predicted)

6.705649965510293

In [23]:
mean_squared_error(y_actual, y_predicted)

69.59770639694536

### R2 Cofficient

R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates a perfect fit. It is calculated as:

$R^2 = 1 - \frac{RSS}{TSS}$

Where RSS is the residual sum of squares and TSS is the total sum of squares.

The RSS (Residual Sum of Squares) represents the sum of squared differences between the observed dependent variable values (y) and the predicted values (ŷ) obtained from the linear regression model. Mathematically, it is calculated as follows:

$RSS = Σ(y - ŷ)^2$

On the other hand, the TSS (Total Sum of Squares) represents the total variation in the dependent variable (y) from its mean (ȳ). It measures the sum of squared differences between each observed dependent variable value (y) and the mean of the dependent variable (ȳ). Mathematically, it is calculated as follows:

$TSS = Σ(y - ȳ)^2$

R-squared is always between 0 and 100%:

* 0% represents a model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
* 100% represents a model that explains all the variation in the response variable around its mean.


In [24]:
r2_score(y_actual, y_predicted)

0.057995850874791754

In [25]:
r2_adj = 1 - (1-linear_model.score(X_test, y_test))*(len(y_test)-1)/(len(y_test)-X.shape[1]-1)

In [26]:
r2_adj

-0.7177722719342032