In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Machine Learning

Machine learning is a branch of artificial intelligence (AI) that involves developing algorithms and statistical models that enable computers to automatically learn from data without being explicitly programmed. In machine learning, algorithms use data to identify patterns and make decisions with minimal human intervention. Machine learning is used in a wide range of applications, from self-driving cars and facial recognition technology to medical diagnosis and natural language processing.

## Major Approaches of ML

1. **Supervised learning:** This approach involves using a labeled dataset to train a model that can predict output values for new input data. It is called "supervised" because the training data includes the correct answers, and the goal is for the model to learn how to produce the correct output for new, unseen data.

2. **Unsupervised learning:** This approach involves using an unlabeled dataset to train a model that can identify patterns or structure in the data. There is no "correct answer" to learn from, so the model is typically used for tasks like clustering, where similar data points are grouped together.

3. **Reinforcement learning:** This approach involves training a model to make decisions based on rewards or penalties received in response to its actions. The goal is for the model to learn to take actions that maximize its cumulative reward over time. This approach is often used in areas like game-playing and robotics, where an agent must learn to navigate an environment and make decisions based on feedback.

## Terminology

In [15]:
### Dataset

![image.png](attachment:f9d3cbd8-e467-4423-97e3-f20891aaa385.png)

In [18]:
## Training Testing and Validation Set

### Supervised Learning

#### `Classification`: 
In supervised learning, classification is a type of task where the goal is to predict the categorical label of new observations based on past observations with known labels. The algorithm learns from a labeled dataset, where each data point is associated with a class label, and then predicts the class labels for unseen data.

Classification can be binary, where there are only two classes, or multiclass, where there are more than two classes. Some common algorithms for classification include logistic regression, decision trees, random forests, support vector machines (SVM), and neural networks. These algorithms use different approaches to learn the patterns in the data and make predictions about the classes of new observations.

1. Logistic Regression
2. Naive Bayes
3. Decision Treesons.

#### `Regression`:
In supervised learning, regression is a type of task where the goal is to predict a continuous value for new observations based on past observations with known values. The algorithm learns from a labeled dataset, where each data point is associated with a real-valued label, and then predicts the continuous value for unseen data.

Regression tasks can involve predicting a single value (univariate regression) or multiple values (multivariate regression). Some common algorithms for regression include linear regression, polynomial regression, decision tree regression, random forest regression, support vector regression (SVR), and neural network regression.

These algorithms use different mathematical techniques to learn the relationships between the input features and the target variable and make predictions about the continuous values of new observations.

1. Linear Regression

<br>

## Linear Regression

Linear regression is a widely used statistical technique for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables and aims to find the best-fit line that minimizes the difference between the predicted and actual values. The goal of linear regression is to make predictions or understand the impact of independent variables on the dependent variable.

In simple linear regression, we consider a single independent variable and a single dependent variable. The relationship between the variables can be represented by the equation:

$y = mx + c$

Where:

- `y` is the dependent variable
- `x` is the independent variable
- `c` is the y-intercept (the value of `y` when `x` is 0)
- `m` is the slope (the change in `y` for a unit change in `x`)

In higher dimension this equation becomes:

$y = wx + b$

The goal is to estimate the values of $w$ and $b$ that best fit the data.

## Ordinary Least Squares (OLS) Estimation

The most common method to estimate the coefficients (`b` and `w`) in linear regression is the Ordinary Least Squares (OLS) estimation. It aims to minimize the sum of squared residuals by finding the values of `b` and `w` that minimize the following equation:

$\frac{∂RSS}{∂b0} = -2Σ(y - b0 - b1 * x) = 0$

$\frac{∂RSS}{∂b1} = -2Σx(y - b0 - b1 * x) = 0$

Solving these equations simultaneously will yield the estimated coefficients b0 and b1:

$w = \frac{Σ(x - x̄)(y - ȳ)}{Σ(x - x̄)^2}$

$b = ȳ - w * x̄$

Where:

- `x̄` is the mean of the independent variable `x`
- `ȳ` is the mean of the dependent variable `y`

These formulas can be computed efficiently, providing the best-fit line.

In [25]:
stat = pd.read_csv("clean_stat_data.csv")

In [26]:
stat.head()

Unnamed: 0,matric_gpa_%,accommodation_status,monthly_allowance,scholarship_bursary_2023,study_hours_week,socialising_week,drinks_night,classes_missed_alcohol,modules_failed,in_relationship,...,faculty_Engineering,faculty_Law,faculty_Medicine and Health Services,faculty_Science,year_in_2023_0th Year,year_in_2023_1st Year,year_in_2023_2nd Year,year_in_2023_3rd Year,year_in_2023_4th Year,year_in_2023_Postgraduate
0,-0.365008,-0.395407,-1.095457,-0.360505,-1.155375,-1.205451,-1.407833,-2.507187,-0.753244,-1.148268,...,-0.288175,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412
1,1.794677,-0.395407,-0.464006,2.773886,-1.155375,-1.205451,-0.571682,-1.676764,-0.753244,0.870877,...,-0.288175,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412
2,-0.365008,-0.395407,-1.095457,-0.360505,-0.362322,-0.498885,-1.407833,-2.507187,-0.753244,0.870877,...,-0.288175,-0.172062,-0.172062,-0.399073,-0.056614,1.312151,-0.929029,-0.38278,-0.151248,-0.127412
3,1.794677,-0.395407,0.167446,-0.360505,-0.362322,0.207681,-1.407833,-0.846341,-0.753244,-1.148268,...,3.47011,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412
4,-0.697268,-0.395407,-1.095457,-0.360505,-0.362322,-1.205451,0.26447,-0.015919,-0.077699,0.870877,...,-0.288175,-0.172062,-0.172062,-0.399073,-0.056614,-0.762108,1.076392,-0.38278,-0.151248,-0.127412


In [27]:
X = stat.drop('target', axis=1).to_numpy()
y = data['target'].to_numpy()

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=42)

In [30]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [31]:
linear_model.coef_

array([ 1.96534281e+00, -5.22989387e-01,  2.39514883e-01, -3.51243860e-01,
        6.52832364e-02, -1.38268481e-01, -2.48539243e-01,  8.03119691e-01,
       -2.90799578e+00,  5.77401861e-02, -2.38650570e-01,  2.80275015e-01,
        3.21393407e+11,  3.21393407e+11, -4.41203015e+11, -5.74949322e+11,
       -9.20408213e+11, -2.31658567e+11, -4.91634723e+11, -3.08777789e+11,
       -3.08777789e+11, -6.36068833e+11,  1.69824101e+12,  1.45078662e+13,
        1.50058523e+13,  1.00469293e+13,  4.44971053e+12,  3.77296162e+12])

In [32]:
linear_model.intercept_

66.40056456542969

In [33]:
X.shape

(313, 28)

#### Performance Metrics
- Mean Absolute Error
- R2 Score
