# Deep Learning Fundamentals Lab 1 -- Introduction to Scikit Learn

# Course Overview

![Projekt bez nazwy.png](attachment:79ea6b6d-4833-47e1-a3a8-c8d30bd6b8b7.png)

Welcome to the second installment of our Deep Learning Fundamentals laboratory series, presented by [TheLion.AI](https://www.thelion.ai/) — an interdisciplinary research group specializing in AI-based healthcare solutions. This comprehensive program is designed to equip you with practical skills in implementing deep learning models across various domains, with a special emphasis on natural language processing and computer vision. We like to make our software and teaching materials as accessible as possible. If you like what we do, consider supporting us at [https://buymeacoffee.com/thelionai](https://buymeacoffee.com/thelionai).

The course follows a progressive learning path, starting with foundational concepts and gradually building up to advanced techniques. Each lab session includes a brief overview of key concepts and hands-on coding exercises.

**New notebooks will be added weekly!**

### Syllabus
1. [Introduction to sci-kit learn](https://www.kaggle.com/code/basia25/introduction-to-scikit-learn/)
2. [Introduction to linear algebra in PyTorch](https://www.kaggle.com/code/basia25/introduction-to-linear-algebra-in-pytorch/)
3. [Neural network from scratch](https://www.kaggle.com/code/basia25/neural-network-from-scratch/)
4. [Neural network in pure PyTorch](https://www.kaggle.com/code/basia25/neural-network-in-pure-pytorch/)
5. Neural network in PyTorch Lightning
6. Regularization methods
7. Convolutional neural networks
8. State-of-the-art CNNs
9. Image segmentation
10. NLP fundamentals
11. HuggingFace
12. Sentence transformers
13. Explainable AI
14. Image transformer and AI in Healthcare
15. Running experiments in ClearML
16. Creating smart configuration files with Hydra

# Introduction

Welcome to the first laboratory session. In this exercise, we will explore linear regression, a fundamental concept in machine learning and statistics. Linear regression is a supervised learning algorithm used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between these variables and aims to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

We will use the diabetes dataset, which contains various physiological measurements and disease progression indicators. This dataset is commonly used in machine learning tutorials and provides a good foundation for understanding regression problems. Through this exercise, you will learn how to implement linear regression using different libraries -- sci-kit learn, interpret results, and evaluate model performance.

# Key Concepts
**Linear Regression**

Linear regression models the relationship between variables using a linear equation. In its simplest form (simple linear regression), it uses one independent variable to predict a dependent variable. The equation is:
$$y = b_0 + b_1*x$$

Where:

- $y$ - dependent variable (target)
- x - independent variable (feature)
- b0 - intercept (value of y when x=0)
- b1 - coefficient (slope of the line)

In multiple linear regression, multiple independent variables are used:
$$y = b_0 + b_1*x_1 + b_2*x_2 + ... + b_n*x_n$$

**Model Evaluation Metrics**
- Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower values indicate better performance.
- R-squared (R²): Represents the proportion of variance in the dependent variable that's predictable from the independent variables. Values range from 0 to 1, with higher values indicating better model fit.
  
**Training and Test Sets**

Splitting data into training and test sets helps evaluate model performance on unseen data. Typically, 70-80% of data is used for training, and 20-30% for testing.

**Regression Analysis**
Regression analysis helps understand relationships between variables, make predictions, and assess model significance through statistical measures like p-values and confidence intervals.

# Exercise

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, r2_score

import statsmodels.api as sm
from statsmodels.regression import linear_model

## 1. Data Preparation

In [None]:
# Load diabetes dataset
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

In [None]:
X

In [None]:
diabetes.feature_names

In [None]:
y

### 1.1 Limit the number of features to BMI and blood pressure.

In [None]:
X = # TODO
feature_names = # TODO

### 1.2 Split data into training and test sets with scikit learn

In [None]:
from numpy.lib.function_base import trapz
# Zadanie 1
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = # TODO


## 2. Linear regression in sci-kit learn

### 2.1 Add fit and predict

In [None]:
from sklearn.linear_model import LinearRegression

reg_sklearn = # TODO: Fit the model on the training set
y_pred = # TODO: Predict the results for the test data

### 2.2 Compare the predicted values with the true values.

In [None]:
print(f"Mean squared error: {mean_squared_error(# TODO)}")
print(f"Coefficient of determination: {r2_score(# TODO)}")

In [None]:
print("Coefficients: ", reg_sklearn.coef_)
print("Coefficients: ", reg_sklearn.intercept_)

In [None]:
fig, ax = plt.subplots(2)
for i in range(len(feature_names)):
  ax[i].scatter(X_test[:, i], y_test)
  ax[i].scatter(X_test[:, i], y_pred)
  ax[i].set_title(feature_names[i])

  plt.xticks()
  plt.yticks()
fig.tight_layout()
plt.show()

### 2.3 Modify the plot so that the predicted values are represented by blue stars and the true values are black dots

In [None]:
# TODO:

## 3. Linear regression in statsmodels

In [None]:
reg_sm = linear_model.OLS(y_train, X_train).fit()
reg_sm.summary()

### 3.1 Why does the regression from statsmodels return different results than the regression from scikit-learn? Find the error in the statsmodels implementation and fix it.

In [None]:
# TODO:


### 3.2 How to evaluate whether the linear regression model returns reasonable results? Analyze the output of the .summary() function. Describe 5 selected parameters and explain how to interpret their values.

TODO: