#  Data Scaling

- **Scaling** improves the training process for ML.

- we do **not** scale the target column y.
- we do **not** scale categorical columns. Leave the categorical data unchanged.
  
1. **Scaling** (Min-max scaler) (**Normalization**)
    - Min-Max Scaler
    - scales data to a specific range: [0,1].
    - used when the data doesnt have a normal distribution.
    - used when the data has a unknown distribution.
    - sensitive to outliers.
    - Min-Max Scaler is used with KNN, SVM, Gradient Descent, Neural Networks.

    - When you calculate MSE in linear regression you must calculate the inverse.
    - Normalization does not always save the relationship between the data.

3. **Standarization**(Standard scaler)
    - scales data to a specific range: [-1,1].
    - used then the data has a normal distribution.
    - used when the data has an unknown distribution.
    - less sensitive to outliers.
    - StandardScaler is used with Linear Regression, KNN-classification Logistic Regression, PCA, SVM, Clustering.

    - changes the shape of the distribution.
    - Standardization saves the relations between the data (when the data is correlated).

**MinMaxScaler vs StandardScaler**
- StandardScaler is prefered when you have outliers.
- MinMaxScaler is not prefered for data with outliers.
- Min-MaxScaler is used for deep learning models, especially with activation functions like **sigmoid** or **tanh**.



In [30]:
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Original Data
data = np.array([[10, 100], [20, 200], [30, 300], [40, 400], [50, 500]])

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
scaled_data = min_max_scaler.fit_transform(data)

# Standardization
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(data)

# Print Results
print("Original Data:\n", data)
print("\nMin-Max Scaled Data:\n", scaled_data)
print("\nStandardized Data:\n", standardized_data)

Original Data:
 [[ 10 100]
 [ 20 200]
 [ 30 300]
 [ 40 400]
 [ 50 500]]

Min-Max Scaled Data:
 [[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [0.75 0.75]
 [1.   1.  ]]

Standardized Data:
 [[-1.41421356 -1.41421356]
 [-0.70710678 -0.70710678]
 [ 0.          0.        ]
 [ 0.70710678  0.70710678]
 [ 1.41421356  1.41421356]]


![a84e9924-b0f7-4998-a6d4-f9084e59d303.jpg](attachment:2e6ec048-73ba-4295-bf02-dd13202afa63.jpg)

- .fit() is used only for the training data.
- .transform() is only used for the test data.
- X_train = scaler.fit(X_train)
- X_test = scaler.transform(X_test)

# Linear Regression

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

- Check the R2 score:

model.score(X_train, Y_train)

The greater the R2 score the predicted Y is close to the original Y.

- model.coef_ Gives us the coefficients y = mx+b. b is the coefficient b1,b2,...bn

![Screenshot 2025-01-15 134636.png](attachment:7aab874d-3fa2-4f0e-8292-5548bdde2ad6.png)

- Predict the slope:

![Screenshot 2025-01-15 135234.png](attachment:a1d67cfe-0ed8-4bfe-8f8d-8989d607f39c.png)

# KNN-Classification

![Screenshot 2025-01-15 140543.png](attachment:78fc0089-17da-441b-8e27-e67c6755e602.png)

- This means that the values are mapped only to some points -->theyre like categories.

![Screenshot 2025-01-15 140905.png](attachment:9502e8b2-5135-4c98-bfea-167c53b8b160.png) 

- 'Personal Loan' is the target column. **What type of classification is used?** We check what type of categories it has: Binomial categories

![Screenshot 2025-01-15 141854.png](attachment:325730f4-8e20-45bc-bd02-d01d5a8aebde.png)

![Screenshot 2025-01-15 142032.png](attachment:be5c101f-ee6d-4ca2-a1ff-cd155ca64f5f.png)

- Because the values are 0 and 1 we will import classification_report and confusion_matrix and f1_score

![Screenshot 2025-01-15 142337.png](attachment:7aab3e3d-ca5e-4c84-ac4d-d19601590caa.png)

The confusion matrix tells us the 895 from the total nr of 0's are really 0. 5 were mistakenly classified as 1. 39 were correctly classidied as 1 and 61 were mistakenly classified as 61.
[TP, FN,
FP, TN ]

![Screenshot 2025-01-15 142721.png](attachment:d3eabb6f-c3a0-4835-8847-0a4c91c2af47.png)

![Screenshot 2025-01-15 142825.png](attachment:01478c83-e407-4234-9a38-414fbba12123.png)

![Screenshot 2025-01-15 142923.png](attachment:b243a2fc-bed2-494a-9c94-aa0d0b616353.png)

![Screenshot 2025-01-15 143047.png](attachment:014849b2-534d-4d42-ba25-3bd83d499fcf.png)

![Screenshot 2025-01-15 143101.png](attachment:645f3a66-3faf-4d54-adab-5cd08a303146.png)

# Logistic Regression

![Screenshot 2025-01-15 180038.png](attachment:9f066d04-8ada-413f-a6d1-d1b3097fde88.png)

- just like KNN Classifier (it returns 0s and 1s)

![Screenshot 2025-01-15 182702.png](attachment:3477b6a8-d1b9-4a67-be6b-28e473e80c14.png)

![Screenshot 2025-01-15 183112.png](attachment:adc9e5e2-92e6-44a0-9842-c52de3ae1f0a.png)

![Screenshot 2025-01-15 183801.png](attachment:9df6c428-6b3e-4ce6-8f7e-cf44dc339ccb.png)

# Exercise 6

![Screenshot 2025-01-15 184337.png](attachment:bce35c9c-1b05-4dd2-82fb-4d3158c69c04.png)

![Screenshot 2025-01-15 190758.png](attachment:704f21f5-c006-41eb-b5f6-322ecb633643.png)

![Screenshot 2025-01-15 190855.png](attachment:25ee5185-bb35-43df-be30-37b0a3e8974a.png)

 ![Screenshot 2025-01-15 191106.png](attachment:e556fa2c-73d6-4c63-ac6b-9abe03471064.png)

![Screenshot 2025-01-15 191203.png](attachment:c9a45aff-3c54-4c2b-9a0a-32ccf30c4777.png)

## Polynomial Regression

![Screenshot 2025-01-15 191719.png](attachment:660a2a9d-348a-4138-b8a5-194ebc531018.png)

 1. Split the dataset

![Screenshot 2025-01-15 192114.png](attachment:2679cf83-2505-416b-ad03-d8acbb5da190.png)

2. Scale the dataset

![Screenshot 2025-01-15 192530.png](attachment:f0679c1b-4a25-4be2-96fb-378141c0e887.png)

 3. Call the Polynomial Regressor

In [59]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
# poly.fit_transform(X)
# 2 is the degree of the polynomial
# the shape in the graph is like x^2
# x^2 or x^4 are a good choice

![Screenshot 2025-01-15 195542.png](attachment:403f694e-0560-4b3f-a44a-213b72353373.png)

![Screenshot 2025-01-15 195715.png](attachment:e8a7325a-b261-4f88-9a3d-137e73b006fe.png)

In [63]:
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)

![Screenshot 2025-01-15 202551.png](attachment:5f4e7238-1326-4cbd-bb94-8c4cc67b875e.png)

- ↑ we have only two values

![Screenshot 2025-01-15 203055.png](attachment:8b782fed-e916-4c96-a507-10535322138a.png)

- the MSE is a large value because we didnt scale the data. If we had scaled the data the MSE would be low

 **How to find the best coefficient?** Which model will give the best result?

![Screenshot 2025-01-15 211443.png](attachment:d1eb0f25-e36b-4fdf-a78a-1be9bb3e9b8b.png)

- If you read the graph below, all the n^x (x^2, x^4...)have the same results. either one is ok. even linear is ok

![Screenshot 2025-01-15 211711.png](attachment:a5afe407-5099-41d1-aae4-c80360dc0f2c.png)

- In the graph below, from 2 up to 10 the MSE is the same. From 4 to 10 we can see an almost linear relation. This means that 4 (up to 10 opt)would be a good fit.

![Screenshot 2025-01-15 212108.png](attachment:866bb243-c3f9-4322-8001-6f3dfedfb65b.png)

### Ridge Regularization & K-Fold

![Screenshot 2025-01-16 134424.png](attachment:6fa2f1dd-6ec7-4b87-a0bc-8cea2c70e3d6.png)

![Screenshot 2025-01-16 141751.png](attachment:b16d42bd-f076-4905-8266-52e1689b8e71.png)

![Screenshot 2025-01-16 142010.png](attachment:5903b0f3-50e9-46de-a067-6eee54e5e122.png)

![Screenshot 2025-01-16 142245.png](attachment:2aae4287-b054-459c-b768-8a0f0b07c1c6.png)

## Decision Trees for Classification

![Screenshot 2025-01-16 142747.png](attachment:4b6348ff-8e6c-4d3a-8a61-98463f3c0f95.png)

![Screenshot 2025-01-16 152108.png](attachment:40f2a2ed-2b7f-439d-aea4-139262031a99.png)

 - Optimize the decision tree performance:

![Screenshot 2025-01-16 154045.png](attachment:754953fa-7c21-4517-aeaa-306de9351daf.png)

## Decision Trees for Regression

 ![Screenshot 2025-01-16 173432.png](attachment:29420e48-20a2-44f1-8915-c36b1d29dc5a.png)

![Screenshot 2025-01-16 173600.png](attachment:c7f04cc9-7c59-4af1-bd75-37a0d84481f6.png)

![Screenshot 2025-01-16 173806.png](attachment:1d79ddbd-879a-47c1-8e95-b1478cc281cb.png)

![Screenshot 2025-01-16 174139.png](attachment:7c4502fc-0d2c-4842-8302-9b49f6c23905.png)

- The graph looks like this because we have only one value for X_train