
# **Sklearn Regression**

### Scikit-Learn (sklearn) is one of the most popular machine learning libraries in Python.

**-----> It provides simple and efficient tools for:**

1-Data preprocessing (cleaning, scaling, encoding).

2-Splitting data into training and testing sets.

3-Machine learning models (classification, regression, clustering).

4-Model evaluation and metrics.


### What is Regression in Scikit-Learn?

*   Regression is a type of supervised machine learning where the goal is to predict a continuous numerical value (not a category).

*   Example: Predicting house prices, temperature, sales, or exam scores.

### Why use Scikit-Learn for Regression?


*    Easy to use – just a few lines of code to train a model.
*   Built-in algorithms like Linear Regression, Ridge, Lasso, Polynomial Regression, etc.
*   Comes with helper functions to split data, train models, and evaluate results.
*   Integrates well with NumPy and Pandas.







In [32]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes

# Load our dataset
diabetes = load_diabetes()

# Convert the data to datafram / Make me a table (DataFrame) from the diabetes dataset, and give each column its proper name
df = pd.DataFrame(diabetes.data , columns=diabetes.feature_names)

# Add the target
df['target'] = diabetes.target

# Grap the head
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


### Split data into X and Y

In [36]:
x = df.drop('target' , axis=1)   # X = all features (drop the target column)
y = df['target']                 # y = the target column we want to predict
x.shape, y.shape                 # check the shape of X and y

((442, 10), (442,))

### Split data into train and test

In [37]:
from sklearn.model_selection import train_test_split

# Split the data: 80% for training and 20% for testing (random_state fixes reproducibility)
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size=0.2 , random_state=42)

# print out the shape of the trainning and testing sets
print("Trainning set shape : ", x_train.shape , y_train.shape) # 80% for training  0.8 * 442 = 353
print("Testing set shape : ", x_test.shape , y_test.shape)     # 20% for testing   0.2 * 442 = 89

Trainning set shape :  (353, 10) (353,)
Testing set shape :  (89, 10) (89,)


In [38]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train , y_train)  # train the model on training data

In [39]:
y_pred = model.predict(x_test) # Predict on the test set

### Evaluate the model (check accuracy/performance)

In [40]:
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)   # error size
r2 = r2_score(y_test, y_pred)              # accuracy measure

print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 2900.193628493482
R^2 Score: 0.4526027629719195


In [41]:
# Predict on test data
y_pred = model.predict(x_test)

# Calculate R^2 (like accuracy for regression)
r2 = r2_score(y_test, y_pred)

# Calculate RMSE (root of MSE, more interpretable)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("R^2 Score (Accuracy measure):", r2)
print("RMSE (Root Mean Squared Error):", rmse)

R^2 Score (Accuracy measure): 0.4526027629719195
RMSE (Root Mean Squared Error): 53.85344583676593
