# **Diabetes Progression Prediction using Linear Regression**

## **Problem Statement**
The **Diabetes Dataset** is one of the datasets available in `sklearn`.  
It consists of **10 physiological variables** (age, sex, weight, blood pressure, etc.) measured on **442 patients**, along with an **indication of disease progression after one year**.

You are given a **training dataset (CSV files)** containing `X_train` and `Y_train` data.  
Your task is to **train a Linear Regression model** and use it to make predictions on `X_test`.

---

## **Instructions**
1. Use **Linear Regression** from `scikit-learn` as the training algorithm.
2. Load the CSV files using **NumPy’s `genfromtxt()`** function.
3. Save the predictions using **NumPy’s `savetxt()`** function.
4. **Submission Format:**
   - Submit a CSV file with **only the predictions** for `X_test`.
   - The file **should not contain headers** and should have **only one column**.
   - Ensure that the **prediction values are rounded to 5 decimal places**.

---

## **Evaluation Criteria**
- Your submission will be evaluated based on the **coefficient of determination (R² score)**.
- A **higher R² score** indicates better model performance.

---

## **Hints**
- Use `LinearRegression` from `sklearn.linear_model`.
- Ensure data is correctly loaded and formatted before training.
- Use `np.round(Y_pred, 5)` to round predictions to **5 decimal places**.
- Use `np.savetxt("Y_pred.csv", Y_pred, delimiter=",", fmt="%.5f")` to save predictions correctly.

---

🚀 **Good Luck!**

In [3]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LinearRegression # A regression algorithm from sklearn to train the model.
from sklearn.datasets import load_diabetes # used to load the datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score # r2_score: A metric to evaluate model performance.

In [4]:
# importing the data set
diabetes = load_diabetes()
# diabetes

In [5]:
# data has are the input values used to predict the target variable.
x = diabetes.data      # independent variables
# Each value in Y represents the diabetes disease progression score for a patient.
y = diabetes.target    # dependent variables

In [6]:
data = pd.DataFrame(x,columns = diabetes.feature_names)
data['target'] = y
data.head(5)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [83]:
train = np.genfromtxt('train_data.csv', delimiter=',')
print(train.shape)

(331, 11)


In [93]:
x_train = train[:,:-1]
print(x_train.shape)
y_train = train[:,-1]
print(y_train.shape)


model = LinearRegression()

(331, 10)
(331,)


In [95]:
model.fit(x_train,y_train)

In [101]:
test = np.genfromtxt('Test_diabaties.csv',delimiter=',')
test.shape

(111, 10)

In [103]:
y_predict = model.predict(test)
y_predict = np.round(y_predict,5)

In [104]:
y_predict

array([105.52975, 105.80392, 178.60857,  79.38288,  52.95869,  98.87264,
       150.71755,  34.86515, 113.13536, 161.50184, 135.86156,  94.71592,
       138.48094, 141.37442, 158.76863, 171.65289, 106.44836, 103.9266 ,
        95.38694, 167.40118, 166.53426, 101.53465, 252.45036, 147.02259,
       214.78907, 161.27557, 210.61315,  71.78092, 189.65032, 206.61343,
       219.98643, 168.80193, 116.84846, 178.744  ,  77.03247,  59.54633,
       111.56738, 156.95187, 154.59591, 198.94457, 115.53884, 153.46699,
        84.9618 , 113.70337, 142.14156, 147.3104 ,  82.78072,  77.89237,
       128.99006, 261.58712, 213.31188, 243.98791, 167.68132, 183.69712,
       166.85927, 202.1144 , 220.39236, 172.40288, 176.60898, 109.04657,
       276.3779 ,  90.99942, 289.37221, 119.56253,  75.45688, 180.78599,
       146.62093, 156.42382,  41.07904, 247.90645, 207.99121,  90.09641,
       222.2417 , 189.86378, 182.26954, 164.39881, 190.27067, 105.44868,
       199.86136, 245.79639, 123.20281, 119.41032, 

### **Understanding train_test_split() Parameters**

- **X**: Feature variables (independent variables, input data).
- **Y**: Target variable (dependent variable, output labels).
- **test_size=0.2**: 20% of the data is used for testing, while 80% is used for training.
- **random_state=42**: Ensures reproducibility (so the same split happens every time you run the code).


#### Why Use train_test_split()?
- ✅ Prevents overfitting (ensures model is tested on unseen data).
- ✅ Ensures generalization (model performs well on new data).
- ✅ Makes training/testing separation easy (automatically splits data).

