# ***Building First Machine Learning model***

In previous project, we cleaned our data, pre-processed it and scaled it so that it is best to be trained on by our Machine Learning. In this project, we will build our first Machine Learning and try to predict the value of charges for new people.

## ***Step 01: Importing Dataset***

First, let import our cleaned dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("final_df.csv")

df.head(10)

Unnamed: 0,age,is_female,bmi,children,is_smoker,charges,region_southeast,bmi_category_Obese
0,-1.440418,1,-0.517949,-0.909234,1,16884,0,0
1,-1.511647,0,0.462463,-0.079442,0,1725,1,1
2,-0.79935,0,0.462463,1.580143,0,4449,1,1
3,-0.443201,0,-1.33496,-0.909234,0,21984,0,0
4,-0.514431,0,-0.354547,-0.909234,0,3866,0,0
5,-0.585661,1,-0.844753,-0.909234,0,3756,1,0
6,0.482785,1,0.462463,-0.079442,0,8240,1,1
7,-0.158282,1,-0.517949,1.580143,0,7281,0,0
8,-0.158282,0,-0.191145,0.750351,0,6406,0,0
9,1.480002,1,-0.844753,-0.909234,0,28923,0,0


As you can see that the data is cleaned and scaled as we did in our earler project

## ***Step 2: Splitting the Data for Training and Testing***

In next step, we will split our data for training and testing.
- First we split our data in two general parts.
- **'X'** which contains all our input features, wheres **'y'** contain our target feature / output variable.

In [3]:
from sklearn.model_selection import train_test_split

X = df.drop("charges", axis=1)
y = df['charges']

### In next step, we will split our data in 4 parts.

The `train_test_split` function from the `sklearn.model_selection` module is used to split your data into training and testing sets.

- `X_train`: This contains the features (input variables) that will be used to train the machine learning model. This is a subset of the original `X` data.
- `X_test`: This contains the features that will be used to test the trained machine learning model's performance. This is a separate subset of the original `X` data that the model has not seen during training.
- `y_train`: This contains the corresponding target variable (output) for the `X_train` data. The model learns to predict these values during training.
- `y_test`: This contains the corresponding target variable for the `X_test` data. These are the actual values that the model's predictions (`y_pred`) will be compared against to evaluate its performance.

Splitting the data in this way is crucial to prevent overfitting, ensuring that the model generalizes well to new, unseen data.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
    )

## ***Step 3: Building the Machine Learning Model***

Now that our data is split, we can build our first machine learning model. We will use **Linear Regression**, a simple yet powerful algorithm for predicting a continuous target variable (in this case, 'charges') based on the input features.

Linear Regression works by finding the best-fitting straight line (or hyperplane in higher dimensions) that minimizes the difference between the predicted values and the actual values in the training data.

We chose Linear Regression for this initial model because:
- It is a good starting point for regression problems and provides a baseline performance.
- It is easy to interpret, allowing us to understand the relationship between each feature and the target variable.
- It is computationally efficient to train.

The code below initializes a `LinearRegression` model and then uses the `fit()` method to train the model on the `X_train` and `y_train` data.

In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

## ***Step 04: Testing our model***

Now that we have built our first model and also we have trained it on our traing data, it's time to actually test our data that how well it is performing. We test our data against **'X_test'** feature.

In [6]:
y_pred = model.predict(X_test)

Now, **"y_pred'** contain those values that our model predicted against the **'X_test'** features, whereas **"y_test"** contains original output values against **"X_test"** given in our dataset. In next step, we will compare both values in **'y_pred'** and **y_test'** to see how well our model has performed.

## ***Step 05: Evaluating our Model***

Now that we have our predictions (`y_pred`) and the actual values (`y_test`), we need to evaluate how well our model performed. A common metric for regression problems is the **R-squared (R2) score**.

The R-squared score measures the proportion of the variance in the dependent variable (our 'charges') that is predictable from the independent variables (our input features). In simpler terms, it tells us how well our model fits the data.

- An R-squared score of 1 means the model perfectly predicts the target variable.
- An R-squared score of 0 means the model does not explain any of the variability in the target variable.
- A score between 0 and 1 indicates the proportion of variance explained by the model.

The code below calculates the R-squared score using the `r2_score` function from `sklearn.metrics`, comparing the predicted values (`y_pred`) to the actual values (`y_test`). A higher R-squared score indicates a better performing model.

In [12]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)

print("R-squared:", r2)

R-squared: 0.8040712413347118


The R-squared score of **0.804** indicates that approximately 80.4% of the variance in the 'charges' can be explained by the features in our model. This is a reasonably good score for a first model, suggesting that the selected features have a significant impact on predicting medical charges.

## Adjusted R-Squared Score

The Adjusted R-squared score is another metric used to evaluate the performance of a regression model. While R-squared measures the proportion of the variance in the dependent variable explained by the independent variables, Adjusted R-squared takes into account the number of predictors in the model and the number of observations.

The Adjusted R-squared is calculated using the following formula:

$Adjusted\ R^2 = 1 - \frac{(1 - R^2) \times (n - 1)}{n - p - 1}$

Where:
- $R^2$ is the R-squared value
- $n$ is the number of observations (data points)
- $p$ is the number of predictor variables

Unlike R-squared, which can increase with the addition of more predictors (even if they don't improve the model), Adjusted R-squared will only increase if the new predictor improves the model more than would be expected by chance. This makes Adjusted R-squared a more reliable metric when comparing models with different numbers of predictors, as it penalizes the inclusion of unnecessary features.

In [13]:
n = X_test.shape[0]
p = X_test.shape[1]

adjusted_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))

print("Adjusted R-squared:", adjusted_r2)

Adjusted R-squared: 0.7987962362937232


The Adjusted R-squared score of **0.7987962362937232** is very close to the R-squared score (0.8040712413347118). This indicates that the features included in the model are contributing significantly to the prediction of charges and the model is not being penalized heavily for having too many predictors relative to the number of observations. An adjusted R-squared value close to the R-squared value suggests that the model is a good fit for the data and the added complexity of including multiple features is justified by the improved predictive power.

## ***Step 06: Predicting the Unknown***

Now as our model is a good fit for data, let's try to predict a total new **'charges'** value against some new datapoints

In [15]:
import numpy as np

# Create some new data points with the same features as the training data
# Ensure the features are scaled and encoded in the same way as the training data
new_data = np.array([
    [-1.0, 0, 0.5, 1, 0, 1, 1],  # Example 1: A non-smoking male with average BMI and one child in the southeast, obese
    [0.8, 1, -0.2, 0, 1, 0, 0]   # Example 2: A smoking female with lower BMI and no children in the northeast, not obese
])

# Make predictions using the trained model
new_predictions = model.predict(new_data)

# Display the predictions
print("Predicted charges for new data points:")
for i, prediction in enumerate(new_predictions):
    print(f"Data point {i+1}: ${prediction:.2f}")

Predicted charges for new data points:
Data point 1: $7229.98
Data point 2: $32599.49




The output shows the predicted charges for the two new data points you provided:

- **Data point 1:** Predicted charge of approximately 7229.98 USD. This data point represents a non-smoking male with average BMI and one child in the southeast who is obese.
- **Data point 2:** Predicted charge of approximately 32599.49 USD. This data point represents a smoking female with lower BMI and no children in the northeast who is not obese.

These predictions demonstrate how the trained Linear Regression model estimates the medical charges based on the input features. The significant difference in predicted charges between the two points highlights the impact of factors like smoking status and other features on the cost.

## ***Conclusion***

In this notebook, you successfully built your first machine learning model using Linear Regression to predict medical charges. You followed these key steps:

1.  **Data Loading:** Loaded the pre-processed and scaled dataset.
2.  **Data Splitting:** Divided the data into training and testing sets (`X_train`, `X_test`, `y_train`, `y_test`) to ensure the model is evaluated on unseen data.
3.  **Model Building and Training:** Initialized and trained a `LinearRegression` model on the training data.
4.  **Model Testing:** Used the trained model to make predictions on the test set (`y_pred`).
5.  **Model Evaluation:** Assessed the model's performance using the R-squared and Adjusted R-squared scores, which indicated a reasonably good fit for the data.
6.  **Prediction on New Data:** Demonstrated how to use the trained model to predict charges for new, unseen data points.
