# 📊 Project One: Predicting Medical Expenses with Linear Regression  

## 🔹 Overview  
This project demonstrates how to approach a real-world regression problem using **Scikit-learn**.  
We work with a dataset of medical expenses and explore how different features (numerical and categorical) influence the cost predictions.  

---

## 🔹 Key Steps in this Notebook  
1. **Problem Statement** → Define the machine learning objective and why regression is appropriate.  
2. **Data Collection & EDA** → Load the dataset, explore distributions, detect patterns, and visualize relationships.  
3. **Linear Regression (Single Variable)** → Build and interpret a simple regression model with one predictor.  
4. **Linear Regression (Multiple Variables)** → Extend to multivariate regression and evaluate performance.  
5. **Handling Categorical Features** → Apply encoding techniques (e.g., OneHotEncoder) to use non-numerical data.  
6. **Interpreting Model Coefficients** → Understand the impact of features and their importance in prediction.  
7. **Exploring Other Regression Models** → Briefly compare with additional regressors available in Scikit-learn (e.g., Decision Tree, Random Forest).  
8. **Applying to New Datasets** → Generalize the workflow by testing linear regression on another dataset.  

---

## 🔹 Skills Demonstrated  
- Exploratory Data Analysis (EDA)  
- Feature Engineering (numerical + categorical)  
- Linear Regression (univariate & multivariate)  
- Model Evaluation & Interpretation  
- Comparing Multiple Regression Models in Scikit-learn  


--------------------------------------------
--------------------------------------------


## 🔹 Problem Statement  

Healthcare costs are influenced by multiple factors such as age, gender, BMI, smoking habits, and region of residence.  
Accurately predicting medical expenses can help insurance companies design fair policies and assist individuals in planning their healthcare budgets.  

In this project, our goal is to **build a regression model** that can estimate a person’s annual medical charges based on their demographic and lifestyle attributes.  
We will start with simple linear regression, extend to multiple variables, and then compare performance against other regression models available in Scikit-learn.  


In [11]:
# Loading Project Dataset

import pandas as pd

data = pd.read_csv('../datasets/medical_expensives.csv')

data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


*✅ The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one customer.*

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


*✅ Based on the information, we observe that the dataset contains a mix of numerical and categorical features. A key observation is that there are no missing values, which is a significant advantage when preparing data for machine learning models.*

In [14]:
# Some statistics for the numerical columns:
data.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801
