##### `Medical Cost Analysis Using Hypothesis Testing and Regression`

#### `OBJECTIVE` 
The company wants to better understand what drives medical insurance costs so that they can:
- Price insurance premiums more accurately.
- Identify high-risk groups for health interventions.
- Design policies that are fair and competitive.


In [29]:
#import appropriate liabrary and upload data set  

import pandas as pd 

df = pd.read_csv('insurance.csv')
df.head()



Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [30]:
df['children'].value_counts()

children
0    574
1    324
2    240
3    157
4     25
5     18
Name: count, dtype: int64

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [32]:
data_set = df.shape

print(f"This dataset has {data_set[0]} rows and {data_set[1]} columns.")

This dataset has 1338 rows and 7 columns.


In [33]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


### Question 1: Do smokers incur significantly higher medical charges than non-smokers?

`Null Hypothesis (H‚ÇÄ)`

Smokers do not incur higher medical charges than non-smokers.
(or: there is no significant difference in charges)

`Alternative Hypothesis (H‚ÇÅ)`

Smokers incur significantly higher medical charges than non-smokers.

*This is a one-tailed hypothesis because you expect smokers to have higher charges.*

In [34]:
from scipy.stats import ttest_ind

smokers = df[df['smoker'] == 'yes']['charges']
nonsmokers = df[df['smoker'] == 'no']['charges']

t_stat, p_value = ttest_ind(smokers, nonsmokers, equal_var=False)

print("t-statistic:", t_stat)
print("p-value:", p_value)

t-statistic: 32.751887766341824
p-value: 5.88946444671698e-103


In [10]:
p_value_one_tailed = p_value / 2
print("One-tailed p-value:", p_value_one_tailed)

One-tailed p-value: 2.94473222335849e-103


#### Decision:
**Reject the null hypothesis (H‚ÇÄ)** 
*There is overwhelming statistical evidence that smokers incur higher medical charges than non-smokers.The direction of the effect (smokers > non-smokers) is confirmed.*


*The independent samples t-test indicates a highly significant difference in mean medical charges between smokers and non-smokers (t = 32.75, p ‚â™ 0.05), leading to rejection of the null hypothesis.*

#### Question 2: Is there a significant difference in average medical charges across different regions?

`Null Hypothesis (H‚ÇÄ)`
- The average medical charges are the same across all regions.

\[
\mu_1 = \mu_2 = \mu_3 = \mu_4
\]


`Alternative Hypothesis (H‚ÇÅ)`
- At least one region has a different average medical charge.


In [35]:
summary_stats = df.groupby("region")["charges"].agg(
    Mean="mean",
    Standard_Deviation="std",
    Count="count"
)

print(summary_stats)

                   Mean  Standard_Deviation  Count
region                                            
northeast  13406.384516        11255.803066    324
northwest  12417.575374        11072.276928    325
southeast  14735.411438        13971.098589    364
southwest  12346.937377        11557.179101    325


In [36]:

from scipy.stats import f_oneway
# Split charges by region
groups = [df[df['region']==r]['charges'] for r in df['region'].unique()]

# One-way ANOVA
f_stat, p_value = f_oneway(*groups)

print("F-statistic:", f_stat)
print("p-value:", p_value)

F-statistic: 2.96962669358912
p-value: 0.0308933560705201


1. `Given Results`
- F-statistic = 2.97
- p-value = 0.0309

2. Decision Rule
At Œ± = 0.05:
ùëù
=
0.0309
<
0.05
p=0.0309<0.05

   `Decision: Reject the null hypothesis (H‚ÇÄ)`

3. Interpretation
- There is statistically significant evidence that average medical charges differ across regions.
In other words, region has an effect on medical costs.

#### Question 3: Is there a statistically significant relationship between BMI and medical charges?

### Hypotheses

**Null Hypothesis (H‚ÇÄ):**  

> There is no linear relationship between BMI and medical charges.

**Alternative Hypothesis (H‚ÇÅ):**    
> There is a statistically significant linear relationship between BMI and medical charges.


In [37]:
from scipy.stats import pearsonr

# Compute Pearson correlation
r, p_value = pearsonr(df['bmi'], df['charges'])

print("Pearson correlation coefficient (r):", r)
print("p-value:", p_value)

Pearson correlation coefficient (r): 0.1983409688336289
p-value: 2.459085535116766e-13


1. Given Results
- Pearson correlation coefficient (r) = 0.198
- p-value = 2.46 √ó 10‚Åª¬π¬≥

2. Decision Rule

At Œ± = 0.05:
p=2.46√ó10‚àí13<0.05

`Decision: Reject the null hypothesis (H‚ÇÄ)`

There is a statistically significant linear relationship between BMI and medical charges.

### Question 4: Can we predict medical charges using age, BMI, number of children, and smoking status?
1. Identify Variables

> Dependent variable (Y): Medical charges (continuous)

> Independent variables (X):

- Age (continuous)

- BMI (continuous)

- Number of children (discrete/continuous)

- Smoking status (categorical: yes/no ‚Üí needs encoding)

***Since we have multiple predictors, we use multiple linear regression.***

### Hypotheses for Multiple Linear Regression

**Null Hypothesis (H‚ÇÄ):**  
 $
H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0
$

None of the independent variables (age, BMI, number of children, smoking status) significantly predict medical charges.

**Alternative Hypothesis (H‚ÇÅ):**  
$
H_1: \exists \, \beta_i \neq 0
$
 
At least one predictor has a significant effect on medical charges.

> Where $\beta_i$ is the regression coefficient for predictor \(i\).


In [38]:
# Convert categorical variable to numeric (0 = non-smoker, 1 = smoker)
df['smoker_encoded'] = df['smoker'].map({'no': 0, 'yes': 1})

In [42]:
# check whether smokers were converted to numeric
df.groupby('smoker_encoded')['charges'].sum()


smoker_encoded
0    8.974061e+06
1    8.781764e+06
Name: charges, dtype: float64

**Select predictors and dependent variable**

In [40]:
X = df[['age', 'bmi', 'children', 'smoker_encoded']]
Y = df['charges']

###  Fit Multiple Linear Regression

In [43]:
import statsmodels.api as sm

# Add constant for intercept
X = sm.add_constant(X)

# Fit model
model = sm.OLS(Y, X).fit()

# View summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     998.1
Date:                Thu, 15 Jan 2026   Prob (F-statistic):               0.00
Time:                        11:43:53   Log-Likelihood:                -13551.
No. Observations:                1338   AIC:                         2.711e+04
Df Residuals:                    1333   BIC:                         2.714e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           -1.21e+04    941.984    -12.

**1. Model Summary**

| Statistic          | Value   | Interpretation                                            |
| ------------------ | ------- | --------------------------------------------------------- |
| Dependent variable | charges | What we are predicting                                    |
| R-squared          | 0.750   | 75% of the variation in charges is explained by the model |
| Adjusted R-squared | 0.749   | Adjusted for 4 predictors                                 |
| F-statistic        | 998.1   | Tests overall significance of the model                   |
| Prob(F-statistic)  | 0.000   | Model is highly significant                               |





- > `The model fits very well; age, BMI, children, and smoking status explain a large portion of medical charges variation`


**2. Regression Coefficients**

| Predictor      | Coefficient (Œ≤) | Std. Error | t-stat | p-value | Interpretation                                                                                |
| -------------- | --------------- | ---------- | ------ | ------- | --------------------------------------------------------------------------------------------- |
| const          | -12,100         | 941.98     | -12.85 | 0.000   | Base charge when all predictors are 0 (not practically meaningful)                            |
| age            | 257.85          | 11.90      | 21.68  | 0.000   | Each additional year of age **increases charges by ~Ksh 258**, holding other factors constant |
| bmi            | 321.85          | 27.38      | 11.76  | 0.000   | Each unit increase in BMI **increases charges by ~Ksh 322**                                   |
| children       | 473.50          | 137.79     | 3.44   | 0.001   | Each additional child **increases charges by ~Ksh 474**, holding others constant              |
| smoker_encoded | 23,810          | 411.22     | 57.90  | 0.000   | Being a smoker **increases charges by ~Ksh 23,810** compared to non-smokers  


- > `All predictors are statistically significant (p < 0.05).`                 |


**3. Regression Equation**


$$
\text{Medical Charges} = -12,100 + 257.85 \cdot \text{Age} + 321.85 \cdot \text{BMI} + 473.50 \cdot \text{Children} + 23,810 \cdot \text{Smoker Status}
$$

**4. Key Insights**

- *Smoking status* is the strongest predictor (largest coefficient).

- *Age* and *BMI* also contribute meaningfully to charges.

- *Children* have a smaller but significant effect.

- The **model** explains 75% of the variation in medical charges.


### Question 5: How does region affect medical charges after controlling for age, BMI, smoking status, and number of children?
**1. Identify Variables**

- Dependent variable (Y): Medical charges

 - Independent variables (X):

1. Age 

2. BMI 

3. Children 

4. Smoking status (categorical: yes/no ‚Üí encoded 0/1)

> Region (categorical: e.g., northeast, northwest, southeast, southwest)

> Since region is categorical with multiple levels, we need dummy encoding for regression.

 **2. Hypotheses**

**Null Hypothesis (H‚ÇÄ):**  

All regional coefficients are zero:  
H‚ÇÄ: Œ≤_region = 0  

After controlling for age, BMI, smoking, and children, **region has no effect on medical charges**.

**Alternative Hypothesis (H‚ÇÅ):**  

At least one regional coefficient is not zero:  
H‚ÇÅ: At least one Œ≤_region ‚â† 0  

After controlling for other variables, **region does have a significant effect on medical charges**.

### 3. Encoding Region

In [48]:
# Convert region to dummy variables
region_dummies = pd.get_dummies(df['region'], drop_first=True)

# Combine with other predictors
X = pd.concat([df[['age', 'bmi', 'children', 'smoker_encoded']], region_dummies], axis=1)

"""drop_first=True avoids
 the dummy variable trap (multicollinearity)"""

'drop_first=True avoids\n the dummy variable trap (multicollinearity)'

### 4. Fit Regression Model

In [54]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             1338 non-null   int64  
 1   sex             1338 non-null   object 
 2   bmi             1338 non-null   float64
 3   children        1338 non-null   int64  
 4   smoker          1338 non-null   object 
 5   region          1338 non-null   object 
 6   charges         1338 non-null   float64
 7   smoker_encoded  1338 non-null   int64  
dtypes: float64(2), int64(3), object(3)
memory usage: 83.8+ KB


In [55]:
X.dtypes

const             float64
age                 int64
bmi               float64
children            int64
smoker_encoded      int64
northwest            bool
southeast            bool
southwest            bool
dtype: object

In [56]:
X[['northwest', 'southeast', 'southwest']] = X[['northwest', 'southeast', 'southwest']].astype(int)

In [57]:
X = X.apply(pd.to_numeric)

In [59]:
X.dtypes

const             float64
age                 int64
bmi               float64
children            int64
smoker_encoded      int64
northwest           int32
southeast           int32
southwest           int32
dtype: object

In [58]:
y = df['charges']
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,charges,R-squared:,0.751
Model:,OLS,Adj. R-squared:,0.75
Method:,Least Squares,F-statistic:,572.7
Date:,"Thu, 15 Jan 2026",Prob (F-statistic):,0.0
Time:,12:24:21,Log-Likelihood:,-13548.0
No. Observations:,1338,AIC:,27110.0
Df Residuals:,1330,BIC:,27150.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.199e+04,978.762,-12.250,0.000,-1.39e+04,-1.01e+04
age,256.9736,11.891,21.610,0.000,233.646,280.301
bmi,338.6646,28.559,11.858,0.000,282.639,394.690
children,474.5665,137.740,3.445,0.001,204.355,744.778
smoker_encoded,2.384e+04,411.856,57.875,0.000,2.3e+04,2.46e+04
northwest,-352.1821,476.120,-0.740,0.460,-1286.211,581.847
southeast,-1034.3601,478.537,-2.162,0.031,-1973.130,-95.590
southwest,-959.3747,477.778,-2.008,0.045,-1896.656,-22.094

0,1,2,3
Omnibus:,300.735,Durbin-Watson:,2.089
Prob(Omnibus):,0.0,Jarque-Bera (JB):,720.516
Skew:,1.212,Prob(JB):,3.48e-157
Kurtosis:,5.654,Cond. No.,309.0


**1. Model Fit**
| Statistic         | Value | Interpretation                                          |
| ----------------- | ----- | ---------------------------------------------------------|
| R-squared         | 0.751 | 75.1% of variation in charges is explained by the model |
| Adj. R-squared    | 0.750 | Adjusted for 7 predictors                               |
| F-statistic       | 572.7 | Tests overall significance of the model                 |
| Prob(F-statistic) | 0.000 | Model is highly significant                             |

***The model fits well ‚Äî controlling for age, BMI, smoking, children, region improves prediction slightly.***

**2. Coefficients for Region**
| Region    | Coefficient (Œ≤) | Std. Err | t      | p-value | Interpretation                                                           |
| --------- | --------------- | -------- | ------ | ------- | ------------------------------------------------------------------------ |
| northwest | -352.18         | 476.12   | -0.740 | 0.460   | Charges are **Ksh 352 lower than reference region**, **not significant** |
| southeast | -1034.36        | 478.54   | -2.162 | 0.031   | Charges are **Ksh 1,034 lower than reference region**, **significant**   |
| southwest | -959.37         | 477.78   | -2.008 | 0.045   | Charges are **Ksh 959 lower than reference region**, **significant**     |

***Reference region (the one dropped in dummy encoding) is presumably northeast. All other regions are compared to it.***

**3. Interpretation**

- Northwest: Not significantly different from northeast - p = 0.46

- Southeast: Significantly lower charges than northeast - p = 0.031

- Southwest: Significantly lower charges than northeast - p = 0.045

**After controlling for age, BMI, number of children, and smoking status, region still has a modest but significant effect for southeast and southwest, but not for northwest.**

***4. Full Regression Equation (Numeric Example)***

$$
\text{Medical Charges} = -11,990 + 256.97 \cdot \text{Age} + 338.66 \cdot \text{BMI} + 474.57 \cdot \text{Children} + 23,840 \cdot \text{Smoker} - 352.18 \cdot \text{Northwest} - 1,034.36 \cdot \text{Southeast} - 959.37 \cdot \text{Southwest}
$$

**Notes:**
- **Age, BMI, Children** are numeric predictors.  
- **Smoker** = 1 if smoker, 0 if non-smoker.  
- **Northwest, Southeast, Southwest** = 1 if person is in that region, 0 otherwise. Northeast is the reference region.  




### Overall Conclusion

- **Strongest Predictors:** Smoking status, BMI, and age have the largest impact on medical charges.  
- **Moderate Predictors:** Number of children increases costs slightly.  
- **Minor Predictors:** Region has a modest effect, with some regions showing slightly lower charges.  
- **Practical Implication:** Healthcare costs are primarily driven by personal health factors (smoking, BMI, age) rather than geographic location, though regional differences should be considered in cost planning and insurance pricing.