## Linear Regression Assumptions

To ensure the [Kaggle Insurance Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) is suitable for linear regression, we verify the following assumptions (Muller & Guido, 2016, Chapter 2):

1. **Continuous Target**: The dependent variable (`charges`) must be numerical and continuous.
2. **Linear Relationships**: Features (e.g., `age`, `bmi`) should have linear relationships with `charges`, checked via scatter plots.
3. **No Extreme Outliers**: Outliers can skew results, assessed with boxplots.
4. **Homoscedasticity**: Residuals should have constant variance, checked post-modeling.
5. **Normality of Residuals**: Residuals should be normally distributed, checked post-modeling.
6. **Independence**: Observations are independent (assumed for insurance data).
7. **Data Quality**: Minimal missing values, correct types, no duplicates.

**Research**:
- Scikit-learn (2021) recommends encoding categorical variables for regression.
- Brownlee (2020) suggests checking skewness for transformations (e.g., log-transformation).

**Implications**:
- Confirm `charges` is continuous and check skewness.
- Plan EDA for linearity and outliers.
- Preprocess categorical variables (`sex`, `smoker`, `region`) via one-hot encoding (Learning Unit 4, LO1, Page 11).

**References**:
- Muller, A.C., & Guido, S. (2016). *Introduction to Machine Learning with Python*. O’Reilly Media.
- Brownlee, J. (2020). *Data Preparation for Machine Learning*. Machine Learning Mastery.
- Scikit-learn. (2021). *User Guide: Linear Models*. https://scikit-learn.org/stable/modules/linear_models.html

## Dataset Suitability Evaluation

The [Kaggle Insurance Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) was evaluated for linear regression (Muller & Guido, 2016):

1. **Continuous Target**:
   - `charges` is float64, continuous (e.g., $16884.924).
   - **Conclusion**: Suitable (Learning Unit 2, LO1, Page 9).

2. **Feature Types**:
   - Numerical (`age`, `bmi`, `children`); categorical (`sex`, `smoker`, `region`).
   - **Conclusion**: Categorical variables need one-hot encoding (Learning Unit 4, LO1, Page 11).

3. **Data Quality**:
   - No missing values, 1 duplicate, `charges` skewness = 1.52.
   - **Conclusion**: Minor cleaning needed; log-transformation suggested (Brownlee, 2020).

4. **Linear Relationships**:
   - To be confirmed via scatter plots in EDA.

5. **Outliers**:
   - To be checked via boxplots.

6. **Homoscedasticity/Normality**:
   - To be assessed post-modeling (Learning Unit 5, LO5, Page 12).

7. **Client Context**:
   - US data valid for proof of concept; SA data recommended for future.
   - **Plan**: Note limitation in report.

**Implications**:
- **EDA**: Remove duplicates, visualize distributions, scatter plots, boxplots.
- **Feature Selection**: Encode variables, use p-values, VIF.
- **Model Training**: Consider log-transformation.
- **Report**: Include in Introduction and EDA sections.

**References**:
- Muller, A.C., & Guido, S. (2016). *Introduction to Machine Learning with Python*. O’Reilly Media.
- Brownlee, J. (2020). *Data Preparation for Machine Learning*. Machine Learning Mastery.
- Scikit-learn. (2021). *User Guide: Linear Models*. https://scikit-learn.org/stable/modules/linear_models.html

In [None]:
### Step 1: Initial Dataset Inspection
# Import pandas for data handling
import pandas as pd

# Load the dataset
df = pd.read_csv('insurance.csv')

# Inspect structure
print("Dataset Info:")
print(df.info())

# Show first 5 rows
print("\nFirst 5 Rows:")
print(df.head())

# Check size
print("\nDataset Shape:", df.shape)


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None

First 5 Rows:
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Dataset Shape: (1338, 7)


In [None]:
### Step 1: Suitability Criteria Check
# Check duplicates
print("\nDuplicates:", df.duplicated().sum())

# Confirm target type and sample values
print("\nCharges Data Type:", df['charges'].dtype)
print("Sample Charges:", df['charges'].head().tolist())

# Check skewness
print("\nCharges Skewness:", df['charges'].skew())


Duplicates: 1

Charges Data Type: float64
Sample Charges: [16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552]

Charges Skewness: 1.5158796580240388
