## Linear Regression Assumptions

To ensure the [Kaggle Insurance Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) is suitable for linear regression, we verify the following assumptions (Muller & Guido, 2016, Chapter 2):

1. **Continuous Target**: The dependent variable (`charges`) must be numerical and continuous.
2. **Linear Relationships**: Features (e.g., `age`, `bmi`) should have linear relationships with `charges`, checked via scatter plots.
3. **No Extreme Outliers**: Outliers can skew results, assessed with boxplots.
4. **Homoscedasticity**: Residuals should have constant variance, checked post-modeling.
5. **Normality of Residuals**: Residuals should be normally distributed, checked post-modeling.
6. **Independence**: Observations are independent (assumed for insurance data).
7. **Data Quality**: Minimal missing values, correct types, no duplicates.

**Research**:
- Scikit-learn (2021) recommends encoding categorical variables for regression.
- Brownlee (2020) suggests checking skewness for transformations (e.g., log-transformation).

**Implications**:
- Confirm `charges` is continuous and check skewness.
- Plan EDA for linearity and outliers.
- Preprocess categorical variables (`sex`, `smoker`, `region`) via one-hot encoding (Learning Unit 4, LO1, Page 11).

**References**:
- Muller, A.C., & Guido, S. (2016). *Introduction to Machine Learning with Python*. O’Reilly Media.
- Brownlee, J. (2020). *Data Preparation for Machine Learning*. Machine Learning Mastery.
- Scikit-learn. (2021). *User Guide: Linear Models*. https://scikit-learn.org/stable/modules/linear_models.html

## Dataset Suitability Evaluation

The [Kaggle Insurance Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) was evaluated for linear regression (Muller & Guido, 2016):

1. **Continuous Target**:
   - `charges` is float64, continuous (e.g., $16884.924).
   - **Conclusion**: Suitable (Learning Unit 2, LO1, Page 9).

2. **Feature Types**:
   - Numerical (`age`, `bmi`, `children`); categorical (`sex`, `smoker`, `region`).
   - **Conclusion**: Categorical variables need one-hot encoding (Learning Unit 4, LO1, Page 11).

3. **Data Quality**:
   - No missing values, 1 duplicate, `charges` skewness = 1.52.
   - **Conclusion**: Minor cleaning needed; log-transformation suggested (Brownlee, 2020).

4. **Linear Relationships**:
   - To be confirmed via scatter plots in EDA.

5. **Outliers**:
   - To be checked via boxplots.

6. **Homoscedasticity/Normality**:
   - To be assessed post-modeling (Learning Unit 5, LO5, Page 12).

7. **Client Context**:
   - US data valid for proof of concept; SA data recommended for future.
   - **Plan**: Note limitation in report.

**Implications**:
- **EDA**: Remove duplicates, visualize distributions, scatter plots, boxplots.
- **Feature Selection**: Encode variables, use p-values, VIF.
- **Model Training**: Consider log-transformation.
- **Report**: Include in Introduction and EDA sections.

**References**:
- Muller, A.C., & Guido, S. (2016). *Introduction to Machine Learning with Python*. O’Reilly Media.
- Brownlee, J. (2020). *Data Preparation for Machine Learning*. Machine Learning Mastery.
- Scikit-learn. (2021). *User Guide: Linear Models*. https://scikit-learn.org/stable/modules/linear_models.html

### Step 1: Initial Dataset Inspection

In [1]:
# Import pandas for data handling
import pandas as pd

# Load the dataset
df = pd.read_csv('insurance.csv')

# Inspect structure
print("Dataset Info:")
print(df.info())

# Show first 5 rows
print("\nFirst 5 Rows:")
print(df.head())

# Check size
print("\nDataset Shape:", df.shape)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None

First 5 Rows:
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Dataset Shape: (1338, 7)


### Step 1: Suitability Criteria Check

In [None]:
# Check duplicates
print("\nDuplicates:", df.duplicated().sum())

# Confirm target type and sample values
print("\nCharges Data Type:", df['charges'].dtype)
print("Sample Charges:", df['charges'].head().tolist())

# Check skewness
print("\nCharges Skewness:", df['charges'].skew())


Duplicates: 1

Charges Data Type: float64
Sample Charges: [16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552]

Charges Skewness: 1.5158796580240388


### Step 2: Analysis Plan

## Analysis Plan

The following table outlines the plan for analyzing the [Kaggle Insurance Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance) to build a linear regression model for predicting medical insurance charges, addressing the South African medical aid scheme’s needs. The plan incorporates findings from Step 1 (e.g., skewness = 1.52, categorical variables) and research (Muller & Guido, 2016; Brownlee, 2020).

| **Step** | **Description** | **Tools/Methods** | **Considerations** | **Research Reference** |
|----------|-----------------|-------------------|--------------------|-----------------------|
| **Exploratory Data Analysis (EDA)** | - Inspect data types, missing values, duplicates.<br>- Summarize statistics (mean, median, quartiles).<br>- Visualize distributions (histograms, boxplots for outliers).<br>- Analyze correlations (heatmap for numerical features).<br>- Plot feature vs. charges (scatter for numerical, boxplots for categorical).<br>- Check skewness of charges (1.52, Step 1). | - Pandas: `info()`, `describe()`, `isnull()`, `drop_duplicates()`.<br>- Seaborn/Matplotlib: histograms, boxplots, scatter plots, heatmap.<br>- Skewness: `df['charges'].skew()`.<br>- Log-transformation if skew > 1 (Brownlee, 2020). | - Remove 1 duplicate (Step 1).<br>- Handle outliers if >3 standard deviations (justify removal).<br>- Visualize skewness to confirm log-transformation need.<br>- Identify strong predictors (e.g., `smoker`) for model training.<br>- Validate linearity for regression assumptions (Learning Unit 2, LO1, Page 9). | - Muller & Guido (2016, Chapter 2).<br>- Brownlee, J. (2020). *Data Preparation for Machine Learning*. |
| **Feature Selection** | - Encode categorical variables (`sex`, `smoker`, `region`) using one-hot encoding.<br>- Correlation analysis for numerical features (`age`, `bmi`, `children`).<br>- Backward elimination with p-values (<0.05) for significance.<br>- Check multicollinearity with Variance Inflation Factor (VIF).<br>- Retain client-relevant features (e.g., `region`) even if p > 0.05. | - Pandas: `get_dummies(drop_first=True)`.<br>- Statsmodels: `OLS` for p-values.<br>- Scikit-learn: `LinearRegression` coefficients.<br>- Statsmodels: `variance_inflation_factor` for VIF.<br>- Drop features if p > 0.05 or VIF > 5. | - One-hot encoding avoids dummy variable trap (Learning Unit 4, LO1, Page 11).<br>- Retain `region` for client context (SA medical aid).<br>- Justify feature retention/removal in report.<br>- Ensure features align with model training (e.g., `smoker` significance, Step 1). | - Muller & Guido (2016, Chapter 4).<br>- Scikit-learn. (2021). *OneHotEncoder*. https://scikit-learn.org/stable/modules/preprocessing.html |
| **Train Model** | - Split data: 80% training, 20% testing.<br>- Use `LinearRegression` with default hyperparameters.<br>- Test log-transformation of `charges` (skewness = 1.52).<br>- Test polynomial features (degree=2) for non-linearities.<br>- Fit model on training data. | - Scikit-learn: `train_test_split(random_state=42)`, `LinearRegression`, `PolynomialFeatures`.<br>- Numpy: `np.log()` for transformation.<br>- Random seed for reproducibility. | - Default hyperparameters suffice (no tuning needed, Learning Unit 2, LO4, Page 9).<br>- Log-transformation addresses skewness (Step 1).<br>- Polynomial features if R² < 0.7 (address underfitting, Learning Unit 2, LO2, Page 9).<br>- Ensure interpretability for client (e.g., coefficients in $). | - Muller & Guido (2016, Chapter 2).<br>- Brownlee, J. (2020). *Polynomial Regression*. |
| **Interpret and Evaluate Model** | - Metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R², Adjusted R².<br>- Statistical testing: p-values for coefficients.<br>- Residual analysis: Normality (histogram), homoscedasticity (residuals vs. predicted).<br>- Compare train/test metrics for overfitting/underfitting.<br>- Visualize actual vs. predicted charges.<br>- Test alternative model (e.g., log-transformed) if needed. | - Scikit-learn: `mean_squared_error`, `r2_score`.<br>- Statsmodels: `OLS.summary()` for p-values, Adjusted R².<br>- Seaborn: residual plots.<br>- Numpy: Adjusted R² formula.<br>- Matplotlib: scatter plots. | - RMSE for client-interpretable errors (in $).<br>- R² > 0.7 for good fit, Adjusted R² for feature count.<br>- P-values < 0.05 confirm significant predictors (Module Outcome 1, Page 7).<br>- Address heteroscedasticity via transformation (Learning Unit 5, LO5, Page 12).<br>- Justify metrics (Srivastava, 2019). | - Muller & Guido (2016, Chapter 5).<br>- Srivastava, T. (2019). *11 Important Model Evaluation Metrics*. |
| **Write Report** | - **Introduction**: Client context, dataset, objectives.<br>- **EDA**: Cleaning, visualizations, patterns (e.g., skewness = 1.52).<br>- **Feature Selection**: Features, rationale (p-values, VIF).<br>- **Model Training**: Features, process, hyperparameters.<br>- **Results**: Metrics, statistical significance, plots.<br>- **Discussion**: Effectiveness, limitations (US vs. SA data), recommendations.<br>- **Conclusion**: Client recommendations.<br>- Include code snippets, references. | - Google Docs/Colab for drafting.<br>- Matplotlib: Save plots (`plt.savefig()`).<br>- GitHub: Store code (Activity 1.2, Page 21).<br>- Cite textbook, Scikit-learn, Kaggle (Page 42–45). | - Client-focused, concise (5–7 pages).<br>- Visualizations highlight patterns (Module Outcome 4, Page 7).<br>- Address US vs. SA data limitations (Step 1).<br>- Include advanced insights (e.g., log-transformation).<br>- Avoid plagiarism (Page 42–45). | - Muller & Guido (2016).<br>- Kaggle. (2021). *Insurance Dataset*. https://www.kaggle.com/datasets/mirichoi0218/insurance |

**Additional Research**:
- Brownlee, J. (2020). *Data Preparation for Machine Learning*. Machine Learning Mastery.
- Scikit-learn. (2021). *User Guide: Linear Models*. https://scikit-learn.org/stable/modules/linear_models.html
- Srivastava, T. (2019). *11 Important Model Evaluation Metrics*. Analytics Vidhya.

**Notes**:
- The plan addresses Step 1 findings: skewness = 1.52 suggests log-transformation; 1 duplicate requires cleaning; categorical variables need encoding.
- Considerations ensure client relevance (SA medical aid) and regression assumptions (Learning Unit 2, Page 9).
- The report will be drafted in Markdown (Colab) and exported as PDF for submission.