# 0701 I04 Linear Regression - Bin Liao

## Project Overview

The primary objective of this project is to build and hone data science skills, specifically in the areas of data loading, exploratory data analysis (EDA), and regression modeling. We will be using the popular 'Medical Cost Personal Dataset' from Kaggle to conduct this exercise.

In this project, we will:

- Load the data from a GitHub repository.
- Check the data types and add any missing types as necessary.
- Perform exploratory data analysis to understand the data better.
- Formulate a business question and answer it using a regression model.
- Validate the model using appropriate metrics.
- Visualize the data to gain and share insights.

## Dataset Overview - Medical Cost Personal Dataset

The dataset we are using for this project originates from Kaggle's 'Medical Cost Personal Dataset'. This dataset provides a rich collection of data representing individual medical costs billed by health insurance. 

The dataset features the following attributes:

- **age**: The age of the primary beneficiary.
- **sex**: The gender of the primary beneficiary - male or female.
- **bmi**: The body mass index (BMI), which provides an understanding of body weight based on a person's height and weight. 
- **children**: The number of children covered by health insurance / the number of dependents.
- **smoker**: Indicates whether the beneficiary is a smoker or not.
- **region**: The beneficiary's residential area in the US - northeast, southeast, southwest, or northwest.
- **charges**: Individual medical costs billed by health insurance.

With the help of this dataset, we will predict medical insurance charges based on features such as age, BMI, and the number of children. This is a practical and common use case in the healthcare sector, where understanding the factors influencing insurance charges can be essential to various stakeholders.


## Project Initialzation and Data Loading

In this initial task, we set up our Python environment by importing the necessary libraries for data manipulation, exploration, modeling, and visualization. We then load our chosen dataset, 'Medical Cost Personal Dataset', directly from a GitHub repository into a pandas DataFrame.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

# If the data file are not exist, download it from the given url.
def download_and_load_data(file_name, url):
    data_dir = 'data'
    file_path = os.path.join(data_dir, file_name)

    # Check if the data file exists
    if not os.path.isfile(file_path):
        # If not, check if the data directory exists
        if not os.path.isdir(data_dir):
            # If not, create the data directory
            os.makedirs(data_dir)
        
        # Download the data file
        print(f'Downloading file from {url} ...')
        !wget -O {file_path} {url}
        print(f'File downloaded and saved to {file_path}')

    # Load data into a dataframe.
    df = pd.read_csv(file_path)
    return df


