<a href="https://colab.research.google.com/github/ChewPeng/R/blob/main/R_PreConf_Intro_to_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction to Linear Regression**


Prepared by:
    
Dr. Gan Chew Peng
<br><br>

LinkedIn: [ChewPeng](https://www.linkedin.com/in/chew-peng-gan-03b516a6/)

Github: [R](https://github.com/ChewPeng/R/blob/main/R_PreConf_Intro_to_Linear_Regression.ipynb) 

## Dataset

In this session, we are going to use [insurance.csv](https://github.com/ChewPeng/R/blob/main/insurance.csv) dataset. Take note that this dataset was downloaded from [Kaggle](https://www.kaggle.com/mirichoi0218/insurance).

------

1: Read the `insurance.csv`

In [None]:
df <- read.csv("https://raw.githubusercontent.com/ChewPeng/R/main/insurance.csv",
                  header=T, 
                  sep=",",
                  strip.white=T,
                  stringsAsFactors = T)


head(df)

dim(df)



2: Summary of the data

In [None]:
summary(df)

------

## Linear Regression : Factors affecting `charges`

The mathematical formula of the linear regression can be written as `y = b0 + b1*x + e`, where:

`b0` and `b1` are known as the regression coefficients or parameters:
*   `b0` is the intercept of the regression line; that is the predicted value when `x = 0`.
*   `b1` is the slope of the regression line.
*   `e` is the error term (also known as the residuals), the part of `y` that cannot be explained by the regression model



In [None]:
cor(df[, unlist(lapply(df, is.numeric))]) 


Now , let us investigate the relationship between `charges` and `age` . 

**1: Simple Linear Regression**

The simple linear regression tries to find the best fit line to represent the relationship between y and x. 

The linear model equation can be written as follow: 

```
charges = b0 + b1 * age
```

The R function `lm()` can be used to determine the beta coefficients of the linear model:

In [None]:
model1 <- lm(charges ~  age, data = df)
model1

**2: Scatter Plot**

In [None]:
library(ggplot2)

ggplot(df, aes(x=age, y=	charges)) +
  geom_point() +
  stat_smooth(method = lm)

**3: Statistical Summary of the Regression Model**




In [None]:
summary(model1)

**4: Outlier**

In [None]:
ggplot(df, aes(y=charges)) +
  geom_boxplot() 

In [None]:
Q1 <- quantile(df$charges, .25)
Q3 <- quantile(df$charges, .75)
IQR <- IQR(df$charges)

In [None]:
df_no_outliers <- subset(df, 
                         df$charges> (Q1 - 1.5*IQR) &
                          df$charges< (Q3 + 1.5*IQR))


dim(df_no_outliers)

In [None]:
ggplot(df_no_outliers, aes(x=age, y=charges)) +
  geom_point() +
  stat_smooth(method = lm)

**5: LR Model Fitting using *`df_no_outliers`***

In [None]:
model2 <- lm(charges ~  age,  data = df_no_outliers)
model2

In [None]:
summary(model2)

**6: Multivariate Linear Regression**

In [None]:
colnames(df)

In [None]:
model3 <- lm(charges ~  ., 
            data = df_no_outliers)
model3

summary(model3)

In [None]:
model4 <- lm(charges ~  age + 
                        bmi  + 
                        children + 
                        smoker+
                        region, 
            data = df)
model4

summary(model4)

In [None]:
model5 <- lm(charges ~ . ,
            data = df)
model5

summary(model5)

**7: Plot**

In [None]:
plot(model5)

**8: Improvement**

Referring to plots, #1301 , #578 and #544, #243 are outliers. Let us remove these points and then rerun the linear model again. 

In [None]:
library(dplyr)
df_new <- df %>% slice(-c(1301,578,544,243))
dim(df_new)

In [None]:
model5_improve <- lm(charges ~  ., 
            data = df_new)

summary(model5_improve)

**9: Real-time Demo**

Go to this [Page](https://ganchewpeng2.shinyapps.io/LinearRegressionDemo/ ) and Repeat the analysis.







------