# HiMCM Session 7
# Other Prediction Methods

- **Multiple Linear Regression**: Predict the dependent variable based on a linear combination of independent variables.
- **Polynomial Regression**: Describe the relationship as a polynomial.
- **Logistic Regression**: Designed for classification tasks.

## Multiple Linear Regression

Multiple linear regression is an extension of linear regression. It aims to build a relationship between the dependent variable and multiple independent variables. We will use the following example to see how to conduct multiple linear regression.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# The following "magic command" allows figures to be displayed automatically
%matplotlib inline

In [None]:
# Load the insurance prices data set.
url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
insurance = pd.read_csv(url)
insurance.head()

To get a feel for the data, it is common to do some quick exploratory analysis.

In [None]:
# Size of data
print(insurance.shape)

In [None]:
# Variable names, data types, and missing values
insurance.info()

In [None]:
# Distribution of categorical variables
insurance['sex'].value_counts()

In [None]:
insurance['sex'].value_counts().plot.bar()

In [None]:
# Visualize the distribution of `smoker` and `region`.



In [None]:
# Descriptive statistics on numerical variables
# BMI: body mass index
insurance.describe()

In [None]:
# Distribution of data
pd.plotting.scatter_matrix(insurance, figsize=(10, 10))
plt.show()

Which variable has a linear relationship with `charges`?

In [None]:
# Use relplot from the seaborn library to inspect the data more closely
import seaborn as sns
sns.relplot(x = "bmi", y="charges", hue ="smoker",data=insurance)
# plt.show()

In [None]:
# Exercise: Use Seaborn to make a scatter plot with age on the x axis and
# charges on the y axis, colored by whether the person is a smoker.



Let's build a multiple linear regression model.

In [None]:
import statsmodels.formula.api as smf

model1 = smf.ols('charges ~ age + bmi + children', data = insurance).fit()
model1.summary()

- What is the equation of this linear model?
- Is the R^2 value good?
- Could any of the coefficients be 0?
- Are the residuals follow a normal distribution?
- Are the predicted values close to the actual values?

In [None]:
model1.resid.hist(bins = 20)

In [None]:
plt.scatter(x = insurance["charges"], y = model1.fittedvalues)
plt.xlabel("Actual charges")
plt.ylabel("Predicted charges")

Let's add the remaining columns. For categorical variables, we need to convert them to quantitative data using **dummy variables**.

In [None]:
insurance_new = pd.get_dummies(insurance, columns=["sex"], drop_first=True)
insurance_new.head()

In [None]:
# Exercise: Convert "smoker" and "region" to quantitative columns.



In [None]:
# Exercise: Build a lienar regression model using all these columns.



Evaluate the performance of this linear model.

## Polynomial Regression

If the relationship is not linear, we can use a polynomial to fit the data. A polynomial curve is more flexible since it has more parameters.

In [None]:
url = "https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Wage.csv"
# wage = pd.read_csv(url)
wage = pd.read_csv(url, usecols=['age', 'wage'])
wage.head()

In [None]:
plt.plot(wage['age'], wage['wage'], 'b.', alpha=0.1)

In [None]:
model = smf.ols('wage ~ age + I(age**2) + I(age**3)', wage).fit()
model.summary()

In [None]:
# Visualize the model
pred = model.fittedvalues
plt.plot(wage['age'], pred, 'gs', alpha=0.1)
plt.plot(wage['age'], wage['wage'], 'b.', alpha=0.1)

## Logistic Regression

The Challenger Space Shuttle tragically explored in 1986, killing all astronauts on board. The explosion was shown to have been caused by an O-ring failure, likely due to cold temperatures the day of the launch.

<img src="https://www.history.com/.image/ar_16:9%2Cc_fill%2Ccs_srgb%2Cfl_progressive%2Cg_faces:center%2Cq_auto:good%2Cw_768/MTU3ODc4NTk5MjI2NTY1OTYx/image-placeholder-title.jpg" width="300">

Below is the test data of this O-ring under a variety of temperatures:

In [None]:
url = "http://comet.lehman.cuny.edu/owen/teaching/mat328/chall.txt"
data = pd.read_csv(url,
                   sep="\s+",
                   header=None,
                   names=["Temperature", "Failure"])
# data.sort_values(by='Temperature', inplace=True)
# data.reset_index(drop=True, inplace=True)
data

In [None]:
# Plot the data
data.plot.scatter(x="Temperature", y="Failure")

On the launch day, the temperature was 30 degree Fahrenheit. How likely will this O-ring fail?

A **logistic regression** model aims to predict the probability of a event given a group of independent variables. Let's apply this method.

In [None]:
logistic_model = smf.logit('Failure ~ Temperature',data).fit()
logistic_model.summary()

In [None]:
# Graph the model as a probability curve.
x = np.linspace(50, 85, 200) # sample 200 evenly-spaced values in [50, 85]
params = logistic_model.params
logits = params['Intercept'] + x * params['Temperature']
probs = np.exp(logits) / (1 + np.exp(logits))

data.plot.scatter(x = "Temperature", y = "Failure")
plt.plot(x, probs)


In [None]:
test = pd.DataFrame({"Temperature":[30]})
test

In [None]:
logistic_model.predict(test)