# Core Statistics Using Python
### Hana Choi, Simon Business School, University of Rochester


# Simple Linear Regression in Python

## Topics covered

- Simple linear regression in Python: two ways
- Some other examples (Diamonds, Wines)

## Required packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

# Example: House Prices

## Load data: hprices.csv

In [None]:
# Method 1: Save the data file directly to your working directory
# hprices = pd.read_csv('hprices.csv')

# Method 2: Tell Python where your data file exists "explicitly"
# Below is "my" file path, you should specify yours instead.
hprices = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/hprices.csv")

# Method 3: We can also import a dataset from the web
# hprices = pd.read_csv("http://hanachoi.github.io/datasets/hprices.csv")

# Display the first few rows of the dataframe
print(hprices.head())

## Describe data

In [None]:
# Display basic descriptive statistics
print(hprices.describe())

In [None]:
# Correlation between size and price

print(hprices.corr())

## Visualize data

In [None]:
# Create the scatterplot with a regression line using regplot()
sns.regplot(x='sqrft', y='price', scatter_kws={'s': 20}, 
            line_kws={'color': 'red', 'label': 'Fitted Line'}, data=hprices)

# Add labels and title
plt.title('House Size vs Price', fontsize=14)
plt.xlabel('Size', fontsize=12)
plt.ylabel('Price', fontsize=12)
plt.legend()

# Show the plot
plt.show()

## Using the formulas directly to estimate regression coefficients

In [None]:
# Calculating slope 
cov_sqrft_price = hprices[['sqrft', 'price']].cov().iloc[0, 1]  # Covariance between sqrft and price
var_sqrft = hprices['sqrft'].var()  # Variance of sqrft
beta1_hat = cov_sqrft_price / var_sqrft

# Calculating intercept 
mean_price = hprices['price'].mean()
mean_sqrft = hprices['sqrft'].mean()
beta0_hat = mean_price - beta1_hat * mean_sqrft

# Print the result
print(f"Intercept: {beta0_hat}")
print(f"Slope: {beta1_hat}")

# Using `statsmodels` to Run Linear Regressions

- Linear regression can be run using `statsmodels` package.
- Recall our basic linear regression model:

$$Y_i = \beta_0 + \beta_1 X_i + e_i$$

$~~~~~~$ in which we are trying to obtain the coefficient estimates $\widehat{\beta}_0$ and $\widehat{\beta}_1$ using a sample of data

- Using the statsmodels package, there are two ways to run a regression in Python.
- Both require you to tell Python what the $Y$ and $X$ are.
- The second method is easier.

## Method 1: using `.OLS` functionality

- The first way is to use the `.OLS` functionality in `statsmodels`, which uses matrix notation.
- You need to specify what your $Y$ is and what your $X$ is in your dataset
- You also need to augment the $X$ "matrix" to include a constant

In [None]:
# Specifying X, Y in your dataset
y = hprices[['price']]
X = hprices[['sqrft']]

# Add a constant (intercept term)
X = sm.add_constant(X) # This adds the intercept term to the model

# Fit the linear regression model
hprices_fit = sm.OLS(y, X).fit() # OLS stands for Ordinary Least Squares

In [None]:
# Print the summary of the regression results
print(hprices_fit.summary())

In [None]:
# Print the intercept and slope of the regression results
print(hprices_fit.params)

In [None]:
intercept = hprices_fit.params.iloc[0]
slope = hprices_fit.params.iloc[1]

print(f"Intercept: {intercept}")
print(f"Slope: {slope}")

## Method 2: using `formula.api` functionality

- The formula API allows you to specify a model. It uses formulas rather than matrix notation (very similar to how it works in R or Stata)
- Note that the output is exactly the same as the matrix version above.
- It is performing the same calculations, you are just calling it using a different syntax.
- I will use this method for the remainder of the course, but the matrix version works just fine too (I just find using formula easier).

In [None]:
# You need to import formula API from statsmodels first
import statsmodels.formula.api as smf

In [None]:
# Here's how to run a simple linear regression using the formula api

# Formula format: y ~ x
hprices_fit = smf.ols(formula='price ~ sqrft', data=hprices).fit()

# Print the summary of the regression results
print(hprices_fit.summary())

# Some Other Examples

## Example: diamonds

- Here is the scatter plot we saw in Excel, with a linear trend line included: <br> 

<img src='http://hanachoi.github.io/datasets/Lec7Diamond.png' alt="Smiley face" align="center" style="width: 50%; height: auto"> <br>


- Excel provided an equation with an intercept of 5573.3 and a slope of -1679.2.
- Let's verify whether the linear regression yields the same results.

In [None]:
# Data
diamonds = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/diamonds.csv")

# Run a simple linear regression of price on carats
diamonds_fit = smf.ols(formula='price ~ carats', data=diamonds).fit()

# Print the intercept and slope of the regression results
print(f"Intercept: {diamonds_fit.params.iloc[0]}")
print(f"Slope: {diamonds_fit.params.iloc[1]}")

## Example: wines

In [None]:
# Data
wines = pd.read_csv("/Users/hanachoi/Dropbox/teaching/core_statistics/Data/wines.csv")

# Run a simple linear regression of price on rating and print the summary
print(smf.ols(formula='Price ~ Score', data=wines).fit().summary())