## Linear Regression

In [1]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [2]:
diamonds = pd.read_csv('Data/Diamonds Prices2022.csv')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [4]:
X = diamonds[['carat']] # Features
y = diamonds['price'] # Target

In [6]:
sm.OLS(y, X).fit().summary() # Summary of our fitted regression

0,1,2,3
Dep. Variable:,price,R-squared (uncentered):,0.881
Model:,OLS,Adj. R-squared (uncentered):,0.881
Method:,Least Squares,F-statistic:,400400.0
Date:,"Mon, 06 Oct 2025",Prob (F-statistic):,0.0
Time:,15:56:06,Log-Likelihood:,-484640.0
No. Observations:,53943,AIC:,969300.0
Df Residuals:,53942,BIC:,969300.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
carat,5666.2131,8.955,632.764,0.000,5648.662,5683.764

0,1,2,3
Omnibus:,26112.003,Durbin-Watson:,0.344
Prob(Omnibus):,0.0,Jarque-Bera (JB):,146451.869
Skew:,2.34,Prob(JB):,0.0
Kurtosis:,9.577,Cond. No.,1.0


In [7]:
model = LinearRegression(fit_intercept=False).fit(X, y)
model.coef_

array([5666.21312432])

In [8]:
model.intercept_

0.0

In [9]:
model.score(X, y)

0.7658869476042685

## Linear Regression in Statsmodels

In [10]:
import pandas as pd
import numpy as np

import statsmodels.api as sm

In [11]:
diamonds = pd.read_csv('Data/Diamonds Prices2022.csv')
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [14]:
X = sm.add_constant(diamonds['carat'])
y = diamonds['price']

model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.849
Model:,OLS,Adj. R-squared:,0.849
Method:,Least Squares,F-statistic:,304100.0
Date:,"Mon, 06 Oct 2025",Prob (F-statistic):,0.0
Time:,17:06:49,Log-Likelihood:,-472760.0
No. Observations:,53943,AIC:,945500.0
Df Residuals:,53941,BIC:,945500.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2256.3950,13.055,-172.840,0.000,-2281.983,-2230.807
carat,7756.4362,14.066,551.423,0.000,7728.866,7784.006

0,1,2,3
Omnibus:,14027.005,Durbin-Watson:,0.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,153060.389
Skew:,0.939,Prob(JB):,0.0
Kurtosis:,11.036,Cond. No.,3.65


**Results from model, using the second table:**  
coef = -2256.3950  
carat = 7756.4362  

How do we interpret this?  
* An increase of 1 carat in a diamond is associated with a $7,756 dollar increase in its price.

* We cannot say 1 carat causes a $7,756 increase in price without a more rigorous experiment.

* Technically, a 0-carat diamond is predicted to cost -$2,256.  

## Making Predictions

In [16]:
new_diamonds = pd.DataFrame({'carat': [0, .1, .3, .5, 1, 2, 3, 5]})
new_diamonds.head()

Unnamed: 0,carat
0,0.0
1,0.1
2,0.3
3,0.5
4,1.0


In [18]:
model.predict(sm.add_constant(new_diamonds))

0    -2256.395048
1    -1480.751432
2       70.535800
3     1621.823032
4     5500.041112
5    13256.477271
6    21012.913431
7    36525.785750
dtype: float64