# Predicting Diamond Prices
### Phase 2: Statistical Modeling

### Group 106
### Dylan Hann s3719281
### Edward Pearson s3844470

## Table of Contents
- [Introduction](#intro)
- [Statistical Modeling](#stats)
- [Critique and Limitations](#crithit)
- [Summary and Conclusion](#summary)

## Phase One Summary 
In phase one we preprocessed the data which involved checking for any missing or incorrect values and random sampling the large dataset. We then output some basic graphs looking at the trends of the diamonds with specific interest in how the different diamond specifications affected price.
It was discovered that there is a correlation between price and the diamond specifications carat and colour. The other diamond specifications were considered inconclusive to whether they have an effect on diamond price.

### Report Overview

### Overview of Methodology

## Statistical Modeling

### model Overview

#### Model imports

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None) 

%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

df = pd.read_csv('diamonds.csv')

In [15]:
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43,326
1,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31,326
2,0.23,Good,E,VS1,56.9,65.0,4.05,4.07,2.31,327
3,0.29,Premium,I,VS2,62.4,58.0,4.2,4.23,2.63,334
4,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75,335


In [16]:
formula_string_ind_vars= '+'.join(df.drop(columns='price').columns)
formula_string='price~'+formula_string_ind_vars
print('Formula string: ', formula_string)

Formula string:  price~carat+cut+color+clarity+depth+table+x+y+z


In [22]:
df_encoded=pd.get_dummies(df, drop_first=True)
df_encoded=df_encoded.rename(columns={'cut_Very Good': 'cut_very_good'})
df_encoded.head()

Unnamed: 0,carat,depth,table,x,y,z,price,cut_Good,cut_Ideal,cut_Premium,cut_very_good,color_E,color_F,color_G,color_H,color_I,color_J,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,0.23,61.5,55.0,3.95,3.98,2.43,326,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
1,0.21,59.8,61.0,3.89,3.84,2.31,326,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0
2,0.23,56.9,65.0,4.05,4.07,2.31,327,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
3,0.29,62.4,58.0,4.2,4.23,2.63,334,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0
4,0.31,63.3,58.0,4.34,4.35,2.75,335,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0


In [24]:
formula_string_ind_vars_encoded= '+'.join(df_encoded.drop(columns='price').columns)
formula_string_encoded='price~'+formula_string_ind_vars_encoded
print('Formula string encoded: ', formula_string_encoded)

Formula string encoded:  price~carat+depth+table+x+y+z+cut_Good+cut_Ideal+cut_Premium+cut_very_good+color_E+color_F+color_G+color_H+color_I+color_J+clarity_IF+clarity_SI1+clarity_SI2+clarity_VS1+clarity_VS2+clarity_VVS1+clarity_VVS2


In [25]:
model_full = sm.formula.ols(formula=formula_string_encoded, data=df_encoded)
model_full_fitted = model_full.fit()
print(model_full_fitted.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.920
Model:                            OLS   Adj. R-squared:                  0.920
Method:                 Least Squares   F-statistic:                 2.688e+04
Date:                Mon, 18 Oct 2021   Prob (F-statistic):               0.00
Time:                        12:23:58   Log-Likelihood:            -4.5573e+05
No. Observations:               53940   AIC:                         9.115e+05
Df Residuals:                   53916   BIC:                         9.117e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept      2184.4774    408.197      5.352

In [27]:
patsy_description = patsy.ModelDesc.from_formula(formula_string_encoded)

linreg_fit = model_full_fitted

p_val_cutoff = 0.05

print('\nPerforming backwards feature selection using p-values:')

while True:

    pval_series = linreg_fit.pvalues.drop(labels='Intercept')
    pval_series = pval_series.sort_values(ascending=False)
    term = pval_series.index[0]
    pval = pval_series[0]
    if (pval < p_val_cutoff):
        break
    term_components = term.split(':')
    print(f'\nRemoving term "{term}" with p-value {pval:.4}')
    if (len(term_components) == 1): 
        patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))    
    else: 
        patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]), 
                                                        patsy.EvalFactor(term_components[1])]))    
        
    linreg_fit = smf.ols(formula=patsy_description, data=df_encoded).fit()
    
model_reduced_fitted = smf.ols(formula = patsy_description, data = df_encoded).fit()

print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")


Performing backwards feature selection using p-values:

Removing term "y" with p-value 0.6192

Removing term "z" with p-value 0.1488

***
                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.920
Model:                            OLS   Adj. R-squared:                  0.920
Method:                 Least Squares   F-statistic:                 2.944e+04
Date:                Mon, 18 Oct 2021   Prob (F-statistic):               0.00
Time:                        12:45:14   Log-Likelihood:            -4.5573e+05
No. Observations:               53940   AIC:                         9.115e+05
Df Residuals:                   53918   BIC:                         9.117e+05
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025  