In [12]:
import pandas as pd
import sklearn
from sklearn.linear_model import SGDRegressor
from sklearn import svm
from sklearn import svm, preprocessing

df = pd.read_csv("../data/diamonds.csv", index_col = 0)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


We are looking to see if we can come up with some sort of formula to take inputs like carat, cut, color, depth, table, x, y, and z to then see if we can predict price. Something important to note is that some columns with string values, like cut and clarity, will have to be converted to numbers.

We start off with using linear regression to predict pricing. We can see what kind of cuts we are working with,

In [2]:
df["cut"].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

And, again, ideally these cuts are represented as some sort of numerical value. To do this we can use dictionaries.

In [3]:
cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}

We can do the same with clarity. First we can see what the data looks like,

In [4]:
df['clarity'].unique()

array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
      dtype=object)

This data is already oredered from best to worst, so now we can just convert this data to a dictionary as well.

In [5]:
clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}

And the same for color.

In [6]:
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}

Now we map all of this

In [7]:
df['cut'] = df['cut'].map(cut_class_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


Now we can attempt to train a regression model to figure this out using a supervised learning task. To do this we will use the Scikit-learn package. Next we want to pick the right model, and to do this [we will choose the right estimator.](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html), which suggests using an SGD Regressor. 

Now, we convert to features and labels. In machine learning, it's standard practice to store featuresets as a capital X and labels as lowercase y.

In [8]:
df = sklearn.utils.shuffle(df) # always shuffle your data to avoid any biases that may emerge b/c of some order.

X = df.drop("price", axis=1).values
y = df["price"].values

For X we want all the columns except for the price (since it's what we're trying to predict), so we can drop the price column. Then we use the `values` function to convert to a numpy array. For our y labels, we say this is just the price column's values. We are going to save some of these values for testing the model after it's been trained.

In [9]:
test_size = 200

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

Now we can train and test our classifier.

In [10]:
clf = SGDRegressor(max_iter=1000)
clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

-11040836.031693865


In [11]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

-28712207.65813279 883
16235429.880697727 1567
-8302124.994375467 4338
1178612.4980704784 5767
24376962.195331097 2782
3067932.803976774 928
22499800.345269203 3669
-845398.3787369728 2351
19644139.130872965 12891
-16527455.150798917 1053


The score for these regression models is r-squared/coefficient of determination. R-Squared is more often between 0 and 100%, where 100% is a perfect fit. This isn't good. Instead, let's try support vector regression.

In [13]:
clf = svm.SVR()

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

-0.07291750049658297


In [14]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

2329.800628638714 883
2384.317423870536 1567
2515.0917019293056 4338
2411.7848679711587 5767
2413.836531710348 2782
2324.0556701386718 928
2449.3987157029937 3669
2417.567589976069 2351
2489.5287746618906 12891
2341.60398483887 1053
