In [1]:
import pandas as pd

df = pd.read_csv("diamonds.csv")
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


**Side-Task**: Shift the index by 1    -->    index.map()

In [2]:
df.index = df.index.map(lambda x: x+1)    # Shift the index by 1
df.index.values

array([    1,     2,     3, ..., 53938, 53939, 53940], dtype=int64)

#### Goal (Prediction)
- Input: {carat,cut,color,clarity,depth,table,x,y,z}
- Output: price

#### Preliminary Plans
- string to numbers to do ML
- will use Linear Regression here, so it would be nice to have linear string classifications, i.e. meaningful order


In [3]:
cut_to_num_dict = {'Ideal':5, 'Premium':4, 'Very Good':3, 'Good':2, 'Fair':1}
df['cut'] = df["cut"].map(cut_to_num_dict)    # .map can even work by passing a dict!
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Let's do the same for color and calrity

In [4]:
color_to_num_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}
clarity_to_num_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}

df['color'] = df['color'].map(color_to_num_dict)
df['clarity'] = df['clarity'].map(clarity_to_num_dict)
df.tail()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53936,0.72,5,7,5,60.8,57.0,2757,5.75,5.76,3.5
53937,0.72,2,7,5,63.1,55.0,2757,5.69,5.75,3.61
53938,0.7,3,7,5,62.8,60.0,2757,5.66,5.68,3.56
53939,0.86,4,3,4,61.0,58.0,2757,6.15,6.12,3.74
53940,0.75,5,7,4,62.2,55.0,2757,5.83,5.87,3.64


In [5]:
import sklearn
from sklearn.linear_model import SGDRegressor

df = sklearn.utils.shuffle(df)    # Always shuffle data to avoid biases

X = df.drop("price", axis=1).values     # featuresets(All data except price) stored as X
                                        # .values to convert to np.array
y = df["price"].values                  # Labels stroed as lowercase y

For X we want all of the columns EXCEPT for the price one, so we can just drop it. Then we use .values to convert to a numpy array. Then, for our labels, y, we say this is just the price column's values. 

Great, but we want to probably save some of these values for testing the model after it's been trained. So we'll do something like:

In [6]:
import numpy as np

test_size = 200
 
X_train = X[:-test_size]    # everything in X except of the last 200 ones
y_train = y[:-test_size]    

X_test = X[-test_size:]     # last 200 items in the array
y_test = y[-test_size:]

### Begin Training and testing our classifier!

In [7]:
clf = SGDRegressor(max_iter=1000)
clf.fit(X_train, y_train)

print(clf.score(X_test,y_test))

-17297654.09619937


In [8]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

-12245857.977849007 2590
-6725210.215133905 1417
96069.76275777817 717
-8678792.147341728 4242
1428208.1224856377 449
6562826.341778517 1715
17603349.11327815 830
19852465.265902996 561
-12497601.045515537 2812
5632387.048245907 781


**That's not very good...**

Let's try support vector regression instead:

In [14]:
from sklearn import svm

clf = svm.SVR()

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

-0.07932457566995432


In [15]:
for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0],y)
    

2406.378705071603 2590
2366.570110029726 1417
2347.8683109637223 717
2484.3424388462986 4242
2356.800050547668 449
2408.0331456923095 1715
2360.756982117727 830
2333.106742013439 561
2429.658043213829 2812
2304.441111496569 781


Good news is some of these are at least close. We're in the same zipcode at least! That took a while to run though. 

One difference between svm.SVR() and the SGDRegressor according to the docs is that **svm.SVR() by default has an unlimited number of iterations**. Let's try that with the SGDRegressor to be fair, by setting it to something quite large. Apparently -1 isn't allowed! 10,000 it is!

In [16]:
clf = SGDRegressor(max_iter=10000)

clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

for X,y in list(zip(X_test, y_test))[:10]:
    print(clf.predict([X])[0], y)

-4161241.6638545925
8300670.370905042 2590
5920479.1872678995 1417
2911480.5972977877 717
6701813.947842121 4242
3003546.759865999 449
10744.382373571396 1715
-5701248.493950248 830
-6129497.080839157 561
8872396.805275798 2812
-208766.6729054451 781


Ok no, it just isn't gonna work unless we tweak more. Let's go back to the svm.SVR() model and see if we can improve it.

The most common way to improve models is to scale data. Let's try that.

In [18]:
import sklearn
from sklearn import svm, preprocessing

df=  sklearn.utils.shuffle(df)

X = df.drop("price", axis=1).values
X = preprocessing.scale(X)
y = df["price"].values

test_size = 200

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

clf = svm.SVR()

clf.fit(X_train, y_train)
print(clf.score(X_test,y_test))


for X,y in list(zip(X_test, y_test))[:10]:
    print(f"model predicts {clf.predict([X])[0]}, real value:{y}")

0.5398559788337425
model predicts 3751.216765183839, real value:3800
model predicts 1631.9844736941711, real value:971
model predicts 1233.991338145499, real value:1223
model predicts 515.5916089523771, real value:666
model predicts 5927.013726047266, real value:9706
model predicts 3898.759130651379, real value:3772
model predicts 6101.704751579964, real value:5098
model predicts 3186.221456373053, real value:3135
model predicts 848.4182069334834, real value:984
model predicts 4872.862974219195, real value:5339


This improved our score a bit, so that's nice. We could keep tweaking things and probably improve this model further, but that's not quite the intention of this series, so this will do for now.

Any new diamond data you got would need to be combined into your main dataset, scaled, then predicted from.