### we will be predicting the price of diamonds based on their features (carat, cut color clarity etc)

### put all these features into a regression model using sci kit learn!

### for ML you need >10 000 Deep Learning >100 000.

### sklearn cheatsheet for choosing the right estimator!: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

### in our case of 50k rows, we are looking for qty so we should be using SVR (which is basically svm for REGRESSION and also using the LINEAR kernel)

In [1]:
import pandas as pd

df = pd.read_csv("D:\d Documents\datasets for practice\diamonds.csv", index_col=0)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## things to take note in ML: it's easy to cheat, even when you are trying not to!

### in any case, we need to convert 

In [2]:
df['cut'].unique()



array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

### as you can see, it seems that the 'cut' is in categories and that they can be ordered ie "fair" is better than "Good". 

### we could just convert it to categorical using the code below, but this assigns levels based on which it sees first, which is not very good if we are trying to do linear regression (higher cut, should have higher price ie positive coeff)

### lets create a dict instead which links the names to incresing values. WE GET THE ADDITIONAL DATA BY LOOKING AT KAGGLE eg clarity of i3 : 1 and i2: 2 etc.

In [3]:
# df['cut'].astype('category').cat.codes

In [4]:
cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}

clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}

color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}

## ONLY RUN THIS ONCE! COPY() DOES NOT SEEM TO SAVE THIS. MOST LIKELY DUE TO MAP()
df['cut'] = df['cut'].copy().map(cut_class_dict)
df['clarity'] = df.copy()['clarity'].copy().map(clarity_dict)
df['color'] = df.copy()['color'].map(color_dict)

df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


In [5]:
import sklearn
from sklearn import svm, preprocessing

## ALWAYS START BY SHUFFLING THE DATA. It might be ordered even if you don't think it is. In this case, it does appear that this data set is sorted by price

### also imagine if you did not set the column when you took in the diamonds.csv! this would mean that the numbered index actually correl to price because 1 is the first row which is kinda the highest few prices! watch out for that.

### basically this would inform the model

In [6]:
df = sklearn.utils.shuffle(df)

x = df.drop("price", axis = 1).values
y = df['price'].values

### let's also scale the x vals using preprocessing. So instead of the cuts having values from 1 to 5, the range is scaled down to 0 to 1.

### this simplifies the model ie reduce its complexity. you can comment out the preprocessing to see if it actually makes a huge difference

In [7]:
x = preprocessing.scale(x)

## separating into train and test data

In [8]:
test_size = 200

x_train = x[:-test_size] #all the way up to the last 200
y_train = y[:-test_size]

x_test = x[-test_size:] # the last 200
y_test = y[-test_size:]

clf = svm.SVR(kernel="linear")
clf.fit(x_train, y_train) # training ie fitting the data

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

In [9]:
clf.score(x_test, y_test) # 0 bad 1 is good

0.8655622526391615

### classification is much more clear cut than regression, whether it's in the correct class or not. regression is abit more vague. let's try and look at the data

In [25]:
#for x, y in zip(x_test, y_test):
 #   print(clf.predict[x][0], y)

list(zip(clf.predict(x_test), y_test))

[(751.9170683611669, 1069),
 (-254.52774027894156, 477),
 (1934.2725126734936, 1728),
 (5490.475624454063, 7035),
 (3112.005133033288, 2422),
 (5330.47444654415, 6425),
 (546.2661289376388, 645),
 (852.7532708098515, 1080),
 (979.8292868542112, 1399),
 (4238.207338622738, 3080),
 (6036.502669136721, 7038),
 (887.2620781525402, 1087),
 (182.00209785769948, 834),
 (1643.5989046183959, 1243),
 (703.9532361224301, 678),
 (1519.880361364359, 1148),
 (5760.631929781412, 6232),
 (1006.7936384752998, 998),
 (1666.2286179953453, 1152),
 (6201.178746330784, 3788),
 (3522.373658809984, 3209),
 (4053.4472471632466, 3282),
 (1048.880689557986, 990),
 (224.54333453283562, 802),
 (2132.792109875146, 2064),
 (1156.8371413376885, 1076),
 (5253.480023538255, 6645),
 (2470.8165092097634, 2039),
 (5102.931378092467, 7135),
 (26.816401289768237, 507),
 (2044.5888321823084, 1763),
 (1515.9381501792673, 1417),
 (9075.57540459666, 12028),
 (3309.0787341603373, 3077),
 (32.77586966260333, 579),
 (2859.34824495

### we are in the general ballpark, but some of them are weird eg this model wants you to pay people to get the diamond off you (-254, 477)

In [None]:
clf = svm.SVR(kernel="rbf")
clf.fit(x_train, y_train)

In [None]:
clf.score(x_test, y_test) # 0 bad 1 is good

In [None]:
list(zip(clf.predict(x_test), y_test))

### generally for ML, we use an ensemble of classifiers (ie alot of classifiers) and they will all make predictions to which you will determine an avg prediction. Also, if they give wrong numbers ie negative, we throw that classifier out

### we can even make a voting classifier in sklear to improve the Rsquared