# Part6; ML with Pandas

[text tutorial](https://pythonprogramming.net/machine-learning-python3-pandas-data-analysis/)

[kaggle datasets page link](https://www.kaggle.com/shivam2503/diamonds)

we are going to build a model that predicts the price of the diamond based on its other data such as depth using csv dataset (regression model (not classifyer))

### downloading/uploading dataset to colab

In [None]:
# unzipping zipfile
from zipfile import ZipFile

file_name = 'diamonds.zip'
with ZipFile(file_name, 'r') as zip:
  zip.extractall()

In [None]:
# getting cwd
import os 

cwd = os.getcwd()
files = os.listdir(cwd)
print(cwd, files)

### desiding the type of model

consult with [this ML map](https://scikit-learn.org/stable/tutorial/machine_learning_map/)

based on the map and diamond dataset, we chose SVR model

## pre-processing dataset

In [1]:
import pandas as pd

# reading dataset from csv file
# because the dataset already has original index, we avoid adding another index column
df = pd.read_csv('diamonds.csv', index_col=0)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


column 'price' is the one we want to predict.

therefore, first we need to remove column 'price' from dataset

we also need to convert values into meaningful numbers before feeding them into our model

In [2]:
df['cut'].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

### automatic assignment

usually, this method is enough for ML

In [3]:
# we can use this code to automatically assign numbers
df['cut'].astype('category').cat.codes

1        2
2        3
3        1
4        3
5        1
6        4
7        4
8        4
9        0
10       4
11       1
12       2
13       3
14       2
15       3
16       3
17       2
18       1
19       1
20       4
21       1
22       4
23       4
24       4
25       4
26       4
27       3
28       4
29       4
30       4
        ..
53911    3
53912    3
53913    3
53914    1
53915    1
53916    2
53917    1
53918    4
53919    3
53920    2
53921    4
53922    4
53923    4
53924    2
53925    2
53926    2
53927    2
53928    1
53929    3
53930    2
53931    3
53932    3
53933    4
53934    4
53935    3
53936    2
53937    1
53938    4
53939    3
53940    2
Length: 53940, dtype: int8

### manual assignment

because this labels ('Ideal', 'Premium', 'Good', 'Very Good', 'Fair') has meaning, we manually assign numbers.

in this example, numbers get higher as the quality goes up

###making dictionary

In [4]:
cut_class_dict = {"Fair": 1, "Good": 2, "Very Good": 3, "Premium": 4, "Ideal": 5}

In [5]:
# making dicts for other labels
clarity_dict = {"I3": 1, "I2": 2, "I1": 3, "SI2": 4, "SI1": 5, "VS2": 6, "VS1": 7, "VVS2": 8, "VVS1": 9, "IF": 10, "FL": 11}
color_dict = {"J": 1,"I": 2,"H": 3,"G": 4,"F": 5,"E": 6,"D": 7}

###mapping based on dictionary

In [6]:
df['cut'] = df['cut'].map(cut_class_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


now our dataset has only numorical values and are ready to be feeded into our ML model

##building the model

by consulting with [ML map](https://scikit-learn.org/stable/tutorial/machine_learning_map/), we desided to use SVR model.

follow [this tutorial](https://scikit-learn.org/stable/modules/svm.html#regression) when building the model

In [7]:
import sklearn
from sklearn import svm, preprocessing

### shuffuling dataset

it is always good idea to shuffule dataset before feeding to the model

In [8]:
df = sklearn.utils.shuffle(df)

###checking shape of X and y (features and labels)

In [9]:
# dropping price column
df.drop('price', axis=1).head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
1872,0.82,5,2,7,61.6,56.0,6.05,6.01,3.72
20663,1.22,4,7,6,59.6,60.0,6.97,6.95,4.15
48213,0.76,3,1,6,59.9,59.0,5.92,5.94,3.55
39688,0.4,4,5,8,62.2,59.0,4.65,4.7,2.91
41333,0.5,2,5,5,59.6,61.0,5.12,5.19,3.07


In [10]:
# converting dataframe into numpy array by adding .values
df.drop('price', axis=1).values

array([[0.82, 5.  , 2.  , ..., 6.05, 6.01, 3.72],
       [1.22, 4.  , 7.  , ..., 6.97, 6.95, 4.15],
       [0.76, 3.  , 1.  , ..., 5.92, 5.94, 3.55],
       ...,
       [2.02, 3.  , 4.  , ..., 8.05, 7.96, 5.07],
       [1.51, 5.  , 5.  , ..., 7.35, 7.41, 4.55],
       [0.31, 5.  , 2.  , ..., 4.38, 4.39, 2.71]])

In [11]:
df['price'].head()

1872     3071
20663    8950
48213    1949
39688    1088
41333    1214
Name: price, dtype: int64

### setting X and y (features and labels)

In [12]:
# X represent feature such as cut, clarity and color
# we can create df for X just by dropping price column (axis=1)
X = df.drop('price', axis=1).values

# y represent label which is price 
y = df['price'].values

In [13]:
X

array([[0.82, 5.  , 2.  , ..., 6.05, 6.01, 3.72],
       [1.22, 4.  , 7.  , ..., 6.97, 6.95, 4.15],
       [0.76, 3.  , 1.  , ..., 5.92, 5.94, 3.55],
       ...,
       [2.02, 3.  , 4.  , ..., 8.05, 7.96, 5.07],
       [1.51, 5.  , 5.  , ..., 7.35, 7.41, 4.55],
       [0.31, 5.  , 2.  , ..., 4.38, 4.39, 2.71]])

In [14]:
X.shape

(53940, 9)

In [15]:
y.shape

(53940,)

### preprocessing X for efficiency

In [16]:
preprocessed_X = preprocessing.scale(X)

In [17]:
preprocessed_X

array([[ 0.04653994,  0.98147332, -1.41427211, ...,  0.28423685,
         0.2411945 ,  0.25686297],
       [ 0.8904096 ,  0.08588908,  1.52502147, ...,  1.10438369,
         1.06422241,  0.86619369],
       [-0.08004051, -0.80969515, -2.00213083, ...,  0.16834654,
         0.17990519,  0.01596478],
       ...,
       [ 2.57814893, -0.80969515, -0.23855468, ...,  2.06716476,
         1.94853963,  2.16987802],
       [ 1.50221511,  0.98147332,  0.34930404, ...,  1.44313999,
         1.46698075,  1.43301296],
       [-1.02939387,  0.98147332, -1.41427211, ..., -1.20450795,
        -1.1772153 , -1.1743557 ]])

### setting X and y (final ver)

In [18]:
# X represent feature such as cut, clarity and color
# we can create df for X just by dropping price column (axis=1)
X = df.drop('price', axis=1).values
X = preprocessing.scale(X)

# y represent label which is price 
y = df['price'].values

In [19]:
# splitting dataset into training set and testing set
test_size = 200

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

In [20]:
X_train

array([[ 0.04653994,  0.98147332, -1.41427211, ...,  0.28423685,
         0.2411945 ,  0.25686297],
       [ 0.8904096 ,  0.08588908,  1.52502147, ...,  1.10438369,
         1.06422241,  0.86619369],
       [-0.08004051, -0.80969515, -2.00213083, ...,  0.16834654,
         0.17990519,  0.01596478],
       ...,
       [-1.02939387,  0.98147332,  1.52502147, ..., -1.26691043,
        -1.20348215, -1.1743557 ],
       [-1.00829713,  0.08588908,  1.52502147, ..., -1.18667867,
        -1.20348215, -1.10350329],
       [ 1.60769882,  0.98147332, -0.8264134 , ...,  1.57685958,
         1.57204814,  1.41884248]])

In [21]:
X_train.shape

(53740, 9)

In [22]:
y_test

array([  827,  2728,  4566,  1343, 16232,   826,  1979, 10993,  1008,
         544,  3219,   472,  1031,  1438,  8040,   631,  3893,  4470,
        5279,  2041,  1906, 16297,  1229, 11713,  1240,  1905,   628,
        1235,   698,  1259,  4900,  1183,  6230, 12931,  5510,   492,
       11843,  2206, 16715,   433,   872,   993,  2872, 15581,  2569,
        6785,   616,   574,  1018,  5183,  2075,  1076,  1454,  2192,
       12229,  5167,   706, 18102,   902,   507,  4672,  9165, 11946,
        1257,  2780,  1105,   581,  2693,  4673, 13132,  5050,  1866,
        9777,  1237,  3052, 12872,  3378, 11360,   885,  3096,   402,
        1574,  2288, 18528,  1679,   522, 10082,  4118,   642,  4015,
        3288,  5378,  4606,  2585,   953,  1822,  3332,  1046,   460,
        1050,   829,   914,   899,   469, 13006,   551,   827,   956,
        1295,  2309,  3910, 11580, 10863,  1080,   942, 10439,  2340,
        3082,   730,  1845,  1375,  9009, 11970,   756,  1998,   702,
         969,   746,

In [23]:
y_test.shape

(200,)

### defining classifyer

In [24]:
clf = svm.SVR(kernel='linear')

### fitting(training)

In [25]:
clf.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='linear', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

### score; 0 is bad, 1 is good

In [26]:
clf.score(X_test, y_test)

0.8352477773999075

### printing actual prediction

[stack overflow](https://stackoverflow.com/questions/47583259/typeerror-zip-object-is-not-callable-in-python-3-x)

In [28]:
for X,y in zip(X_test, y_test):
  print(f'Model: {clf.predict([X])[0]}, Actual: {y}')

Model: 1028.3703563519211, Actual: 827
Model: 2956.1965773950255, Actual: 2728
Model: 4306.8625152456025, Actual: 4566
Model: 1405.426113106731, Actual: 1343
Model: 11347.729363355142, Actual: 16232
Model: 1137.027986016236, Actual: 826
Model: 2494.491689207516, Actual: 1979
Model: 9610.269548297052, Actual: 10993
Model: 921.925536759185, Actual: 1008
Model: -44.015009334172646, Actual: 544
Model: 3251.3632057082878, Actual: 3219
Model: 377.2941163853561, Actual: 472
Model: 1257.7485632139947, Actual: 1031
Model: 1435.6306199907258, Actual: 1438
Model: 18605.197297013714, Actual: 8040
Model: 149.00646130353198, Actual: 631
Model: 4070.1807180906562, Actual: 3893
Model: 4375.911400228481, Actual: 4470
Model: 4865.792759496645, Actual: 5279
Model: 2290.315824349196, Actual: 2041
Model: 2193.1683589942504, Actual: 1906
Model: 9721.05483700602, Actual: 16297
Model: 1786.4319695666838, Actual: 1229
Model: 8679.83386949282, Actual: 11713
Model: 1360.6170487180566, Actual: 1240
Model: 2325.19

despite of having high score, our model sometime predict negative price (not goot prediction).

to improve the prediction, we can try other models

In [29]:
clf = svm.SVR(kernel='rbf')
clf.fit(X_train, y_train)
clf.score(X_test, y_test)



0.5850129729719094

In [30]:
for X,y in zip(X_test, y_test):
  print(f'Model: {clf.predict([X])[0]}, Actual: {y}')

Model: 1011.0699644926517, Actual: 827
Model: 2464.4423393395086, Actual: 2728
Model: 3619.812760838816, Actual: 4566
Model: 1155.1006749193662, Actual: 1343
Model: 7085.168485318277, Actual: 16232
Model: 1019.5043732883287, Actual: 826
Model: 2149.8618151299725, Actual: 1979
Model: 7824.002266206819, Actual: 10993
Model: 2297.5115428833337, Actual: 1008
Model: 655.741738473354, Actual: 544
Model: 3145.566001343646, Actual: 3219
Model: 2002.1278919380793, Actual: 472
Model: 1050.7609254772028, Actual: 1031
Model: 1448.4108338364347, Actual: 1438
Model: 4318.910489690464, Actual: 8040
Model: 1052.3876135435553, Actual: 631
Model: 4229.585600583269, Actual: 3893
Model: 4259.9620406198665, Actual: 4470
Model: 4465.888390676475, Actual: 5279
Model: 1898.7272467224077, Actual: 2041
Model: 2207.1086636083346, Actual: 1906
Model: 7553.97521540503, Actual: 16297
Model: 1462.260303405384, Actual: 1229
Model: 5877.932998144541, Actual: 11713
Model: 1346.4882360565705, Actual: 1240
Model: 2714.07