### Linear Regression Task

##### 다이아몬드 가격 예측

- price: 미국 달러로 표시된 가격 (＄326 ~ ＄18,823)
- carat: 다이아몬드의 무게(0.2 ~ 5.01)
- cut: 품질(공정, 좋음, 매우 좋음, 프리미엄, 이상적)
- color: 다이아몬드 색상, J(최악)부터 D(최우수)까지
- clarity: 다이아몬드가 얼마나 선명한지에 대한 측정값 (I1(최악), SI2, SI1, VS2, VS1, VVS2, VVS1, IF(최우수))
- x: 길이(mm) (0 ~ 10.74)
- y: 너비(mm)(0 ~ 58.9)
- z: 깊이(mm)(0 ~ 31.8)
- depth: 총 깊이 백분율 = z / 평균(x, y) = 2 * z / (x + y) (43–79)
- table: 가장 넓은 점에 대한 다이아몬드 상단 폭(43 ~ 95)

In [2]:
import pandas as pd
dia_df = pd.read_csv('./datasets/diamond.csv')
dia_df.info()
dia_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53943 entries, 0 to 53942
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53943 non-null  int64  
 1   carat       53943 non-null  float64
 2   cut         53943 non-null  object 
 3   color       53943 non-null  object 
 4   clarity     53943 non-null  object 
 5   depth       53943 non-null  float64
 6   table       53943 non-null  float64
 7   price       53943 non-null  int64  
 8   x           53943 non-null  float64
 9   y           53943 non-null  float64
 10  z           53943 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,53940,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64
53940,53941,0.71,Premium,E,SI1,60.5,55.0,2756,5.79,5.74,3.49
53941,53942,0.71,Premium,F,SI1,59.8,62.0,2756,5.74,5.73,3.43


In [6]:
dia_df.duplicated().sum()
dia_df.isna().sum()
dia_df = dia_df.drop('Unnamed: 0',axis=1)

In [8]:
from sklearn.preprocessing import LabelEncoder

columns = ['cut','color','clarity']
for column in columns:
    encoder = LabelEncoder()
    targets = encoder.fit_transform(dia_df[column])
    dia_df[column] = targets
    print(encoder.classes_)

['Fair' 'Good' 'Ideal' 'Premium' 'Very Good']
['D' 'E' 'F' 'G' 'H' 'I' 'J']
['I1' 'IF' 'SI1' 'SI2' 'VS1' 'VS2' 'VVS1' 'VVS2']


In [9]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

features, targets = dia_df.drop("price",axis=1), dia_df.price

X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=124)

# y_train = np.log1p(y_train)

linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

# 기울기(가중치)
print(linear_regression.coef_)
# 절편(상수)
print(linear_regression.intercept_)

[11255.84776995    64.7758201   -270.06641511   283.52613231
  -158.77385376   -91.89196942 -1435.09696388   225.64314197
   -54.94888592]
16614.448411892306


In [13]:
features.columns[np.argsort(linear_regression.coef_)][::-1]

Index(['carat', 'clarity', 'y', 'cut', 'z', 'table', 'depth', 'color', 'x'], dtype='object')

In [16]:
prediction = linear_regression.predict(X_test)

In [19]:
print(linear_regression.score(X_test,y_test))
r2_score(y_test,prediction)

0.8789788641373973


0.8789788641373973

In [20]:
features, targets = dia_df.drop("price",axis=1), dia_df.price

X_train, X_test, y_train, y_test = train_test_split(features, targets, test_size=0.2, random_state=124)

y_train = np.log1p(y_train)

linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

# 기울기(가중치)
print(linear_regression.coef_)
# 절편(상수)
print(linear_regression.intercept_)

[-0.69564817  0.00618665 -0.06624627  0.06328708  0.02831702 -0.00675578
  0.97212303  0.15066492  0.14078379]
-0.037676020006323974


In [21]:
prediction = linear_regression.predict(X_test)

In [None]:
print(linear_regression.score(X_test,y_test))
r2_score(y_test,prediction)