***Pandas로 CSV 파일 불러오기***
<pre>
read_csv(..., header = None)으로 첫번째 데이터가 데이터프레임 컬럼명이 되지 않게 만들 수 있다.

컬럼명을 추가하려면 수에 맞게 배열로 추가해주면 된다.
    수가 맞지 않으면 에러 발
DataFrame.columns = [Column_Name_String_List]
</pre>

In [10]:
import pandas as pd

df = pd.read_csv("C:/Users/admin/Desktop//Homework/AI/AI_Class/Data/car_evaluation.csv", header=None)
df.columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


---
***데이터 확인***
<pre>
컬럼명 확인은 DataFrame.columns 
</pre>

In [13]:
df.columns

Index(['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class'], dtype='object')

<pre>
결측치 확인은 DataFrame.(isnull() or isna()).sum()
</pre>

In [16]:
label = 'class'
df.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

<pre>
특정 컬럼의 데이터 수 확인은 DataFrame['column name'].value_counts()
</pre>

In [19]:
df[label].value_counts()

class
unacc    1210
acc       384
good       69
vgood      65
Name: count, dtype: int64

<pre>
DataFrame.values나 DataFrame.to_numpy()로 데이터프레임을 배열 형태로 변환할 수 있다.
</pre>

In [22]:
df.values

array([['vhigh', 'vhigh', '2', ..., 'small', 'low', 'unacc'],
       ['vhigh', 'vhigh', '2', ..., 'small', 'med', 'unacc'],
       ['vhigh', 'vhigh', '2', ..., 'small', 'high', 'unacc'],
       ...,
       ['low', 'low', '5more', ..., 'big', 'low', 'unacc'],
       ['low', 'low', '5more', ..., 'big', 'med', 'good'],
       ['low', 'low', '5more', ..., 'big', 'high', 'vgood']], dtype=object)

In [24]:
df.to_numpy()

array([['vhigh', 'vhigh', '2', ..., 'small', 'low', 'unacc'],
       ['vhigh', 'vhigh', '2', ..., 'small', 'med', 'unacc'],
       ['vhigh', 'vhigh', '2', ..., 'small', 'high', 'unacc'],
       ...,
       ['low', 'low', '5more', ..., 'big', 'low', 'unacc'],
       ['low', 'low', '5more', ..., 'big', 'med', 'good'],
       ['low', 'low', '5more', ..., 'big', 'high', 'vgood']], dtype=object)

---
***데이터 변환***
<pre>
문자열을 입력받지 못하기에 각 데이터의 값을 숫자로 바꿔주어야 한다.
잘 변환이 되었는지 value_counts로 확인한다.
</pre>

In [27]:
# Label을 숫자로 변환
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

for col in df.columns :
    df[col] = label_encoder.fit_transform(df[col])

df['class'].value_counts()

class
2    1210
0     384
1      69
3      65
Name: count, dtype: int64

<pre>
변환한 후에 
    X  =>  DataFrame.drop('column name', axis=1).values
    Y  =>  DataFrame['column name'].values
    로 나누어 준다.
</pre>

In [30]:
X = df.drop('class', axis=1).values
Y = df['class'].values

X, Y

(array([[3, 3, 0, 0, 2, 1],
        [3, 3, 0, 0, 2, 2],
        [3, 3, 0, 0, 2, 0],
        ...,
        [1, 1, 3, 2, 0, 1],
        [1, 1, 3, 2, 0, 2],
        [1, 1, 3, 2, 0, 0]]),
 array([2, 2, 2, ..., 2, 1, 3]))

<pre>
변환 후에 학습 데이터와 테스트 데이터를 나누고 
    Data.shape로 개수를 확인한다.
    스케일링이 필요하다면 수행한다
</pre>

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1209, 6), (519, 6), (1209,), (519,))

In [90]:
from sklearn.preprocessing import StandardScaler
# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

---
***Classification - 분류***
<pre>
분류는 Accuracy를 가지고 성능을 평가한다.
    accuracy_score(y_true, y_pred, ...)
    예측한 값과 실제 값을 비교하여 결과를 리턴한다.
        (예측 성공한 값의 수) / (테스트 데이터 수)
</pre>

In [38]:
from sklearn.metrics import accuracy_score

<pre>
DecisionTreeClassifier
    의사결정트리
</pre>

In [41]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'DecisionTree  -  {acc}')

DecisionTree  -  0.9710982658959537


<pre>
RandomForestClassifier
    램덤 포레스트
        의사결정트리를 앙상블
</pre>

In [44]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'RandomForest  -  {acc}')

RandomForest  -  0.9672447013487476


<pre>
Support Vector Machine ( SVM )
    서포트 벡터 머신
</pre>

In [46]:
from sklearn.svm import SVC

model = SVC(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'SVM  -  {acc}')

SVM  -  0.9132947976878613


<pre>
LogisticRegression
    로지스틱 회귀
    ?
</pre>

In [92]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'LogisticRegression  -  {acc}')

LogisticRegression  -  0.6628131021194605


<pre>
K-Nearest Neighbors ( KNN )
    K-최근접 이웃
</pre>

In [49]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'KNN  -  {acc}')

KNN  -  0.9402697495183044


---
***Regression - 회귀***
<pre>
회귀는 mean_squared_error(y_true, y_pred, ...)로 성능을 평가한다.
    ((Σ(y_test - y_pred)) ^ 2) / n
        값이 낮을수록 성능이 좋음
</pre>

In [71]:
from sklearn.metrics import mean_squared_error

<pre>
LinearRegression
    선형회귀
</pre>

In [67]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mean = mean_squared_error(y_test, y_pred)
print(f'LinearRegression  -  {mean}')

LinearRegression  -  0.7338410712893653


<pre>
DecisionTreeRegressor
    의사결절트리 회귀
</pre>

In [75]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mean = mean_squared_error(y_test, y_pred)
print(f'DecisionTreeRegression  -  {mean}')

DecisionTreeRegression  -  0.09441233140655106


<pre>
RandomForestRegressor
    랜덤포레스트 회귀
</pre>

In [82]:
from sklearn.ensemble  import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mean = mean_squared_error(y_test, y_pred)
print(f'RandomForestRegressor  -  {mean}')

RandomForestRegressor  -  0.0902412331406551
