### Naive Bayes Classifier Task
### 문장에서 느껴지는 감정 예측
##### 다중 분류(Multiclass Classification)
- 비대면 심리 상담사로서 메세지를 전달한 환자에 대한 감정 데이터를 수집했다.
- 각 메세지 별로 감정이 표시되어 있다.
- 미래에 동일한 메세지를 보내는 환자에게 어떤 심리 치료가 적합할 수 있는지 알아보기 위한 모델을 구축한다.

##### 🚩제시된 feature에 알맞은 target이 나올 수 있게 훈련한다.  
- 'Sweat deer': love  
- 'The moment I saw her, I realized something was wrong.': sadness

In [6]:
import pandas as pd
ms_df = pd.read_csv('./datasets/feeling.csv',sep=';')
ms_df

Unnamed: 0,message,feeling
0,im feeling quite sad and sorry for myself but ...,sadness
1,i feel like i am still looking at a blank canv...,sadness
2,i feel like a faithful servant,love
3,i am just feeling cranky and blue,anger
4,i can have for a treat or if i am feeling festive,joy
...,...,...
17995,i just had a very brief time in the beanbag an...,sadness
17996,i am now turning and i feel pathetic that i am...,sadness
17997,i feel strong and good overall,joy
17998,i feel like this was such a rude comment and i...,anger


In [8]:
ms_df.describe().T

Unnamed: 0,count,unique,top,freq
message,18000,17962,i feel a remembrance of the strange by justin ...,2
feeling,18000,6,joy,6066


In [10]:
ms_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18000 entries, 0 to 17999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  18000 non-null  object
 1   feeling  18000 non-null  object
dtypes: object(2)
memory usage: 281.4+ KB


In [13]:
from sklearn.preprocessing import LabelEncoder
feeling_encoder = LabelEncoder()
feeling_encoded= feeling_encoder.fit_transform(ms_df["feeling"])
ms_df["feeling"] = feeling_encoded
ms_df

Unnamed: 0,message,feeling
0,im feeling quite sad and sorry for myself but ...,4
1,i feel like i am still looking at a blank canv...,4
2,i feel like a faithful servant,3
3,i am just feeling cranky and blue,0
4,i can have for a treat or if i am feeling festive,2
...,...,...
17995,i just had a very brief time in the beanbag an...,4
17996,i am now turning and i feel pathetic that i am...,4
17997,i feel strong and good overall,2
17998,i feel like this was such a rude comment and i...,0


In [33]:
from sklearn.model_selection import train_test_split
features = ms_df["message"]
targets = ms_df["feeling"]


X_train,X_test,y_train,y_test = train_test_split(features,targets,stratify=targets,random_state=124,test_size=0.1)


In [40]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

countVectorizer = CountVectorizer()
x_train_vectorized = countVectorizer.fit_transform(X_train)

x_train_vectorized


# navie_bayes_pipeline = Pipeline([('count_vectorizer', CountVectorizer()), ('naive_bayes', MultinomialNB())])
# navie_bayes_pipeline.fit(X_train, y_train)

<16200x15210 sparse matrix of type '<class 'numpy.int64'>'
	with 251911 stored elements in Compressed Sparse Row format>

In [41]:
y_pred = navie_bayes_pipeline.predict(X_test)
ms_df_test = ms_df.iloc[X_test.index]
ms_df_test["y_pred"] = y_pred
ms_df_test

navie_bayes_pipeline.score(X_test, y_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ms_df_test["y_pred"] = y_pred


0.7588888888888888

In [43]:
navie_bayes_pipeline.predict_proba(X_test)

array([[2.06403529e-03, 1.95613628e-03, 8.97466296e-01, 1.11539913e-02,
        8.73550228e-02, 4.51845757e-06],
       [2.96863304e-02, 1.24318652e-01, 1.91916870e-01, 2.79567962e-02,
        3.89342425e-01, 2.36778926e-01],
       [4.51230414e-03, 5.50272294e-05, 9.91097469e-01, 1.48804486e-04,
        4.18632802e-03, 6.71350253e-08],
       ...,
       [2.47303020e-02, 7.20675888e-02, 8.06790317e-01, 4.41986646e-02,
        2.29297497e-02, 2.92833775e-02],
       [1.39011817e-02, 1.18804995e-02, 8.28005076e-01, 2.85013021e-02,
        1.15744979e-01, 1.96696204e-03],
       [7.78016541e-02, 3.95603492e-02, 1.67944091e-01, 8.26800104e-02,
        6.28204429e-01, 3.80946624e-03]])

In [49]:
from sklearn.metrics import accuracy_score, precision_score , recall_score , confusion_matrix, ConfusionMatrixDisplay, f1_score, roc_auc_score
import matplotlib.pyplot as plt
# 타겟 데이터와 예측 객체를 전달받는다.
def get_evaluation(y_test, prediction, classifier=None, X_test=None):
#     오차 행렬
    confusion = confusion_matrix(y_test, prediction,average='macro')
#     정확도
    accuracy = accuracy_score(y_test , prediction,average='macro')
#     정밀도
    precision = precision_score(y_test , prediction,average='macro')
#     재현율
    recall = recall_score(y_test , prediction,average='macro')
#     F1 score
    f1 = f1_score(y_test, prediction,average='macro')
#     ROC-AUC
    roc_auc = roc_auc_score(y_test, prediction,average='macro')

    print('오차 행렬')
    print(confusion)
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1:{3:.4f}, AUC:{4:.4f}'.format(accuracy , precision ,recall, f1, roc_auc))
    print("#" * 75)
    
    if classifier is not None and  X_test is not None:
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8,4))
        titles_options = [("Confusion matrix", None), ("Normalized confusion matrix", "true")]

        for (title, normalize), ax in zip(titles_options, axes.flatten()):
            disp = ConfusionMatrixDisplay.from_estimator(classifier, X_test, y_test, ax=ax, cmap=plt.cm.Blues, normalize=normalize)
            disp.ax_.set_title(title)
        plt.show()

In [45]:
get_evaluation(y_test, y_pred, navie_bayes_pipeline, X_test)

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].