## Logistic Regression pada Binary Classification Task

### Formula Dasar

#### Simple Linear Regression
- y = a + Bx
- g(x) = a + Bx

#### Multiple Linear Regression
- y = a + B1x1 + B2x2 +...+Bnxn
- g(x) = a + Bx

#### Logistic Regression
- g(X) = sigmoid(a + Bx)
- sigmoid(x) = 1 : (1+exp(-x))

### Dataset: SMS Spam Collection Data Set
- mengimport moduls pandas ke dalam script
- memanggil fungsi SMSSpamCollection

In [None]:
import pandas as pd

df = pd.read_csv('C:Users\Rici\Downloads\SMSSpamCollection',
                 sep='\t',
                 header=None,
                 names=['label', 'sms'])

df.head()

In [None]:
df['label'].value_counts()

### Training & Testing Dataset
- mengimport LabelBinarizer ke dalam script

In [None]:
from sklearn.preprocessing import LabelBinarizer

x = df['sms'].values
y = df['label'].values

lb = LabelBinarizer()
y = lb.fit_transform(y).ravel()
lb.classes_

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=0)

print(x_train, '\n')
print(y_train)

### Features Extraction dengan TF-IDF
- mengimport Tfidfvectorizer kedalam script

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=0)

print(x_train, '\n')
print(y_train)

### Binary Classification dengan Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train_tfidf, y_train)
y_pred = model.predict(x_test_tfidf)

for pred, sms in zip(y_pred[:5], x_test[:5]):
    print(f'PRED: {pred} - SMS: {sms}\n)

### Evaluation Metrics pada Binary Classification
- Confusion
- Accuracy
- Precission & Recall
- F1 Score
- ROC

### Terminologi Dasar
- True Positive == TP
- True Negative == TN
- False Positive == FP
- False Negative == FN

### Confusion Matrix
dikenal sebagai error matrix

In [None]:
from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, y_pred)
matrix

In [None]:
tn, fp, fn, tp = matrix.ravel()

print(f'TN: {tn}')
print(f'FP: {fp}')
print(f'FN: {fn}')
print(f'TP: {tp}')

In [None]:
import matplotlib.pyplot as plt

plt.matshow(matrix)
plt.colorbar()

plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

### Accuracy
mengukur porsi dari hasil prediksi yang tepat

In [None]:
from sklearn.metrics import accurary_score

accuracy_score(y_test, y_pred)

Recall or True Positive Rate (TPR) or Sensitivity

Recall = TP : (TP+FN)

In [None]:
from sklearn.metrics import recall score

recall_score(y_test, y_pred)

### F1-Score

adalah harmonic mean dari precission dan recall

f1 score = (precission x recall) : (precission + recall)

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred)

### ROC: Receiver Operating Characterictic

menawarkan visualisasi terhadap performa dari clasifier dengan membandingkan nilai recall dan nilai fallout

fallout = FP : (TN + FP)

In [None]:
from sklearn.metrics import roc_curve, auc

prob_estimates = model.predict_proba(x_test_tfidf)

fpr, tpr, threshhold = roc_curve(y_test, prob_estimates[:, 1])
nilai auc = auc(fpr, tpr)

plt.plot(fpr, tpr, 'b', label=f'AUC={nilai_auc}')
plt.plot([0,1] [0,1], 'r--', label='Random Classifier')

plt.title('ROC: Receiver Operating Characterictics')
plt.xlabel('Fallout or False Positive Rate')
plt.ylabel('Recall or True Positive Rate')
plt.legend()
plt.show()