# 11 - Logistic Regression pada Binary Classification Task

## Formula Dasar
### Simple Linear Regression

Hanya dapat menyertakan satu feature saja untuk melakukan estimasi nilai.

- $y = \alpha + \beta x$
- $g(x) = \alpha + \beta x$

### Multiple Linear Regression

Dapat menyertakan lebih dari satu feature saja untuk melakukan estimasi nilai

- $y = \alpha + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n$
- $g(X) = \alpha + \beta X$

### Logistic Regression

- $g(X) = sigmoid(\alpha + \beta X)$
- $sigmoid(x) = \frac{1}{1 + exp(-x)}$

## Dataset: SMS Spam Collection Data Set

In [1]:
import pandas as pd
df = pd.read_csv('./dataset/SMSSpamCollection',
                 sep='\t',
                 header=None,
                 names=['label', 'sms'])
df.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

Kesimpulan dari data di atas yaitu inbalance dikarenakan jumlah ham jauh lebih banyak jika dibandingkan dengan jumlah spam.

## Training & Testing Dataset

In [None]:
from sklearn.preprocessing import LabelBinarizer

X = df['sms'].values
y = df['label'].values

lb = LabelBinarizer()
y = lb.fit_transform(y).ravel()
lb.classes_

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=0)

print(X_train, '\n')
print(y_train)

## Feature Extraction dengan TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(X_train_tfidf)

## Binary Classification dengan Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)

for pred, sms in zip(y_pred[:5], X_test[:5]):
    print(f'PRED: {pred} - SMS: {sms}\n')

## Evaluation Metrics pada Binary Classification

- Confusion Matrix
- Accuracy
- Precission & Recall
- F1 Score
- ROC

## Terminologi Dasar

- True Positive (TP)
- True Negative (TN)
- False Positive (FP)
- False Negative (FN)

## Confusion Matrix
Confusion matrix seringkali juga dikenal sebagai error matrix.

In [None]:
from sklearn.metrics import confusion_matrix

matrix = confusion_matrix(y_test, y_pred)
matrix

In [None]:
tn, fp, fn, tp = matrix.ravel()

print(f'TN: {tn}')
print(f'FP: {fp}')
print(f'FN: {fn}')
print(f'TP: {tp}')

In [None]:
import matplotlib.pyplot as plt

plt.matshow(matrix)
plt.colorbar()

plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

## Accuracy
Accuracy mengukur porsi dari hasil prediksi yang tepat.

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{correct}{total}$

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

## Precission & Recall
#### Precission or  Positive Predictive Value (PPV)
$Precission = \frac{TP}{TP + FP}$

In [None]:
from sklearn.metrics import precision_score

precision_score(y_test, y_pred)

#### Recall or  True Positive Rate (TPR) or Sensitivity
$Recall = \frac{TP}{TP + FN}$

In [None]:
from sklearn.metrics import recall_score

recall_score(y_test, y_pred)

### F1-Score
$F1\ score = \frac{precission\ \times\ recall}{precission\ +\ recall}$

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred)

### ROC: Receiver Operating Characteristic
$fallout = \frac{FP}{TN+FP}$

In [None]:
from sklearn.metrics import roc_curve, auc

prob_estimates = model.predict_proba(X_test_tfidf)

fpr, tpr, threshhold = roc_curve(y_test, prob_estimates[:, 1])
nilai_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, 'b', label=f'AUC={nilai_auc}')
plt.plot([0,1], [0,1], 'r--', label='Random CLassifier')

plt.title('ROC: Receiver Operating Characteristic')
plt.xlabel('Fallout or False Positive Rate')
plt.ylabel('Recall or True Positive Rate')
plt.legend()
plt.show()