<h1><center> Identifying Spam in Email Using Machine Learning </center></h1> 

# <a id="toc">Table of Content</a>
### 1. [Import Modul dan Persiapan Data](#)
### 2. [Preprocessing Data](#)
- 2.1. [Pengolahan Teks Email](#)
- 2.2. [Stemming dan URL Extraction](#)
- 2.3. [Transformasi Email menjadi Hitungan Kata](#)
- 2.4. [Transformasi Hitungan Kata menjadi Representasi Vektor](#)
### 3. [Model Klasifikasi dan Evaluasi](#)
- 3.1. [Model Logistic Regression](#)
- 3.2. [Model AdaBoost](#)
- 3.3. [Model Support Vector Classifier (SVC)](#)
- 3.4. [Model Random Forest](#)
- 3.5. [Ensemble Voting Classifier.](#)
### 4. [Contoh Penggunaan Model pada Data Uji](#)
### 5. [Contoh Pembuatan Objek Email](#)

### Mengimport beberapa library yang digunakan. 
> Library seperti pandas, numpy, matplotlib, seaborn, os, tarfile, urllib.request, email, re, Counter, Pipeline, dan beberapa kelas dari sklearn diimpor untuk digunakan dalam program.

In [1]:
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import precision_score, recall_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.svm import SVC
from html import unescape
import urllib.request
import seaborn as sns
import email.message
import pandas as pd
import email.policy
import numpy as np
import tarfile
import email
import os
import re

### Membaca file "spam.csv" menggunakan pd.read_csv dari library pandas. 
> Data tersebut disimpan dalam variabel sms. Lalu, nama kolomnya diubah menjadi 'label' dan 'message' menggunakan sms.columns = ['label', 'message']. sms.head() digunakan untuk menampilkan 5 baris pertama dari data.

In [2]:
sms = pd.read_csv("spam.csv",encoding='latin-1')
sms.columns = ['label', 'message']

sms.head()

Unnamed: 0,label,message


### Mendefinisikan fungsi fetch_spam_data() untuk mengunduh dataset spam. 
> Fungsi ini menggunakan URL untuk mengunduh file .tar.bz2 dari server dan mengekstraknya ke direktori yang ditentukan oleh spam_path.

In [3]:
DOWNLOAD_ROOT = "http://spamassassin.apache.org/old/publiccorpus/"
HAM_URL = DOWNLOAD_ROOT + "20030228_easy_ham.tar.bz2"
SPAM_URL = DOWNLOAD_ROOT + "20030228_spam.tar.bz2"
SPAM_PATH = os.path.join("datasets", "spam")

def fetch_spam_data(ham_url=HAM_URL, spam_url=SPAM_URL, spam_path=SPAM_PATH):
    if not os.path.isdir(spam_path):
        os.makedirs(spam_path)
    for filename, url in (("ham.tar.bz2", ham_url), ("spam.tar.bz2", spam_url)):
        path = os.path.join(spam_path, filename)
        if not os.path.isfile(path):
            urllib.request.urlretrieve(url, path)
        tar_bz2_file = tarfile.open(path)
        tar_bz2_file.extractall(path=spam_path)
        tar_bz2_file.close()

In [4]:
fetch_spam_data()

### Mendefinisikan direktori untuk email ham dan spam (HAM_DIR dan SPAM_DIR). 
> daftar nama file dalam direktori tersebut disimpan dalam variabel ham_filenames dan spam_filenames, dengan kondisi bahwa panjang nama file harus lebih dari 20 karakter. Jumlah file spam dan ham kemudian ditampilkan.

In [5]:
HAM_DIR = os.path.join(SPAM_PATH, "easy_ham")
SPAM_DIR = os.path.join(SPAM_PATH, "spam")
ham_filenames = [name for name in sorted(os.listdir(HAM_DIR)) if len(name) > 20]
spam_filenames = [name for name in sorted(os.listdir(SPAM_DIR)) if len(name) > 20]

In [6]:
len(spam_filenames), len(ham_filenames), 

(500, 2500)

### Mendefinisikan fungsi load_email() untuk membaca file email dalam format byte. 
> Fungsi ini menerima parameter is_spam untuk menentukan apakah email tersebut spam atau bukan, dan filename untuk menentukan nama file email yang akan dibaca. File email dibuka dalam mode byte dan di-parse menggunakan email.parser.BytesParser. Hasilnya disimpan dalam daftar ham_emails dan spam_emails, dan kemudian email pertama dari daftar ham_emails ditampilkan.

In [7]:
def load_email(is_spam, filename, spam_path=SPAM_PATH):
    directory = "spam" if is_spam else "easy_ham"
    with open(os.path.join(spam_path, directory, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [8]:
ham_emails = [load_email(is_spam=False, filename=name) for name in ham_filenames]
spam_emails = [load_email(is_spam=True, filename=name) for name in spam_filenames]
ham_emails[0]

<email.message.EmailMessage at 0x27a12fd4fd0>

### Mencetak isi konten dari email ham ke-3 dan email spam ke-6. 
> .get_content() digunakan untuk mendapatkan konten email, dan .strip() digunakan untuk menghapus spasi yang tidak perlu.

In [9]:
print(ham_emails[2].get_content().strip())

Man Threatens Explosion In Moscow 

Thursday August 22, 2002 1:40 PM
MOSCOW (AP) - Security officers on Thursday seized an unidentified man who
said he was armed with explosives and threatened to blow up his truck in
front of Russia's Federal Security Services headquarters in Moscow, NTV
television reported.
The officers seized an automatic rifle the man was carrying, then the man
got out of the truck and was taken into custody, NTV said. No other details
were immediately available.
The man had demanded talks with high government officials, the Interfax and
ITAR-Tass news agencies said. Ekho Moskvy radio reported that he wanted to
talk with Russian President Vladimir Putin.
Police and security forces rushed to the Security Service building, within
blocks of the Kremlin, Red Square and the Bolshoi Ballet, and surrounded the
man, who claimed to have one and a half tons of explosives, the news
agencies said. Negotiations continued for about one and a half hours outside
the building, ITAR-

In [10]:
print(spam_emails[5].get_content().strip())

A POWERHOUSE GIFTING PROGRAM You Don't Want To Miss! 
 
  GET IN WITH THE FOUNDERS! 
The MAJOR PLAYERS are on This ONE
For ONCE be where the PlayerS are
This is YOUR Private Invitation

EXPERTS ARE CALLING THIS THE FASTEST WAY 
TO HUGE CASH FLOW EVER CONCEIVED
Leverage $1,000 into $50,000 Over and Over Again

THE QUESTION HERE IS:
YOU EITHER WANT TO BE WEALTHY 
OR YOU DON'T!!!
WHICH ONE ARE YOU?
I am tossing you a financial lifeline and for your sake I 
Hope you GRAB onto it and hold on tight For the Ride of youR life!

Testimonials

Hear what average people are doing their first few days:
�We've received 8,000 in 1 day and we are doing that over and over again!' Q.S. in AL
 �I'm a single mother in FL and I've received 12,000 in the last 4 days.� D. S. in FL
�I was not sure about this when I sent off my $1,000 pledge, but I got back $2,000 the very next day!� L.L. in KY
�I didn't have the money, so I found myself a partner to work this with. We have received $4,000 over the last 2 days

### Mendefinisikan dua fungsi: get_email_structure() untuk mendapatkan struktur email dan structures_counter() untuk menghitung frekuensi struktur email. 
> Fungsi structures_counter() menghitung frekuensi struktur email pada daftar email ham dan spam. Hasilnya kemudian ditampilkan menggunakan .most_common() dari kelas Counter.

In [11]:
def get_email_structure(email):
    if isinstance(email, str):
        return email
    payload = email.get_payload()
    if isinstance(payload, list):
        return "multipart({})".format(", ".join([
            get_email_structure(sub_email)
            for sub_email in payload
        ]))
    else:
        return email.get_content_type()

In [12]:
def structures_counter(emails):
    structures = Counter()
    for email in emails:
        structure = get_email_structure(email)
        structures[structure] += 1
    return structures

In [13]:
structures_counter(ham_emails).most_common()

[('text/plain', 2408),
 ('multipart(text/plain, application/pgp-signature)', 66),
 ('multipart(text/plain, text/html)', 8),
 ('multipart(text/plain, text/plain)', 4),
 ('multipart(text/plain)', 3),
 ('multipart(text/plain, application/octet-stream)', 2),
 ('multipart(text/plain, text/enriched)', 1),
 ('multipart(text/plain, application/ms-tnef, text/plain)', 1),
 ('multipart(multipart(text/plain, text/plain, text/plain), application/pgp-signature)',
  1),
 ('multipart(text/plain, video/mng)', 1),
 ('multipart(text/plain, multipart(text/plain))', 1),
 ('multipart(text/plain, application/x-pkcs7-signature)', 1),
 ('multipart(text/plain, multipart(text/plain, text/plain), text/rfc822-headers)',
  1),
 ('multipart(text/plain, multipart(text/plain, text/plain), multipart(multipart(text/plain, application/x-pkcs7-signature)))',
  1),
 ('multipart(text/plain, application/x-java-applet)', 1)]

In [14]:
structures_counter(spam_emails).most_common()

[('text/plain', 218),
 ('text/html', 183),
 ('multipart(text/plain, text/html)', 45),
 ('multipart(text/html)', 20),
 ('multipart(text/plain)', 19),
 ('multipart(multipart(text/html))', 5),
 ('multipart(text/plain, image/jpeg)', 3),
 ('multipart(text/html, application/octet-stream)', 2),
 ('multipart(text/plain, application/octet-stream)', 1),
 ('multipart(text/html, text/plain)', 1),
 ('multipart(multipart(text/html), application/octet-stream, image/jpeg)', 1),
 ('multipart(multipart(text/plain, text/html), image/gif)', 1),
 ('multipart/alternative', 1)]

### Mencetak semua header dan nilainya dari email spam pertama dalam daftar spam_emails.
> subjek dari email spam pertama juga dicetak.

In [15]:
for header, value in spam_emails[0].items():
    print(header,":",value)

Return-Path : <12a1mailbot1@web.de>
Delivered-To : zzzz@localhost.spamassassin.taint.org
Received : from localhost (localhost [127.0.0.1])	by phobos.labs.spamassassin.taint.org (Postfix) with ESMTP id 136B943C32	for <zzzz@localhost>; Thu, 22 Aug 2002 08:17:21 -0400 (EDT)
Received : from mail.webnote.net [193.120.211.219]	by localhost with POP3 (fetchmail-5.9.0)	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 13:17:21 +0100 (IST)
Received : from dd_it7 ([210.97.77.167])	by webnote.net (8.9.3/8.9.3) with ESMTP id NAA04623	for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 13:09:41 +0100
From : 12a1mailbot1@web.de
Received : from r-smtp.korea.com - 203.122.2.197 by dd_it7  with Microsoft SMTPSVC(5.5.1775.675.6);	 Sat, 24 Aug 2002 09:42:10 +0900
To : dcek1a1@netsgo.com
Subject : Life Insurance - Why Pay More?
Date : Wed, 21 Aug 2002 20:31:57 -1600
MIME-Version : 1.0
Message-ID : <0103c1042001882DD_IT7@dd_it7>
Content-Type : text/html; charset="iso-8859-1"
Content-Transfer-Encoding : qu

In [16]:
spam_emails[0]["Subject"]

'Life Insurance - Why Pay More?'

### Menggabungkan email ham dan spam menjadi satu array X, dan labelnya disimpan dalam array y. 
> data dibagi menjadi data latih dan data uji menggunakan train_test_split() dengan proporsi 80:20. Variabel X_train, X_test, y_train, dan y_test akan berisi data yang telah dibagi.

In [17]:
X = np.array(ham_emails + spam_emails, dtype=object)
y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Mendefinisikan fungsi html_to_plain_text() untuk mengkonversi teks dalam format HTML menjadi teks biasa. 
> Fungsi ini menggunakan ekspresi reguler untuk menghilangkan tag HTML dan menggantikan hyperlink dengan kata "HYPERLINK". Kemudian, unescape() digunakan untuk mengubah karakter khusus HTML menjadi karakter aslinya.

In [18]:
def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I)
    text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', '', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

### Mencari email spam dalam format HTML dalam data latih (X_train) dan menyimpannya dalam daftar html_spam_emails. 
> email spam HTML yang kedelapan dalam daftar tersebut dipilih dan kontennya dicetak, baik dalam format asli maupun dalam format teks biasa setelah diproses menggunakan html_to_plain_text().

In [19]:
html_spam_emails = [email for email in X_train[y_train==1]
                    if get_email_structure(email) == "text/html"]
sample_html_spam = html_spam_emails[7]
print(sample_html_spam.get_content().strip()[:1000], "...")

<HTML><HEAD><TITLE></TITLE><META http-equiv="Content-Type" content="text/html; charset=windows-1252"><STYLE>A:link {TEX-DECORATION: none}A:active {TEXT-DECORATION: none}A:visited {TEXT-DECORATION: none}A:hover {COLOR: #0033ff; TEXT-DECORATION: underline}</STYLE><META content="MSHTML 6.00.2713.1100" name="GENERATOR"></HEAD>
<BODY text="#000000" vLink="#0033ff" link="#0033ff" bgColor="#CCCC99"><TABLE borderColor="#660000" cellSpacing="0" cellPadding="0" border="0" width="100%"><TR><TD bgColor="#CCCC99" valign="top" colspan="2" height="27">
<font size="6" face="Arial, Helvetica, sans-serif" color="#660000">
<b>OTC</b></font></TD></TR><TR><TD height="2" bgcolor="#6a694f">
<font size="5" face="Times New Roman, Times, serif" color="#FFFFFF">
<b>&nbsp;Newsletter</b></font></TD><TD height="2" bgcolor="#6a694f"><div align="right"><font color="#FFFFFF">
<b>Discover Tomorrow's Winners&nbsp;</b></font></div></TD></TR><TR><TD height="25" colspan="2" bgcolor="#CCCC99"><table width="100%" border="0" 

In [20]:
print(html_to_plain_text(sample_html_spam.get_content())[:1000], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Watch for analyst "Strong Buy Recommendations" and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future.
Put CBYI on your watch list, acquire a position TODAY.
REASONS TO INVEST IN CBYI
A profitable company and is on track to beat ALL earnings estimates!
One of the FASTEST growing distributors in environmental & safety equipment instruments.
Excellent management team, several EXCLUSIVE contracts.  IMPRESSIVE client list including the U.S. Air Force, Anheuser-Busch, Chevron Refining and Mitsubishi Heavy Industries, GE-Energy & Environmental Research.
RAPIDLY GROWING INDUSTRY
Industry revenues exceed $900 million, estimates indicate that there could be as much as $25 billi

### Mendefinisikan fungsi email_to_text() untuk mengambil teks dari email. 
> Fungsi ini berjalan melalui setiap bagian email (part) dan mencari konten dengan tipe "text/plain" atau "text/html". Jika konten berupa teks biasa, maka konten tersebut dikembalikan. Jika konten berupa HTML, maka konten tersebut diubah menjadi teks biasa menggunakan html_to_plain_text(). Kemudian, potongan pertama dari teks hasil pengolahan tersebut dicetak.

In [21]:
def email_to_text(email):
    html = None
    for part in email.walk():
        ctype = part.get_content_type()
        if not ctype in ("text/plain", "text/html"):
            continue
        try:
            content = part.get_content()
        except: # in case of encoding issues
            content = str(part.get_payload())
        if ctype == "text/plain":
            return content
        else:
            html = content
    if html:
        return html_to_plain_text(html)

In [22]:
print(email_to_text(sample_html_spam)[:100], "...")


OTC
 Newsletter
Discover Tomorrow's Winners 
For Immediate Release
Cal-Bay (Stock Symbol: CBYI)
Wat ...


### Mengimpor library nltk dan urlextract. 
> Sebelum menggunakan library-library tersebut, perlu dilakukan instalasi terlebih dahulu menggunakan nltk.download() untuk library NLTK dan pip install urlextract untuk library urlextract. Jika library belum diinstal, maka pesan kesalahan akan ditampilkan.

In [23]:
try:
    import nltk

    stemmer = nltk.PorterStemmer()
    for word in ("Computations", "Computation", "Computing", "Computed", "Compute", "Compulsive"):
        print(word, "=>", stemmer.stem(word))
except ImportError:
    print("Error: stemming requires the NLTK module.")
    stemmer = None

Computations => comput
Computation => comput
Computing => comput
Computed => comput
Compute => comput
Compulsive => compuls


In [24]:
try:
    import urlextract # may require an Internet connection to download root domain names
    
    url_extractor = urlextract.URLExtract()
    print(url_extractor.find_urls("Will it detect github.com and https://youtu.be/7Pq-S557XQU?t=3m32s"))
except ImportError:
    print("Error: replacing URLs requires the urlextract module.")
    url_extractor = None

Error: replacing URLs requires the urlextract module.


### Mendefinisikan kelas EmailToWordCounterTransformer yang merupakan subclass dari BaseEstimator dan TransformerMixin dari modul scikit-learn. 
> Kelas ini bertujuan untuk mentransformasi email menjadi hitungan kata. Konstruktor kelas ini menerima beberapa argumen opsional yang mengatur pemrosesan teks seperti penghapusan header, lowercase, penghapusan tanda baca, penggantian URL, penggantian angka, dan stemming.

> Metode fit tidak melakukan apapun, hanya mengembalikan dirinya sendiri. Metode transform melakukan transformasi email menjadi hitungan kata dengan menerapkan pemrosesan teks yang telah diatur sebelumnya.

In [25]:
from sklearn.base import BaseEstimator, TransformerMixin

class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True,
                 replace_urls=True, replace_numbers=True, stemming=True):
        self.strip_headers = strip_headers
        self.lower_case = lower_case
        self.remove_punctuation = remove_punctuation
        self.replace_urls = replace_urls
        self.replace_numbers = replace_numbers
        self.stemming = stemming
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_transformed = []
        for email in X:
            text = email_to_text(email) or ""
            if self.lower_case:
                text = text.lower()
            if self.replace_urls and url_extractor is not None:
                urls = list(set(url_extractor.find_urls(text)))
                urls.sort(key=lambda url: len(url), reverse=True)
                for url in urls:
                    text = text.replace(url, " URL ")
            if self.replace_numbers:
                text = re.sub(r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', 'NUMBER', text)
            if self.remove_punctuation:
                text = re.sub(r'\W+', ' ', text, flags=re.M)
            word_counts = Counter(text.split())
            if self.stemming and stemmer is not None:
                stemmed_word_counts = Counter()
                for word, count in word_counts.items():
                    stemmed_word = stemmer.stem(word)
                    stemmed_word_counts[stemmed_word] += count
                word_counts = stemmed_word_counts
            X_transformed.append(word_counts)
        return np.array(X_transformed)

### Menginisialisasi objek EmailToWordCounterTransformer dan menerapkan transformasi pada tiga email pertama dalam X_train. Hasil transformasi berupa hitungan kata akan ditampilkan.

In [26]:
X_few = X_train[:3]
X_few_wordcounts = EmailToWordCounterTransformer().fit_transform(X_few)
X_few_wordcounts

array([Counter({'chuck': 1, 'murcko': 1, 'wrote': 1, 'stuff': 1, 'yawn': 1, 'r': 1}),
       Counter({'the': 11, 'of': 9, 'and': 8, 'all': 3, 'christian': 3, 'to': 3, 'by': 3, 'jefferson': 2, 'i': 2, 'have': 2, 'superstit': 2, 'one': 2, 'on': 2, 'been': 2, 'ha': 2, 'half': 2, 'rogueri': 2, 'teach': 2, 'jesu': 2, 'some': 1, 'interest': 1, 'quot': 1, 'http': 1, 'www': 1, 'postfun': 1, 'com': 1, 'pfp': 1, 'worboi': 1, 'html': 1, 'thoma': 1, 'examin': 1, 'known': 1, 'word': 1, 'do': 1, 'not': 1, 'find': 1, 'in': 1, 'our': 1, 'particular': 1, 'redeem': 1, 'featur': 1, 'they': 1, 'are': 1, 'alik': 1, 'found': 1, 'fabl': 1, 'mytholog': 1, 'million': 1, 'innoc': 1, 'men': 1, 'women': 1, 'children': 1, 'sinc': 1, 'introduct': 1, 'burnt': 1, 'tortur': 1, 'fine': 1, 'imprison': 1, 'what': 1, 'effect': 1, 'thi': 1, 'coercion': 1, 'make': 1, 'world': 1, 'fool': 1, 'other': 1, 'hypocrit': 1, 'support': 1, 'error': 1, 'over': 1, 'earth': 1, 'six': 1, 'histor': 1, 'american': 1, 'john': 1, 'e': 1, 're

### Mendefinisikan kelas WordCounterToVectorTransformer yang juga merupakan subclass dari BaseEstimator dan TransformerMixin dari modul scikit-learn. 
> Kelas ini bertujuan untuk mentransformasi hitungan kata menjadi representasi vektor. Konstruktor kelas ini menerima argumen opsional vocabulary_size yang menentukan ukuran vokabulari yang digunakan.

> Metode fit melakukan perhitungan frekuensi kata pada data masukan X dan menghasilkan kamus vokabulari berdasarkan kata-kata dengan frekuensi tertinggi. Metode transform melakukan transformasi hitungan kata menjadi representasi vektor berdasarkan kamus vokabulari yang telah diperoleh sebelumnya.

In [27]:
class WordCounterToVectorTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, vocabulary_size=1000):
        self.vocabulary_size = vocabulary_size
    def fit(self, X, y=None):
        total_count = Counter()
        for word_count in X:
            for word, count in word_count.items():
                total_count[word] += min(count, 10)
        most_common = total_count.most_common()[:self.vocabulary_size]
        self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
        return self
    def transform(self, X, y=None):
        rows = []
        cols = []
        data = []
        for row, word_count in enumerate(X):
            for word, count in word_count.items():
                rows.append(row)
                cols.append(self.vocabulary_.get(word, 0))
                data.append(count)
        return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))

### Menginisialisasi objek WordCounterToVectorTransformer dengan ukuran vokabulari sebesar 10 dan menerapkan transformasi pada hitungan kata yang diperoleh sebelumnya (X_few_wordcounts). 
> Hasil transformasi berupa representasi vektor akan ditampilkan.

In [28]:
vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
X_few_vectors

<3x11 sparse matrix of type '<class 'numpy.intc'>'
	with 20 stored elements in Compressed Sparse Row format>

In [29]:
X_few_vectors.toarray()

array([[  6,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [112,  11,   9,   8,   3,   1,   0,   1,   3,   0,   1],
       [ 92,   0,   1,   2,   3,   4,   5,   3,   1,   4,   2]],
      dtype=int32)

In [30]:
vocab_transformer.vocabulary_

{'the': 1,
 'of': 2,
 'and': 3,
 'to': 4,
 'http': 5,
 'number': 6,
 'com': 7,
 'all': 8,
 'yahoo': 9,
 'in': 10}

### Menggunakan Pipeline dari scikit-learn untuk menggabungkan dua transformer sebelumnya menjadi sebuah pipeline yang dapat digunakan untuk melakukan transformasi secara berurutan.
> preprocess_pipeline terdiri dari dua langkah transformasi: 
> 1. EmailToWordCounterTransformer.
> 2. WordCounterToVectorTransformer. 

> Pipeline ini kemudian digunakan untuk mentransformasi X_train.

In [31]:
preprocess_pipeline = Pipeline([
    ("email_to_wordcount", EmailToWordCounterTransformer()),
    ("wordcount_to_vector", WordCounterToVectorTransformer()),
])

X_train_transformed = preprocess_pipeline.fit_transform(X_train)

# Logistic Regression

### Menginisialisasi model klasifikasi LogisticRegression dari scikit-learn dan melakukan evaluasi menggunakan cross-validation dengan 3 fold pada data transformasi (X_train_transformed) dan label (y_train). 
> Akurasi rata-rata dari cross-validation akan ditampilkan.

In [32]:
log_clf = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3, verbose=3)
score.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] END ................................ score: (test=0.983) total time=   0.0s
[CV] END ................................ score: (test=0.985) total time=   0.0s
[CV] END ................................ score: (test=0.993) total time=   0.1s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.3s finished


0.9866666666666667

### Melakukan transformasi pada data uji (X_test) menggunakan pipeline yang telah di-fit sebelumnya. 
> Model LogisticRegression dilatih menggunakan data transformasi latih (X_train_transformed) dan label latih (y_train). Prediksi dilakukan pada data transformasi uji dan kemudian ditampilkan nilai presisi dan recall dari hasil prediksi.

In [33]:
X_test_transformed = preprocess_pipeline.transform(X_test)

log_clf = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
log_clf.fit(X_train_transformed, y_train)

y_pred = log_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred)))

Precision: 95.88%
Recall: 97.89%


### Mengulangi langkah-langkah yang sama seperti sebelumnya, namun menggunakan model klasifikasi yang berbeda seperti AdaBoostClassifier, SVC (Support Vector Classifier), dan RandomForestClassifier. Hasil presisi dan recall dari setiap model juga ditampilkan.

# Adaboost

In [34]:
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
score_adaboost = cross_val_score(ada_clf, X_train_transformed, y_train, cv=3, verbose=3)
score_adaboost.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.980) total time=   0.6s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s remaining:    0.0s


[CV] END ................................ score: (test=0.986) total time=   0.6s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.3s remaining:    0.0s


[CV] END ................................ score: (test=0.985) total time=   0.6s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.9s finished


0.98375

In [35]:
X_test_transformed = preprocess_pipeline.transform(X_test)

ada_clf = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_clf.fit(X_train_transformed, y_train)

y_pred_adaboost = ada_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred_adaboost)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred_adaboost)))

Precision: 96.94%
Recall: 100.00%


# Support Vector Machine

In [36]:
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

svc_clf = SVC(gamma='auto')
score_svc = cross_val_score(svc_clf, X_train_transformed, y_train, cv=3, verbose=3)
score_svc.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.948) total time=   0.5s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV] END ................................ score: (test=0.954) total time=   0.5s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.2s remaining:    0.0s


[CV] END ................................ score: (test=0.951) total time=   0.5s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.8s finished


0.9508333333333333

In [37]:
X_test_transformed = preprocess_pipeline.transform(X_test)

svc_clf = SVC(gamma='auto')
svc_clf.fit(X_train_transformed, y_train)

y_pred_svc = svc_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred_svc)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred_svc)))

Precision: 100.00%
Recall: 75.79%


# Random Forest

In [38]:
X_train_transformed = preprocess_pipeline.fit_transform(X_train)

ran_clf = RandomForestClassifier(random_state=42)
score_randfor = cross_val_score(ran_clf, X_train_transformed, y_train, cv=3, verbose=3)
score_randfor.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] END ................................ score: (test=0.983) total time=   0.4s


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV] END ................................ score: (test=0.986) total time=   0.4s


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s


[CV] END ................................ score: (test=0.985) total time=   0.4s


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.4s finished


0.9845833333333333

In [39]:
X_test_transformed = preprocess_pipeline.transform(X_test)

ran_clf = RandomForestClassifier(random_state=42)
ran_clf.fit(X_train_transformed, y_train)

y_pred_randfor = ran_clf.predict(X_test_transformed)

print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_pred_randfor)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_pred_randfor)))

Precision: 98.84%
Recall: 89.47%


In [40]:
X_train

array([<email.message.EmailMessage object at 0x0000027A18304250>,
       <email.message.EmailMessage object at 0x0000027A18304F70>,
       <email.message.EmailMessage object at 0x0000027A12FD70D0>, ...,
       <email.message.EmailMessage object at 0x0000027A1970E0E0>,
       <email.message.EmailMessage object at 0x0000027A199A5030>,
       <email.message.EmailMessage object at 0x0000027A18306AA0>],
      dtype=object)

In [41]:
X_test_transformed

<600x1001 sparse matrix of type '<class 'numpy.intc'>'
	with 51594 stored elements in Compressed Sparse Row format>

# Ensemble Learning

In [42]:
random_forest_clf = RandomForestClassifier(random_state=42)
lg_reg            = LogisticRegression(solver="lbfgs", max_iter=1000, random_state=42)
svm_clf           = SVC(gamma='auto')
ada_clf           = AdaBoostClassifier(n_estimators=100, random_state=42)

In [43]:
estimators = [
    random_forest_clf, 
    lg_reg, 
    svm_clf, 
    ada_clf
]

In [44]:
for estimator in estimators:
    print("Training the", estimator)
    estimator.fit(X_train_transformed, y_train)

Training the RandomForestClassifier(random_state=42)
Training the LogisticRegression(max_iter=1000, random_state=42)
Training the SVC(gamma='auto')
Training the AdaBoostClassifier(n_estimators=100, random_state=42)


In [45]:
for estimator in estimators:
    print("Precision "+str(estimator)+" : {:.2f}%".format(100 * precision_score(y_test, estimator.predict(X_test_transformed) )))
    print("Recall  "+str(estimator)+" : {:.2f}%".format(100 * recall_score(y_test, estimator.predict(X_test_transformed) )))
    print("")

Precision RandomForestClassifier(random_state=42) : 98.84%
Recall  RandomForestClassifier(random_state=42) : 89.47%

Precision LogisticRegression(max_iter=1000, random_state=42) : 95.88%
Recall  LogisticRegression(max_iter=1000, random_state=42) : 97.89%

Precision SVC(gamma='auto') : 100.00%
Recall  SVC(gamma='auto') : 75.79%

Precision AdaBoostClassifier(n_estimators=100, random_state=42) : 96.94%
Recall  AdaBoostClassifier(n_estimators=100, random_state=42) : 100.00%



In [46]:
named_estimators = [
    ("random_forest_clf", random_forest_clf),
    ("logistic_regression", lg_reg),
    ("svm_clf", svm_clf),
    ("adaboost_clf", ada_clf),
]

In [47]:
voting_clf = VotingClassifier(named_estimators)

In [48]:
voting_clf.fit(X_train_transformed, y_train)

In [49]:
X_test_transformed

<600x1001 sparse matrix of type '<class 'numpy.intc'>'
	with 51594 stored elements in Compressed Sparse Row format>

In [50]:
y_votingpred = voting_clf.predict(X_test_transformed)

In [51]:
print("Precision: {:.2f}%".format(100 * precision_score(y_test, y_votingpred)))
print("Recall: {:.2f}%".format(100 * recall_score(y_test, y_votingpred)))

Precision: 100.00%
Recall: 93.68%


### Mendefinisikan fungsi create_email yang digunakan untuk membuat objek email menggunakan modul email bawaan Python. 
> Objek email ini kemudian digunakan sebagai contoh data untuk diuji menggunakan pipeline yang telah dibangun sebelumnya. Hasilnya adalah list test yang berisi objek email.

In [52]:
def create_email(sender, recipient, subject, message):
    email_msg = email.message.EmailMessage()
    email_msg['From'] = sender
    email_msg['To'] = recipient
    email_msg['Subject'] = subject
    email_msg.set_content(message)

    return email_msg

In [53]:
sender = 'spammer@example.com'
recipient = 'yehezkiel@example.com'
subject = 'Life Insurance - Why Pay More?'
message = "Watch for analyst 'Strong Buy Recommendations' and several advisory newsletters picking CBYI.  CBYI has filed to be traded on the OTCBB, share prices historically INCREASE when companies get listed on this larger trading exchange. CBYI is trading around 25 cents and should skyrocket to $2.66 - $3.25 a share in the near future. Put CBYI on your watch list, acquire a position TODAY."

email_obj = create_email(sender, recipient, subject, message)
test = [email_obj]
test

[<email.message.EmailMessage at 0x27a1cc56380>]

<h1><center> The End </center></h1> 