# Prepare the AI Spam Classifier Dataset
We'll be combining 2 open source datasets curated by [The University of California, Irvine (UCI)](https://archive.ics.uci.edu):

- Spam SMS ([source](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection))
- YouTube Spam ([source](https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection))


#### Requirements
- Python
- Jupyter (Setup with [this video](https://www.youtube.com/watch?v=9tPS-7TWjq0))
- Pandas

### Step 1. Download Datasets

#### Create destination folders

In [64]:
import pathlib
import pandas as pd

USE_PROJECT_ROOT = True
BASE_DIR = pathlib.Path(".").resolve()
if USE_PROJECT_ROOT:
    BASE_DIR = BASE_DIR.parent.parent
DATASET_DIR = BASE_DIR / "datasets"
ZIPS_DIR = DATASET_DIR / 'zips'
EXPORT_DIR = DATASET_DIR / "exports"
SMS_SPAM_DIR = DATASET_DIR / 'imports' / 'sms-spam'
YOUTUBE_SPAM_DIR = DATASET_DIR / 'imports' / 'youtube-spam'
SMS2_SPAM_DIR = DATASET_DIR / 'imports' / 'sms-spam-2'
LATIH_SPAM_DIR = DATASET_DIR / 'imports' / 'spam-latih'


print(ZIPS_DIR)

E:\code\SpamDetector_FastAPI\AI-as-an-API\datasets\zips


In [65]:
ZIPS_DIR.mkdir(exist_ok=True, parents=True)

EXPORT_DIR.mkdir(exist_ok=True, parents=True)

SMS_SPAM_DIR.mkdir(exist_ok=True, parents=True)

YOUTUBE_SPAM_DIR.mkdir(exist_ok=True, parents=True)

SMS2_SPAM_DIR.mkdir(exist_ok=True, parents=True)

LATIH_SPAM_DIR.mkdir(exist_ok=True, parents=True)

You could also create the directories using:

```
!mkdir -p $DATASET_DIR/zips/
!mkdir -p $SMS_SPAM_DIR
!mkdir -p $YOUTUBE_SPAM_DIR
!mkdir -p $EXPORT_DIR
```

#### UCI Spam SMS
Source: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [66]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o $ZIPS_DIR/uci-sms-spam.zip

import zipfile
with zipfile.ZipFile(ZIPS_DIR / "uci-sms-spam.zip") as zip_ref:
  zip_ref.extractall(SMS_SPAM_DIR)

!curl https: // archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip - o $ZIPS_DIR/uci-youtube-spam.zip

with zipfile.ZipFile(ZIPS_DIR / "uci-youtube-spam.zip") as zip_ref:
  zip_ref.extractall(YOUTUBE_SPAM_DIR)


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
 23  198k   23 47616    0     0  28552      0  0:00:07  0:00:01  0:00:06 28563
100  198k  100  198k    0     0  91948      0  0:00:02  0:00:02 --:--:-- 92001


#### YouTube Spam
Source: https://archive.ics.uci.edu/ml/datasets/YouTube+Spam+Collection

In [67]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip -o $ZIPS_DIR/uci-youtube-spam.zip

import zipfile
with zipfile.ZipFile(ZIPS_DIR / "uci-youtube-spam.zip") as zip_ref:
  zip_ref.extractall(YOUTUBE_SPAM_DIR)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  9  159k    9 15872    0     0  11801      0  0:00:13  0:00:01  0:00:12 11800
100  159k  100  159k    0     0  76709      0  0:00:02  0:00:02 --:--:-- 76755


### Step 2. Load Datasets into a Pandas DataFrame

In [1]:
! c:\Users\Acer\Documents\Visual-Studio\MACHINE-LEARNING\ai-api\Scripts\python.exe -m pip install pandas
import sys

print (sys.executable)
import pandas as pd

sms_path = SMS_SPAM_DIR / 'SMSSpamCollection'
sms_df = pd.read_csv(str(sms_path), sep='\t', header=None)

sms_df.columns = ['label', 'text']
sms_df['source'] = 'uci-spam-sms'

location = YOUTUBE_SPAM_DIR
csvs = list(location.glob("*.csv"))

new_dfs = []
for csv in csvs:
    csv_df = pd.read_csv(str(csvs[0]), usecols=['CLASS', 'CONTENT'])
    csv_df.rename(columns={'CLASS': 'class', "CONTENT": 'text'}, inplace=True)
    csv_df['label'] = csv_df['class'].apply(
        lambda x: "spam" if str(x) == "1" else "ham")
    sub_df = csv_df.copy()[['label', 'text']]
    new_dfs.append(sub_df)

yt_df = pd.concat(new_dfs)
yt_df['source'] = 'uci-youtube-spam'

location = SMS2_SPAM_DIR
csvs = list(location.glob("*.csv"))

df = pd.concat([sms_df, yt_df])

The system cannot find the path specified.


c:\Users\AnakBaik\AppData\Local\Programs\Python\Python39\python.exe


NameError: name 'SMS_SPAM_DIR' is not defined

**Load the `sms-spam` dataset into a pandas dataframe**

In [69]:
sms_path = SMS_SPAM_DIR / 'SMSSpamCollection'
sms_df = pd.read_csv(str(sms_path), sep='\t', header=None)

Now set the headers

In [70]:
sms_df.columns = ['label', 'text']
sms_df['source'] = 'uci-spam-sms'

**Load the `youtube-spam` datasets into a pandas dataframe**

The youtube-spam dataset is stored across multiple csvs. Let's combine them into 1 big file.

In [71]:
location = YOUTUBE_SPAM_DIR
csvs = list(location.glob("*.csv"))

In [72]:
new_dfs = []
for csv in csvs:
    csv_df = pd.read_csv(str(csvs[0]), usecols=['CLASS', 'CONTENT'])
    csv_df.rename(columns={'CLASS': 'class', "CONTENT": 'text'}, inplace=True)
    csv_df['label'] = csv_df['class'].apply(lambda x: "spam" if str(x) == "1" else "ham")
    sub_df = csv_df.copy()[['label', 'text']] 
    new_dfs.append(sub_df)

yt_df = pd.concat(new_dfs)
yt_df['source'] = 'uci-youtube-spam'

In [73]:
location = SMS2_SPAM_DIR
csvs = list(location.glob("*.csv"))

In [74]:
new_dfs = []
for csv in csvs:
    csv_df = pd.read_csv(str(csvs[0]), usecols=['label', 'Teks'])
    csv_df.rename(columns={'label': 'class', "teks": 'text'}, inplace=True)
    csv_df['label'] = csv_df['class'].apply(lambda x: "spam" if str(x) == "penipuan" else "ham")
    sub_df = csv_df.copy()[['label', 'Teks']] 
    new_dfs.append(sub_df)
sms2_df = pd.concat(new_dfs)

In [75]:
location = LATIH_SPAM_DIR
csvs = list(location.glob("*.csv"))

**Combine the `sms-spam` dataset and the `youtube-spam` dataset**

In [76]:
df = pd.concat([sms_df, yt_df])

### Step 3. Export Complete Dataset

In [77]:
df = pd.concat([sms_df, yt_df])
df.to_csv(EXPORT_DIR / 'spam-dataset.csv', index=False)