# **Philippine Scam SMS**

**Author/s: [Anton Reyes](https://www.github.com/AGR-yes)**

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis

In [1]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is an open-source graphing library for Python.

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions

In [3]:
import re

#### **Datasets and Files**

The following files were used for this project:

- `Scam_SMS_Reports.xlsx` contains the reports of users with the phone numbers, type of scam, proof, and name-inclusion.
- `SPAM_SMS.csv` contains text messages of one person with number, text itself, and the time and date received.
- `text-scams-incidents-philippines-2019-by-region.xlsx` contains number of scam messages (in thousands) received per region
- `networks.csv` contains the first 4-5 numbers in a Philippine phone number to identify the network it belongs to.

## **Data Collection**

Importing the dataset using pandas.

In [4]:
dataset = "Raw Datasets/Scam_SMS_Reports.xlsx"

report = pd.read_excel(dataset)
report.head()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes,Unnamed: 6,GRAPHS AND STUFF,Unnamed: 8,Unnamed: 9,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,1,9103239417,,Work from home,,False,,,,,...,,,,,,,,,,
1,2,95348643,,Dating Scam,,False,,,,,...,,,,,,,,,,
2,3,931804865,,work,,False,,,,,...,,,,,,,,,,
3,4,981197529,,nanalo sa lotto,,False,,,,,...,,,,,,,,,,
4,5,981369614,,Abroad Opportunity kuno,,False,,,,,...,,,,,,,,,,


In [5]:
dataset = "Raw Datasets/SPAM_SMS.csv"

spam = pd.read_csv(dataset)
spam.head()

Unnamed: 0.1,Unnamed: 0,_id,address,date,text,threadId
0,0,8787,+6396***32373,2022-11-12 14:02:10.079,"Welcome ! your have P1222 for S!ot , \nWeb: 11...",836
1,1,8788,+6398***78852,2022-11-12 14:33:48.916,"My god, at least 999P rewards waiting for you\...",837
2,2,8789,+6394***80113,2022-11-13 23:03:15.023,"DEAR VIP <REAL NAME>, No. 1 Online Sabong Site...",838
3,3,8790,+6395***34934,2022-11-14 00:07:18.715,"<REAL NAME>! Today, you can win the iphone14PR...",839
4,4,8791,+6396***74401,2022-11-15 02:28:56.636,"Welcome ! your have P1222 for S!ot , \nWeb: gr...",841


In [6]:
dataset = "Raw Datasets/text-scams-incidents-philippines-2019-by-region.xlsx"

incidents = pd.read_excel(dataset, sheet_name="Data")
incidents.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,,,
1,,Number of SMS fraud or text scams incidents Ph...,
2,,Total number of SMS fraud or text scam inciden...,
3,,,
4,,Region 3,3484.73


## **Description of the Dataset**

Here, we find the shape of the dataset.

In [7]:
sets = [report, spam, incidents]

for set in sets:
    print(set.shape)

(10493, 28)
(170, 6)
(21, 3)


By looking at the `info` of the dataframe, we can see that there are `null` values. 

In [8]:
report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10493 entries, 0 to 10492
Data columns (total 28 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0                                  10493 non-null  int64  
 1   Number                         4883 non-null   object 
 2   Network (Auto-Generates)       9974 non-null   object 
 3   Type of Scam                   4686 non-null   object 
 4   Proofs                         1257 non-null   object 
 5   Knows your name?
Check if yes  10492 non-null  object 
 6   Unnamed: 6                     0 non-null      float64
 7   GRAPHS AND STUFF               27 non-null     object 
 8   Unnamed: 8                     8 non-null      object 
 9   Unnamed: 9                     1 non-null      object 
 10  Unnamed: 10                    0 non-null      float64
 11  Unnamed: 11                    1 non-null      object 
 12  Unnamed: 12                    1 non-null     

In [9]:
spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170 entries, 0 to 169
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  170 non-null    int64 
 1   _id         170 non-null    int64 
 2   address     170 non-null    object
 3   date        170 non-null    object
 4   text        170 non-null    object
 5   threadId    170 non-null    int64 
dtypes: int64(3), object(3)
memory usage: 8.1+ KB


In [10]:
incidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  0 non-null      float64
 1   Unnamed: 1  19 non-null     object 
 2   Unnamed: 2  17 non-null     float64
dtypes: float64(2), object(1)
memory usage: 632.0+ bytes


## **Exploratory Data Analysis**

### **Report**

In [11]:
report.columns

Index([' ', 'Number', 'Network (Auto-Generates)', 'Type of Scam', 'Proofs',
       'Knows your name?\nCheck if yes', 'Unnamed: 6', 'GRAPHS AND STUFF',
       'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16',
       'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20',
       'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24',
       'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27'],
      dtype='object')

In [12]:
report['Network (Auto-Generates)'].value_counts()

                         5106
Smart or Talk ‘N Text    3205
Globe or TM              1240
Smart                     289
Sun Cellular              134
Name: Network (Auto-Generates), dtype: int64

In [13]:
report['Proofs'].value_counts()

https://t.ly/stsH -compiled spam message screenshot                                                                                                                                                                                                                  62
Row 469 - Proof                                                                                                                                                                                                                                                      39
Screenshot                                                                                                                                                                                                                                                           37
'                                                                                                                                                                                                               

In [14]:
report['Knows your name?\nCheck if yes'].value_counts()

False                                                                                                                          10279
True                                                                                                                             195
linky-ph/-BingoPlus                                                                                                                2
Atin/B.e.t/  is  B.e.s.t On/l.i.n.e Ca/si.n0 in the Philippines. http://okadaonline.pics/KsN G/e/t 80 % for every de/po/sit        2
Mentioned my name and the link. Ayaw nya daw ako maging mahirap HAHHAHA                                                            1
98 Games; Cash in                                                                                                                  1
http://gcxvd2ny.com                                                                                                                1
mentioned my name                                                    

In [15]:
report.iloc[:, 0:6].dtypes

                                   int64
Number                            object
Network (Auto-Generates)          object
Type of Scam                      object
Proofs                            object
Knows your name?\nCheck if yes    object
dtype: object

In [16]:
report['Number'].value_counts()[report['Number'].value_counts() > 2]

9602956931    4
9171832274    4
9811905645    4
9813126760    4
9173211259    4
9812913402    4
9852597532    4
9317418500    4
9750223903    4
9177095646    3
9813696569    3
9504889147    3
9813810743    3
9261762496    3
9811905386    3
9270512120    3
9171050224    3
9171874218    3
9761276078    3
9813458245    3
9171473405    3
9811993224    3
9389379718    3
9811905673    3
9178255960    3
9125514092    3
9702629918    3
9813810774    3
9171453599    3
9811905410    3
9097709911    3
9812142562    3
9813458247    3
9171780185    3
9096938537    3
9171344686    3
9171342236    3
9171299857    3
Name: Number, dtype: int64

### **Spam**

In [17]:
spam.columns

Index(['Unnamed: 0', '_id', 'address', 'date', 'text', 'threadId'], dtype='object')

In [18]:
spam['date'].describe()

count                         170
unique                        170
top       2022-11-12 14:02:10.079
freq                            1
Name: date, dtype: object

In [19]:
spam['date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 170 entries, 0 to 169
Series name: date
Non-Null Count  Dtype 
--------------  ----- 
170 non-null    object
dtypes: object(1)
memory usage: 1.5+ KB


## **Data Preprocessing**

### **Networks**

In [20]:
network = pd.read_csv("Raw Datasets/networks.csv")
network['Network'].value_counts()

Globe/TM          25
Smart             14
Sun               14
TNT                9
Globe PostPaid     9
DITO               8
Globe              1
Globe/GOMO         1
Name: Network, dtype: int64

We need to make `Globe` consistent, so we clean it up.

In [21]:
#replce the Globe Postpaid, Globe/TM, into just Globe
network['Network'] = network['Network'].replace(['Globe/TM', 'Globe PostPaid', 'Globe/GOMO', 'Globe'], 'Globe or TM')
network['Network'] = network['Network'].replace(['Smart', 'TNT'], "Smart or Talk 'N Text")
network['Network'] = network['Network'].replace(['Sun'], "Sun Cellular")

network['Network'].value_counts()

Globe or TM              36
Smart or Talk 'N Text    23
Sun Cellular             14
DITO                      8
Name: Network, dtype: int64

In [22]:
#keep the first three digits of the number
network['Prefix'] = network['Prefix'].astype(str).str[:3]

network

Unnamed: 0,Prefix,Network
0,817,Globe or TM
1,895,DITO
2,896,DITO
3,897,DITO
4,898,DITO
...,...,...
76,925,Globe or TM
77,925,Globe or TM
78,925,Globe or TM
79,925,Globe or TM


### **Reports**

#### **Data Preprocessing**

##### **Dropping**

In [23]:
report.head()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes,Unnamed: 6,GRAPHS AND STUFF,Unnamed: 8,Unnamed: 9,...,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,1,9103239417,,Work from home,,False,,,,,...,,,,,,,,,,
1,2,95348643,,Dating Scam,,False,,,,,...,,,,,,,,,,
2,3,931804865,,work,,False,,,,,...,,,,,,,,,,
3,4,981197529,,nanalo sa lotto,,False,,,,,...,,,,,,,,,,
4,5,981369614,,Abroad Opportunity kuno,,False,,,,,...,,,,,,,,,,


In [24]:
select = report.iloc[:, 0:6]
select.head()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes
0,1,9103239417,,Work from home,,False
1,2,95348643,,Dating Scam,,False
2,3,931804865,,work,,False
3,4,981197529,,nanalo sa lotto,,False
4,5,981369614,,Abroad Opportunity kuno,,False


In [25]:
select = select.dropna(subset=['Number'], axis = 0)
select.tail()

Unnamed: 0,Unnamed: 1,Number,Network (Auto-Generates),Type of Scam,Proofs,Knows your name?\nCheck if yes
4885,4886,9207721859,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., Just a minimum deposit, you...",True
4886,4887,9854472269,Smart or Talk ‘N Text,SBET,"*insert my name*., Experience SBET, STABLE SYS...",True
4887,4888,9811248577,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., JACKPOT CITY has the best g...",True
4888,4889,9855665323,Smart or Talk ‘N Text,Dear VIP,"Why are you still waiting, DEAR VIP Get your P...",False
4890,4891,9264224386,Globe or TM,fake seller,,True


##### **Columns**

In [26]:
#rename all columns
select.columns = ['id','number', 'network', 'type', 'proof', 'name']


In [27]:
select.head()

Unnamed: 0,id,number,network,type,proof,name
0,1,9103239417,,Work from home,,False
1,2,95348643,,Dating Scam,,False
2,3,931804865,,work,,False
3,4,981197529,,nanalo sa lotto,,False
4,5,981369614,,Abroad Opportunity kuno,,False


In [28]:
#get the first 4 digits of the number
select['number'] = select['number'].astype(str).str[:3]
select['number']

0       910
1       953
2       931
3       981
4       981
       ... 
4885    920
4886    985
4887    981
4888    985
4890    926
Name: number, Length: 4883, dtype: object

#### **Data Cleaning**

##### **Network Column**

In [29]:
# Create a dictionary mapping prefixes to networks from the second dataframe
prefix_network_map = dict(zip(network['Prefix'], network['Network']))

# Fill null values with corresponding network information using map() but if the number doesn't have a network, just put "unknwown"
#select['Network'] = select['Number'].map(prefix_network_map).fillna('unknown')

select['network'] = select['number'].map(prefix_network_map).fillna(select['network'])


# Print the updated first dataframe
select

Unnamed: 0,id,number,network,type,proof,name
0,1,910,Smart or Talk 'N Text,Work from home,,False
1,2,953,Globe or TM,Dating Scam,,False
2,3,931,Sun Cellular,work,,False
3,4,981,,nanalo sa lotto,,False
4,5,981,,Abroad Opportunity kuno,,False
...,...,...,...,...,...,...
4885,4886,920,Smart or Talk 'N Text,JACKPOT CITY,"*Insert my name*., Just a minimum deposit, you...",True
4886,4887,985,Smart or Talk ‘N Text,SBET,"*insert my name*., Experience SBET, STABLE SYS...",True
4887,4888,981,Smart or Talk ‘N Text,JACKPOT CITY,"*Insert my name*., JACKPOT CITY has the best g...",True
4888,4889,985,Smart or Talk ‘N Text,Dear VIP,"Why are you still waiting, DEAR VIP Get your P...",False


In [30]:
select['network'].value_counts()

Smart or Talk ‘N Text    2110
Globe or TM              1245
Smart or Talk 'N Text    1018
Smart                     289
Sun Cellular              131
DITO                       78
Name: network, dtype: int64

In [31]:
#making the Smart sim consistent
select['network'] = select['network'].replace(['Smart or Talk ‘N Text', 'Smart'], "Smart or Talk 'N Text")

#fill null values with "Unknown"
select['network'] = select['network'].fillna('Unknown')
select['network'].value_counts()

Smart or Talk 'N Text    3417
Globe or TM              1245
Sun Cellular              131
DITO                       78
Unknown                    12
Name: network, dtype: int64

##### **Type of Spam column**

In [40]:
select.head()

Unnamed: 0,id,number,network,type,proof,name
0,1,910,Smart or Talk 'N Text,Work from home,,False
1,2,953,Globe or TM,Dating Scam,,False
2,3,931,Sun Cellular,work,,False
3,4,981,Unknown,nanalo sa lotto,,False
4,5,981,Unknown,Abroad Opportunity kuno,,False


In [46]:
#get the value counts less than 50 in the type column
select['type'].value_counts()[select['type'].value_counts() < 10]

Fake COVID 19 Cash Grant    9
Cash Support                9
T1bet7                      9
Paload                      9
Verification Code           9
                           ..
6.9m                        1
5.8m                        1
5000 php                    1
good offer                  1
fake seller                 1
Name: type, Length: 249, dtype: int64

In [52]:
#show top 50 value counts in the type column
select['type'].value_counts()[:50]

Online Games                     352
Not Specified                    322
Casino                           273
Online Casino                    211
Solar Lights                     206
Lazada Kuno                      150
Email                            149
Work                             146
Unclaimed Bonus                  110
Work from home                   109
Loan/Pautang                     108
Register and win                 106
Bank Scam                         99
Raffle                            97
Play to Win                       82
Job Offer                         72
Rewards                           64
Deposit scam                      63
Bonus                             57
Name                              55
BINGO                             51
Passive Income                    51
Funds                             48
Investment                        45
Gcash scam                        44
Claim B0nus                       41
Nanalo Sa Lotto                   40
C

In [53]:
select['type'].value_counts()[51:100]

Cashback                               18
BINGO PLUS                             18
Fake News                              17
Resgister and win                      17
Deposit Bonus                          17
Phone Call Scam                        16
T1BET                                  16
FastCashVIP                            15
Interview                              15
Magbukas ng account                    15
Relief/Assistance Fund                 15
URGENT                                 14
PAGCOR licensed                        14
Okbet                                  14
Free (luckyphil)                       13
Pindutin Para Kunin                    13
Salary Claim                           13
Dating Scam                            12
Blank Message                          12
Shoppee Kuno                           12
cash prize                             11
Cash In/Out                            10
Mega win                               10
Pacquiao Foundation               

In [47]:
#get allunqite values in the type column
select['type'].unique()

array(['Work from home', 'Dating Scam', 'work', 'nanalo sa lotto',
       'Abroad Opportunity kuno', 'Fake News', 'bill ease', 'Raffle',
       'Bank Scam', 'Online Games', 'Lazada Kuno', 'Unclaimed Bonus',
       'Paload', 'Name', 'email', 'investment', 'Nanalo Sa Lotto',
       'Web Platform', 'Email', 'Work', 'Not Specified',
       'Relief/Assistance Fund', 'Investment', 'Passive Income',
       'Online Casino', 'No message text', 'Online sale', 'Casino winner',
       'Lazada kuno', 'salary claim', 'Globe', 'Casino', 'Pampapayat',
       'Rewards', 'Loan/Pautang', 'Cryptocurrency',
       'Pacquiao Foundation ', 'Legit Site daw', 'JILI GAmes',
       'Phone Discounts', 'POINTs', 'Product', 'Funds', 'Shoppee Kuno',
       'loan', 'casino', 'Vaccum Flask kuno', 'Wrong Number Scam',
       'Property Details', 'Netflix kuno', 'unclaimed Bonus', 'Bank scam',
       'Solar Lights', 'Can assist in cash essentials', 'Deposit scam',
       'Political', 'Gcash scam', 'PAGCOR licensed', 'Pro

##### **Proof column**

In [32]:
select.head()

Unnamed: 0,id,number,network,type,proof,name
0,1,910,Smart or Talk 'N Text,Work from home,,False
1,2,953,Globe or TM,Dating Scam,,False
2,3,931,Sun Cellular,work,,False
3,4,981,Unknown,nanalo sa lotto,,False
4,5,981,Unknown,Abroad Opportunity kuno,,False


##### **Name column**

In [33]:
select['name'].value_counts()

False                                                                                                                          4669
True                                                                                                                            195
linky-ph/-BingoPlus                                                                                                               2
Atin/B.e.t/  is  B.e.s.t On/l.i.n.e Ca/si.n0 in the Philippines. http://okadaonline.pics/KsN G/e/t 80 % for every de/po/sit       2
Mentioned my name and the link. Ayaw nya daw ako maging mahirap HAHHAHA                                                           1
98 Games; Cash in                                                                                                                 1
http://gcxvd2ny.com                                                                                                               1
mentioned my name                                                           

In [34]:
select['name'].value_counts().sum()

4882

In [35]:
df = select.copy()

In [36]:
df['name'] = df['name'].apply(lambda x: True if isinstance(x, str) and re.search(r'\bname\b', x) else x)

In [37]:
df['name'].value_counts()

False                                                                                                                          4669
True                                                                                                                            199
linky-ph/-BingoPlus                                                                                                               2
Atin/B.e.t/  is  B.e.s.t On/l.i.n.e Ca/si.n0 in the Philippines. http://okadaonline.pics/KsN G/e/t 80 % for every de/po/sit       2
98 Games; Cash in                                                                                                                 1
http://gcxvd2ny.com                                                                                                               1
https://linnki.in/WNgzX                                                                                                           1
Mybitglobal (https://t.co/9GDnInpR0M                                        

In [38]:
df['name'] = df['name'].astype(str)

# Replace non-Boolean values with False
df.loc[~df['name'].str.lower().isin(['true', 'false']), 'name'] = 'False'

df['name'].value_counts()

False    4684
True      199
Name: name, dtype: int64

#### **Feature Selection**

### **Spam**

#### **Data Preprocessing**

#### **Data Cleaning**

#### **Feature Selection**

### **Incidents**

#### **Data Preprocessing**

#### **Data Cleaning**

#### **Feature Selection**

# **Saving Dataframes as CSVs**

In [39]:
#.to_csv('.csv')
