# **Philippine Scam SMS**

**Author/s: [Anton Reyes](https://www.github.com/AGR-yes)**

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis

In [None]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is an open-source graphing library for Python.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions

In [None]:
import re

#### **Datasets and Files**

The following files were used for this project:

- `Scam_SMS_Reports.xlsx` contains the reports of users with the phone numbers, type of scam, proof, and name-inclusion.
- `SPAM_SMS.csv` contains text messages of one person with number, text itself, and the time and date received.
- `text-scams-incidents-philippines-2019-by-region.xlsx` contains number of scam messages (in thousands) received per region
- `networks.csv` contains the first 4-5 numbers in a Philippine phone number to identify the network it belongs to.

## **Data Collection**

Importing the dataset using pandas.

In [None]:
dataset = "Raw Datasets/Scam_SMS_Reports.xlsx"

report = pd.read_excel(dataset)
report.head()

In [None]:
dataset = "Raw Datasets/SPAM_SMS.csv"

spam = pd.read_csv(dataset)
spam.head()

In [None]:
dataset = "Raw Datasets/text-scams-incidents-philippines-2019-by-region.xlsx"

incidents = pd.read_excel(dataset, sheet_name="Data")
incidents.head()

## **Description of the Dataset**

Here, we find the shape of the dataset.

In [None]:
sets = [report, spam, incidents]

for set in sets:
    print(set.shape)

By looking at the `info` of the dataframe, we can see that there are `null` values. 

In [None]:
report.info()

In [None]:
spam.info()

In [None]:
incidents.info()

## **Exploratory Data Analysis**

### **Report**

In [None]:
report.columns

In [None]:
report['Network (Auto-Generates)'].value_counts()

In [None]:
report['Proofs'].value_counts()

In [None]:
report['Knows your name?\nCheck if yes'].value_counts()

In [None]:
report.iloc[:, 0:6].dtypes

In [None]:
report['Number'].value_counts()[report['Number'].value_counts() > 2]

### **Spam**

In [None]:
spam.columns

In [None]:
spam['date'].describe()

In [None]:
spam['date'].info()

## **Data Preprocessing**

### **Networks**

In [None]:
network = pd.read_csv("Raw Datasets/networks.csv")
network['Network'].value_counts()

We need to make `Globe` consistent, so we clean it up.

In [None]:
#replce the Globe Postpaid, Globe/TM, into just Globe
network['Network'] = network['Network'].replace(['Globe/TM', 'Globe PostPaid', 'Globe/GOMO', 'Globe'], 'Globe or TM')
network['Network'] = network['Network'].replace(['Smart', 'TNT'], "Smart or Talk 'N Text")
network['Network'] = network['Network'].replace(['Sun'], "Sun Cellular")

network['Network'].value_counts()

In [None]:
#keep the first three digits of the number
network['Prefix'] = network['Prefix'].astype(str).str[:3]

network

### **Reports**

#### **Data Preprocessing**

##### **Dropping**

In [None]:
report.head()

In [None]:
select = report.iloc[:, 0:6]
select.head()

In [None]:
select = select.dropna(subset=['Number'], axis = 0)
select.tail()

##### **Columns**

In [None]:
#rename all columns
select.columns = ['id','number', 'network', 'type', 'proof', 'name']


In [None]:
select.head()

In [None]:
#get the first 4 digits of the number
select['number'] = select['number'].astype(str).str[:3]
select['number']

#### **Data Cleaning**

##### **Network Column**

In [None]:
# Create a dictionary mapping prefixes to networks from the second dataframe
prefix_network_map = dict(zip(network['Prefix'], network['Network']))

# Fill null values with corresponding network information using map() but if the number doesn't have a network, just put "unknwown"
#select['Network'] = select['Number'].map(prefix_network_map).fillna('unknown')

select['network'] = select['number'].map(prefix_network_map).fillna(select['network'])


# Print the updated first dataframe
select

In [None]:
select['network'].value_counts()

In [None]:
#making the Smart sim consistent
select['network'] = select['network'].replace(['Smart or Talk ‘N Text', 'Smart'], "Smart or Talk 'N Text")

#fill null values with "Unknown"
select['network'] = select['network'].fillna('Unknown')
select['network'].value_counts()

##### **Proof column**

In [None]:
select.head()

##### **Name column**

In [None]:
select['name'].value_counts()

In [None]:
select['name'].value_counts().sum()

In [None]:
df = select.copy()

In [None]:
df['name'] = df['name'].apply(lambda x: True if isinstance(x, str) and re.search(r'\bname\b', x) else x)

In [None]:
df['name'].value_counts()

In [None]:
df['name'] = df['name'].astype(str)

# Replace non-Boolean values with False
df.loc[~df['name'].str.lower().isin(['true', 'false']), 'name'] = 'False'

df['name'].value_counts()

#### **Feature Selection**

### **Spam**

#### **Data Preprocessing**

#### **Data Cleaning**

#### **Feature Selection**

### **Incidents**

#### **Data Preprocessing**

#### **Data Cleaning**

#### **Feature Selection**

# **Saving Dataframes as CSVs**

In [None]:
#.to_csv('.csv')
