# **Philippine Scam SMS**
**Phase 3: Data Visualization**

**Author/s: [Anton Reyes](https://www.github.com/AGR-yes)**

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis

In [34]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is an open-source graphing library for Python.

In [57]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions

In [36]:
import re

#### **Datasets and Files**

The following `csv` files was used for this project:

- `incidents.csv` contains 2019 data of how many spam texts were received in each region of the Philippines.
- `proof_cleaned.csv` contains data from the `select.csv` and `spam.csv` that has been processed already with Natural Language Processing methods.
- `select.csv` contains the necessary columns from Google Sheets.
- `spam.csv` contains the necessary columns from a Kaggle user's own spam texts that they've received.
- `top100_words.csv` contains the top 100 commons words from the `proof_cleaned.csv` 

## **Data Collection**

Importing the dataset using pandas.

In [37]:
incidents = pd.read_csv("Processed Datasets/incidents.csv")
proof = pd.read_csv("Processed Datasets/proof_cleaned.csv")
select = pd.read_csv("Processed Datasets/select.csv")
spam = pd.read_csv("Processed Datasets/spam.csv")
top100 = pd.read_csv("Processed Datasets/top100_words.csv")

datasets = [dataset, proof, select, spam, top100]

In [38]:
for i in datasets:
    display(i.head())

Unnamed: 0,region,number
0,BARMM,390.48
1,CAR,112.6
2,CARAGA,500.0
3,NCR,2739.52
4,Region 1,113.3


Unnamed: 0,proof,name,type,token
0,poea,False,others,['POEA']
1,poea,False,others,['POEA']
2,federal partylist,False,others,"['Federal', 'Partylist']"
3,build build build,False,others,"['Build', 'Build', 'Build']"
4,luckyphilcomlogin,False,casino/gambling,['luckyphilcomlogin']


Unnamed: 0,id,number,network,type,proof,name,indicator
0,1,9103239417,Unknown,work,,False,910
1,2,95348643,Unknown,others,,False,953
2,3,931804865,Unknown,work,,False,931
3,4,981197529,Unknown,lotto,,False,981
4,5,981369614,Unknown,work,,False,981


Unnamed: 0,proof,Date,Time,name
0,"Welcome ! your have P1222 for S!ot , \nWeb: 11...",2022-11-12,14:02,False
1,"My god, at least 999P rewards waiting for you\...",2022-11-12,14:33,False
2,"DEAR VIP , No. 1 Online Sabong Site here in SB...",2022-11-13,23:03,True
3,"! Today, you can win the iphone14PROMAX while ...",2022-11-14,00:07,True
4,"Welcome ! your have P1222 for S!ot , \nWeb: gr...",2022-11-15,02:28,False


Unnamed: 0,word,count
0,now,157
1,bonus,122
2,message,68
3,LIGHTS,67
4,3,66


## **Description of the Dataset**

Here, we find the shape of the dataset.

In [39]:
#printing shape of each dataset from the list
for i in datasets:
    print(i.shape)

(17, 2)
(1414, 4)
(4883, 7)
(159, 4)
(100, 2)


By looking at the `info` of the dataframe, we can see that there are `non-null` values. 

In [40]:
for i in datasets:
    display(i.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   region  17 non-null     object 
 1   number  17 non-null     float64
dtypes: float64(1), object(1)
memory usage: 400.0+ bytes


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1414 entries, 0 to 1413
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   proof   1404 non-null   object
 1   name    1414 non-null   bool  
 2   type    1228 non-null   object
 3   token   1414 non-null   object
dtypes: bool(1), object(3)
memory usage: 34.6+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4883 entries, 0 to 4882
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         4883 non-null   int64 
 1   number     4883 non-null   object
 2   network    4883 non-null   object
 3   type       4679 non-null   object
 4   proof      1255 non-null   object
 5   name       4883 non-null   bool  
 6   indicator  4883 non-null   object
dtypes: bool(1), int64(1), object(5)
memory usage: 233.8+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   proof   159 non-null    object
 1   Date    159 non-null    object
 2   Time    159 non-null    object
 3   name    159 non-null    bool  
dtypes: bool(1), object(3)
memory usage: 4.0+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    100 non-null    object
 1   count   100 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 1.7+ KB


None

## **Charts**

### **Type & Name**

In [41]:
proof.head()

Unnamed: 0,proof,name,type,token
0,poea,False,others,['POEA']
1,poea,False,others,['POEA']
2,federal partylist,False,others,"['Federal', 'Partylist']"
3,build build build,False,others,"['Build', 'Build', 'Build']"
4,luckyphilcomlogin,False,casino/gambling,['luckyphilcomlogin']


#### **Type**

In [50]:
proof_type = pd.DataFrame(proof['type'].value_counts()).reset_index()
proof_type

Unnamed: 0,index,type
0,others,341
1,casino/gambling,304
2,online activity,209
3,bank/money,186
4,free,96
5,work,92


In [54]:
#change column name by column index
proof_type.rename(columns = {proof_type.columns[0]:'type', proof_type.columns[1]:'count'}, inplace = True)
proof_type

Unnamed: 0,type,count
0,others,341
1,casino/gambling,304
2,online activity,209
3,bank/money,186
4,free,96
5,work,92


In [61]:
#using plotly, make  a pie chart
fig = px.pie(proof_type, values = 'count', names = 'type', title = 'Proof Type')
fig.show()

#### **Name**

In [62]:
proof_name = pd.DataFrame(proof['name'].value_counts()).reset_index()
proof_name

Unnamed: 0,index,name
0,False,1204
1,True,210


In [63]:
#change column name by column index
proof_name.rename(columns = {proof_name.columns[0]:'type', proof_name.columns[1]:'count'}, inplace = True)
proof_name

Unnamed: 0,type,count
0,False,1204
1,True,210


In [64]:
#changing False to "No name", True to "Includes name"
proof_name['type'] = proof_name['type'].replace([False, True], ['No name', 'Includes name'])

In [69]:
fig = px.pie(proof_name, values = 'count', names = 'type', title = 'Proof Name')
fig.show()

#### **Types for Texts With Names**

In [83]:
nametype = pd.DataFrame(proof[['type']][proof['name'] == True].value_counts()).reset_index()
#change column name by column index
nametype.rename(columns = {nametype.columns[0]:'type', nametype.columns[1]:'count'}, inplace = True)
nametype

Unnamed: 0,type,count
0,others,75
1,casino/gambling,24
2,free,17
3,online activity,13
4,bank/money,10


In [84]:
fig = px.pie(nametype, values = 'count', names = 'type', title = 'Proof Name Type')
fig.show()

### **Number**

### **Words**

#### **Top 100 Words**

#### **WordCloud**

### **Peak Time of Texts**

### **Incidents per Region**