# **Philippine Scam SMS**
**Phase 3: Data Visualization**

**Author/s: [Anton Reyes](https://www.github.com/AGR-yes)**

## **Introduction**

### **Requirements and Imports**

#### **Imports**

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis

In [1]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `plotly` is an open-source graphing library for Python.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions

In [None]:
import re

#### **Datasets and Files**

The following `csv` files was used for this project:

- `incidents.csv` contains 2019 data of how many spam texts were received in each region of the Philippines.
- `proof_cleaned.csv` contains data from the `select.csv` and `spam.csv` that has been processed already with Natural Language Processing methods.
- `select.csv` contains the necessary columns from Google Sheets.
- `spam.csv` contains the necessary columns from a Kaggle user's own spam texts that they've received.
- `top100_words.csv` contains the top 100 commons words from the `proof_cleaned.csv` 

## **Data Collection**

Importing the dataset using pandas.

In [None]:
incidents = pd.read_csv("Processed Datasets/incidents.csv")
proof = pd.read_csv("Processed Datasets/proof_cleaned.csv")
select = pd.read_csv("Processed Datasets/select.csv")
spam = pd.read_csv("Processed Datasets/spam.csv")
top100 = pd.read_csv("Processed Datasets/top100_words.csv")

datasets = [incidents, proof, select, spam, top100]

In [None]:
for i in datasets:
    display(i.head())

## **Description of the Dataset**

Here, we find the shape of the dataset.

In [None]:
#printing shape of each dataset from the list
for i in datasets:
    print(i.shape)

By looking at the `info` of the dataframe, we can see that there are `non-null` values. 

In [None]:
for i in datasets:
    display(i.info())

## **Charts**

### **Type & Name**

In [None]:
proof.head()

#### **Type**

In [None]:
proof_type = pd.DataFrame(proof['type'].value_counts()).reset_index()
proof_type

In [None]:
#change column name by column index
proof_type.rename(columns = {proof_type.columns[0]:'type', proof_type.columns[1]:'count'}, inplace = True)
proof_type

In [None]:
#using plotly, make  a pie chart
fig = px.pie(proof_type, values = 'count', names = 'type', title = 'Proof Type')
fig.show()

#### **Name**

In [None]:
proof_name = pd.DataFrame(proof['name'].value_counts()).reset_index()
proof_name

In [None]:
#change column name by column index
proof_name.rename(columns = {proof_name.columns[0]:'type', proof_name.columns[1]:'count'}, inplace = True)
proof_name

In [None]:
#changing False to "No name", True to "Includes name"
proof_name['type'] = proof_name['type'].replace([False, True], ['No name', 'Includes name'])

In [None]:
fig = px.pie(proof_name, values = 'count', names = 'type', title = 'Proof Name')
fig.show()

#### **Types for Texts With Names**

In [None]:
nametype = pd.DataFrame(proof[['type']][proof['name'] == True].value_counts()).reset_index()
#change column name by column index
nametype.rename(columns = {nametype.columns[0]:'type', nametype.columns[1]:'count'}, inplace = True)
nametype

In [None]:
fig = px.pie(nametype, values = 'count', names = 'type', title = 'Proof Name Type')
fig.show()

### **Number**

In [None]:
select.head()

In [None]:
number = pd.DataFrame(select['network'].value_counts()).reset_index()
#change column name by column index
number.rename(columns = {number.columns[0]:'network', number.columns[1]:'count'}, inplace = True)
number

In [None]:
fig = px.bar(number, x = 'network', y = 'count', title = 'Number of Incidents by Network')
fig.show()

### **Top 100 Words**

In [None]:
top100.head()

In [None]:
#plot the top 20 words
fig = px.bar(top100[:20], x = 'word', y = 'count', title = 'Top 20 Words')
fig.show()

### **Peak Time of Texts**

In [None]:
spam.head()

#### **Date**

In [None]:
#get the day of the week in the Date column
spam['day'] = pd.to_datetime(spam['Date']).dt.day_name()
spam.head()

In [None]:
spam_day = pd.DataFrame(spam['day'].value_counts()).reset_index()

#change column name by column index
spam_day.rename(columns = {spam_day.columns[0]:'day', spam_day.columns[1]:'count'}, inplace = True)
spam_day

In [None]:
#plotting the day olumn as a pie
fig = px.pie(spam_day, values = 'count', names = 'day', title = 'Day of the Week')
fig.show()

#### **Time**

In [None]:
spam['Time'].describe()

In [None]:
#convert the Time column to datetime
spam['Time'] = pd.to_datetime(spam['Time'])
spam['Time'] = spam['Time'].dt.time

#round the time to the nearest hour
spam['time_of_day'] = spam['Time'].apply(lambda dt: dt.replace(minute=0, second=0, microsecond=0))

spam['Time'].describe()

In [None]:
#new dataframe with the time of day and the count of each time
spam_time = pd.DataFrame(spam['time_of_day'].value_counts()).reset_index()

spam_time.rename(columns = {spam_time.columns[0]:'time_of_day', spam_time.columns[1]:'count'}, inplace = True)

#order the time of day column by time 
spam_time = spam_time.sort_values(by = 'time_of_day')

spam_time

In [None]:
#plot the time of day column as a time series
fig = px.line(spam_time, x = 'time_of_day', y = 'count', title = 'Time of Day')
fig.show()

### **Incidents per Region**

In [None]:
incidents

In [None]:
px.set_mapbox_access_token(open("Supplemental Files\mapbox_token.txt").read())

In [None]:
#plot incidents as philippines map
#fig = px.scatter_geo(incidents, lat = 'latitude', lon = 'longitude', 
#                     color = 'network', hover_name = 'network', 
#                     size = 'count', projection = 'natural earth', 
#                     title = 'Incidents in the Philippines')
#fig.show()

In [None]:
df = pd.read_json("Supplemental Files\Regions.json")
df

In [None]:
import json

In [None]:
with open('Supplemental Files\Regions.json') as file:
    data = json.load(file)

# Extract the necessary information from the JSON data
features = data['features']
region_names = [feature['properties']['REGION'] for feature in features]
region_geometry = [feature['geometry'] for feature in features]

# Create a DataFrame
df = pd.DataFrame({'name': region_names, 'geometry': region_geometry})

# Create the choropleth map
fig = px.choropleth(df, geojson=df['geometry'], locations=df.index,
                    color='name',
                    color_discrete_sequence=px.colors.qualitative.Plotly,
                    labels={'name': 'Region'},
                    title='Regions in the Philippines')
fig.update_geos(fitbounds="locations", visible=False)  # Fit map to the region bounds
fig.show()

In [None]:
#geojson