# Data visualization

---

In this data set, we do not have categorical data. Therefore we can simply plot everything as scater plots, violins and so on..

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

## Import the data

In [8]:
df = pd.read_csv("spambase_prep_norm.csv", sep=";", index_col="Unnamed: 0")
df

Unnamed: 0,make,address,all,3d,our,over,remove,internet,order,mail,...,;,(,[,!,$,#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spa
0,0.000000,0.195719,0.281938,0.0,0.108844,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.295705,0.000000,0.000000,0.063038,0.089820,0.090939,1.0
1,0.166667,0.085627,0.220264,0.0,0.047619,0.231405,0.107143,0.043210,0.000000,0.354717,...,0.000000,0.128155,0.0,0.141391,0.201342,0.062910,0.094099,0.149701,0.338955,1.0
2,0.047619,0.000000,0.312775,0.0,0.418367,0.157025,0.096939,0.074074,0.481203,0.094340,...,0.015504,0.138835,0.0,0.104903,0.205817,0.013106,0.201761,0.724551,0.746032,1.0
3,0.000000,0.000000,0.000000,0.0,0.214286,0.000000,0.158163,0.388889,0.233083,0.237736,...,0.000000,0.133010,0.0,0.052071,0.000000,0.000000,0.058028,0.058383,0.062169,1.0
4,0.000000,0.000000,0.000000,0.0,0.214286,0.000000,0.158163,0.388889,0.233083,0.237736,...,0.000000,0.131068,0.0,0.051311,0.000000,0.000000,0.058028,0.058383,0.062169,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.246032,0.000000,0.273128,0.0,0.000000,0.256198,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.225243,0.0,0.000000,0.000000,0.000000,0.003248,0.002994,0.028108,0.0
4597,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0,0.134170,0.000000,0.000000,0.012694,0.004491,0.003638,0.0
4598,0.238095,0.000000,0.132159,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.158140,0.697087,0.0,0.000000,0.000000,0.000000,0.009241,0.007485,0.038029,0.0
4599,0.761905,0.000000,0.000000,0.0,0.108844,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.055340,0.0,0.000000,0.000000,0.000000,0.003362,0.005988,0.024802,0.0


## Scatter plots

Something to do in the streamlit app : choose what to plot in 2-3D.

In [10]:
fig1 = px.scatter(df, 
                    x = "make",
                    y = "address")
fig1

We can clearly see some linear behavior even though it is more complex. 

Here are two other figure in respect to $spa$.

In [16]:
fig2 = px.scatter(df, 
                    x = "make",
                    y = "address",
                    color="spa",
                    title="make function of address, colored spa")
fig2.show()

fig3 = px.scatter(df, 
                    x = "all",
                    y = "our",
                    color="spa",
                    title="all function of our, colored spa")
fig3.show()

## Correlation

---

In [25]:
corr = df.corr()

In [39]:
fig4 = px.imshow(  corr,
            title="Correlation plot")
fig4.update_layout(
    margin=dict(l=5, r=0, t=30, b=0),
)

We can observe some high positive correlation arround the center. Also, it appears there is no high negative correlation, interesting.

Let's sum that up and sort it for next plots.

In [51]:
corr_sum = corr.sum().sort_values(ascending=False)
corr_sum[:5]

857                           6.888718
415                           6.858819
direct                        6.825664
telnet                        6.460967
capital_run_length_longest    5.913286
dtype: float64

In [52]:
corr_cols = corr_sum.keys()
corr_cols[:5]

Index(['857', '415', 'direct', 'telnet', 'capital_run_length_longest'], dtype='object')

## Histograms and box plots

---

In [54]:
fig5 = px.histogram(df[corr_cols[:5]],
                    title="Stacked histogram of the most correlated variables")
fig5.show()

It appears highly correlated variables are mostly centered at $0$. An exeption is $capital\_run\_length\_longest$. One key factor of this is that we took care of the outliers via capping them to the $99$%, this can explain the behavior from the $1+$ side.

In [57]:
columns = df.columns

In [61]:
fig6 = px.box(  df[columns[-15:]],
                color="spa",
                title="Box plot of last 14 variables")
fig6.show()

All of a sudden, it is much easier to see that spam have very particular behavior. For exemple, a lot of spam seem to relate to money as of the occurence of the char $\$$, this is feels good as often scam tend to steel money. On the other hand, table does not seem to be a subject of interest.