# Introducción

El phishing es una amenaza significativa en la seguridad en línea, donde atacantes utilizan URLs engañosas para obtener información confidencial de los usuarios. Este proyecto se centra en desarrollar un detector de phishing basado en características de las URLs, utilizando el algoritmo de Naive Bayes (NB).

### Objetivo

El objetivo principal es diseñar y evaluar un modelo de clasificación que identifique URLs de phishing con alta precisión y bajo número de falsos positivos. Para ello, nos enfocamos en:

1. **Extracción de Características:** Longitud de la URL, caracteres especiales, subdominios, uso de HTTPS, entre otros.
2. **Entrenamiento del Modelo:** Uso de un conjunto de datos etiquetados para entrenar el modelo NB.
3. **Evaluación del Modelo:** Medición del rendimiento con métricas como precisión, recall, F1-score y tasa de falsos positivos.

### Metodología

1. **Recopilación de Datos:** URLs legítimas y de phishing.
2. **Preprocesamiento:** Limpieza y extracción de características de las URLs.
3. **Entrenamiento y Evaluación:** Implementación del algoritmo NB y validación cruzada del modelo.
4. **Optimización:** Ajuste de parámetros para mejorar el rendimiento.

### Estructura del Documento

1. **Revisión de Literatura:** Métodos existentes en la detección de phishing.
2. **Descripción del Conjunto de Datos:** Detalles sobre la recopilación y características del conjunto de datos.
3. **Metodología:** Pasos en la implementación y evaluación del modelo.
4. **Resultados:** Análisis de los resultados obtenidos.

In [1]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
df = pd.read_csv("/content/drive/MyDrive/Nueva carpeta/dataset_full.csv")

In [3]:
df

Unnamed: 0,qty_dot_url,qty_hyphen_url,qty_underline_url,qty_slash_url,qty_questionmark_url,qty_equal_url,qty_at_url,qty_and_url,qty_exclamation_url,qty_space_url,...,qty_ip_resolved,qty_nameservers,qty_mx_servers,ttl_hostname,tls_ssl_certificate,qty_redirects,url_google_index,domain_google_index,url_shortened,phishing
0,3,0,0,1,0,0,0,0,0,0,...,1,2,0,892,0,0,0,0,0,1
1,5,0,1,3,0,3,0,2,0,0,...,1,2,1,9540,1,0,0,0,0,1
2,2,0,0,1,0,0,0,0,0,0,...,1,2,3,589,1,0,0,0,0,0
3,4,0,2,5,0,0,0,0,0,0,...,1,2,0,292,1,0,0,0,0,1
4,2,0,0,0,0,0,0,0,0,0,...,1,2,1,3597,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88642,3,1,0,0,0,0,0,0,0,0,...,1,3,1,3597,0,0,0,0,0,0
88643,2,0,0,0,0,0,0,0,0,0,...,1,2,2,591,0,2,0,0,0,0
88644,2,1,0,5,0,0,0,0,0,0,...,1,2,5,14391,1,0,0,0,0,1
88645,2,0,0,1,0,0,0,0,0,0,...,1,1,1,52,1,0,0,0,0,1


In [4]:
df_filtered = df.loc[:, ['length_url', 'domain_length', 'qty_dot_url', 'qty_hyphen_url',
       'qty_underline_url', 'qty_slash_url', 'qty_questionmark_url',
       'qty_equal_url', 'qty_at_url', 'qty_and_url', 'qty_exclamation_url',
       'qty_space_url', 'qty_tilde_url', 'qty_comma_url', 'qty_plus_url',
       'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
       'directory_length', 'qty_dot_directory', 'qty_underline_directory',
       'qty_slash_directory', 'qty_questionmark_directory',
       'qty_equal_directory', 'qty_at_directory', 'qty_and_directory',
       'qty_exclamation_directory', 'qty_space_directory',
       'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory',
       'qty_asterisk_directory', 'qty_hashtag_directory',
       'qty_dollar_directory', 'file_length', 'qty_dot_file',
       'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file',
       'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file',
       'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file',
       'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file','phishing']
]

In [5]:
df_filtered.shape

(88647, 51)

In [6]:
import os
from urllib.parse import urlparse, parse_qs
import pandas as pd

def url_to_dataframe(url):
    # Obtener el directorio y el archivo de la URL
    parsed_url = urlparse(url)
    ruta = parsed_url.path
    directorio, archivo = os.path.split(ruta)

    # Obtener la parte de la consulta de la URL
    consulta = parsed_url.query
    parametros = parse_qs(consulta)
    consulta_completa = "?" + "&".join([f"{k}={','.join(v)}" for k, v in parametros.items()])

    # Verificar si la última parte de la ruta es un archivo
    nombre_archivo, extension = os.path.splitext(archivo)
    if extension:
        # Si es un archivo, unir el archivo y la consulta en una columna
        archivo_y_consulta = archivo + consulta_completa if consulta_completa != "?" else archivo
    else:
        # Si no es un archivo, la columna "Archivo y Consulta" tendrá solo la consulta si no es solo un signo de interrogación
        archivo_y_consulta = consulta_completa if consulta_completa != "?" else ""

    # Verificar si hay un directorio en la ruta
    if directorio == "/":
        directorio = ""

    # Crear el DataFrame con las columnas correspondientes
    df = pd.DataFrame({
        "URL completa": [url],
        "Directorio": [directorio],
        "Archivo y Consulta": [archivo_y_consulta]
    })

    return df
urls = ["http://horizonsgallery.com/js/bin/ssl1/_id/www.paypal.com/fr/cgi-bin/webscr/cmd=_registration-run/login.php?cmd=_login-run&amp;dispatch=1471c4bdb044ae2be9e2fc3ec514b88b1471c4bdb044ae2be9e2fc3ec514b88b"]
dfs = []
for url in urls:
    dfs = url_to_dataframe(url)
dfs


Unnamed: 0,URL completa,Directorio,Archivo y Consulta
0,http://horizonsgallery.com/js/bin/ssl1/_id/www...,/js/bin/ssl1/_id/www.paypal.com/fr/cgi-bin/web...,login.php?cmd=_login-run&amp;dispatch=1471c4bd...


In [7]:
import urllib.parse
import re

def analyze_url(row):
    url = row["URL completa"]
    # Analizar la URL y extraer los componentes relevantes
    parsed_url = urllib.parse.urlparse(url)
    length_url = len(url)
    domain_length = len(parsed_url.netloc)
    qty_dot_url = url.count('.')
    qty_hyphen_url = url.count('-')
    qty_underline_url = url.count('_')
    qty_slash_url = url.count('/')
    qty_questionmark_url = url.count('?')
    qty_equal_url = url.count('=')
    qty_at_url = url.count('@')
    
    # Contar la cantidad de ocurrencias de diferentes caracteres en el path
    path = parsed_url.path
    qty_and_url = path.count('&')
    qty_exclamation_url = path.count('!')
    qty_space_url = path.count(' ')
    qty_tilde_url = path.count('~')
    qty_comma_url = path.count(',')
    qty_plus_url = path.count('+')
    qty_asterisk_url = path.count('*')
    qty_hashtag_url = path.count('#')
    qty_dollar_url = path.count('$')
    
    # Crear un diccionario con la información analizada
    url_info = {
        'length_url': length_url,
        'domain_length': domain_length,
        'qty_dot_url': qty_dot_url,
        'qty_hyphen_url': qty_hyphen_url,
        'qty_underline_url': qty_underline_url,
        'qty_slash_url': qty_slash_url,
        'qty_questionmark_url': qty_questionmark_url,
        'qty_equal_url': qty_equal_url,
        'qty_at_url': qty_at_url,
        'qty_and_url': qty_and_url,
        'qty_exclamation_url': qty_exclamation_url,
        'qty_space_url': qty_space_url,
        'qty_tilde_url': qty_tilde_url,
        'qty_comma_url': qty_comma_url,
        'qty_plus_url': qty_plus_url,
        'qty_asterisk_url': qty_asterisk_url,
        'qty_hashtag_url': qty_hashtag_url,
        'qty_dollar_url': qty_dollar_url
    }
    
    # Devolver el diccionario
    return url_info
df1 = dfs.apply(analyze_url, axis=1)
df1

0    {'length_url': 200, 'domain_length': 19, 'qty_...
dtype: object

In [8]:
def analyze_directory(directory):
    if not directory:
        return pd.Series([-1] * 16, index=['directory_length', 'qty_dot_directory', 'qty_underline_directory', 'qty_slash_directory', 'qty_questionmark_directory', 'qty_equal_directory', 'qty_at_directory', 'qty_and_directory', 'qty_exclamation_directory', 'qty_space_directory', 'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory', 'qty_asterisk_directory', 'qty_hashtag_directory', 'qty_dollar_directory'])
    
    # Obtener las características del directorio
    directory_length = len(directory)
    qty_dot_directory = directory.count('.')
    qty_underline_directory = directory.count('_')
    qty_slash_directory = directory.count('/')
    qty_questionmark_directory = directory.count('?')
    qty_equal_directory = directory.count('=')
    qty_at_directory = directory.count('@')
    qty_and_directory = directory.count('&')
    qty_exclamation_directory = directory.count('!')
    qty_space_directory = directory.count(' ')
    qty_tilde_directory = directory.count('~')
    qty_comma_directory = directory.count(',')
    qty_plus_directory = directory.count('+')
    qty_asterisk_directory = directory.count('*')
    qty_hashtag_directory = directory.count('#')
    qty_dollar_directory = directory.count('$')
    
    # Devolver un objeto Series de pandas con las características calculadas
    return pd.Series([directory_length, qty_dot_directory, qty_underline_directory, qty_slash_directory, qty_questionmark_directory, qty_equal_directory, qty_at_directory, qty_and_directory, qty_exclamation_directory, qty_space_directory, qty_tilde_directory, qty_comma_directory, qty_plus_directory, qty_asterisk_directory, qty_hashtag_directory, qty_dollar_directory], index=['directory_length', 'qty_dot_directory', 'qty_underline_directory', 'qty_slash_directory', 'qty_questionmark_directory', 'qty_equal_directory', 'qty_at_directory', 'qty_and_directory', 'qty_exclamation_directory', 'qty_space_directory', 'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory', 'qty_asterisk_directory', 'qty_hashtag_directory', 'qty_dollar_directory'])

In [9]:
def analyze_file(filename):
    if not filename:
        return pd.Series([-1] * 16, index=['file_length', 'qty_dot_file', 'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file', 'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file', 'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file', 'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file'])
    
    # Obtener el tamaño del archivo
    file_length = len(filename)
    qty_dot_file = filename.count('.')
    qty_underline_file = filename.count('_')
    qty_slash_file = filename.count('/')
    qty_questionmark_file = filename.count('?')
    qty_equal_file = filename.count('=')
    qty_at_file = filename.count('@')
    qty_and_file = filename.count('&')
    qty_exclamation_file = filename.count('!')
    qty_space_file = filename.count(' ')
    qty_tilde_file = filename.count('~')
    qty_comma_file = filename.count(',')
    qty_plus_file = filename.count('+')
    qty_asterisk_file = filename.count('*')
    qty_hashtag_file = filename.count('#')
    qty_dollar_file = filename.count('$')
    
    # Devolver un objeto Series de pandas con las características calculadas
    return pd.Series([file_length, qty_dot_file, qty_underline_file, qty_slash_file, qty_questionmark_file, qty_equal_file, qty_at_file, qty_and_file, qty_exclamation_file, qty_space_file, qty_tilde_file, qty_comma_file, qty_plus_file, qty_asterisk_file, qty_hashtag_file, qty_dollar_file], index=['file_length', 'qty_dot_file', 'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file', 'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file', 'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file', 'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file'])

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [11]:
modelnb = GaussianNB()

In [12]:
df_features=df_filtered

In [13]:
df_features.shape

(88647, 51)

In [14]:
array=df_features.values
x_class = array[ : , 0:50]
y_class = array[ : , 50]
X_trainnb, X_testnb, y_trainnb, y_testnb = train_test_split(x_class, y_class, test_size=0.2,random_state=0)

In [15]:
modelnb.fit(X_trainnb, y_trainnb)

In [16]:
y_pred = modelnb.predict(X_testnb)

In [17]:
accuracy = accuracy_score(y_testnb, y_pred)


In [18]:
print("Precisión: {:.2f}%".format(accuracy*100))

Precisión: 86.46%


In [19]:
df_filtered.columns

Index(['length_url', 'domain_length', 'qty_dot_url', 'qty_hyphen_url',
       'qty_underline_url', 'qty_slash_url', 'qty_questionmark_url',
       'qty_equal_url', 'qty_at_url', 'qty_and_url', 'qty_exclamation_url',
       'qty_space_url', 'qty_tilde_url', 'qty_comma_url', 'qty_plus_url',
       'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
       'directory_length', 'qty_dot_directory', 'qty_underline_directory',
       'qty_slash_directory', 'qty_questionmark_directory',
       'qty_equal_directory', 'qty_at_directory', 'qty_and_directory',
       'qty_exclamation_directory', 'qty_space_directory',
       'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory',
       'qty_asterisk_directory', 'qty_hashtag_directory',
       'qty_dollar_directory', 'file_length', 'qty_dot_file',
       'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file',
       'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file',
       'qty_space_file', 'qty_tild

In [20]:
df_filtered2 = df_filtered.loc[: ,['length_url', 'domain_length', 'qty_dot_url', 'qty_hyphen_url',
       'qty_underline_url', 'qty_slash_url', 'qty_questionmark_url',
       'qty_equal_url', 'qty_at_url', 'qty_and_url', 'qty_exclamation_url',
       'qty_space_url', 'qty_tilde_url', 'qty_comma_url', 'qty_plus_url',
       'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
       'directory_length', 'qty_dot_directory', 'qty_underline_directory',
       'qty_slash_directory', 'qty_questionmark_directory',
       'qty_equal_directory', 'qty_at_directory', 'qty_and_directory',
       'qty_exclamation_directory', 'qty_space_directory',
       'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory',
       'qty_asterisk_directory', 'qty_hashtag_directory',
       'qty_dollar_directory', 'file_length', 'qty_dot_file',
       'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file',
       'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file',
       'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file',
       'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file']]

In [21]:
y_predp = modelnb.predict(df_filtered2)



In [23]:
df_url = pd.read_csv("/content/drive/MyDrive/Nueva carpeta/phishing_site_urls.csv")

In [35]:
def mezcla(dfs):
  df1 = pd.DataFrame(dfs.apply(analyze_url, axis=1).tolist())
  dfs = dfs.join(df1)
  df2 = dfs["Directorio"].apply(analyze_directory).apply(pd.Series)
  dfs = dfs.join(df2)
  df3 = dfs["Archivo y Consulta"].apply(analyze_file).apply(pd.Series)
  dfs = dfs.join(df3)
  return dfs

In [24]:
df_url

Unnamed: 0,URL,Label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,bad
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,bad
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,bad
3,mail.printakid.com/www.online.americanexpress....,bad
4,thewhiskeydregs.com/wp-content/themes/widescre...,bad
...,...,...
549841,cam.ac.uk,good
549842,over-blog-kiwi.com,good
549843,merriam-webster.com,good
549844,bp3.blogger.com,good




In [30]:
def actualizar_url(df):
    # Iterar sobre cada URL en la columna "URL"
    for i, url in enumerate(df["URL"]):
        # Verificar si la URL comienza con "http://" o "https://"
        if not url.startswith('http://') and not url.startswith('https://'):
            # Si la URL no tiene "http://" o "https://", agregarlo al inicio de la URL
            df.at[i, 'URL'] = 'http://' + url
    return df

In [25]:
df_nuevo = pd.DataFrame(columns=["URL completa", "Directorio", "Archivo y Consulta"])

In [26]:
df_sample = df_url.sample(n=10000, random_state=42)
df_sample.reset_index(drop=True, inplace=True)

In [27]:
df_sample

Unnamed: 0,URL,Label
0,zimbio.com/Video+of+the+Day/articles/bgGtrcF2e...,good
1,mylife.com/kathylynnbaker,good
2,linkedin.com/company/parc-de-la-chute-montmore...,good
3,1.179.170.7:4493,bad
4,linkedin.com/in/tinapugh,good
...,...,...
9995,hfboards.com/showthread.php?p=1044686,good
9996,harpers.org/search.php?q=David+Foster+W,good
9997,theskopelosproject.com/trees/tree_polizos.html,good
9998,allflagsautoexp.com/document/,bad


In [31]:

df_actus = actualizar_url(df_sample)

In [32]:
df_actus

Unnamed: 0,URL,Label
0,http://zimbio.com/Video+of+the+Day/articles/bg...,good
1,http://mylife.com/kathylynnbaker,good
2,http://linkedin.com/company/parc-de-la-chute-m...,good
3,http://1.179.170.7:4493,bad
4,http://linkedin.com/in/tinapugh,good
...,...,...
9995,http://hfboards.com/showthread.php?p=1044686,good
9996,http://harpers.org/search.php?q=David+Foster+W,good
9997,http://theskopelosproject.com/trees/tree_poliz...,good
9998,http://allflagsautoexp.com/document/,bad


In [None]:
missing_count = df_actus["URL"].isnull().sum()
print("Cantidad de valores faltantes en columna1:", missing_count)

Cantidad de valores faltantes en columna1: 0


In [None]:
import time
import pandas as pd

df_nuevo = pd.DataFrame(columns=["URL completa", "Directorio", "Archivo y Consulta"])
start_time = time.time()  # Registro del tiempo de inicio

for url in df_actus["URL"]:
    try:
        df_temp = url_to_dataframe(url)
        df_nuevo = pd.concat([df_nuevo, df_temp])
    except ValueError:
        print(f"URL inválida: {url}")

df_nuevo.reset_index(drop=True, inplace=True)

end_time = time.time()  # Registro del tiempo de finalización
execution_time = end_time - start_time  # Cálculo del tiempo de ejecución
print(f"Tiempo de ejecución: {execution_time} segundos")

Tiempo de ejecución: 8.652297735214233 segundos


In [None]:
import os
from urllib.parse import urlparse, parse_qs
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import time

def process_url(url):
    try:
        df_temp = url_to_dataframe(url)
        return df_temp
    except ValueError:
        print(f"URL inválida: {url}")
        return None

df_nuevo = pd.DataFrame(columns=["URL completa", "Directorio", "Archivo y Consulta"])
urls = df_actus["URL"]

start_time = time.time()  # Registro del tiempo de inicio

with ThreadPoolExecutor() as executor:
    results = executor.map(process_url, urls)
    for result in results:
        if result is not None:
            df_nuevo = pd.concat([df_nuevo, result])

df_nuevo.reset_index(drop=True, inplace=True)

end_time = time.time()  # Registro del tiempo de finalización
execution_time = end_time - start_time  # Cálculo del tiempo de ejecución
print(f"Tiempo de ejecución: {execution_time} segundos")

Tiempo de ejecución: 11.297534942626953 segundos


In [39]:
def process_url(url):
    try:
        df_temp = url_to_dataframe(url)
        return df_temp
    except ValueError:
        print(f"URL inválida: {url}")
        return None

In [None]:
df_nuevo

Unnamed: 0,URL completa,Directorio,Archivo y Consulta
0,http://zimbio.com/Video+of+the+Day/articles/bg...,/Video+of+the+Day/articles/bgGtrcF2e4S,
1,http://mylife.com/kathylynnbaker,,
2,http://linkedin.com/company/parc-de-la-chute-m...,/company,
3,http://1.179.170.7:4493,,
4,http://linkedin.com/in/tinapugh,/in,
...,...,...,...
9995,http://hfboards.com/showthread.php?p=1044686,,showthread.php?p=1044686
9996,http://harpers.org/search.php?q=David+Foster+W,,search.php?q=David Foster W
9997,http://theskopelosproject.com/trees/tree_poliz...,/trees,tree_polizos.html
9998,http://allflagsautoexp.com/document/,/document,


In [None]:
start_time = time.time()  # Registro del tiempo de inicio

df_mezcla = None
df_mezcla = mezcla(df_nuevo)

end_time = time.time()  # Registro del tiempo de finalización
execution_time = end_time - start_time  # Cálculo del tiempo de ejecución
print(f"Tiempo de ejecución: {execution_time} segundos")

Tiempo de ejecución: 6.253378868103027 segundos


In [None]:
df_mezcla = None

start_time = time.time()  # Registro del tiempo de inicio

with ThreadPoolExecutor() as executor:
    future = executor.submit(mezcla, df_nuevo)
    df_mezcla = future.result()

end_time = time.time()  # Registro del tiempo de finalización
execution_time = end_time - start_time  # Cálculo del tiempo de ejecución
print(f"Tiempo de ejecución: {execution_time} segundos")

Tiempo de ejecución: 6.3521177768707275 segundos


In [None]:
df_mezcla

Unnamed: 0,URL completa,Directorio,Archivo y Consulta,length_url,domain_length,qty_dot_url,qty_hyphen_url,qty_underline_url,qty_slash_url,qty_questionmark_url,...,qty_at_file,qty_and_file,qty_exclamation_file,qty_space_file,qty_tilde_file,qty_comma_file,qty_plus_file,qty_asterisk_file,qty_hashtag_file,qty_dollar_file
0,http://zimbio.com/Video+of+the+Day/articles/bg...,/Video+of+the+Day/articles/bgGtrcF2e4S,,101,10,1,0,0,6,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
1,http://mylife.com/kathylynnbaker,,,32,10,1,0,0,3,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,http://linkedin.com/company/parc-de-la-chute-m...,/company,,63,12,1,6,0,4,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
3,http://1.179.170.7:4493,,,23,16,3,0,0,2,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
4,http://linkedin.com/in/tinapugh,/in,,31,12,1,0,0,4,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,http://hfboards.com/showthread.php?p=1044686,,showthread.php?p=1044686,44,12,2,0,0,3,1,...,0,0,0,0,0,0,0,0,0,0
9996,http://harpers.org/search.php?q=David+Foster+W,,search.php?q=David Foster W,46,11,2,0,0,3,1,...,0,0,0,2,0,0,0,0,0,0
9997,http://theskopelosproject.com/trees/tree_poliz...,/trees,tree_polizos.html,53,22,2,0,1,4,0,...,0,0,0,0,0,0,0,0,0,0
9998,http://allflagsautoexp.com/document/,/document,,36,19,1,0,0,4,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [None]:
df_nuevo

Unnamed: 0,URL completa,Directorio,Archivo y Consulta
0,http://zimbio.com/Video+of+the+Day/articles/bg...,/Video+of+the+Day/articles/bgGtrcF2e4S,
1,http://mylife.com/kathylynnbaker,,
2,http://linkedin.com/company/parc-de-la-chute-m...,/company,
3,http://1.179.170.7:4493,,
4,http://linkedin.com/in/tinapugh,/in,
...,...,...,...
9995,http://hfboards.com/showthread.php?p=1044686,,showthread.php?p=1044686
9996,http://harpers.org/search.php?q=David+Foster+W,,search.php?q=David Foster W
9997,http://theskopelosproject.com/trees/tree_poliz...,/trees,tree_polizos.html
9998,http://allflagsautoexp.com/document/,/document,


In [None]:
df_filtered2 = df_mezcla.loc[: ,['length_url', 'domain_length', 'qty_dot_url', 'qty_hyphen_url',
       'qty_underline_url', 'qty_slash_url', 'qty_questionmark_url',
       'qty_equal_url', 'qty_at_url', 'qty_and_url', 'qty_exclamation_url',
       'qty_space_url', 'qty_tilde_url', 'qty_comma_url', 'qty_plus_url',
       'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
       'directory_length', 'qty_dot_directory', 'qty_underline_directory',
       'qty_slash_directory', 'qty_questionmark_directory',
       'qty_equal_directory', 'qty_at_directory', 'qty_and_directory',
       'qty_exclamation_directory', 'qty_space_directory',
       'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory',
       'qty_asterisk_directory', 'qty_hashtag_directory',
       'qty_dollar_directory', 'file_length', 'qty_dot_file',
       'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file',
       'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file',
       'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file',
       'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file']]

In [None]:
y_predp=[]

In [None]:
y_predp = modelnb.predict(df_filtered2)



In [None]:
y_predp

array([1, 0, 0, ..., 1, 0, 1])

In [None]:
df_mezcla['prediccion'] = y_predp


In [None]:
df_mezcla

Unnamed: 0,URL completa,Directorio,Archivo y Consulta,length_url,domain_length,qty_dot_url,qty_hyphen_url,qty_underline_url,qty_slash_url,qty_questionmark_url,...,qty_and_file,qty_exclamation_file,qty_space_file,qty_tilde_file,qty_comma_file,qty_plus_file,qty_asterisk_file,qty_hashtag_file,qty_dollar_file,prediccion
0,http://zimbio.com/Video+of+the+Day/articles/bg...,/Video+of+the+Day/articles/bgGtrcF2e4S,,101,10,1,0,0,6,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,1
1,http://mylife.com/kathylynnbaker,,,32,10,1,0,0,3,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,0
2,http://linkedin.com/company/parc-de-la-chute-m...,/company,,63,12,1,6,0,4,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,0
3,http://1.179.170.7:4493,,,23,16,3,0,0,2,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,0
4,http://linkedin.com/in/tinapugh,/in,,31,12,1,0,0,4,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,http://hfboards.com/showthread.php?p=1044686,,showthread.php?p=1044686,44,12,2,0,0,3,1,...,0,0,0,0,0,0,0,0,0,1
9996,http://harpers.org/search.php?q=David+Foster+W,,search.php?q=David Foster W,46,11,2,0,0,3,1,...,0,0,2,0,0,0,0,0,0,1
9997,http://theskopelosproject.com/trees/tree_poliz...,/trees,tree_polizos.html,53,22,2,0,1,4,0,...,0,0,0,0,0,0,0,0,0,1
9998,http://allflagsautoexp.com/document/,/document,,36,19,1,0,0,4,0,...,-1,-1,-1,-1,-1,-1,-1,-1,-1,0




In [None]:
df7 = process_url("http://asecvech.cl/wp-content/languages/wetransfer/wetransfer/kola.html?ICAgIDxkaXYgY2xhc3M9Im5vdHByaW50YWJsZSI%20PHRhYmxlIGNlbGxzcGFjaW5nPSIwIiBjZWxscGFkZGluZz0iMCIgd2lkdGg9IjEwMCUiPjx0cj48dGQgY2xhc3M9ImJjcm93IiBzdHlsZT0icGFkZGluZy1sZWZ0OiAwcHg7Ij48L3RkPjwvdHI")
df7 = mezcla(df7)
df7 = df7.loc[: ,['length_url', 'domain_length', 'qty_dot_url', 'qty_hyphen_url',
       'qty_underline_url', 'qty_slash_url', 'qty_questionmark_url',
       'qty_equal_url', 'qty_at_url', 'qty_and_url', 'qty_exclamation_url',
       'qty_space_url', 'qty_tilde_url', 'qty_comma_url', 'qty_plus_url',
       'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
       'directory_length', 'qty_dot_directory', 'qty_underline_directory',
       'qty_slash_directory', 'qty_questionmark_directory',
       'qty_equal_directory', 'qty_at_directory', 'qty_and_directory',
       'qty_exclamation_directory', 'qty_space_directory',
       'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory',
       'qty_asterisk_directory', 'qty_hashtag_directory',
       'qty_dollar_directory', 'file_length', 'qty_dot_file',
       'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file',
       'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file',
       'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file',
       'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file']]
y_pred2 = modelnb.predict(df7)
y_pred2

Unnamed: 0,length_url,domain_length,qty_dot_url,qty_hyphen_url,qty_underline_url,qty_slash_url,qty_questionmark_url,qty_equal_url,qty_at_url,qty_and_url,...,qty_at_file,qty_and_file,qty_exclamation_file,qty_space_file,qty_tilde_file,qty_comma_file,qty_plus_file,qty_asterisk_file,qty_hashtag_file,qty_dollar_file
0,261,11,2,1,0,7,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
def predecir_phishing(df):
    df = mezcla(df)
    df = df.loc[:, ['length_url', 'domain_length', 'qty_dot_url', 'qty_hyphen_url',
                    'qty_underline_url', 'qty_slash_url', 'qty_questionmark_url',
                    'qty_equal_url', 'qty_at_url', 'qty_and_url', 'qty_exclamation_url',
                    'qty_space_url', 'qty_tilde_url', 'qty_comma_url', 'qty_plus_url',
                    'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
                    'directory_length', 'qty_dot_directory', 'qty_underline_directory',
                    'qty_slash_directory', 'qty_questionmark_directory',
                    'qty_equal_directory', 'qty_at_directory', 'qty_and_directory',
                    'qty_exclamation_directory', 'qty_space_directory',
                    'qty_tilde_directory', 'qty_comma_directory', 'qty_plus_directory',
                    'qty_asterisk_directory', 'qty_hashtag_directory',
                    'qty_dollar_directory', 'file_length', 'qty_dot_file',
                    'qty_underline_file', 'qty_slash_file', 'qty_questionmark_file',
                    'qty_equal_file', 'qty_at_file', 'qty_and_file', 'qty_exclamation_file',
                    'qty_space_file', 'qty_tilde_file', 'qty_comma_file', 'qty_plus_file',
                    'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file']]
    y_pred2 = modelnb.predict(df)
    return y_pred2

In [None]:
def menu():
    while True:
        print("==== Menú ====")
        print("1. Predecir si una URL es phishing")
        print("2. Salir")
        
        opcion = input("Seleccione una opción: ")
        
        if opcion == "1":
            url = input("Ingrese la URL a analizar: ")
            df = process_url(url)
            y_pred = predecir_phishing(df)
            
            if y_pred[0] == 1:
                print("La URL es phishing.")
            else:
                print("La URL no es phishing.")
            
            print()
        
        elif opcion == "2":
            print("Saliendo del programa...")
            break
        
        else:
            print("Opción no válida. Intente nuevamente.")
            print()

# Llamar a la función del menú
menu()

==== Menú ====
1. Predecir si una URL es phishing
2. Salir




La URL es phishing.

==== Menú ====
1. Predecir si una URL es phishing
2. Salir




La URL no es phishing.

==== Menú ====
1. Predecir si una URL es phishing
2. Salir
