# lab 1 deteccion de pishing
- Marco Jurado 20308
- Oscar Lopez 20679

## Parte 1 – Ingeniería de características
### Exploración de datos

In [76]:
import pandas as pd

In [77]:
data_set_original = pd.read_csv('dataset_pishing.csv')
data_set_original

Unnamed: 0,url,status
0,http://www.crestonwood.com/router.php,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,phishing
2,https://support-appleld.com.secureupdate.duila...,phishing
3,http://rgipt.ac.in,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,legitimate
...,...,...
11425,http://www.fontspace.com/category/blackletter,legitimate
11426,http://www.budgetbots.com/server.php/Server%20...,phishing
11427,https://www.facebook.com/Interactive-Televisio...,legitimate
11428,http://www.mypublicdomainpictures.com/,legitimate


In [78]:
data_set_original['status'].value_counts()

legitimate    5715
phishing      5715
Name: status, dtype: int64

Podemos observar que el dataset si se encuentra balanceado teniendo exactamente 5715 elementos para legitimate y la misma parte para phishing.

## Derivación de características
1. ¿Qué ventajas tiene el análisis de una URL contra el análisis de otros datos, cómo el tiempo de vida del dominio, o las características de la página Web?
2. ¿Qué características de una URL son más prometedoras para la detección de phishing?

## Preprocesamiento

In [79]:
data_set_original['status'] = data_set_original['status'].map({'legitimate' : 0, 'phishing' : 1})
data_set_original

Unnamed: 0,url,status
0,http://www.crestonwood.com/router.php,0
1,http://shadetreetechnology.com/V4/validation/a...,1
2,https://support-appleld.com.secureupdate.duila...,1
3,http://rgipt.ac.in,0
4,http://www.iracing.com/tracks/gateway-motorspo...,0
...,...,...
11425,http://www.fontspace.com/category/blackletter,0
11426,http://www.budgetbots.com/server.php/Server%20...,1
11427,https://www.facebook.com/Interactive-Televisio...,0
11428,http://www.mypublicdomainpictures.com/,0


In [80]:
# Protocolo
data_set_original['protocol'] = data_set_original['url'].apply(lambda x: 1 if x.startswith('http://') else 0)
data_set_original['protocol'].value_counts()

1    6983
0    4447
Name: protocol, dtype: int64

In [81]:
def check_tld_corrected(url):
    accepted_tlds = ['.com', '.net', '.org', '.edu', '.gov']
    parts = url.split('/')
    domain = parts[2] if len(parts) > 2 else parts[0]
    tld = domain.split('.')[-1]
    return 0 if any(domain.endswith('.' + tld) for tld in accepted_tlds) else 1

# Aplicar
data_set_original['tld_check'] = data_set_original['url'].apply(check_tld_corrected)
data_set_original

Unnamed: 0,url,status,protocol,tld_check
0,http://www.crestonwood.com/router.php,0,1,1
1,http://shadetreetechnology.com/V4/validation/a...,1,1,1
2,https://support-appleld.com.secureupdate.duila...,1,0,1
3,http://rgipt.ac.in,0,1,1
4,http://www.iracing.com/tracks/gateway-motorspo...,0,1,1
...,...,...,...,...
11425,http://www.fontspace.com/category/blackletter,0,1,1
11426,http://www.budgetbots.com/server.php/Server%20...,1,1,1
11427,https://www.facebook.com/Interactive-Televisio...,0,0,1
11428,http://www.mypublicdomainpictures.com/,0,1,1


In [82]:
def tld_repetition(url):
    accepted_tlds = ['.com', '.net', '.org', '.edu', '.gov']
    tld_counts = sum(url.count(tld) for tld in accepted_tlds)
    return 1 if tld_counts > 1 else 0

# Aplicar
data_set_original['tld_repetition'] = data_set_original['url'].apply(tld_repetition)
data_set_original

Unnamed: 0,url,status,protocol,tld_check,tld_repetition
0,http://www.crestonwood.com/router.php,0,1,1,0
1,http://shadetreetechnology.com/V4/validation/a...,1,1,1,0
2,https://support-appleld.com.secureupdate.duila...,1,0,1,1
3,http://rgipt.ac.in,0,1,1,0
4,http://www.iracing.com/tracks/gateway-motorspo...,0,1,1,0
...,...,...,...,...,...
11425,http://www.fontspace.com/category/blackletter,0,1,1,0
11426,http://www.budgetbots.com/server.php/Server%20...,1,1,1,1
11427,https://www.facebook.com/Interactive-Televisio...,0,0,1,0
11428,http://www.mypublicdomainpictures.com/,0,1,1,0


In [83]:
def long_sld(url):
    domain = url.split("//")[-1].split("/")[0].split('?')[0]
    sld = domain.split('.')[-2] if len(domain.split('.')) > 1 else domain
    threshold = 10
    return 1 if len(sld) > threshold else 0

# Aplicar
data_set_original['long_sld'] = data_set_original['url'].apply(long_sld)
data_set_original

Unnamed: 0,url,status,protocol,tld_check,tld_repetition,long_sld
0,http://www.crestonwood.com/router.php,0,1,1,0,1
1,http://shadetreetechnology.com/V4/validation/a...,1,1,1,0,1
2,https://support-appleld.com.secureupdate.duila...,1,0,1,1,1
3,http://rgipt.ac.in,0,1,1,0,0
4,http://www.iracing.com/tracks/gateway-motorspo...,0,1,1,0,0
...,...,...,...,...,...,...
11425,http://www.fontspace.com/category/blackletter,0,1,1,0,0
11426,http://www.budgetbots.com/server.php/Server%20...,1,1,1,1,0
11427,https://www.facebook.com/Interactive-Televisio...,0,0,1,0,0
11428,http://www.mypublicdomainpictures.com/,0,1,1,0,1
