# Websites Phishing Detection using Logistic Regresión and the KNN algorithm

## Task 1 | Regresión Lineal

##### 1. Considera un modelo de regresión lineal con dos características, X₁ y X₂, y sus pesos correspondientes w₁ y w₂. Si el modelo predice una salida y mediante la ecuación y = 2w₁X₁ + 3w₂X₂ + 1, ¿cuál es la interpretación del coeficiente 3w₂ en el contexto del modelo?


##### 2. Explica el concepto de multicolinealidad en el contexto de la regresión lineal. ¿Cómo afecta la multicolinealidad a la interpretación de los coeficientes de regresión individuales?



# Task 2
## Análisis Exploratorio

In [3]:
import matplotlib
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

In [4]:
df = pd.read_csv('data/dataset_phishing.csv')
df.head()

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


In [5]:
features = df[[
    # 'url', 
    'length_url', 
    'length_hostname', 
    'ip', 
    # 'nb_www', 
    # 'nb_com',
    # 'nb_dslash', 
    'http_in_path', 
    'https_token', 
    # 'ratio_digits_url',
    # 'ratio_digits_host', 
    'punycode', 
    'port', 
    'tld_in_path',
    'tld_in_subdomain', 
    'abnormal_subdomain',
    # 'nb_subdomains',
    'prefix_suffix', 
    'random_domain', 
    'shortening_service',
    'path_extension', 
    # 'nb_redirection', 'nb_external_redirection',
    'phish_hints', 
    'domain_in_brand',
    'brand_in_subdomain', 
    'brand_in_path', 
    'suspecious_tld',
    'statistical_report', 
    # 'nb_hyperlinks', 
    # 'ratio_intHyperlinks',
    # 'ratio_extHyperlinks', 'ratio_nullHyperlinks', 'nb_extCSS',
    # 'ratio_intRedirection', 'ratio_extRedirection', 'ratio_intErrors',
    # 'ratio_extErrors', 
    'login_form', 
    'external_favicon', 
    # 'links_in_tags',
    # 'submit_email',  0's across the board
    # 'ratio_intMedia', 'ratio_extMedia', 
    # 'sfh', 0's across the board
    'iframe',
    'popup_window', 
    'onmouseover', 
    'right_clic',
    'empty_title', 
    'domain_in_title', 
    'domain_with_copyright',
    'whois_registered_domain', 
    'domain_registration_length', 'domain_age',
    'web_traffic', 'dns_record', 'google_index', 'page_rank', 'status']].copy()

### Encoding

Solamente es necesario en la parte del objetivo, ya que es la única variable categórica.

In [6]:
features['status'] = features['status'].map({'phishing': 1, 'legitimate': 0})

In [7]:
features.describe()

Unnamed: 0,length_url,length_hostname,ip,http_in_path,https_token,punycode,port,tld_in_path,tld_in_subdomain,abnormal_subdomain,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
count,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,...,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0
mean,61.126684,21.090289,0.150569,0.01671,0.610936,0.00035,0.002362,0.065617,0.050131,0.02161,...,0.775853,0.439545,0.072878,492.532196,4062.543745,856756.6,0.020122,0.533946,3.185739,0.5
std,55.297318,10.777171,0.357644,0.169358,0.487559,0.018705,0.048547,0.247622,0.218225,0.145412,...,0.417038,0.496353,0.259948,814.769415,3107.7846,1995606.0,0.140425,0.498868,2.536955,0.500022
min,12.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-1.0,-12.0,0.0,0.0,0.0,0.0,0.0
25%,33.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,84.0,972.25,0.0,0.0,0.0,1.0,0.0
50%,47.0,19.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,242.0,3993.0,1651.0,0.0,1.0,3.0,0.5
75%,71.0,24.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,449.0,7026.75,373845.5,0.0,1.0,5.0,1.0
max,1641.0,214.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,29829.0,12874.0,10767990.0,1.0,1.0,10.0,1.0


In [8]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 37 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   length_url                  11430 non-null  int64
 1   length_hostname             11430 non-null  int64
 2   ip                          11430 non-null  int64
 3   http_in_path                11430 non-null  int64
 4   https_token                 11430 non-null  int64
 5   punycode                    11430 non-null  int64
 6   port                        11430 non-null  int64
 7   tld_in_path                 11430 non-null  int64
 8   tld_in_subdomain            11430 non-null  int64
 9   abnormal_subdomain          11430 non-null  int64
 10  prefix_suffix               11430 non-null  int64
 11  random_domain               11430 non-null  int64
 12  shortening_service          11430 non-null  int64
 13  path_extension              11430 non-null  int64
 14  phish_

### Balanceo de clases

No necesita de ningún balanceo de clases.

In [9]:
features['status'].value_counts()

status
0    5715
1    5715
Name: count, dtype: int64

### Scaling

Las magnitudes de los datos son muy diferentes, por lo que es necesario escalarlos.
Serán los siguientes features: length_url, length_hostname, domain_registration_length, domain_age, web_traffic, page_rank

In [10]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [11]:
# List of features to scale
features_to_scale = ['length_url', 'length_hostname', 'domain_registration_length', 'domain_age', 'web_traffic', 'page_rank']

# Apply the scaler to the features in the list
features[features_to_scale] = scaler.fit_transform(features[features_to_scale])

In [12]:
features[features_to_scale].mean()

length_url                   -5.221836e-17
length_hostname              -1.367624e-16
domain_registration_length    2.237930e-17
domain_age                   -1.243294e-18
web_traffic                  -1.429789e-17
page_rank                    -8.703061e-18
dtype: float64

### Data Splitting

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X = features.drop('status', axis=1)
y = features['status']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21562)

## Task 2.1 | Regresión Logística

### Entrenamiento sin librerías

In [16]:
class LogisticRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.learning_rate = learning_rate
        self.iterations = iterations
        
        self.X = None
        self.y = None
        self.m = None
        self.n = None
        self.weights = None
        self.bias = None
        
    def fit(self, X, y):
        self.X = X
        self.y = y
        
        # M = número de ejemplos de entrenamiento
        # N = número de características
        self.m, self.n = X.shape
        
        # Inicializar los pesos y el bias
        self.weights = np.zeros(self.n)
        self.bias = 0
        
        # Gradiente descendiente
        for i in range(self.iterations):
            self.update_weights()
    
    def update_weights(self):
        # Modelo Lineal: z = w1x1 + w2x2 + ... + wnxn + b
        z = np.dot(self.X, self.weights) + self.bias
        
        # Sigmoide
        sigmoid = 1 / (1 + np.exp(-z))
        
        # Derivadas
        dw = (1 / self.m) * np.dot(self.X.T, (sigmoid - self.y))
        db = (1 / self.m) * np.sum(sigmoid - self.y)
        
        # Actualizar los pesos y el bias
        self.weights -= self.learning_rate * dw
        self.bias -= self.learning_rate * db
    
    def predict(self, X):
        temp_z = np.dot(X, self.weights) + self.bias
        temp_sigmoid = 1 / (1 + np.exp(-temp_z))
        
        decision_boundary = 0.5
        y_predicted = np.where(temp_sigmoid >= decision_boundary, 1, 0)
    
        
        return y_predicted

In [17]:
classifier = LogisticRegression()

In [18]:
classifier.fit(X_train, y_train)

In [19]:
X_train_predictions = classifier.predict(X_train)

### Entrenamiento con librerías

In [29]:
from sklearn.linear_model import LogisticRegression

In [30]:
sk_classifier = LogisticRegression()

In [31]:
sk_classifier.fit(X_train, y_train)

In [32]:
sk_X_train_predictions = sk_classifier.predict(X_train)

### Medición de desempeño

Se utiliza la métrica de _precision_ para medir el desempeño del modelo. Esto debido a la necesidad de conocer la cantidad de predicciones correctas realizadas por el modelo, es decir, la cantidad de predicciones positivas que fueron correctas.

In [27]:
from sklearn.metrics import precision_score

In [34]:
print('Precision sin librerias:{:.3f}'.format(precision_score(y_train, X_train_predictions)))
print('Precision con librerias:{:.3f}'.format(precision_score(y_train, sk_X_train_predictions)))

Precision sin librerias:0.883
Precision con librerias:0.910


### ¿Qué implementación fue mejor? ¿Por qué?

EL modelo de la librería _sklearn_ ha tenido mejores resultados que el modelo implementado sin librerías. Esto se debe a que la librería utiliza un algoritmo de optimización más eficiente que el utilizado en la implementación sin librerías. También incluye regularización para evitar el sobreajuste; y es desarrollada por expertos en el tema. En resumen, tiene muchas más técnicas "_under the hood_" que la implementación sin librerías en conjunto con el _expertise_ de profesionales.

### Gráficas de grupos encontrados

## Task 2.2 | KNN

### Entrenamiento sin librerías

### Entrenamiento con librerías