Feature selection is the process of identifying and selecting relevant features for your sample. Feature engineering is manually generating new features from existing features, by applying some transformation or performing some operation on them.

# Importing Required Libraries

- **Pandas** :  For data processing, CSV file I/O (e.g. pd.read_csv)
- **Numpy**  :  For linear algebra
- **Matplotlib** : For Data visualization
- **sklearn.model_selection**  : For spliting data in Train & Test
- **sklearn.linear_mode.LogisticRegression**   : For Logistic Regression 
- **sklearn.metrics**  : Evaluation metrics 

In [None]:
import pandas as pd
import numpy as np
import copy
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression  # For Logistic Regression
from sklearn.ensemble import RandomForestClassifier # For RFC
from sklearn.svm import SVC                               #For SVM
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import matthews_corrcoef    
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import accuracy_score,roc_curve,auc
sns.set(style="ticks", color_codes=True)

## Loding complete data in Panda's Dataframe

In [None]:
df = pd.read_csv("../input/phishing-data/combined_dataset.csv")
df.head()

The description of data are as follows:
- Domain: The URL itself.
- Ranking: Page Ranking
- isIp: Is there an IP address in the weblink
- valid: This data is fetched from google's whois API that tells us more about the current status of the URL's registration.
- activeDuration: Also from whois API. Gives the duration of the time since the registration up until now.
- urlLen: It is simply the length of the URL
- is@: If the link has a '@' character then it's value = 1
- isredirect: If the link has double dashes, there is a chance that it is a redirect. 1-> multiple dashes present together.
- haveDash: If there are any dashes in the domain name.
- domainLen: The length of just the domain name.
- noOfSubdomain: The number of subdomains preset in the URL.
- Labels: 0 -> Legitimate website , 1 -> Phishing Link/ Spam Link

# EDA

In [None]:
df.isnull().sum()
df.isna().sum()
#df.info()

- No null Value

In [None]:
df.describe()

In [None]:
sns.countplot(df['label'])

- Result/target are distribute in aprox 4-6 ratio

## Prepration Of Data

In [None]:
X= df.drop(['label', 'domain'], axis=1)
Y= df.label
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.40)
print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_test.shape[0]))


# Dimensionality Reduction



## Feature Selection Methods

###  Filter Methods


![title](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/08/Screenshot-from-2018-08-10-12-07-43.png)

![title](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/How-to-Choose-Feature-Selection-Methods-For-Machine-Learning.png)

-- Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable

-- Heatmaps that show the correlation between features is a good idea.

In [None]:
sns.heatmap(df.corr(),annot=True)

In [None]:
#df.corr()
df.corr()['label'].sort_values()

- We can Select the feature by concedering the corelation factor with targate variable
- We can also drop the features if both are genrating same effoct on targate or having high correation factor between them

In [None]:
a = df.corr()['label']
# saving column names in a variable
variables = x_train.columns
variable = [ ]
#Taking features only if they have higher than +-0.1
for i in range(len(variables)):
    if a[i]>0.1  or a[i]<=(-0.1):   #setting the threshold as 0.1
        variable.append(variables[i])
variable

In [None]:
def RFC(x_train, y_train, x_test, y_test):
    #create RFC object
    RFClass1 = RandomForestClassifier(max_depth=5, random_state=0)
    #Train the model using training data 
    RFClass1.fit(x_train,y_train)

    #Test the model using testing data
    y_pred_rfc1 = RFClass1.predict(x_test)

    cm=confusion_matrix(y_test,y_pred_rfc1)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_rfc1,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_rfc1))
    print("The accuracy Random forest classifier on testing data is: ",100.0 *accuracy_score(y_test,y_pred_rfc1))
    return;

def SVM_C(x_train, y_train, x_test, y_test):
    #create SVM object
    svc = SVC()
    svc.fit(x_train,y_train)
    y_pred_svc = svc.predict(x_test)
    cm=confusion_matrix(y_test,y_pred_svc)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_svc,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_svc))
    print("The accuracy SVC on testing data is: ",100.0 *accuracy_score(y_test,y_pred_svc))
    return;
def LogReg(x_train, y_train, x_test, y_test):
    LogReg1=LogisticRegression(random_state= 0, multi_class='multinomial' , solver='newton-cg')
    #Train the model using training data 
    LogReg1.fit(x_train,y_train)
    #Test the model using testing data
    y_pred_log = LogReg1.predict(x_test)
    cm=confusion_matrix(y_test,y_pred_log)
    sns.heatmap(cm,annot=True)
    print("f1 score is ",f1_score(y_test,y_pred_log,average='weighted'))
    print("matthews correlation coefficient is ",matthews_corrcoef(y_test,y_pred_log))
    print("The accuracy Logistic Regression on testing data is: ",100.0 *accuracy_score(y_test,y_pred_log))
    return;

In [None]:
RFC(x_train=x_train[variable],y_train=y_train,x_test=x_test[variable],y_test=y_test)
#SVM_C(x_train=x_train[variable],y_train=y_train,x_test=x_test[variable],y_test=y_test)

In [None]:
 variable.remove("nosOfSubdomain")

In [None]:
RFC(x_train=x_train[variable],y_train=y_train,x_test=x_test[variable],y_test=y_test)

- Feature selection by using Filter Methods is not provided any rise in our accuracy 

## Linear Dimensionality Reduction Methods
- PCA (Principal Component Analysis) : Popularly used for dimensionality reduction in continuous data, PCA rotates and projects data along the direction of increasing variance. The features with the maximum variance are the principal components.
- Factor Analysis : a technique that is used to reduce a large number of variables into fewer numbers of factors. The values of observed data are expressed as functions of a number of possible causes in order to find which are the most important. The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise.
- LDA (Linear Discriminant Analysis): projects data in a way that the class separability is maximised. Examples from same class are put closely together by the projection. Examples from different classes are placed far apart by the projection

### Steps Involved in PCA

    - Standardize the data. (with mean =0 and variance = 1)
    - Compute the Covariance matrix of dimensions.
    - Obtain the Eigenvectors and Eigenvalues from the covariance matrix (we can also use correlation matrix or even Single value decomposition, however in this post will focus on covariance matrix).
    - Sort eigenvalues in descending order and choose the top k Eigenvectors that correspond to the k largest eigenvalues (k will become the number of dimensions of the new feature subspace k≤d, d is the number of original dimensions).
    - Construct the projection matrix W from the selected k Eigenvectors.
    - Transform the original data set X via W to obtain the new k-dimensional feature subspace Y.

### Standardization of data
![title](https://miro.medium.com/max/700/0*IQtYrjdIiNl88F7K)

In [None]:
X= df.drop(['label', 'domain'], axis=1).values
Y= df.label.values
from sklearn.preprocessing import StandardScaler 
x_std = StandardScaler().fit_transform(X)
#X_std values are standardized in the range of -1 to +1.
x_std

### Create a covariance matrix for Eigen decomposition

In [None]:
mean_vec = np.mean(x_std, axis=0) 
cov_mat = (x_std - mean_vec).T.dot((x_std - mean_vec)) / (x_std.shape[0]-1) 
#print('Covariance matrix \n%s' %cov_mat) 
print('Covariance matrix \n') 
cov_mat= np.cov(x_std, rowvar=False) 
cov_mat

In [None]:
cov_mat = np.cov(x_std.T) 
eig_vals, eig_vecs = np.linalg.eig(cov_mat) 
print('Eigenvectors \n%s' %eig_vecs) 
print('\nEigenvalues \n%s' %eig_vals)

In [None]:
sq_eig=[] 
for i in eig_vecs: 
    sq_eig.append(i**2)
    print(sq_eig) 
sum(sq_eig) 
print("sum of squares of each values in an eigen vector is \n", 0.27287211+ 0.13862096+0.51986524+ 0.06864169) 
for ev in eig_vecs: np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev))

### Sorting eigenvalues

In [None]:
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))] 
#print(type(eig_pairs))
#Sort the (eigenvalue, eigenvector) tuples from high to low eig_pairs.sort() 
eig_pairs.sort() 
eig_pairs.reverse() 
#Visually confirm that the list is correctly sorted by decreasing eigenvalues 
print('\n\n\nEigenvalues in descending order:') 
for i in eig_pairs: 
    print(i[0])

### Explained Variance 

In [None]:
tot = sum(eig_vals) 
print("\n",tot) 
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)] 
print("\n\n1. Variance Explained\n",var_exp) 
cum_var_exp = np.cumsum(var_exp) 
print("\n\n2. Cumulative Variance Explained\n",cum_var_exp) 
print("\n\n3. Percentage of variance the first 2 principal components each contain\n ",var_exp[0:2]) 
print("\n\n4. Percentage of variance the first 2 principal components together contain\n",sum(var_exp[0:2]))

### Construct the projection matrix W from the selected k eigenvectors

In [None]:
#print(eig_pairs[i][1] for i in range(0,5)) 
#print(eig_pairs[2][1]) 
matrix_w = np.hstack((eig_pairs[0][1].reshape(10,1), eig_pairs[1][1].reshape(10,1)))  #hstack: Stacks arrays in sequence horizontally (column wise). print('Matrix W:\n', matrix_w)
print('Matrix W:\n', matrix_w)

In [None]:
a = x_std.dot(matrix_w) 
principalDf = pd.DataFrame(data = a , columns = ['principal component 1', 'principal component 2']) 
principalDf.head()

In [None]:
finalDf = pd.concat([principalDf,pd.DataFrame(Y,columns = ['label'])], axis = 1) 
finalDf.head()
finalDf.isna().sum()

In [None]:
#Dropping null value
finalDf.dropna(inplace=True)
sns.countplot(finalDf['label'])

In [None]:
X=finalDf.drop(['label'], axis=1)
Y=finalDf.label
x_train_pc, x_test_pc, y_train_pc, y_test_pc = train_test_split(X,Y,test_size=0.40)
print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_test.shape[0]))


In [None]:
LogReg(x_train=x_train_pc,y_train=y_train_pc,x_test=x_test_pc,y_test=y_test_pc)
#RFC(x_train=x_train_pc,y_train=y_train_pc,x_test=x_test_pc,y_test=y_test_pc)


In [None]:
fig = plt.figure(figsize = (8,5)) 
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15) 
ax.set_ylabel('Principal Component 2', fontsize = 15) 
ax.set_title('2 Component PCA', fontsize = 20) 
targets = [0, 1] 
colors = ['r', 'g', 'b'] 
for target, color in zip(targets,colors): 
    indicesToKeep = finalDf['label'] == target  
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1'] , finalDf.loc[indicesToKeep, 'principal component 2'] , c = color , s = 50) 
    ax.legend(targets) 
    ax.grid()
    #print(indicesToKeep)

- PCA is not helpful as already dimenssion is less

In [None]:
pca = PCA(n_components=2) 
# Here we can also give the percentage as a paramter to the PCA function as pca = PCA(.95). .95 means that we want to include 95% of the variance. Hence PCA will return the no of components which describe 95% of the variance. However we know from above computation that 2 components are enough so we have passed the 2 components.
principalComponents = pca.fit_transform(x_std) 
principalDDf = pd.DataFrame(data = principalComponents , columns = ['principal component 1', 'principal component 2'])
finalDDf = pd.concat([principalDDf, pd.DataFrame(Y,columns = ['label'])], axis = 1)
finalDDf.head(5) # prints the top 5 rows