In [1]:
import pandas as pd

*Importing pandas to read files and data exploration*

In [7]:
df = pd.read_csv("malicious_phish.csv")

In [8]:
df

Unnamed: 0,url,type
0,br-icloud.com.br,phishing
1,mp3raid.com/music/krizz_kaliko.html,benign
2,bopsecrets.org/rexroth/cr/1.htm,benign
3,http://www.garage-pirenne.be/index.php?option=...,defacement
4,http://adventure-nicaragua.net/index.php?optio...,defacement
...,...,...
651186,xbox360.ign.com/objects/850/850402.html,phishing
651187,games.teamxbox.com/xbox-360/1860/Dead-Space/,phishing
651188,www.gamespot.com/xbox360/action/deadspace/,phishing
651189,en.wikipedia.org/wiki/Dead_Space_(video_game),phishing


*Feature engineering* 

In [9]:
df.dtypes

url     object
type    object
dtype: object

In [10]:
df.isnull().sum()

url     0
type    0
dtype: int64

In [11]:
X = df['url']
X

0                                          br-icloud.com.br
1                       mp3raid.com/music/krizz_kaliko.html
2                           bopsecrets.org/rexroth/cr/1.htm
3         http://www.garage-pirenne.be/index.php?option=...
4         http://adventure-nicaragua.net/index.php?optio...
                                ...                        
651186              xbox360.ign.com/objects/850/850402.html
651187         games.teamxbox.com/xbox-360/1860/Dead-Space/
651188           www.gamespot.com/xbox360/action/deadspace/
651189        en.wikipedia.org/wiki/Dead_Space_(video_game)
651190            www.angelfire.com/goth/devilmaycrytonite/
Name: url, Length: 651191, dtype: object

In [12]:
df.loc[df['type'] =='phishing','type',] = 0
df.loc[df['type'] =='benign','type',] = 1
df.loc[df['type'] =='defacement','type',] = 2
df.loc[df['type'] =='malware','type',] = 3

In [13]:
y = df['type']
y

0         0
1         1
2         1
3         2
4         2
         ..
651186    0
651187    0
651188    0
651189    0
651190    0
Name: type, Length: 651191, dtype: object

*Performing train test split to split the data between training set and testing set*

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 1)

*Importing TfidVectorizer from sklearn.feature_extraction.text which will remove StopWords and will convert colletion of raw documents to a matrix of Tf  Idf*

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
feature_extraction = TfidfVectorizer(min_df=1,stop_words='english',lowercase=True)

*Fitting and transforming data*

In [18]:
x_train_tfidf = feature_extraction.fit_transform(x_train)
x_test_tfidf = feature_extraction.transform(x_test)

*Converting datatype of Y_train and Y_test to int*

In [19]:
y_train = y_train.astype('int')
y_test = y_test.astype('int')

*Importing LinearSVC from Sklearn and fitting model*

In [20]:
from sklearn.svm import LinearSVC

In [21]:
model = LinearSVC()
model.fit(x_train_tfidf,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

*Now importing accuracy_score to calculate accuracy on training and testing model*

In [22]:
from sklearn.metrics import accuracy_score

In [23]:
prediction_on_train_data = model.predict(x_train_tfidf)
accuracy_on_train_data = accuracy_score(y_train,prediction_on_train_data)

*Here we got 99% accuracy on training data*

In [24]:
accuracy_on_train_data

0.9973644404858797

In [25]:
prediction_on_test_data = model.predict(x_test_tfidf)
accuracy_on_test_data = accuracy_score(y_test,prediction_on_test_data)

*Here we got 95% accuracy on test data*

In [26]:
accuracy_on_test_data

0.9562880550372777

*Here we are providing input to model and getting output*

In [27]:
input_web = [input()]
input_web_features = feature_extraction.transform(input_web)

pred = model.predict(input_web_features)

if (pred[0] == 0):
  print("phishing !!")
elif(pred[0] == 1):
  print("benign !!")
elif(pred[0] == 2):
  print("defacement!!!")
else:
  print("malware!!")

bopsecrets.org/rexroth/cr/1.htm	
benign !!


*Above is our output telling us type of url*