
##Password Strength Classification.
Password strength is a measure of the effectiveness of a password against guessing or brute-force attacks.

In its usual form, it estimates how many trials an attacker who does not have direct access to the password would need, on average, to guess it correctly.
The strength of a password is a function of length, complexity, and unpredictability.

Using strong passwords lowers overall risk of a security breach, but strong passwords do not replace the need for other effective security controls.

The effectiveness of a password of a given strength is strongly determined by the design and implementation of the factors (knowledge, ownership, inherence).
In this project we will classify strength of password as three values(0 , 1 , 2) i.e. 0 for weak, 1 for medium, 2 for strong using machine learning algorithm.

Steps followed in this project are:

Step 1 Data preprocessing & exploration.
Step 2 Data preparation.
Step 3 Data Training & model creation.
Step 4 Performance evaluation using error & accuracy check.
For the dataset being used in this project **[here](https://www.kaggle.com/bhavikbb/password-strength-classifier-dataset
)** 

---



##Step 1: Data Preprocessing & Exploration.

In [50]:
import pandas as pd

In [51]:
data=pd.read_csv('data.csv', error_bad_lines=False)
data

b'Skipping line 2810: expected 2 fields, saw 5\nSkipping line 4641: expected 2 fields, saw 5\nSkipping line 7171: expected 2 fields, saw 5\nSkipping line 11220: expected 2 fields, saw 5\nSkipping line 13809: expected 2 fields, saw 5\nSkipping line 14132: expected 2 fields, saw 5\nSkipping line 14293: expected 2 fields, saw 5\nSkipping line 14865: expected 2 fields, saw 5\nSkipping line 17419: expected 2 fields, saw 5\nSkipping line 22801: expected 2 fields, saw 5\nSkipping line 25001: expected 2 fields, saw 5\nSkipping line 26603: expected 2 fields, saw 5\nSkipping line 26742: expected 2 fields, saw 5\nSkipping line 29702: expected 2 fields, saw 5\nSkipping line 32767: expected 2 fields, saw 5\nSkipping line 32878: expected 2 fields, saw 5\nSkipping line 35643: expected 2 fields, saw 5\nSkipping line 36550: expected 2 fields, saw 5\nSkipping line 38732: expected 2 fields, saw 5\nSkipping line 40567: expected 2 fields, saw 5\nSkipping line 40576: expected 2 fields, saw 5\nSkipping line 

Unnamed: 0,password,strength
0,kzde5577,1
1,kino3434,1
2,visi7k1yr,1
3,megzy123,1
4,lamborghin1,1
...,...,...
669635,10redtux10,1
669636,infrared1,1
669637,184520socram,1
669638,marken22a,1


In [52]:
data.head()

Unnamed: 0,password,strength
0,kzde5577,1
1,kino3434,1
2,visi7k1yr,1
3,megzy123,1
4,lamborghin1,1


In [53]:
data.tail()

Unnamed: 0,password,strength
669635,10redtux10,1
669636,infrared1,1
669637,184520socram,1
669638,marken22a,1
669639,fxx4pw4g,1


In [54]:
data.columns

Index(['password', 'strength'], dtype='object')

In [55]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669640 entries, 0 to 669639
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   password  669639 non-null  object
 1   strength  669640 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 10.2+ MB


In [56]:
data.shape

(669640, 2)

##step 2 : Data Preparation.

In [57]:
data.isnull().sum()
# check for null values

password    1
strength    0
dtype: int64

In [58]:
data.dropna(inplace=True) # drop the null value.

In [59]:
import numpy as np
password_tuple = np.array(data) # transforming the data.
password_tuple

array([['kzde5577', 1],
       ['kino3434', 1],
       ['visi7k1yr', 1],
       ...,
       ['184520socram', 1],
       ['marken22a', 1],
       ['fxx4pw4g', 1]], dtype=object)

In [60]:
import random
random.shuffle(password_tuple)    # to shuffle the data.

In [61]:
y = [labels[1] for labels in password_tuple ] # this will be in y list

In [62]:
x = [labels[0] for labels in password_tuple ]  # this will be in x list

In [63]:
def word_char(inputs):    # defining a function to divide or split the word in characters.
    a= []
    for i in inputs:
        a.append(i)
    return a

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer = word_char)                    
x =vect.fit_transform(x)
# tokenize the data and getting all features of the dataset.


##Step 3: Data Training & Model Creation.

In [65]:

# Split data into training data and testing data.
#Ratio used for splitting training and testing data is 8:2 respectively
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=1000)

In [66]:
from sklearn.linear_model import LogisticRegression
lr= LogisticRegression(random_state=0,multi_class='multinomial')

In [67]:

lr.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [68]:
# Making predictions 
pred = lr.predict(x_test)

In [69]:
pred

array([1, 1, 1, ..., 1, 1, 1])

##Step 4: Performance Evaluation.

In [70]:

import numpy as np
from sklearn.metrics import mean_squared_error
print("Model\t\t\t    RootMeanSquareError   \t\t   Accuracy of the model") 
print("""Logistic Regression             \t\t {:.4f}  \t \t\t {:.4f}""".format(  np.sqrt(mean_squared_error(y_test, pred)), lr.score(x_train,y_train)))

Model			    RootMeanSquareError   		   Accuracy of the model
Logistic Regression             		 0.4257  	 		 0.8193
