## Password Strength Check

This notebook is a small-scale project meant to provide some insight into the means of how passwords can be assessed in terms of how strong they are by training a ML model on a dataset of variable length/complexity passwords each with an assigned label of "weak", "medium" or "strong".

To try this out, I've made use of a free dataset available on [Kaggle](https://www.kaggle.com/datasets/bhavikbb/password-strength-classifier-dataset?resource=download)

To start with, we can import the necessary packages. `Matplotlib` and `seaborn` for some appealing plotting,`pandas` for data reading and manipulation, `numpy` for numerical processing, and for the actual machine learning itself, we will try using a random forest classification system. `sklearn` will cover all of this

In [50]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
import getpass
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#When readin in the data, the potential for non-standard strings including the likes of too many ','s may cause issues,
#but we want to keep these as they are part of the actual passwords. error_bad_lines ensures this and warn_bad__lines 
#set to False suppresses warnings for each line where this occurs
df = pd.read_csv("password.csv", error_bad_lines=False, warn_bad_lines=False)
print(df.head(10))
print('\nNumber of entries;\t ' + str(len(df)))

           password  strength
0          kzde5577         1
1          kino3434         1
2         visi7k1yr         1
3          megzy123         1
4       lamborghin1         1
5  AVYq1lDE4MgAZfNt         2
6          u6c8vhow         1
7          v1118714         1
8      universe2908         1
9          as326159         1

Number of entries;	 669640


Above, we see that we have ~670,000 passwords in our dataframe, each with an assigned strength value from 0-2;

##### 0: Password is weak
##### 1: Password is average
##### 2: Password is strong

To make things easier, we can simply just convert these from a numeric to categorical type by replacing the int values with their corresponding stregth

In [None]:
df = df.dropna()
df["strength"] = df["strength"].map({0: "Weak", 1: "Medium", 2: "Strong"})
df.sample(10)

Good. Now we can see that our dataset is both free of NaN values and is readily classifiable based on the strenght property. We can now look to actually training an ML model on it as a measn of predictiong strength
One of the first things we need to do is tokenisation; Essentially a means of learning from the combination of random characters, letters and numbers and how their combination leads to higher or lower stregnth

A small apendix on tokenisation will be at the end of this notebook, just for posterity's sake

In [48]:
##To start tokenising our data, we can first define a function to deconstruct a password into an 
#array of its charcaters

def chars(password):
    chars = []
    for i in password:
        chars.append(i)
    return chars

x_data = np.array(df['password'])
y_data = np.array(df['strength'])

#With our data seperated into arrays, we can run some vectorisation using our chars func (the tokeniser), 
#and the tfidf function from sklearn (the vectoriser).
tdif = TfidfVectorizer(tokenizer=chars)
x_data = tdif.fit_transform(x_data)
xtrain, xtest, ytrain, ytest = train_test_split(x_data,y_data,test_size=0.05,random_state=27)


Now that we've run both the tokenisation and splitting the dataset into train and test sets, we can actually run our
model training and estimation

In [49]:
model = RandomForestClassifier(n_jobs=10, verbose=3) #verbose will give updates on current tree being built
model.fit(xtrain, ytrain)
print('Model score: {}'.format(str(model.score(xtest, ytest))))

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.


building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100


[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:   32.3s


building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64

[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:  2.6min finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:    0.0s


Model score: 0.9566334149692372


[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.3s finished


From above, we see that our random forest is assessing our testing data with around 65% accuracy, which is decent enough and can potentially be improved upon with decreasing our test sample alittle more
But with this acceptable level of accuracy, we can then try it our for ourselves with our own custom password we can pass and fit apply to our fittedn model

In [67]:
#already imported the getpass module
usr = getpass.getpass('Please enter password: ')
dat = tdif.transform([usr]).toarray()
est = model.predict(dat)[0]

match est[0]:
    case "Weak":
        bar = u"\u25A2"*3
    case "Medium":
        bar = u"\u25A2"*9
    case "Strong":
        bar = u"\u25A2"*9
        
print('Password Strength: {}  {}'.format(bar,est[0]))

SyntaxError: invalid syntax (4114851292.py, line 6)

In [58]:
print('Password Strength: u"\u25A2"*3  {}'.format(est[0]))

Password Strength: u"▢"*3  Strong
