## Password Strength Check

This notebook is a small-scale project meant to provide some insight into the means of how passwords can be assessed in terms of how strong they are by training a ML model on a dataset of variable length/complexity passwords each with an assigned label of "weak", "medium" or "strong".

To try this out, I've made use of a free dataset available on [Kaggle](https://www.kaggle.com/datasets/bhavikbb/password-strength-classifier-dataset?resource=download)

To start with, we can import the necessary packages. `Matplotlib` and `seaborn` for some appealing plotting,`pandas` for data reading and manipulation, `numpy` for numerical processing, and for the actual machine learning itself, we will try using a random forest classification system. `sklearn` will cover all of this

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import pandas as pd
import getpass
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

#When readin in the data, the potential for non-standard strings including the likes of too many ','s may cause issues,
#but we want to keep these as they are part of the actual passwords. error_bad_lines ensures this and warn_bad__lines 
#set to False suppresses warnings for each line where this occurs
df = pd.read_csv("password.csv", error_bad_lines=False, warn_bad_lines=False)
print(df.head(10))
print('\nNumber of entries;\t ' + str(len(df)))

           password  strength
0          kzde5577         1
1          kino3434         1
2         visi7k1yr         1
3          megzy123         1
4       lamborghin1         1
5  AVYq1lDE4MgAZfNt         2
6          u6c8vhow         1
7          v1118714         1
8      universe2908         1
9          as326159         1

Number of entries;	 669640


Above, we see that we have ~670,000 passwords in our dataframe, each with an assigned strength value from 0-2;

##### 0: Password is weak
##### 1: Password is average
##### 2: Password is strong

To make things easier, we can simply just convert these from a numeric to categorical type by replacing the int values with their corresponding stregth

In [12]:
df = df.dropna()
df["strength"] = df["strength"].map({0: "Weak", 1: "Medium", 2: "Strong"})
df.sample(10)

Unnamed: 0,password,strength
212733,ROkqr6TQ2NAL9s3G,Strong
507969,zefife332,Medium
601414,Hamma.abbes1995,Strong
661131,513UHoxUJOkETY,Strong
91450,7BSgLQDc0NwQZygP,Strong
471432,f1adb1,Weak
384114,s0p0rt3@pj.gob.pe,Strong
483779,24JUILLET,Medium
291846,utoryg272,Medium
632106,vumisuz520,Medium


Good. Now we can see that our dataset is both free of NaN values and is readily classifiable based on the strenght property. We can now look to actually training an ML model on it as a measn of predictiong strength
One of the first things we need to do is tokenisation; Essentially a means of learning from the combination of random characters, letters and numbers and how their combination leads to higher or lower stregnth

A small apendix on tokenisation will be at the end of this notebook, just for posterity's sake

In [13]:
##To start tokenising our data, we can first define a function to deconstruct a password into an 
#array of its charcaters

def chars(password):
    chars = []
    for i in password:
        chars.append(i)
    return chars

x_data = np.array(df['password'])
y_data = np.array(df['strength'])

#With our data seperated into arrays, we can run some vectorisation using our chars func (the tokeniser), 
#and the tfidf function from sklearn (the vectoriser).
tdif = TfidfVectorizer(tokenizer=chars)
x_data = tdif.fit_transform(x_data)
xtrain, xtest, ytrain, ytest = train_test_split(x_data,y_data,test_size=0.05,random_state=27)


Now that we've run both the tokenisation and splitting the dataset into train and test sets, we can actually run our
model training and estimation

In [14]:
model = RandomForestClassifier(n_jobs=10, verbose=3) #verbose will give updates on current tree being built
model.fit(xtrain, ytrain)
print('Model score: {}'.format(str(model.score(xtest, ytest))))

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.


building tree 1 of 100building tree 2 of 100

building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100


[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:   31.5s


building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65

[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:  2.5min finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:    0.0s


Model score: 0.9559464787049758


[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.2s finished


From above, we see that our random forest is assessing our testing data with around 65% accuracy, which is decent enough and can potentially be improved upon with decreasing our test sample alittle more
But with this acceptable level of accuracy, we can then try it our for ourselves with our own custom password we can pass and fit apply to our fittedn model

In [15]:
#already imported the getpass module
usr = getpass.getpass('Please enter password: ')
dat = tdif.transform([usr]).toarray()
est = model.predict(dat)[0]

dict = {"Weak": u"\U0001F7E5"*3, "Medium": u"\U0001F7E8"*6, "Strong": u"\U0001F7E9"*9}
        
print('Password Strength: {}  {}'.format(dict[est],est))

Please enter password:  ········


Password Strength: 🟨🟨🟨🟨🟨🟨  Medium


[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:    0.0s
[Parallel(n_jobs=10)]: Done 100 out of 100 | elapsed:    0.0s finished


## Appendix A.1: Tokenisation

In the above method, we talked about how we first had to tokenise our password data in order to make it classifiable for our random forest classifier method. As such, its worth giving some superficial background and info on tokenisation (both word and sentence).

Tokenisation is just the method of breaking up text material into separable units, such as words, strings or characters, with any of these units being referred to as **tokens**. In the area of natural language processing, theres typically 2 types;

<ol>
    <li>Sentence Tokenisation: This is the breaking of large pieces of text data into sentences. Mostly, you can see this in the form of breaking paragraphs into sentences. This can be easy in NLP, since for many languages, sentences are delimited by the period mark, so a tokenizer can just find all the full stops in a piece of text, and breaks the paragraph into tokens seperated by these periods.
    <li>Word Tokenisation: This is the most common form seen in NLP. While paragraphs are composed of sentences, sentences are composed of words, so tokenising a sentence requires finding the separator, here a space, with which it can break sentences into tokens of words.
<\ol>
    
    
    


## Apendix A.2: TF-DIF

When we seperated our data into x and y sets as `password` and `strengh`, we setup something called a vetoriser which employed the `TfidfVectorizer`, which took in our char function as its tokeniser. This function is based off the **Text Frequency-Inverse Document Frequency** algorithm, a popular and robust method of converting textual data into a vector of numerical data. It combines both the *term frequency* and *document frequency*

<ul>
    <li> Term Frequency: This is the frequency with which specific terms occur in a document i.e. how significant certain terms must be. it works by representing textual information as a matric, with rows being the different documents in am available dataset, and columns being all the distinct terms
    <li> Document Frequency: This is the number of documents that contain a specific term and gives measure to how important said term likely is
<\ul>

IDF (Inverse Document Frequency) is the weight of a given term (how much information a term provides), where it will attribute a lower weight to a term which is occurence is very sporadic and scattered across a wide number of documents. It's a logarithmic scale and can is given by;
    
$$idf_i = log(N/d \{d \in D\})$$
    
where $idf_i$ is the IDF score for a term i, N is the total number of documents in a corups, d is the number of documents in which the term i appears.
From this, we see the inverse relation; the bigger the DF, the lower the IDF of a term. When the DF is equal to N i.e. when the term i is in every single document in the corpus, then we just get an IDF of 0, meaning the term provides no useful information for our vectorisation interests.

So with this value giving us some weight of a terms informational importance, and a measure of term frequency from the TF value, the value of TF-IDF is just the product of the IDF value with the DF matrix;
    
$$tdfidf(t,d,D) = tf(t,d) \cdot idf(t,D)$$
    
where t is the term of interest, d is a document and D is the entire corpus of documents assessed
