## **Modelling for Password Strengh Checker**

### **A. Load Library**

In [41]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import pickle 


### **B. Load Data**

In [26]:
# Load csv data
data = pd.read_csv("data.csv", on_bad_lines = "skip")

# Showing example of the data
print(data.sample(5))

               password  strength
645240        komyat300         1
246667       rasa3rakuw         1
300996      asp34345656         1
372531        17e17e23i         1
586115  INILiHAMuzyh463         2


**Column Information**:
- password, password associated with the user account.
- strength, a numerical rating indicating the security level of the password. Values range from 0 to 2, with 0 being the weakest and 2 being the strongest.

### **C. Feature Engineering**

#### 1. Checking Missing Value

In [27]:
# Checking process
print('Number of Missing Value in Each Column: ')
data.isnull().sum()

Number of Missing Value in Each Column: 


password    1
strength    0
dtype: int64

In [28]:
# Dropping the missing value
data.dropna(inplace=True)

#### 2. Check Data Duplication

In [29]:
# Checking Process
print('Number of Data that duplicated', data.duplicated().sum())

Number of Data that duplicated 0


#### 3. Data Split

In [30]:
# Function to split the password into single character in a list
def word(password):
    character=[]
    for i in password:
        character.append(i)
    return character

# Splitting variable train and target
x = data["password"]
y = data["strength"]

# Transform data train with tokenizer
tdif = TfidfVectorizer(tokenizer=word)
x = tdif.fit_transform(x)

# Split process
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=42)

# Data Size
print('Train Size   : ', X_train.shape)
print('Test Size    : ', X_test.shape)



Train Size   :  (569193, 153)
Test Size    :  (100446, 153)


#### 4. Modelling

In [32]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

0.9565039921948112


### **D. Export Model**

In [42]:
with open('pickle/model.pkl', 'wb') as file:
	pickle.dump(model, file)
with open('pickle/tdif.pkl', 'wb') as file:
	pickle.dump(tdif, file) 