# Description

## 🔐 Password Strength Classification - Model Training  

### **📌 Overview**  
In this notebook, we will train a machine learning model to classify passwords based on their strength (**Weak, Medium, or Strong**).  
We will use the **password_data_ready** generated from our preprocessing pipeline.  

### **📌 Steps in This Notebook**  
1️⃣ Load the password_data_ready.  
2️⃣ Perform statistical analysis & check for class imbalance.  
3️⃣ Train different machine learning models.  
4️⃣ Evaluate model performance using accuracy, precision, recall, and F1-score.  
5️⃣ Optimize the best-performing model for deployment.  

Let's begin! 🚀


# Importing Essential Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
from warnings import filterwarnings
filterwarnings("ignore")

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

print('Done')

Done


# Loading Dataset 

In [2]:
path = r'D:\Data_Projects\\Password_Strength_Checker\data\\processed\\password_data_ready.csv'
data = pd.read_csv(path, on_bad_lines='skip')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669639 entries, 0 to 669638
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   password  669639 non-null  object
 1   strength  669639 non-null  object
 2   length    669639 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 15.3+ MB


In [5]:
data.describe()

Unnamed: 0,length
count,669639.0
mean,9.991648
std,2.819954
min,1.0
25%,8.0
50%,9.0
75%,11.0
max,220.0


In [4]:
class_count = data['strength'].value_counts()
print(class_count)

strength
medium    496801
weak       89701
strong     83137
Name: count, dtype: int64


### 📌 Handling Class Imbalance Approaches

Oversampling (generate synthetic ones using SMOTE)


In [8]:
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert passwords to numeric representation
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(data['password'])

y = data['strength']

smote = SMOTE(sampling_strategy='auto', random_state=42)
X_smote, y_smote = smote.fit_resample(X_tfidf, y)

data_balanced = pd.DataFrame({"password": vectorizer.inverse_transform(X_smote), "strength": y_smote})
data_balanced['strength'].value_counts()

strength
medium    496801
strong    496801
weak      496801
Name: count, dtype: int64

In [11]:
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

models = {
    "Logistic Regression": LogisticRegression(max_iter=500),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
}
param_grids = {
    "Logistic Regression": {
        'model__C': [0.1, 1, 10],
        'model__solver': ['liblinear', 'lbfgs']
    },
    "Random Forest": {
        'model__n_estimators': [50, 100, 200],
        'model__max_depth': [None, 10, 20]
    },
    "XGBoost": {
        'model__n_estimators': [50, 100, 200],
        'model__learning_rate': [0.01, 0.1, 0.2],
        'model__max_depth': [3, 6, 9]
    }
}

In [None]:
for name, model in models.items():
    print(f"Training {name}...")
    
    pipeline = Pipeline([
        ('scaler', StandardScaler(with_mean=False)),
        ('model', model)
    ])
    
    search = GridSearchCV(pipeline, param_grids[name], cv=3, scoring='accuracy', n_jobs=-1)
    search.fit(X_smote, y_smote)
    
    print(f"Best params for {name}: {search.best_params_}")

Training Logistic Regression...
Best params for Logistic Regression: {'model__C': 10, 'model__solver': 'lbfgs'}
Training Random Forest...


In [None]:
y_pred = search.best_estimator_.predict(X_smote)
print(classification_report(y_smote, y_pred))