# Notebook 01: Fetch data

###  Introduction

In this notebook, we will fetch and preprocess the "rockyou.txt" dataset to analyze password strength. This dataset will serve as the foundation for our "Passwordometer" project.


###  Setup

Let's start by setting up the necessary environment and importing the required libraries.


In [1]:
import warnings

import opendatasets as od
import pandas as pd
from password_strength import PasswordStats
from tqdm import tqdm

tqdm.pandas()
warnings.filterwarnings("ignore")

###  Data Fetching

To begin our analysis, we need to download the "rockyou.txt" dataset. We will use the `opendatasets` library to simplify the download process.


In [2]:
database_url = "https://www.kaggle.com/datasets/wjburns/common-password-list-rockyoutxt"
od.download(database_url, data_dir="./raw_dataset")

Skipping, found downloaded files in "./raw_dataset/common-password-list-rockyoutxt" (use force=True to force download)


###  Data Loading

Let's load the dataset into a Pandas DataFrame for further processing.


In [3]:
df = pd.read_csv(
    "raw_dataset/common-password-list-rockyoutxt/rockyou.txt",
    header=None,
    names=["password"],
    sep="\t",
    encoding="ISO-8859-1",
)
df.head()

Unnamed: 0,password
0,123456
1,12345
2,123456789
3,password
4,iloveyou


###  Data Cleaning

Before we proceed, let's perform some basic data cleaning steps to ensure the dataset's quality.


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14344097 entries, 0 to 14344096
Data columns (total 1 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   password  object
dtypes: object(1)
memory usage: 109.4+ MB


In [5]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14344094 entries, 0 to 14344096
Data columns (total 1 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   password  object
dtypes: object(1)
memory usage: 218.9+ MB


###  Password Strength Calculation

To evaluate the strength of each password in our dataset, we will utilize the `PasswordStats` library.


In [6]:
def cal_strength(text) -> int:
    return PasswordStats(text).strength()

###  Strength Calculation and Results

Let's calculate the strength of each password in our dataset and examine the results.


In [7]:
df["strength"] = df["password"].progress_apply(lambda x: cal_strength(x))
df.head()

100%|██████████| 14344094/14344094 [02:23<00:00, 100209.00it/s]


Unnamed: 0,password,strength
0,123456,0.172331
1,12345,0.128996
2,123456789,0.316992
3,password,0.249543
4,iloveyou,0.249543


###  Saving Processed Data

Finally, we will save the processed dataset as a compressed Parquet file for future use.


In [8]:
df.to_parquet("./data/passwords.gzip", compression="gzip", engine="pyarrow")

🎉 Congratulations! We have successfully fetched and processed the "rockyou.txt" dataset, calculating the strength of each password. The resulting dataset will be instrumental in our password strength prediction model.

Next, let's move on to the next notebook to clean the password data.