# Chapter 1: Fetch data

In this notebook, We will download and generate our dataset for our project.

## Import packages 📦

This are the packages used in this notebook

In [1]:
import opendatasets as od
import pandas as pd
from password_strength import PasswordStats

from tqdm import tqdm
import warnings

tqdm.pandas()
warnings.filterwarnings("ignore")

## Download dataset 📥

We use `opendatasets` to download dataset from Kaggle

In [2]:
database_url = "https://www.kaggle.com/datasets/wjburns/common-password-list-rockyoutxt"
od.download(database_url, data_dir='./raw_dataset')

Skipping, found downloaded files in "./raw_dataset/common-password-list-rockyoutxt" (use force=True to force download)


### Fetch txt file to dataframe

Load `rockyou.txt` file to dataframe using `pandas`

In [3]:
df = pd.read_csv('raw_dataset/common-password-list-rockyoutxt/rockyou.txt', header=None, names=['password'], sep='\t', encoding='ISO-8859-1')
df.head()

Unnamed: 0,password
0,123456
1,12345
2,123456789
3,password
4,iloveyou


Get info about the dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14344097 entries, 0 to 14344096
Data columns (total 1 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   password  object
dtypes: object(1)
memory usage: 109.4+ MB


Drop null values

In [5]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14344094 entries, 0 to 14344096
Data columns (total 1 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   password  object
dtypes: object(1)
memory usage: 218.9+ MB


Create a function that takes password text as input and returns the strength of the password as output. Range of the strength is 0 to 1

In [6]:
def cal_strength(text) -> int:
    return PasswordStats(text).strength()

Apply this function to the dataframe

In [7]:
df['strength'] = df['password'].progress_apply(lambda x: cal_strength(x))
df.head()

100%|██████████| 14344094/14344094 [02:23<00:00, 100209.00it/s]


Unnamed: 0,password,strength
0,123456,0.172331
1,12345,0.128996
2,123456789,0.316992
3,password,0.249543
4,iloveyou,0.249543


Saving the dataframe in parquet format

In [8]:
df.to_parquet('./data/passwords.gzip', compression='gzip', engine='pyarrow') 