# Notebook 02.2: Create Sample Data

### Introduction

In this notebook, we will create a stratified sample of the clean password dataset obtained in the previous notebook. The stratified sample will ensure that we have representative samples from different password strength levels, allowing us to perform accurate analysis and modeling.

### Setup

Let's start by importing the necessary libraries.


In [1]:
import pandas as pd

###  Data Loading
Let's load the clean password dataset from the previous notebook.

In [2]:
df = pd.read_parquet("./data/clean_passwords.gzip", engine="pyarrow")
df.head()

Unnamed: 0,password,strength
0,123456,0.172331
1,12345,0.128996
2,123456789,0.316992
3,password,0.249543
4,iloveyou,0.249543


###  Stratified Sampling
To create a stratified sample, we will divide the dataset into several strength level bins and sample a fixed number of passwords from each bin.

In [3]:
num_bins = 5
sample_size = 2_000

df["bin"] = pd.cut(df["strength"], num_bins, labels=False)
sample_df = df.groupby("bin", group_keys=False).apply(
    lambda x: x.sample(min(len(x), sample_size), random_state=24)
)

sample_df.head()

Unnamed: 0,password,strength,bin
8901913,csillik,0.180594,0
7505062,huniihuu,0.177778,0
1931018,chaipy,0.172331,0
11728518,876876b,0.155556,0
5548757,miiwhy,0.154795,0


###  Dropping 'bin' Column
Since we have created the stratified sample, we can drop the 'bin' column from the sample DataFrame.

In [4]:
sample_df.drop(["bin"], axis=1, inplace=True)
sample_df.head()

Unnamed: 0,password,strength
8901913,csillik,0.180594
7505062,huniihuu,0.177778
1931018,chaipy,0.172331
11728518,876876b,0.155556
5548757,miiwhy,0.154795


###  Sample Data Information
Let's examine the information of the stratified sample.

In [5]:
sample_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 8901913 to 5783544
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   password  10000 non-null  object 
 1   strength  10000 non-null  float64
dtypes: float64(1), object(1)
memory usage: 234.4+ KB


###  Saving Stratified Sample Data
Finally, we will save the stratified sample data as a CSV file for further analysis and modeling.

In [6]:
sample_df.to_csv("./data/stratified_sample_data.csv", index=False)

🎉Congratulations! We have successfully created a stratified sample of the clean password dataset. This stratified sample will allow us to perform in-depth analysis and build accurate models for password strength prediction.

Next, let's move on to the next notebook to apply feature engineering to the sample data.