# Generating Fake Data with Scikit-Learn

*Many* libraries including sklearn and pytorch allow you to import real world datasets to test.

However, sometimes it helps to generate data with specific properties you are interested in. For learning and reimmersion it can be beneficial to add random inputs to your data generation function, generate a new dataset and practice your skills on it.

## Contents

[Import](#import) <br>
[Boolean variables](#boolean-variables) <br>
[Categorical variables](#categorical-variables)

## To Do

- continious variables
    - Blobs
    - Different distributions
    - Integer
    - Float
- categorical variables
- Date variables
- Text variables
- Boolean variables

## Notes

I'll generate 3 situations:
1. a: Simply the data type
2. b: Data a little more realistic needing conversion
3. c: Messy data needing cleaning

## Resources
[Sklearn generate datasets](https://scikit-learn.org/stable/datasets/sample_generators.html)


## Import

In [17]:
import pandas as pd
import numpy as np

from sklearn import datasets

In [18]:
class CFG:
    n_rows=1000

## Boolean variables

Let's start with the simplest and create a boolean variable.

In [19]:
df_bool_a = pd.DataFrame(np.random.choice([True, False], 
                                          size=CFG.n_rows),
                         columns=['bool_simple'])
df_bool_a.head(3)

Unnamed: 0,bool_simple
0,False
1,True
2,True


To practice a real world situation, this might start with text which we will need to convert to boolean.

In [20]:
df_bool_b = pd.DataFrame(np.random.choice(['Inside', 'Outside'], 
                                          size=CFG.n_rows), 
                         columns=['bool_text'])
df_bool_b.head(3)

Unnamed: 0,bool_text
0,Inside
1,Outside
2,Outside


In an even more messy scenario, there might be typos or other renaming of the same value. 

Example: If an old dataset was combined with a new one and the values were renamed.

In [21]:
def typo(s: str) -> str:
    """
    Replaces a random character in the input string with a random lowercase letter.

    Args:
        s (str): The input string.

    Returns:
        str: The input string with a single character replaced by a random lowercase letter.
    """
    idx = np.random.randint(len(s))
    replacement_letter = chr(np.random.choice(range(97, 123)))
    return s[:idx] + replacement_letter + s[idx + 1:]

In [22]:
# same principle as b for starting point
df_bool_c = pd.DataFrame(np.random.choice(['Inside', 'Outside'], 
                                          size=CFG.n_rows), 
                         columns=['bool_typo'])

# set a random subset
mask_case = np.random.choice([True, False], size=CFG.n_rows, p=[0.1, 0.9])
mask_typo = np.random.choice([True, False], size=CFG.n_rows, p=[0.05, 0.95])

df_bool_c.loc[mask_case, 'bool_typo'] = (df_bool_c.loc[mask_case, 'bool_typo']
                                         .str
                                         .lower())

df_bool_c.loc[mask_typo, 'bool_typo'] = (df_bool_c.loc[mask_typo, 'bool_typo']
                                         .apply(typo))

print(df_bool_c.value_counts().sort_values(ascending=False)[:6])

df_bool_c.head(3)

bool_typo
Outside      444
Inside       429
inside        48
outside       46
Ivside         1
Inpide         1
Name: count, dtype: int64


Unnamed: 0,bool_typo
0,Inside
1,Outside
2,Inside


Now we have our boolean examples we can concat them into a single dataframe.

In [23]:
df_bool = pd.concat([df_bool_a, df_bool_b, df_bool_c], axis=1)
df_bool.head(3)

Unnamed: 0,bool_simple,bool_text,bool_typo
0,False,Inside,Inside
1,True,Outside,Outside
2,True,Outside,Inside


## Categorical variables 

In [24]:
df_cat_a = pd.DataFrame(np.random.choice(['cat', 'dog', 'fish', 'frog'],
                                         size=CFG.n_rows),
                        columns=['cat_simple'])
df_cat_a = df_cat_a.astype('category')
df_cat_a.loc[:, 'cat_simple'].dtype

CategoricalDtype(categories=['cat', 'dog', 'fish', 'frog'], ordered=False, categories_dtype=object)

For DataFrame with more columns, it's more conveniant and less error prone to use the following syntax:

```df_cat_a = df_cat_a.astype({'cat_simple': 'category'})```

It is worth noting that currently using the loc method doesn't seem to update the DataFrame constructor.

In [25]:
# example if interested
example = pd.DataFrame(np.random.choice(['cat', 'dog', 'fish', 'frog'],
                                         size=CFG.n_rows),
                        columns=['cat_simple'])
example.loc[:, 'cat_simple'] = example.loc[:, 'cat_simple'].astype('category')
print(example.loc[:, 'cat_simple'].dtype)
del example

object


Out in the wild, these often get handed over as strings:

In [26]:
df_cat_b = pd.DataFrame(np.random.choice(['cat', 'dog', 'fish', 'frog'],
                                         size=CFG.n_rows),
                        columns=['cat_simple'])
df_cat_b['cat_simple'].dtype

dtype('O')

These strings can also have typos:

In [27]:
def introduce_typos(text_series, p_cap_error=0.1, p_typo=0.05):
    """
    Introduce typos and capitalization errors into a Pandas Series of strings.
    
    Parameters:
    - text_series: Input series containing strings.
    - p_cap_error: Probability of a capitalization error. Default is 0.1.
    - p_typo: Probability of a typo. Default is 0.05.
    
    Returns:
    - Series with introduced typos and capitalization errors.
    """
    mask_case = np.random.choice([True, False], 
                                 size=text_series.shape[0], 
                                 p=[p_cap_error, 1 - p_cap_error])
    mask_typo = np.random.choice([True, False], 
                                 size=text_series.shape[0], 
                                 p=[p_typo, 1 - p_typo])
    
    text_series.loc[mask_case] = text_series.loc[mask_case].str.lower()
    text_series.loc[mask_typo] = text_series.loc[mask_typo].apply(typo)
    
    return text_series


In [28]:
df_cat_c = pd.DataFrame(np.random.choice(['cat', 'dog', 'fish', 'frog'],
                                         size=CFG.n_rows),
                        columns=['cat_simple'])
df_cat_c['cat_simple'] = introduce_typos(df_cat_c['cat_simple'])
df_cat_c

Unnamed: 0,cat_simple
0,cat
1,cat
2,frog
3,dog
4,bat
...,...
995,frog
996,fish
997,frog
998,cat


## Continious Variables