# Generating Fake Data with Scikit-Learn

*Many* libraries including sklearn and pytorch allow you to import real world datasets to test.

However, sometimes it helps to generate data with specific properties you are interested in. For learning and reimmersion it can be beneficial to add random inputs to your data generation function, generate a new dataset and practice your skills on it.

## Contents

[Import](#import) <br>
[Boolean variables](#boolean-variables) <br>
[Categorical variables](#categorical-variables)

## To Do

- continious variables
    - Blobs
    - Different distributions
    - Integer
    - Float
- categorical variables
- Date variables
- Text variables
- Boolean variables

## Notes

I'll generate 3 situations:
1. a: Simply the data type
2. b: Data a little more realistic needing conversion
3. c: Messy data needing cleaning

## Resources
[Sklearn generate datasets](https://scikit-learn.org/stable/datasets/sample_generators.html)


## Import

In [1]:
import pandas as pd
import numpy as np

from sklearn import datasets

In [2]:
class CFG:
    n_rows=1000

## Boolean variables

Let's start with the simplest and create a boolean variable.

In [3]:
df_bool_a = pd.DataFrame(np.random.choice([True, False], 
                                          size=CFG.n_rows),
                         columns=['bool_simple'])
df_bool_a.head()

Unnamed: 0,bool_simple
0,True
1,False
2,True
3,False
4,False


To practice a real world situation, this might start with text which we will need to convert to boolean.

In [4]:
df_bool_b = pd.DataFrame(np.random.choice(['Inside', 'Outside'], 
                                          size=CFG.n_rows), 
                         columns=['bool_text'])
df_bool_b.head()

Unnamed: 0,bool_text
0,Outside
1,Inside
2,Outside
3,Outside
4,Outside


In an even more messy scenario, there might be typos or other renaming of the same value. 

Example: If an old dataset was combined with a new one and the values were renamed.

In [5]:
def typo(s):
    """
    Replaces a random character in the input string with a random lowercase letter.

    Args:
        s (str): The input string.

    Returns:
        str: The input string with a single character replaced by a random lowercase letter.
    """
    idx = np.random.randint(len(s))
    replacement_letter = chr(np.random.choice(range(97, 123)))
    return s[:idx] + replacement_letter + s[idx + 1:]

In [6]:
# same principle as b for starting point
df_bool_c = pd.DataFrame(np.random.choice(['Inside', 'Outside'], 
                                          size=CFG.n_rows), 
                         columns=['bool_typo'])

# set a random subset
mask_case = np.random.choice([True, False], size=CFG.n_rows, p=[0.1, 0.9])
mask_typo = np.random.choice([True, False], size=CFG.n_rows, p=[0.05, 0.95])

df_bool_c.loc[mask_case, 'bool_typo'] = (df_bool_c.loc[mask_case, 'bool_typo']
                                         .str
                                         .lower())

df_bool_c.loc[mask_typo, 'bool_typo'] = (df_bool_c.loc[mask_typo, 'bool_typo']
                                         .apply(typo))

print(df_bool_c.value_counts().sort_values(ascending=False))

df_bool_c

bool_typo
Inside       439
Outside      430
inside        51
outside       31
ynside         3
Itside         1
Outsidh        1
Insidd         1
Innide         1
Insidv         1
Insidz         1
Insike         1
Insode         1
Inspde         1
Inssde         1
Ojtside        1
Oftside        1
Outsids        1
Oqtside        1
Ouaside        1
Oumside        1
Ouoside        1
Outcide        1
Outkide        1
Outqide        1
Ikside         1
Outsidj        1
wnside         1
Outtide        1
knside         1
Insiae         1
onside         1
Outwide        1
Outxide        1
futside        1
insidb         1
Inlide         1
insive         1
iuside         1
nnside         1
wutside        1
outaide        1
Insida         1
outsbde        1
Iniide         1
outsidt        1
outsqde        1
outstde        1
ouuside        1
qutside        1
Idside         1
Name: count, dtype: int64


Unnamed: 0,bool_typo
0,Inside
1,Outside
2,Insida
3,Inside
4,outside
...,...
995,Outside
996,Outside
997,Inside
998,Inside


Now we have our boolean examples we can concat them into a single dataframe.

In [9]:
df_bool = pd.concat([df_bool_a, df_bool_b, df_bool_c], axis=1)
df_bool

Unnamed: 0,bool_simple,bool_text,bool_typo
0,True,Outside,Inside
1,False,Inside,Outside
2,True,Outside,Insida
3,False,Outside,Inside
4,False,Outside,outside
...,...,...,...
995,True,Inside,Outside
996,True,Outside,Outside
997,False,Outside,Inside
998,True,Inside,Inside


## Categorical variables 

## Continious Variables