<h1 id='faker-dataset'><center>📋 Faker Dataset 📋</center></h1>
<i><center>Amazing library to generate fake datas</center></i>

----

<h2 id='problem-description'>📝 Problem Description</h2>

<br />

<figure>
    <img src='https://files.realpython.com/media/A-Guide-to-Pandas-Dataframes_Watermarked.7330c8fd51bb.jpg' alt='Pandas Cartoon' />
    <figcaption><i>Fig. 1 - Pandas Library Cartoon. <sup>©</sup><a href='https://realpython.com/pandas-dataframe/' target='_blank'>Real Python</a></i></figcaption>
</figure>

<br />

You have been hired for a Data Science job where you must explore the <a href='https://faker.readthedocs.io/en/master/' target='_blank'>Faker Python Library</a> to generate a dataset with fake datas.

Your dataset file must be named `faker-dataset.csv` and be stored into `dataset folder` located at this project directory, contain at least `5 features` and `3,000 rows`, and all features' values must be generated using the Faker Library.

After generating and saving the datas into a csv file, you must read it as `utf-8` charset and show its first 5 rows in order to check out whether the dataset is ok to be used for other Data Scientists.

Good Luck!! 🍀 ☘️

----

<h2 id='files-description'>📁 Files Description</h2>

> **faker-dataset.csv** - contains at least 3,000 rows of fake datas generated by `Faker Python Library`.

----

<h2 id='library-features'>❓ Library Features</h2>

> **region/locale** - the person's locale;

> **name** - the person's name;

> **email** - the person's e-mail;

> **adress** - the person's address;

> **license_plate** - the person's automative's license plate;

> **company** - the company where the generated person works at;

> **job** - the job occupied by the person;

> **color_name** - the person's fav color.

<i>You can check out all the possible features here: <a href='https://faker.readthedocs.io/en/master/providers.html' target='_blank'>Faker Library - Standard Providers</a></i>

----

<h2 id='goals'>🎯 Goals</h2>

> **Goal 1** create a csv file named `faker-dataset` containing at least `5 features` and `3,000 rows` and with `utf-8` charset;

> **Goal 2** - generate all datas with `Faker Library`;

> **Goal 3** - be able to read and display the first five rows of the dataset using `Pandas Library`.

----

<h2 id='setup'>⚙️ Setup</h2>

***Tools***

> Python Version 3.9.x+;

> Jupyter Notebook.

<br />

***Libraries***

> Faker;

> Numpy, Pandas.

----

<h2 id='acknowledgments'>🎉 Acknowledgments</h2>

> <a href='https://github.com/fzaninotto' target='blank'>Fzaninotto</a> and <a href='https://faker.readthedocs.io/en/master/' target='_blank'>Faker Library team</a>!

----

In [137]:
import pandas as pd          # pip install pandas
import numpy as np           # pip install numpy
from faker import Faker      # pip install faker
from random import randrange # native library

SEED = (5779)
NUMBER_OF_ROWS = (7500)
FAKE_FEATURES = ['region', 'name', 'email', 'address', 'license_plate', 'company', 'job', 'color_name']
FAKE_REGIONS = ['en_US', 'pt_BR', 'ja_JP']
FILE_PATH = ('./dataset/faker-dataset.csv')

np.random.seed(SEED)
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 8)

fake = Faker(fake_regions) # to generate Faker without region, let the parenthesis blanked like this: Faker()
Faker.seed(SEED)

----

<h2 id='functions'>0) Functions</h2>

For this step, we need to create two functions: one to select one of the three regions (*en_US*, *pt_BR* and *ja_JP*) randomly; and another one to generate the fake dataset with x number of rows.

Let's go!!

In [127]:
choose_person_region = lambda regions_list: regions_list[randrange(0, len(regions_list))]

In [129]:
def generate_fake_person(number_of_rows):
    """
    Receives an integer representing the number of rows that will containg the returned list.
    
    Each row is generated randomly by the faker library taking a region to direct the datas.
    """
    people_list = []    
    
    for row in range(number_of_rows):
        region = choose_person_region(FAKE_REGIONS)
        
        temp_person = [
            region
            , fake[region].name()
            , fake[region].email()
            , fake[region].address()
            , fake[region].license_plate()
            , fake[region].company()
            , fake[region].job()
            , fake[region].color_name()
        ]
        
        people_list.append(temp_person)
    
    return people_list

----

<h2 id='creating-and-exporting-dataset'>1) Creating and Exporting Dataset</h2>

With our function set, we will create a dataset containing `7,500 rows` and export it in `.csv` format into `dataset folder`.

In [146]:
df = pd.DataFrame(generate_fake_person(NUMBER_OF_ROWS), columns=FAKE_FEATURES)

print(f'# of Rows: {df.shape[0]}')
print(f'# of Columns: {df.shape[1]}')
print('-----')
df.head()

# of Rows: 7500
# of Columns: 8
-----


Unnamed: 0,region,name,email,address,license_plate,company,job,color_name
0,en_US,Margaret Marshall,bonnie92@example.com,"1490 Brett Dam Suite 041\nSouth Kristopher, MH...",7HQ 955,"Garcia, Brady and Bishop",Food technologist,DarkSlateBlue
1,pt_BR,Alana Cavalcanti,almeidagiovanna@example.net,"Estação de da Mata, 45\nBetânia\n15045-069 Tei...",SSK-7I92,Moreira,Psicomotricista,Eminência
2,ja_JP,渡辺 陽子,tanakayui@example.com,宮崎県川崎市川崎区日本堤7丁目24番4号 アーバン橋場820,B86-16Q,有限会社山口銀行,アニメーター,Crimson
3,ja_JP,松本 直子,hanako51@example.com,北海道夷隅郡大多喜町東三島5丁目19番9号 土呂部シャルム199,054P,有限会社石井保険,調理師,AliceBlue
4,pt_BR,Erick Pereira,tpinto@example.net,"Loteamento de Lima, 27\nVila Tirol\n13738-786 ...",DSS-5A15,Fernandes S.A.,Estilista,Carmim


Uh-oh! Take a look at the second line and address column, you'll realize that there are `\n` characters: `Estação de da Mata, 45\nBetânia\n...`. Let's replace all `\n` by a single space from all columns that has more than one word.

In [147]:
df['name']           =   df['name'].str.replace('\n', ' ')
df['address']        =   df['address'].str.replace('\n', ' ')
df['license_plate']  =   df['license_plate'].str.replace('\n', ' ')
df['company']        =   df['company'].str.replace('\n', ' ')
df['job']            =   df['job'].str.replace('\n', ' ')
df['color_name']     =   df['color_name'].str.replace('\n', ' ')

df.head()

Unnamed: 0,region,name,email,address,license_plate,company,job,color_name
0,en_US,Margaret Marshall,bonnie92@example.com,"1490 Brett Dam Suite 041 South Kristopher, MH ...",7HQ 955,"Garcia, Brady and Bishop",Food technologist,DarkSlateBlue
1,pt_BR,Alana Cavalcanti,almeidagiovanna@example.net,"Estação de da Mata, 45 Betânia 15045-069 Teixe...",SSK-7I92,Moreira,Psicomotricista,Eminência
2,ja_JP,渡辺 陽子,tanakayui@example.com,宮崎県川崎市川崎区日本堤7丁目24番4号 アーバン橋場820,B86-16Q,有限会社山口銀行,アニメーター,Crimson
3,ja_JP,松本 直子,hanako51@example.com,北海道夷隅郡大多喜町東三島5丁目19番9号 土呂部シャルム199,054P,有限会社石井保険,調理師,AliceBlue
4,pt_BR,Erick Pereira,tpinto@example.net,"Loteamento de Lima, 27 Vila Tirol 13738-786 Po...",DSS-5A15,Fernandes S.A.,Estilista,Carmim


Okay, now everything's okay to export the dataset into csv!

In [148]:
df.to_csv(FILE_PATH, header=True, encoding='utf-8')

----

<h2 id='reading-dataset-file'>2) Reading Dataset File</h2>

With our `faker-dataset.csv` generated, let's read it and display the first five rows to check out whether everything's okay!!

In [149]:
faker_dataset = pd.read_csv(FILE_PATH, index_col='Unnamed: 0')

print(f'# of Rows: {faker_dataset.shape[0]}')
print(f'# of Columns: {faker_dataset.shape[1]}')
print('-----')
faker_dataset.head()

# of Rows: 7500
# of Columns: 8
-----


Unnamed: 0,region,name,email,address,license_plate,company,job,color_name
0,en_US,Margaret Marshall,bonnie92@example.com,"1490 Brett Dam Suite 041 South Kristopher, MH ...",7HQ 955,"Garcia, Brady and Bishop",Food technologist,DarkSlateBlue
1,pt_BR,Alana Cavalcanti,almeidagiovanna@example.net,"Estação de da Mata, 45 Betânia 15045-069 Teixe...",SSK-7I92,Moreira,Psicomotricista,Eminência
2,ja_JP,渡辺 陽子,tanakayui@example.com,宮崎県川崎市川崎区日本堤7丁目24番4号 アーバン橋場820,B86-16Q,有限会社山口銀行,アニメーター,Crimson
3,ja_JP,松本 直子,hanako51@example.com,北海道夷隅郡大多喜町東三島5丁目19番9号 土呂部シャルム199,054P,有限会社石井保険,調理師,AliceBlue
4,pt_BR,Erick Pereira,tpinto@example.net,"Loteamento de Lima, 27 Vila Tirol 13738-786 Po...",DSS-5A15,Fernandes S.A.,Estilista,Carmim


Yeah!! Our job is done for now.

See you in the next notebook 👋👋

----

See ya in the next notebook!! 👋

<h2 id="reach-me">📫 Reach Me</h2>

> Email: csfelix08@gmail.com

> Linkedin: [linkedin.com/in/csfelix/](https://linkedin.com/in/csfelix/)

> Portfolio: [CSFelix.io](https://csfelix.github.io)

> Kaggle: [DSFelix](https://www.kaggle.com/dsfelix)