## `Faker`

Это библиотекa ([https://faker.readthedocs.io/](https://faker.readthedocs.io/)) для генерации данных

In [1]:
import pandas as pd
from faker import Faker
Faker.seed(42)

`fake` с русскими буквами `ru_Ru`

In [2]:
fake = Faker('ru_Ru')
fake.seed_locale('ru_Ru', 42)
fake.seed_instance(42)

Создадим функцию для генерации тестовых данных `create_row_faker`:

 - `fake.name()`
 - `fake.postcode()`
 - `fake.email()`
 - `fake.country()`.

In [3]:
def create_row_faker(num=1):
    output = [{"name": fake.name(),
               "age": fake.random_int(0, 100),
               "postcode": fake.postcode(),
               "email": fake.email(),
               "nationality": fake.country(),
              } for x in range(num)]
    return output

In [11]:
create_row_faker()

[{'name': 'Сидор Вячеславович Селезнев',
  'age': 13,
  'postcode': '893252',
  'email': 'selivan_01@hotmail.com',
  'nationality': 'Республика Конго'}]

На базе функции сделаем фейковый набор данных `df_fake` из 50 строк сгенеренных `create_row_faker`. 

In [12]:
%%time
df_fake = pd.DataFrame(create_row_faker(50))

Wall time: 48 ms


In [13]:
df_fake.head()

Unnamed: 0,name,age,postcode,email,nationality
0,Журавлев Ферапонт Артурович,10,182278,stojan96@yahoo.com,Северная Македония
1,Евпраксия Константиновна Власова,85,787133,adrian_98@yahoo.com,Соломоновы Острова
2,Богданов Аристарх Давыдович,8,518347,vlas_2019@yandex.ru,Словакия
3,Савватий Устинович Пономарев,12,656670,maslovalukija@gmail.com,Италия
4,Беляков Прохор Викентьевич,54,473178,kulaginasofija@gmail.com,Бразилия


## Данные в Spark для тестирования

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("fake data") \
    .getOrCreate()

In [None]:
df = spark.createDataFrame(create_row_faker(50))

In [10]:
# создадим структуру для Spark DF

from pyspark.sql.types import *
schema = StructType([StructField('name', StringType()),
                     StructField('age',IntegerType()),
                     StructField('postcode',StringType()),
                     StructField('email', StringType()), 
                     StructField('nationality',StringType())])

In [11]:
df = spark.createDataFrame(create_row_faker(50), schema)

In [12]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- postcode: string (nullable = true)
 |-- email: string (nullable = true)
 |-- nationality: string (nullable = true)



In [13]:
%%time
n = 5*10**4        # если ещё больше данных?
df = spark.createDataFrame(create_row_faker(n), schema)

CPU times: user 27 s, sys: 137 ms, total: 27.1 s
Wall time: 27.1 s


In [14]:
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} for i in range(5))
# будьте внимательные, данные генерятся на основе генераторов
type(d)

generator

In [16]:
%%time
n = 5*10**4
d = ({"name": fake.name(),
      "age": fake.random_int(0, 100),
      "postcode": fake.postcode(),
      "email": fake.email(),
      "nationality": fake.country()} 
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 27.5 s, sys: 158 ms, total: 27.6 s
Wall time: 27.6 s


In [None]:
df.show(n=5)

Выполним действия:
 - group by
 - count
 - filter
 - sort by
 - show

In [19]:
import pyspark.sql.functions as F
df.groupBy('postcode') \
  .agg(F.count('postcode').alias('Count'), F.round(F.avg('age'), 2).alias('Average age')) \
  .filter('Count>1') \
  .orderBy('Average age', ascending=False) \
  .show(5)  

+--------+-----+-----------+
|postcode|Count|Average age|
+--------+-----+-----------+
|   86678|    4|       90.0|
|   23084|    4|       87.5|
|   89884|    4|      86.75|
|   99646|    4|       84.5|
|   96353|    4|       82.0|
+--------+-----+-----------+
only showing top 5 rows



## Создание Fake для тестов

In [16]:
from collections import OrderedDict
locales = OrderedDict([
    ('ru_RU', 5), 
    ('de_DE', 2),
])
fake = Faker(locales)
fake.seed_instance(42)
fake.locales

['ru_RU', 'de_DE']

In [17]:
fake.seed_locale('de_DE', 0)
fake.seed_locale('ru_RU', 0)

In [24]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group', 
                     'mail', 'current_location'])

{'current_location': (Decimal('-85.6489225'), Decimal('-34.487601')),
 'blood_group': 'A-',
 'name': 'Magrit Graf',
 'sex': 'F',
 'mail': 'hartungclaudio@web.de',
 'birthdate': datetime.date(1946, 4, 7)}

In [25]:
# схемы для тестового Spark DF

location = StructField('current_location',
                       StructType([StructField('lat', DecimalType()),
                                   StructField('lon', DecimalType())])
                      )

schema = StructType([StructField('name', StringType()),
                     StructField('birthdate', DateType()),
                     StructField('sex', StringType()),
                     StructField('blood_group', StringType()),
                     StructField('mail', StringType()), 
                     location
                     ])

In [32]:
fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group', 
                     'mail', 'current_location'])

{'current_location': (Decimal('-85.6489225'), Decimal('-34.487601')),
 'blood_group': 'A-',
 'name': 'Magrit Graf',
 'sex': 'F',
 'mail': 'hartungclaudio@web.de',
 'birthdate': datetime.date(1946, 4, 6)}

In [28]:
%%time
n = 5*10**3
d = (fake.profile(fields=['name', 'birthdate', 'sex', 'blood_group', 
                          'mail', 'current_location']) 
     for i in range(n))
df = spark.createDataFrame(d, schema)

CPU times: user 9.45 s, sys: 50 ms, total: 9.5 s
Wall time: 9.54 s


In [29]:
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- birthdate: date (nullable = true)
 |-- sex: string (nullable = true)
 |-- blood_group: string (nullable = true)
 |-- mail: string (nullable = true)
 |-- current_location: struct (nullable = true)
 |    |-- lat: decimal(10,0) (nullable = true)
 |    |-- lon: decimal(10,0) (nullable = true)



In [30]:
df.show(n=10)

+--------------------+----------+---+-----------+--------------------+----------------+
|                name| birthdate|sex|blood_group|                mail|current_location|
+--------------------+----------+---+-----------+--------------------+----------------+
|         Ilija Stroh|1986-02-06|  M|        AB-|emueller@googlema...|       [-5, 148]|
|     Philomena Hesse|2002-09-06|  F|         A+|naserhans-hermann...|        [82, 12]|
| Prof. Annelise Mude|1910-03-30|  F|         A-|ftschentscher@aol.de|       [29, 177]|
|       Branka Hamann|1970-06-08|  F|         A+|fheinz@googlemail...|      [78, -136]|
|       Lilli Lercher|1996-12-01|  F|         A-|arnoldlouise@kabs...|     [-15, -108]|
|  Hans-Karl Fröhlich|2000-12-23|  M|         A+|    wlosekann@aol.de|       [-32, 47]|
|Hanife Mitschke MBA.|1986-11-15|  F|        AB-|    hkramer@yahoo.de|      [64, -131]|
|Ing. Susi Weiß B....|1914-07-09|  F|         O-|        lwiek@aol.de|      [85, -113]|
|         Anika Knoll|2007-01-02

In [33]:
# не забудем сделать стоп
spark.stop()