# Creating a MOCK dataset 
Author: Karina Condeixa
__________________________________________

The process to create a fake dataset started by making dataframes about `users` and `item`, and further a complete dataframe that could be used in the `model`.

`numpy random` and `faker` were used to ccreate randomized series. Latitude_longitude points were picked using a polygons method, consideing four points choosen manually in Google Maps.

Dependent variables are part of a same series, identified by underscore.

Addresses are illustrative, they are addresses from all over Germany. This is a limitation of the Faker package, which only considers countries, and does not have data for Berlin.

_________________________________

**TODO:**
- split geopoints into lat and lng files for users
- split geopoints into lat and lng files for items
- create final dataset, organising by columns
- save the files as csv

In [208]:
# import packages
import pandas as pd
from faker import Faker, providers
from faker.providers.address.de_DE import Provider as DeDeAddressProvider
from faker.generator import random
from faker.providers import BaseProvider
# import random

import folium

### Postcodes

In [209]:
# immport and clean original dataset removing poostcodes from Germanay, that are not in Berlin
postcodes_de = pd.read_excel(r'data/German-Zip-Codes.xlsx', sheet_name='Berlin')
df = pd.DataFrame(postcodes_de)
df.set_axis(["postcodes_berlin"], axis=1,inplace=True)
df = (df["postcodes_berlin"].str[8:-11])
df.to_csv('postcodes_berlin.csv', index=False)

  df.set_axis(["postcodes_berlin"], axis=1,inplace=True)


### Creating the postcode series

In [210]:
# import postcodes from Berlin and create a dataframe removing indexes and headers
postcodes_berlin = pd.read_csv('data/postcodes_berlin.csv')
print(postcodes_berlin)
postcodes_berlin_series = postcodes_berlin[:][1:].squeeze()

     postcodes_berlin
0               10117
1               10119
2               10178
3               10179
4               10243
..                ...
184             14169
185             14193
186             14195
187             14197
188             14199

[189 rows x 1 columns]


### Creating lists

#### lat lng : choosing random point based on a polygon
reference: 
    A quick trick to create random lat/long coordinates in python (within a defined polygon)
https://medium.com/the-data-journal/a-quick-trick-to-create-random-lat-long-coordinates-in-python-within-a-defined-polygon-e8997f05123a

In [211]:
# Importing Modules
import numpy as np
import random
# Use 'conda install shapely' to import the shapely library.
from shapely.geometry import Polygon, Point

num_records = 10

### Creating the datasets

In [212]:
# Note: # multi_locale_generator = Faker(['it_IT', 'en_US', 'de-DE', 'pt_BR', 'es-ES', 'fr-FR', 'ru-RU', 'tr-TR'])

# Instantiate Faker with multiple locales
german_locale_generator = Faker(['de_DE'])
fake = Faker()
Faker.seed(0)


In [214]:
# Define the desired polygon : points choosen in Google maps

poly = Polygon([(52.645883, 13.395869), 
                (52.526568, 13.645808),
                (52.381789, 13.405482),
                (52.484773, 13.136317)])


min_x = 52.381789
max_x = 52.645883
min_y = 13.136317
max_y = 13.645808

# Defining the randomization generator
def polygon_random_points (poly, num_records):
    min_x, min_y, max_x, max_y = poly.bounds
    points = []
    while len(points) < num_records:
        random_point = Point([random.uniform(min_x, max_x), random.uniform(min_y, max_y)])
        if (random_point.within(poly)):
            points.append(random_point)
    return points
    # Choose the number of points desired. T\ 
points = polygon_random_points(poly,num_records)
# Testing the results.
for p in points:
    print(p.x,",",p.y)

                             

52.51468923314108 , 13.443433496679852
52.48083455387833 , 13.324635956781632
52.54068013426472 , 13.348663709485534
52.526450026590865 , 13.517410349162015
52.47926254962274 , 13.41680634766404
52.52525214023197 , 13.478763560118274
52.545200100761996 , 13.292108640382077
52.546826551287495 , 13.333600167524281
52.515011473059054 , 13.604610508684335
52.49998179940619 , 13.400180189057899


### The polygon in the map

In [215]:
mapObj = folium.Map(location=[52.520008,13.404954], zoom_start=5)

In [216]:
folium.Polygon([(52.645883, 13.395869), 
                (52.526568, 13.645808), 
                (52.381789, 13.405482), 
                (52.484773, 13.136317)],
                fill=True, 
                weight=3, 
                fill_color='orange',
                color='green',
                fill_opacity=0.8).add_to(mapObj)

mapObj.save('output.html')

### User

In [217]:
# define a function to create user data

user_type = ['giver', 'looker']


def create_user_data(num_records): 
  
    # dictionary 
    user ={} 
    for i in range(0, num_records): 
        user[i] = {} 
        user[i]['name'] = fake.name()
#         user[i]['email'] = fake.email()
#         user[i]['email'] = fake.ascii_free_email()
        user[i]['email'] = fake.ascii_email()
        user[i]['address'] = german_locale_generator.address()  # these addresses are from germany, find a list of address for berlin
        user[i]['user_type'] = fake.random_element(user_type)
        user[i]['user_lat_lng'] = polygon_random_points(poly,1)
        user[i]['user_postcode'] = np.random.choice(postcodes_berlin_series)

    return user

In [218]:
user_mock_df = pd.DataFrame(create_user_data(1000)).transpose()
user_mock_df.head(5)

Unnamed: 0,name,email,address,user_type,user_lat_lng,user_postcode
0,Norma Fisher,ysullivan@yahoo.com,Alwina-Etzold-Ring 19\n89241 Sömmerda,looker,[POINT (52.49270299944904 13.2632600025517)],13158
1,Brian Hamilton,hramos@brown-sellers.com,Eleonore-Oderwald-Ring 51\n93328 Bremen,giver,[POINT (52.55783661408981 13.27197206301923)],13409
2,Sheri Bolton DDS,jasmine85@hotmail.com,Conradistr. 5/9\n42320 Naila,giver,[POINT (52.59564570175657 13.40603977312539)],10439
3,Peter Mcdowell,villanuevasandra@vega.net,Salzring 7/5\n59179 Erfurt,giver,[POINT (52.43368398014155 13.29297219220949)],13057
4,Devin Thornton,marvincabrera@gmail.com,Margot-Ruppert-Allee 013\n61510 Euskirchen,giver,[POINT (52.43427814851649 13.48465490616225)],12309


In [219]:
# add user_ids 
user_mock_df['user_id'] = user_mock_df.index + 1
user_id_series = user_mock_df['user_id']

### Item

In [220]:
item_status = ['avaliable', 'not_available']
item_condition = ['good','medium','bad']

category_item = ['furniture-sofa',
                 'furniture-armchair',
                 'furniture-chair',
                 'furniture-table',
                 'furniture-desk',
                 'furniture-bed',
                 'furniture-bookcase',
                 'furniture-bedside_table',
                 'furniture-cabinet',
                 'furniture-wardrobe',
                 'furniture-shelf',
                 'furniture-cupboard',
                 'furniture-rollcontainers',
                 'furniture-shoe_rack',
                 'furniture-mirror',
                 'furniture-cot',
                 'furniture-trolley',
                 'appliance-washing_machine',
                 'appliance-dish_washer',
                 'appliance-drying_rack',
                 'appliance-refrigerator',
                 'appliance-blender',
                 'appliance-extractor_hood',
                 'appliance-clothes_iron',
                 'appliance-vacuum_cleaner',
                 'appliance-sandwich_maker',
                 'appliance-kettle',
                 'appliance-air_conditioner',
                 'appliance-heater',
                 'appliance-pan',
                 'appliance-popcorn_maker',
                 'appliance-coffee_machine',
                 'appliance-stove',
                 'lighting-lighting',
                 'lighting-chandelier',
                 'lighting-lightbulb',
                 'musical_equipment-guitar',
                 'musical_equipment-sound_amplifier',
                 'musical_equipment-contrabass',
                 'musical_equipment-battery',
                 'musical_equipment-piano',
                 'tech-desktop',
                 'tech-laptop',
                 'tech-phone',
                 'tech-keyboard',
                 'clothes-woman_jacket',
                 'clothes-man_jacket',
                 'clothes-child_jacket',
                 'clothes-woman_clothes',
                 'clothes-man_clothes',
                 'clothes-child_clothes',
                 'shoes-woman_shoes',
                 'shoes-man_shoes',
                 'shoes-child_shoes',
                 'miscelaneaous-ironing_board',
                 'miscelaneaous-picture_frame',
                 'miscelaneaous-bicycle',
                 'miscelaneaous-plant',
                 'miscelaneaous-carpet',
                 'miscelaneaous-roller_skates',
                 'miscelaneaous-ski_skates',
                 'miscelaneaous-books',
                 'miscelaneaous-purse',
                 'miscelaneaous-suitcase',
                 'miscelaneaous-shopping_venture',
                 'miscelaneaous-board',
                 'miscelaneaous-frame',
                 'home-mattress', 
                 'home-carpet',
                 'kids-stroller',
                 'kids-baby_carriage']


In [221]:
# define a function to create item data
limit = '-30d'  # limit of 30 days of item in the app

def create_item_data(num_records): 
  
    # dictionary 
    item ={} 
    for i in range(0, num_records): 
        item[i] = {}
        item[i]['item_category-item_name'] = np.random.choice(category_item)
        item[i]['item_condition'] = np.random.choice(item_condition)
        item[i]['item_postcode'] = np.random.choice(postcodes_berlin_series)
        item[i]['item_status'] = np.random.choice(item_status)
        item[i]['user_lat-user_lng'] = polygon_random_points(poly,1)
        item[i]['userwhochangeditemstatus_id'] = np.random.choice(user_id_series)
        datetime_iteration1 = fake.date_between_dates(limit,'now')
        datetime_iteration2 = fake.date_between_dates(limit,'now')
        if datetime_iteration1 <= datetime_iteration2:
            item[i]['item_datetime_posted'] = datetime_iteration1
            item[i]['item_datetimechangeditemstatus'] = datetime_iteration2
        else:
            item[i]['item_datetime_posted'] = datetime_iteration2
            item[i]['item_datetimechangeditemstatus'] = datetime_iteration1  
 # This date shold be later than the post
        

    return item

In [222]:
item_mock_df = pd.DataFrame(create_item_data(100)).transpose()
item_mock_df.head(5)

Unnamed: 0,item_category-item_name,item_condition,item_postcode,item_status,user_lat-user_lng,userwhochangeditemstatus_id,item_datetime_posted,item_datetimechangeditemstatus
0,furniture-cabinet,bad,10179,avaliable,[POINT (52.51248619826432 13.57587095032318)],24,2023-02-17,2023-02-18
1,shoes-man_shoes,good,10317,not_available,[POINT (52.49582265281618 13.16530272130326)],53,2023-03-05,2023-03-07
2,musical_equipment-contrabass,good,12047,not_available,[POINT (52.46024485098363 13.3163368358419)],509,2023-03-04,2023-03-07
3,lighting-lighting,medium,13583,avaliable,[POINT (52.5115786468483 13.18207418671305)],986,2023-02-24,2023-03-04
4,miscelaneaous-frame,good,14053,avaliable,[POINT (52.57691985974373 13.34614095648883)],485,2023-02-13,2023-03-01


In [223]:
item_mock_df['item_id'] = item_mock_df.index +1  # add item_id
item_id_series = item_mock_df['item_id']  # storage in a variable to use later

### Model

In [224]:
# define a function to create model data


def create_model_data(num_records): 
  
    # dictionary 
    model ={} 
    for i in range(0, num_records): 
        model[i] = {} 
        model[i]['item_id'] = np.random.choice(item_id_series)
        model[i]['item_category-item_name'] = np.random.choice(category_item)
        model[i]['item_condition'] = np.random.choice(item_condition)
        model[i]['item_lat-item_lng'] = polygon_random_points(poly,1)
        model[i]['item_postcode'] = np.random.choice(postcodes_berlin_series)
        model[i]['item_status'] = np.random.choice(item_status)
        model[i]['userwhoposted_id'] = np.random.choice(user_id_series)
        model[i]['userwhopickedup_id'] = np.random.choice(user_id_series)
        model[i]['userwhochangeditemstatus_id'] = np.random.choice(user_id_series)
        model[i]['userwhochangeditemstatus_lat-userwhochangeditemstatus_lng'] = polygon_random_points(poly,1)
        model[i]['searched_item_name-searched_item_category-searched_item'] = np.random.choice(category_item)
        model[i]['searched_postcode'] = np.random.choice(postcodes_berlin_series)
        datetime_iteration1 = fake.date_between_dates(limit,'now')
        datetime_iteration2 = fake.date_between_dates(limit,'now')
        if datetime_iteration1 <= datetime_iteration2:
            model[i]['item_datetime_posted'] = datetime_iteration1
            model[i]['item_datetimechangeditemstatus'] = datetime_iteration2
        else:
            model[i]['item_datetime_posted'] = datetime_iteration2
            model[i]['item_datetimechangeditemstatus'] = datetime_iteration1  
       
    return model


In [225]:
model_mock_df = pd.DataFrame(create_model_data(1000)).transpose()
# model_mock_df.head(5)
model_mock_df.shape

(1000, 14)

In [226]:
model_mock_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column                                                     Non-Null Count  Dtype 
---  ------                                                     --------------  ----- 
 0   item_id                                                    1000 non-null   object
 1   item_category-item_name                                    1000 non-null   object
 2   item_condition                                             1000 non-null   object
 3   item_lat-item_lng                                          1000 non-null   object
 4   item_postcode                                              1000 non-null   object
 5   item_status                                                1000 non-null   object
 6   userwhoposted_id                                           1000 non-null   object
 7   userwhopickedup_id                                         1000 non-null   object
 8   userwhochangeditems

### Split columns of dependent variables

In [227]:
model_mock_df[['item_category','item_name']] = model_mock_df['item_category-item_name'].apply(lambda x: pd.Series(str(x).split("-")))

model_mock_df[['searched_item_name','item_category-searched_item']] = model_mock_df['searched_item_name-searched_item_category-searched_item'].apply(lambda x: pd.Series(str(x).split("-")))


# CONTINUE FROM HERE

In [228]:
# # # model_mock_df['item_lat-item_lng']
# model_mock_df['lon'] = model_mock_df.point_object.apply(lambda p: p.x)
# model_mock_df['lat'] = model_mock_df.point_object.apply(lambda p: p.y)


# # [8:-3]
# # 0 to 8
# # -3 to -19

In [229]:

# # 'item_lat-item_lng'
# model_mock_df[['item_lat','item_lng']] = model_mock_df['item_lat-item_lng'].apply(lambda x: pd.Series(str(x).split("-")))


# # 'userwhochangeditemstatus_lat-userwhochangeditemstatus_lng'
# model_mock_df[['userwhochangeditemstatus_lat','userwhochangeditemstatus_lng']] = model_mock_df['userwhochangeditemstatus_lat-userwhochangeditemstatus_lng'].apply(lambda x: pd.Series(str(x).split("-")))


# model_mock_df.head(5)

ValueError: Columns must be same length as key

In [231]:

model_mock_df.info(5)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                                                     Non-Null Count  Dtype 
---  ------                                                     --------------  ----- 
 0   item_id                                                    1000 non-null   object
 1   item_category-item_name                                    1000 non-null   object
 2   item_condition                                             1000 non-null   object
 3   item_lat-item_lng                                          1000 non-null   object
 4   item_postcode                                              1000 non-null   object
 5   item_status                                                1000 non-null   object
 6   userwhoposted_id                                           1000 non-null   object
 7   userwhopickedup_id                                         1000 non-null   object
 8   userwhochangeditems

In [232]:
# points_item = model_mock_df['item_lat-item_lng'] 

# list_item_lat_lng = model_mock_df['item_lat-item_lng'] 
    
for p in model_mock_df['item_lat-item_lng']:
    print(p.x,",",p.y)

AttributeError: 'list' object has no attribute 'x'

In [None]:
# Create a final dataset
columns = []

### Creating csv files

In [84]:
# user_mock_df.to_csv('data/user_mock_data.csv', index=False)

In [85]:
# item_mock_df.to_csv('data/item_mock_data.csv', index=False)

In [86]:
# model_mock_df.to_csv('data/model_mock_data.csv', index=False)

## References:
- [Generate custom datasets using Python Faker](https://blogs.sap.com/2021/05/26/generate-custom-datasets-using-python-faker/)
- [folium_polygon_rectangle_layers](https://www.youtube.com/watch?v=9E9FTJrOJ1E&t=752s)
- [Faker](https://github.com/joke2k/faker/issues/1183)
- [Generating Mock Data with Python! (NumPy, Pandas, & Datetime Libraries)](https://www.youtube.com/watch?v=VJBY2eVtf7o)