<h1>Test data generation</h1>

This scripts will show how we generated test data for you. Maybe you want to create your own data or extend the existing data? Feel free!

<h2>How to generate custom test data</h2>

Prerequisites: Install moduels in requirements.txt. Open a shell and execute:

```
pip install -r requirements.txt
```

<h2>Import and clean data</h2>

The list with addresses in file `data/ratisbona-addresses.csv` from the ratisbona region should be plenty for this challange. If you really need more data, the file with all german addresses is too large to be shared on github. Please download the full file [here](https://www.wiwald.com/ds/10-millionen-deutsche-adressen-mit-geokoordinaten-urls-und-bezeichnungen/id/adr-20181110) and place the file here: `raw/addresses-201811.csv`.



In [None]:
import pandas as pd

# read data
data = pd.read_csv('raw/adresses-201811.csv', delimiter=';')
# postal_code is of mixed type... we transform it to integer - this is strange and i don't know why it is like this
data['postal_code'] = pd.to_numeric(data['postal_code'], errors='coerce', downcast='integer').round().astype('Int64')
# there are two columns we are not interessted in so we drop it
data.drop(columns=["tag_name","tag_website"], inplace=True)
data.to_csv('raw/all-addresses.csv', index=False)

<h2>Generate sample data</h2>

<h3>Fake people with valid addresses</h3>

We will retrieve 10000 valid addresses from ratisbona region and merge it with 10000 fake names. The result is a dataset with 10000 fake people with valid addresses and geo locations.

In [None]:
data = pd.read_csv('raw/all-addresses.csv')
# get only postal_codes in Regensburg are
data.query('postal_code > 93000 & postal_code < 93300', inplace=True)

# get 10000 sample entries
data.sample(n = 10000, replace = False)

# generate new id column
data.reset_index(drop=True, inplace=True)
data["id"] = data.index+1
data['postal_code'] = pd.to_numeric(data['postal_code'], errors='coerce', downcast='integer').round().astype('Int64')

# write to new CSV file
data.to_csv('data/ratisbona-addresses.csv', index=False)

# Combine addresses and people to generate 10000 valid entries
new_data = pd.merge(pd.read_csv('data/people.csv'), data, on='id', how='inner')
new_data.to_csv('data/people-with-addresses.csv', index=False)


<h3>Companies</h3>

The file `data/ratisbona-companies.csv` contains valid entries for around 2000 companies located in ratisbona including geographic data.

<h4>Combine companies with people</h4>

In [None]:
import random

company_data = pd.read_csv('data/ratisbona-companies.csv')
number_of_companies = len(company_data.index)

people_data = pd.read_csv('data/people-with-addresses.csv')
people_data['company_id'] = random.choices(range(1, number_of_companies), k=len(people_data.index))

people_data.to_csv('data/people-with-companies.csv', index=False)


## Plot sample data

You can use plotly or any other tool to get a visual representation of the 

### mapbox token

To use plotly to create a map you must have a token. There is a free tier that should suffice for our tests.
Go to [mapbox.com](https://www.mapbox.com), create an account and download the token to `.mapbox_token`

### Plot data on map

In [41]:
import plotly.express as px
import plotly.graph_objects as go

mapbox_access_token = open(".mapbox_token").read()

people_data = pd.read_csv('data/people-with-addresses.csv')
people_data['full_name'] = people_data[['first_name','last_name']].apply(lambda x: ' '.join(x), axis=1)

company_data = pd.read_csv('data/ratisbona-companies.csv')

fig = go.Figure()

# People
fig.add_trace(go.Scattermapbox(
        lat=people_data.lat,
        lon=people_data.lon,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=10,
            color='rgb(255, 60, 60)',
            opacity=0.7
        ),
        text=people_data.full_name,
        name="People"
    ))

# Companies
fig.add_trace(go.Scattermapbox(
        lat=company_data.lat,
        lon=company_data.lon,
        mode='markers',
        marker=go.scattermapbox.Marker(
            size=10,
            color='rgb(60, 60, 255)',
            opacity=0.7
        ),
        text=company_data.name,
        name="Companies"
    ))

fig.update_layout(
    hovermode='closest',
    width=1024,
    height=768,
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=go.layout.mapbox.Center(
            lat=49.0134,
            lon=12.1016
        ),
        pitch=0,
        zoom=9
    )
)

fig.show()
