# Project: 01-2024 LaborDE Analysis
## Author: Nadia Ordonez
## Step1 LD Project brief, data sources,  and data cleaning

# Table of Contents
* [1. Introduction](#1.-Introduction)
    * [1.1 Research questions](#1.1-Research-questions)
    * [1.2 Project scripts](#1.2-Project-scripts)
    * [1.3 Data limitations and ethics](#1.3-Data-limitations-and-ethics)
* [2. Data sources](#2.-Data-sources)
    * [2.1 German companies](#2.1-German-companies)
    * [2.2 German population Bundesland](#2.2-German-population-Bundesland)
    * [2.3 German population regional](#2.3-German-population-regional)
        * [2.3.1 Habitants per postcode](#2.3.1-Habitants-per-postcode)
        * [2.3.2 Regions per postcode](#2.3.2-Regions-per-postcode)
* [3. Data wrangling and consistency checks](#3.-Data-wrangling-and-consistency-checks)
    * [3.1 German companies](#3.1-German-companies)
    * [3.2 German population Bundesland](#3.2-German-population-Bundesland)
    * [3.3 German population regional](#3.3-German-population-regional)
        * [3.3.1 Habitants per postcode](#3.3.1-Habitants-per-postcode)
        * [3.3.2 Regions per postcode](#3.3.2-Regions-per-postcode)
* [4. Merging German population regional](#4.-Merging-German-population-regional)
    * [4.1 Key variable](#4.1-Key-variable)
    * [4.2 Merge](#4.2-Merge) 
* [5. Exporting dataframes](#5.-Exporting-dataframes)

# 1. Introduction

LaborDE, a prominent laboratory supplier based in Nordrhein-Westphalia, Germany, has excelled in providing diagnostic laboratories with cutting-edge equipment and materials, including a specialized department focused on cancer diagnostics products. In 2009, the German NGO [Open Knowledge Foundation Deutschland e.V.](https://okfn.de/) and the British NGO [Opencorporates](https://opencorporates.com/) collaborated to release German Trade Register data via [OffeneRegister.de](https://offeneregister.de/), offering a comprehensive list of over 5 million companies to the public. Recognizing the strategic value of this data, LaborDE aims to leverage it for identifying potential laboratory customers, not only in their stronghold of Nordrhein-Westphalia (NRW) but also for exploring new market opportunities in other Bundesländer. This initiative positions LaborDE at the forefront of the industry by adopting a data-driven approach to inform decision-making and foster growth.

## 1.1 Research questions

To strengthen our market presence in NRW and expand our reach into other Bundesländer, stakeholders at LaborDE are committed to addressing key research data analysis questions. These inquiries will not only fortify our customer base but also strategically position us in the highly competitive landscape of laboratory diagnostics.

* Regional Focus in NRW:
    * What is the distribution of potential customer companies specializing in laboratory diagnostics across NRW, and which regions exhibit the highest concentration? Rationale: By understanding the geographical concentration of potential customer companies in NRW, we can tailor our outreach strategies to maximize impact in key cities, thereby enhancing our regional presence. 
    * What is the extent of the customer base served by laboratory diagnostic companies in NRW in terms male and female population? By quantifying the current customer base in NRW, we refine our service offerings and better meet the needs of our existing and potential customer companies that are serving these customers.


* National Landscape in Germany:
    * What is the nationwide distribution of potential customer companies specializing in laboratory diagnostics, and which Bundesländer has the highest concentration of potential customers? Rationale: Expanding our scope beyond NRW, this analysis aims to identify promising markets in other Bundesländer, providing valuable insights to inform our strategic expansion plans across Germany.
    * Within the identified companies, what is the prevalence of those offering diagnostic services specifically for cancer, and where are they situated in Germany? Rationale: Recognizing the demand for cancer-related diagnostic services is crucial for tailoring our offerings to meet specific healthcare needs, thereby positioning LaborDE as a specialized and sought-after service provider.


## 1.2 Project scripts

* This data analysis project was executed using Jupyter and is organized into 7 scripts outlined below:
    * Step 1: LD Project brief, data sources, and initial data cleaning
        * Comprehensive project description, research questions, and data sources
        * Data wrangling, cleaning and inconsistency checks.
        * Merging of German population regional dataframes.
        

    * Step 2: LD Merging dataframes, Bundesland and regional
        * Merging companies and Bundesland dataframes.
        * Merging companies and regional dataframes.
        
    
    * Step 3: LD Exploring relationships between companies and Bundesland
        * Correlation, scatterplots, pairplots and categorial plots.
        * Constructing the research hypothesis.
        
        
    * Step 4: LD Geographical visualization
        * Bundesland company mapping.
        * NRW company mapping.
        

    * Step 5: LD Supervised Machine Learning: Regression
        * Testing the research hypothesis with a regression model.
        * Model performance evaluation.
        

    * Step 6: Unsupervised Machine Learning: Clustering
        * Clustering analysis of Bundesländer. 
        
    
    * Step 7: German GDP time-series analysis
        * Subsetting, wrangling, and cleaning time series data.
        * Decomposition, Dicker Fuller test, Autocorrelation, Stationarizing data.

## 1.3 Data limitations and ethics

Regarding the German companies dataset obtained through OpenCorporates, there are notable limitations in data collection. The terms of use for Bundesanzeiger and Unternehmensregister, two additional German company registers, restrict OpenCorporates from publishing data sourced from these registers. Consequently, information on companies registered in these datasets is unavailable. Moreover, the dataset's information is only current up until January 2019, omitting companies established thereafter. The basic nature of the company details further hinders the precision with which laboratory diagnostics can be extracted from the comprehensive list of companies. In terms of data ethics, it's crucial to acknowledge that the original German companies dataset contains personal data, which may fall under GDPR regulations. Therefore, careful attention to privacy and compliance with relevant data protection laws is imperative when managing and utilizing this dataset. 

# 2. Data sources

### Importing libraries

In [1]:
# Libraries to unzip downloaded SQLite Datenbank for German companies df
import gzip
import shutil

# Import library to access SQLite files for German companies df
import sqlite3

# Importing libraries for data analysis for all dfs
import pandas as pd
import numpy as np
import os

In [2]:
# Project folder path into a string to easily retrieve data
path = r'C:\Users\Ich\Documents\01_2024_LaborDE_analysis'

## 2.1 German companies

The files, compiled by OpenCorporates primarily between June 2017 and January 2019, contain basic information on over 5 million German companies and their officers, sourced mainly from [Handelsregisterbekanntmachungen](https://www.handelsregister.de/rp_web/welcome.xhtml) and to a lesser extent, [Handelsregister](https://www.handelsregister.de/rp_web/welcome.xhtml) search results listings. OpenCorporates is generously sharing this dataset under an [open license](https://creativecommons.org/licenses/by/4.0/), aiming to demonstrate the advantages of releasing company information as open data. As a social enterprise committed to enhancing societal benefits, OpenCorporates actively advocates for open company data, collaborating with civil society, businesses, and governments to promote accessibility and transparency in critical information without financial barriers.

The dataset was downloaded as a [SQLite Datenbak](https://offeneregister.de/#download) and later opened with Python using Jupyter notebooks. The original table named "company" was saved as pickle file as "all_de_companies", selecting relevant variables such as company names and addresses.

### Extracting company dataframe

In [2]:
# Create a path to unzip
gzip_file_path = r'C:\Users\Ich\Documents\01_2024_LaborDE_analysis\02_Data\Original_data\openregister.db.gz'

In [3]:
# Create a path to extract file
extracted_file_path = r'C:\Users\Ich\Documents\01_2024_LaborDE_analysis\02_Data\Original_data\openregister.db'

In [4]:
# Unzip file
with gzip.open(gzip_file_path, 'rb') as f_in:
    with open(extracted_file_path, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
print("File extracted successfully.")

File extracted successfully.


In [5]:
# Create a path to execute SQLite commands
sqlite_file_path = r'C:\Users\Ich\Documents\01_2024_LaborDE_analysis\02_Data\Original_data\openregister.db'

In [8]:
# Get shapes of all tables in the database

# Connect to the SQLite database
conn = sqlite3.connect(sqlite_file_path)
cursor = conn.cursor()

# Get the names of all tables in the database
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

# Iterate through tables and retrieve information
for table in tables:
    table_name = table[0]

    # Get the number of rows in the table
    cursor.execute(f"SELECT COUNT(*) FROM {table_name};")
    row_count = cursor.fetchone()[0]

    # Get the number of columns in the table
    cursor.execute(f"PRAGMA table_info({table_name});")
    columns = cursor.fetchall()
    column_count = len(columns)

    # Print information about the table
    print(f"Table: {table_name}, Rows: {row_count}, Columns: {column_count}")

# Close the connection
conn.close()

Table: name, Rows: 1409429, Columns: 3
Table: registrations, Rows: 375180, Columns: 21
Table: officer, Rows: 4803514, Columns: 15
Table: company, Rows: 5305727, Columns: 25
Table: company_fts, Rows: 5305727, Columns: 2
Table: company_fts_data, Rows: 38643, Columns: 2
Table: company_fts_idx, Rows: 23360, Columns: 3
Table: company_fts_docsize, Rows: 5305727, Columns: 2
Table: company_fts_config, Rows: 1, Columns: 2
Table: officer_fts, Rows: 4803514, Columns: 1
Table: officer_fts_data, Rows: 15757, Columns: 2
Table: officer_fts_idx, Rows: 10764, Columns: 3
Table: officer_fts_docsize, Rows: 4803514, Columns: 2
Table: officer_fts_config, Rows: 1, Columns: 2
Table: name_fts, Rows: 1409429, Columns: 1
Table: name_fts_data, Rows: 7855, Columns: 2
Table: name_fts_idx, Rows: 5770, Columns: 3
Table: name_fts_docsize, Rows: 1409429, Columns: 2
Table: name_fts_config, Rows: 1, Columns: 2


Based on this output, I can see that the "company" table contains 5.3M observations regarding all companies compiled in this dataset.  

In [9]:
# Check the variables containing the company table

# Connect to the SQLite database
conn = sqlite3.connect(sqlite_file_path)
cursor = conn.cursor()

# Specify the table name for which you want to retrieve column names
table_name = 'company'  # Replace with the actual table name

# Get the column names for the specified table
cursor.execute(f"PRAGMA table_info({table_name});")
columns = cursor.fetchall()

# Print the column names
print(f"Column names for table '{table_name}':")
for column in columns:
    print(column[1])  # Column name is at index 1

# Close the connection
conn.close()

Column names for table 'company':
id
company_number
current_status
jurisdiction_code
name
registered_address
retrieved_at
register_flag_AD
register_flag_CD
register_flag_DK
register_flag_HD
register_flag_SI
register_flag_UT
register_flag_VOE
federal_state
native_company_number
registered_office
registrar
register_art
register_nummer
former_registrar
register_flag_
register_flag_Note:
_registerNummerSuffix
register_flag_Status information


In [10]:
# See top results for all variables

# Connect to the SQLite database
conn = sqlite3.connect(sqlite_file_path)
cursor = conn.cursor()

# Example: Retrieve data from a table
cursor.execute("SELECT * FROM company LIMIT 5;")
rows = cursor.fetchall()

for row in rows:
    print(row)

# Close the connection
conn.close()

(1, 'K1101R_HRB150148', 'currently registered', 'de', 'olly UG (haftungsbeschränkt)', 'Waidmannstraße 1, 22769 Hamburg.', '2018-11-09T18:03:03Z', 1, 1, 1, 0, 1, 1, 0, 'Hamburg', 'Hamburg HRB 150148', 'Hamburg', 'Hamburg', 'HRB', '150148', None, None, None, None, None)
(2, 'R1101_HRB81092', 'currently registered', 'de', 'BLUECHILLED Verwaltungs GmbH', 'Oststr.', '2018-07-25T11:14:02Z', 1, 1, 1, 0, 1, 1, 1, 'North Rhine-Westphalia', 'Düsseldorf HRB 81092', 'Düsseldorf', 'Düsseldorf', 'HRB', '81092', None, None, None, None, None)
(3, 'H1101_H1101_HRB18423', 'currently registered', 'de', 'Mittelständische Beteiligungsgesellschaft Bremen mbH', 'Langenstraße 2-4, 28195 Bremen.', '2018-06-24T21:12:00Z', 1, 1, 1, 1, 1, 1, 1, 'Bremen', 'Bremen früher Bremen HRB 18423', 'Bremen', 'Bremen', 'HRB', '18423', 'Bremen', None, None, None, None)
(4, 'R1101_HRB45109', 'currently registered', 'de', 'Albert Barufe GmbH', 'Hans-Sachs-Straße 11, 40721 Hilden.', '2018-07-25T11:15:01Z', 1, 1, 1, 1, 1, 1, 1, '

There are several listed variables that do not play a major role in our project. Thus, I will select variables from both tables that are relevant for this project and save the file as a pickle. 

In [6]:
# Create a path to save pickle file
output_pickle_path = r'C:\Users\Ich\Documents\01_2024_LaborDE_analysis\02_Data\Prepared_data\all_companies_de.pkl'

In [8]:
# Connect to the SQLite database
conn = sqlite3.connect(sqlite_file_path)

# Example: Retrieve data from a table
query = "SELECT company_number, current_status, name, registered_address, federal_state FROM company;"
df = pd.read_sql_query(query, conn)

# Close the connection
conn.close()

# Save the DataFrame as a pickle file with headers
df.to_pickle(output_pickle_path)

print(f"Output exported to {output_pickle_path}")

Output exported to C:\Users\Ich\Documents\01_2024_LaborDE_analysis\02_Data\Prepared_data\all_companies_de.pkl


### Importing extracted company dataframe

In [3]:
# Import “all_companies_de.pickle”
all_companies_de = pd.read_pickle(os.path.join(path, '02_Data', 'Prepared_data', 'all_companies_de.pkl')) 

In [4]:
# See results
all_companies_de.shape
# The pickle file contains all 5305727 expected observations as recorded on the original database and the 5 selected variables

(5305727, 5)

In [5]:
# See headers
all_companies_de.head()

Unnamed: 0,company_number,current_status,name,registered_address,federal_state
0,K1101R_HRB150148,currently registered,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg
1,R1101_HRB81092,currently registered,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia
2,H1101_H1101_HRB18423,currently registered,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen
3,R1101_HRB45109,currently registered,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia
4,R1101_HRB37996,currently registered,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia


In [6]:
# See headers
all_companies_de.tail()

Unnamed: 0,company_number,current_status,name,registered_address,federal_state
5305722,Y1301_VR300853,currently registered,Recht auf Heimat e.V.,,Thuringia
5305723,Y1301_VR300854,currently registered,"""SEI LEBENSWERT - Verein für Gesundheitssport ...",,Thuringia
5305724,Y1302_VR320869,currently registered,Kirmesverein Veilsdorf e. V.,,Thuringia
5305725,Y1304_VR351602,currently registered,"""Reha- und Gesundheitssport Brotterode e. V.""",,Thuringia
5305726,Y1308_VR330869,currently registered,Waldfrieden Outdoor Crew e.V.,,Thuringia


In [7]:
# See variables
all_companies_de.dtypes

company_number        object
current_status        object
name                  object
registered_address    object
federal_state         object
dtype: object

### 2.2 German population Bundesland

This dataset was downloaded from the [Kaggle](https://www.kaggle.com/datasets/ralfschukey/districts-and-towns-in-germany2019) community, however the original data owner is the [Statistisches Bundesamt](https://www.destatis.de/DE/Home/_inhalt.html), which is a federal authority of Germany responsible for collecting, processing, presenting, and analyzing statistical information concerning the topics of economy, society, and environment. This dataset contains the population, area, and population density numbers by gender for all German towns and districts released on 31.12.2018.

In [3]:
# Read the CSV file with UTF-8 encoding
# The encoding='utf-8' parameter ensures that the file is read in UTF-8 format, which supports German characters.
population_de = pd.read_csv(os.path.join(path, '02_Data', 'Original_data', 'Districts and Towns in Germany 2019.csv'), encoding='utf-8')

In [4]:
# See results
population_de.shape

(470, 9)

In [5]:
# See headers
population_de.head()

Unnamed: 0,Schlüsselnummer,Regionale Bezeichnung,Kreisfreie Stadt – Landkreis,NUTS3,Fläche in qkm,Bevölkerung,Bev.M,Bev.W,Bev.qkm
0,0,Staat,Deutschland,,357574.84,83019213.0,40966691.0,42052522.0,232.0
1,1,Bundesland,Schleswig-Holstein,,15804.3,2896712.0,1419457.0,1477255.0,183.0
2,1001,Kreisfreie Stadt,Flensburg,DEF01,56.73,89504.0,44599.0,44905.0,1578.0
3,1002,Landeshauptstadt,Kiel,DEF02,118.65,247548.0,120566.0,126982.0,2086.0
4,1003,Hansestadt,Lübeck,DEF03,214.19,217198.0,104371.0,112827.0,1014.0


In [6]:
# See headers
population_de.tail()

Unnamed: 0,Schlüsselnummer,Regionale Bezeichnung,Kreisfreie Stadt – Landkreis,NUTS3,Fläche in qkm,Bevölkerung,Bev.M,Bev.W,Bev.qkm
465,16073,Landkreis,Saalfeld-Rudolstadt,DEG0I,1036.03,106356.0,52388.0,53968.0,103.0
466,16074,Landkreis,Saale-Holzland-Kreis,DEG0J,815.24,83051.0,41360.0,41691.0,102.0
467,16075,Landkreis,Saale-Orla-Kreis,DEG0K,1151.3,80868.0,40119.0,40749.0,70.0
468,16076,Landkreis,Greiz,DEG0L,845.98,98159.0,48326.0,49833.0,116.0
469,16077,Landkreis,Altenburger Land,DEG0M,569.4,90118.0,44138.0,45980.0,158.0


In [7]:
# See variables
population_de.dtypes

Schlüsselnummer                   int64
Regionale Bezeichnung            object
Kreisfreie Stadt – Landkreis     object
NUTS3                            object
Fläche in qkm                   float64
Bevölkerung                     float64
Bev.M                           float64
Bev.W                           float64
Bev.qkm                         float64
dtype: object

## 2.3 German population regional

I downloaded two files from [Postleitzahlen Deutschland](https://www.suche-postleitzahl.org/), containing postcode information in Germany. I later unified them by merging them using the postcode data as the key variable. These files were last updated on July 15, 2023, and are freely available under the [Open Database License](https://www.openstreetmap.org/copyright). Their source of raw data is linked to [OpenStreetMap](https://osmfoundation.org/) contributors. Population figures as the basis for calculations were originally extracted from the [Statistical Offices of the Federal Government and the States](https://www.statistikportal.de/de).

### 2.3.1 Habitants per postcode

In [3]:
# Read the CSV file with UTF-8 encoding
# The encoding='utf-8' parameter ensures that the file is read in UTF-8 format, which supports German characters.
plz_de = pd.read_csv(os.path.join(path,'02_Data', 'Original_data', 'plz_einwohner.csv'), dtype={'plz': object},  encoding='utf-8')

In [34]:
# See results
plz_de.shape
# The German 5-digit postal code database consists of 8699 areas
# 8170 is close to the whole German postcode database

(8170, 6)

In [35]:
# See headers
plz_de.head()

Unnamed: 0,plz,note,einwohner,qkm,lat,lon
0,1067,01067 Dresden,11957,6.866839,51.06019,13.71117
1,1069,01069 Dresden,25483,5.339213,51.03964,13.7303
2,1097,01097 Dresden,14821,3.298022,51.06945,13.73781
3,1099,01099 Dresden,28018,58.505818,51.09272,13.82842
4,1108,01108 Dresden,5876,16.447222,51.1518,13.79227


In [36]:
# See headers
plz_de.tail()

Unnamed: 0,plz,note,einwohner,qkm,lat,lon
8165,99988,99988 Südeichsfeld,4866,40.320629,51.17553,10.26564
8166,99991,99991 Unstrut-Hainich,5269,96.922943,51.12179,10.49575
8167,99994,"99994 Marolterode, Nottertal-Heilinger Höhen",6384,81.700264,51.23117,10.66455
8168,99996,99996 Unstruttal,6751,100.496013,51.28005,10.43125
8169,99998,"99998 Körner, Weinbergen",4942,74.291525,51.20936,10.56753


In [4]:
# See variables
plz_de.dtypes

plz           object
note          object
einwohner      int64
qkm          float64
lat          float64
lon          float64
dtype: object

### 2.3.2 Regions per postcode

In [5]:
# Read the CSV file with UTF-8 encoding
# The encoding='utf-8' parameter ensures that the file is read in UTF-8 format, which supports German characters.
ort_de = pd.read_csv(os.path.join(path,'02_Data', 'Original_data', 'zuordnung_plz_ort.csv'), dtype={'plz': object},  encoding='utf-8')

In [53]:
# See results
ort_de.shape
# Here, there are more postcodes listed as compared to our plz_de df
# This is expected as some German poscodes are associated to several places, 
# in our ort_de those place will be listed under the city variable
# however consistently the same postcode is related within the same "landkreis" or region level as well as bundesland level

(12854, 6)

In [54]:
# See headers
ort_de.head()

Unnamed: 0,osm_id,ags,ort,plz,landkreis,bundesland
0,1104550,8335001,Aach,78267,Landkreis Konstanz,Baden-Württemberg
1,1255910,7235001,Aach,54298,Landkreis Trier-Saarburg,Rheinland-Pfalz
2,62564,5334002,Aachen,52062,Städteregion Aachen,Nordrhein-Westfalen
3,62564,5334002,Aachen,52064,Städteregion Aachen,Nordrhein-Westfalen
4,62564,5334002,Aachen,52066,Städteregion Aachen,Nordrhein-Westfalen


In [55]:
# See headers
ort_de.tail()

Unnamed: 0,osm_id,ags,ort,plz,landkreis,bundesland
12849,2778122,8415085,Zwiefalten,88529,Landkreis Reutlingen,Baden-Württemberg
12850,959161,9276148,Zwiesel,94227,Landkreis Regen,Bayern
12851,403750,8225113,Zwingenberg,69439,Neckar-Odenwald-Kreis,Baden-Württemberg
12852,537054,6431022,Zwingenberg,64673,Kreis Bergstraße,Hessen
12853,536015,14521710,Zwönitz,8297,Erzgebirgskreis,Sachsen


In [6]:
# See variables
ort_de.dtypes

osm_id         int64
ags            int64
ort           object
plz           object
landkreis     object
bundesland    object
dtype: object

# 3. Data wrangling and consistency checks

## 3.1 German companies

In [8]:
# See headers
all_companies_de.head()

Unnamed: 0,company_number,current_status,name,registered_address,federal_state
0,K1101R_HRB150148,currently registered,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg
1,R1101_HRB81092,currently registered,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia
2,H1101_H1101_HRB18423,currently registered,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen
3,R1101_HRB45109,currently registered,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia
4,R1101_HRB37996,currently registered,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia


### Renaming columns

In [9]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
all_companies_de.rename(columns = {'federal_state' : 'bundesland_en'}, inplace = True)

In [10]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
all_companies_de.rename(columns = {'name' : 'company_name'}, inplace = True)

In [11]:
# See results
all_companies_de.head()

Unnamed: 0,company_number,current_status,company_name,registered_address,bundesland_en
0,K1101R_HRB150148,currently registered,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg
1,R1101_HRB81092,currently registered,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia
2,H1101_H1101_HRB18423,currently registered,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen
3,R1101_HRB45109,currently registered,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia
4,R1101_HRB37996,currently registered,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia


### Subsetting companies

The value "currently registered" within the column "current_status" refers to companies that were active as of January 2019, the release date of the dataset. These are the companies of our interest that I will select for my subsequent analysis. 

In [12]:
# Subsetting for currently registered companies
# Checking values within the "current_status" variable
all_companies_de['current_status'].value_counts(dropna = False)
# companies with the status 'removed' are companies that are not longer active
# 2.3 M companies are active = "currently_registered"

current_status
removed                 2912209
currently registered    2393517
None                          1
Name: count, dtype: int64

In [13]:
#Creating a subset to select only companies currently registered 
companies_reg =  all_companies_de[all_companies_de['current_status']=='currently registered']

In [14]:
# See results
companies_reg.shape
# the output is as expected 2.3M companies are currently (= as in January 2019, the most uptodate data) registered 

(2393517, 5)

### Deriving labor and cancer columns

Selecting companies that specialize in providing laboratory diagnostic services and cancer-related diagnostic services. These are the type of companies that our client LaborDE is interested on. Here, I will select companies whose names contain specific keywords in the "company_name" variable. These keywords include:

* Create a variable named "labor" and assign 1 if the "company_name" contains the below descriptions, else assign 0:
   * 'laborat': To identify all companies involved in laboratory activities.
   * 'diagnosti', 'analytik', 'medizin': To pinpoint companies engaged in laboratory diagnostics services.
* Create a variable named "cancer" and assign 1 if the "company_name" contains the below descriptions, else assign 0:
   * 'kreb', 'cancer', 'onkolog': To identify companies related to cancer research, diagnostics, or services.

This refined approach allows for a more targeted selection of companies that meet our criteria for laboratory diagnostics services and ensures that they are currently registered based on the dataset's release date.

In [15]:
# Adding the "labor" variable
companies_reg.loc[:, 'labor'] = companies_reg['company_name'].str.contains(r'labor|diagnosti|analytik|medizin', case=False, regex=True).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  companies_reg.loc[:, 'labor'] = companies_reg['company_name'].str.contains(r'labor|diagnosti|analytik|medizin', case=False, regex=True).astype(int)


In [16]:
# Adding the "cancer" variable
companies_reg.loc[:, 'cancer'] = companies_reg['company_name'].str.contains(r'kreb|cancer|onkolog', case=False, regex=True).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  companies_reg.loc[:, 'cancer'] = companies_reg['company_name'].str.contains(r'kreb|cancer|onkolog', case=False, regex=True).astype(int)


In [17]:
# See results
companies_reg.head()

Unnamed: 0,company_number,current_status,company_name,registered_address,bundesland_en,labor,cancer
0,K1101R_HRB150148,currently registered,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg,0,0
1,R1101_HRB81092,currently registered,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia,0,0
2,H1101_H1101_HRB18423,currently registered,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen,0,0
3,R1101_HRB45109,currently registered,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia,0,0
4,R1101_HRB37996,currently registered,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia,0,0


In [18]:
# Total number of laboratories that can be potential customers for our LaborDE client = 10015
companies_reg['labor'].value_counts(dropna = False)

labor
0    2383502
1      10015
Name: count, dtype: int64

In [19]:
# Total number of cancer related companies that can be potential customers for our LaborDE client = 1212
companies_reg['cancer'].value_counts(dropna = False)

cancer
0    2392305
1       1212
Name: count, dtype: int64

### Dropping columns

In [20]:
# Dropping the "current_status" column, there is no need of this variable anymore
# Dropping function with function df.drop(columns = ['variable'])
companies_reg = companies_reg.drop(columns = ['current_status'])

In [21]:
# See results
companies_reg.head()
# companies names resemble expected potential customer companies

Unnamed: 0,company_number,company_name,registered_address,bundesland_en,labor,cancer
0,K1101R_HRB150148,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg,0,0
1,R1101_HRB81092,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia,0,0
2,H1101_H1101_HRB18423,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen,0,0
3,R1101_HRB45109,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia,0,0
4,R1101_HRB37996,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia,0,0


### Deriving postcode data

In [22]:
# Define a regular expression to extract the 5-digit postcode in the 'registered_address' variable
regex_pattern = r'(\d{5})'
# Use str.extract to apply the regex pattern and create new columns for postcode
companies_reg['plz_original'] = companies_reg['registered_address'].str.extract(regex_pattern, expand=False)

In [23]:
# See results
companies_reg.head()

Unnamed: 0,company_number,company_name,registered_address,bundesland_en,labor,cancer,plz_original
0,K1101R_HRB150148,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg,0,0,22769.0
1,R1101_HRB81092,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia,0,0,
2,H1101_H1101_HRB18423,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen,0,0,28195.0
3,R1101_HRB45109,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia,0,0,40721.0
4,R1101_HRB37996,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia,0,0,40477.0


In [24]:
# See results
companies_reg.dtypes

company_number        object
company_name          object
registered_address    object
bundesland_en         object
labor                  int32
cancer                 int32
plz_original          object
dtype: object

In [25]:
# Counting NaN within plz_original
nan_count = companies_reg['plz_original'].isna().sum()

print(f'The number of NaN values in the "plz" column is: {nan_count}')
# 58.2% of active companies dont contain a postcode information

The number of NaN values in the "plz" column is: 1393866


### Imputing postcodes

In [26]:
# Imputing NaN for the 'plz_original' variable, since there are several empty addresses
# Duplicate the 'plz_original' column and name it 'plz_imputed'
companies_reg['plz_imputed'] = companies_reg['plz_original'].copy()

In [27]:
# See results
companies_reg.head()

Unnamed: 0,company_number,company_name,registered_address,bundesland_en,labor,cancer,plz_original,plz_imputed
0,K1101R_HRB150148,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg,0,0,22769.0,22769.0
1,R1101_HRB81092,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia,0,0,,
2,H1101_H1101_HRB18423,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen,0,0,28195.0,28195.0
3,R1101_HRB45109,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia,0,0,40721.0,40721.0
4,R1101_HRB37996,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia,0,0,40477.0,40477.0


In [28]:
# Replacing NaN in the 'plz_imputed' variable by using the mode 'plz_original' for each bundesland listed in 'bundesland_en'
companies_reg['plz_imputed'] = companies_reg.groupby('bundesland_en')['plz_imputed'].transform(lambda x: x.fillna(x.mode().iloc[0]))

In [29]:
# See results
subset = companies_reg[['bundesland_en','plz_original', 'plz_imputed']]
subset
# the most common postcode was assigned

Unnamed: 0,bundesland_en,plz_original,plz_imputed
0,Hamburg,22769,22769
1,North Rhine-Westphalia,,40549
2,Bremen,28195,28195
3,North Rhine-Westphalia,40721,40721
4,North Rhine-Westphalia,40477,40477
...,...,...,...
5305722,Thuringia,,99084
5305723,Thuringia,,99084
5305724,Thuringia,,99084
5305725,Thuringia,,99084


In [31]:
# Checking whether extracted postcodes are within a sensitive range
# The current postal codes (Postleitzahlen) in Germany (DE) range from 01067 – 99998
# Check for values less than '01067' or greater than '99998'
invalid_values = companies_reg[~companies_reg['plz_imputed'].between('01067', '99998')]
# Display the invalid values
print(invalid_values)
# in two instances the extracted postcode value was not the adequate
# to avoid further incompatibilites with the german poscode and because these two companies are of no interest for our client
# I will delete these two companies using their indexes

          company_number                            company_name  \
39959    K1101R_HRB72526  TDH - GmbH Technischer Dämmstoffhandel   
4317088  F1103R_HRB59581  Masterplan Informationsmanagement GmbH   

                                        registered_address bundesland_en  \
39959                       Sternstraße 14, 00139 Dresden.       Hamburg   
4317088  Ehrenbergstraße 16 a, Scanbox #00472, 10245 Be...        Berlin   

         labor  cancer plz_original plz_imputed  
39959        0       0        00139       00139  
4317088      0       0        00472       00472  


In [32]:
# Drop rows by indices
companies_reg = companies_reg.drop([39959, 4317088])

# Reset the index after dropping rows
companies_reg = companies_reg.reset_index(drop=True)

In [33]:
# See results
companies_reg.shape
# as expected the two rows were deleted, previously I had 2393517

(2393515, 8)

In [34]:
# See results
companies_reg.dtypes

company_number        object
company_name          object
registered_address    object
bundesland_en         object
labor                  int32
cancer                 int32
plz_original          object
plz_imputed           object
dtype: object

In [35]:
# Count the number of digits in each entry of 'plz_imputed'
companies_reg['plz_digit_count'] = companies_reg['plz_imputed'].apply(lambda x: len(x) if isinstance(x, str) else None)

# Count the occurrences of each digit count
digit_count_distribution = companies_reg['plz_digit_count'].value_counts().sort_index()

# Display the results
print(digit_count_distribution) 
# as expected the plz_imputed consist of 5 digits postcode number, which is typical for Germany

plz_digit_count
5    2393515
Name: count, dtype: int64


In [36]:
# Dropping the column created to check the lenght of the variable, there is no need of this variable anymore
# Dropping function with function df.drop(columns = ['variable'])
companies_reg = companies_reg.drop(columns = ['plz_digit_count'])

In [37]:
# See results
companies_reg.head()

Unnamed: 0,company_number,company_name,registered_address,bundesland_en,labor,cancer,plz_original,plz_imputed
0,K1101R_HRB150148,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg,0,0,22769.0,22769
1,R1101_HRB81092,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia,0,0,,40549
2,H1101_H1101_HRB18423,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen,0,0,28195.0,28195
3,R1101_HRB45109,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia,0,0,40721.0,40721
4,R1101_HRB37996,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia,0,0,40477.0,40477


### Missing data

In [39]:
# Finding missing values
# isnull() function is used to find missing observations, with “observations” referring to entries in your df = cells in Excel
companies_reg.isnull().sum()
# We already knew that nearly half of the companies do not have a registered address,
# or when a registered address is described it does not contain a postcode, thus the excess of missing values with plz_original

company_number              0
company_name                0
registered_address    1195258
bundesland_en               0
labor                       0
cancer                      0
plz_original          1393866
plz_imputed                 0
dtype: int64

### Mixed data type

In [40]:
# Check for mixed data types
for col in companies_reg.columns.tolist():
  weird = (companies_reg[[col]].map(type) != companies_reg[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (companies_reg[weird]) > 0:
    print (col)
# as explore before these are missing values within these two variables
# these variables wont be used on the subsequent analysis so, I wont change them 

registered_address
plz_original


### Duplicates

In [41]:
# The following command will look for full duplicates within your df
dups = companies_reg[companies_reg.duplicated()]

In [42]:
# See results
dups.head()
# No duplicates values

Unnamed: 0,company_number,company_name,registered_address,bundesland_en,labor,cancer,plz_original,plz_imputed


### Inconsistency checks

In [43]:
# Checking how many Bundeslander are listed
companies_reg['bundesland_en'].value_counts(dropna = False)
# there are 16 Bundeslander listed which correspond to the total number of Bundeslander in Germany

bundesland_en
North Rhine-Westphalia           505294
Bavaria                          412937
Baden-Württemberg                299894
Lower Saxony                     219162
Hesse                            195520
Berlin                           142570
Rhineland-Palatinate             109797
Saxony                            92364
Hamburg                           90427
Schleswig-Holstein                83243
Brandenburg                       57793
Saxony-Anhalt                     48918
Thuringia                         48541
Mecklenburg-Western Pomerania     37426
Saarland                          27738
Bremen                            21891
Name: count, dtype: int64

## 3.2 German population Bundesland

This dataset was downloaded from the [Kaggle](https://www.kaggle.com/datasets/ralfschukey/districts-and-towns-in-germany2019) community, however the original data owner is the [Statistisches Bundesamt](https://www.destatis.de/DE/Home/_inhalt.html), which is a federal authority of Germany responsible for collecting, processing, presenting, and analyzing statistical information concerning the topics of economy, society, and environment. This dataset contains the population, area, and population density numbers by gender for all German towns and districts released on 31.12.2018.

In [8]:
# See headers
population_de.head()

Unnamed: 0,Schlüsselnummer,Regionale Bezeichnung,Kreisfreie Stadt – Landkreis,NUTS3,Fläche in qkm,Bevölkerung,Bev.M,Bev.W,Bev.qkm
0,0,Staat,Deutschland,,357574.84,83019213.0,40966691.0,42052522.0,232.0
1,1,Bundesland,Schleswig-Holstein,,15804.3,2896712.0,1419457.0,1477255.0,183.0
2,1001,Kreisfreie Stadt,Flensburg,DEF01,56.73,89504.0,44599.0,44905.0,1578.0
3,1002,Landeshauptstadt,Kiel,DEF02,118.65,247548.0,120566.0,126982.0,2086.0
4,1003,Hansestadt,Lübeck,DEF03,214.19,217198.0,104371.0,112827.0,1014.0


In [9]:
# Dropping the "NUTS3"  column, there is no need of these variables
# Dropping function with function df.drop(columns = ['variable'])
population_de = population_de.drop(columns = ['NUTS3'])

In [10]:
# See results
population_de.head()

Unnamed: 0,Schlüsselnummer,Regionale Bezeichnung,Kreisfreie Stadt – Landkreis,Fläche in qkm,Bevölkerung,Bev.M,Bev.W,Bev.qkm
0,0,Staat,Deutschland,357574.84,83019213.0,40966691.0,42052522.0,232.0
1,1,Bundesland,Schleswig-Holstein,15804.3,2896712.0,1419457.0,1477255.0,183.0
2,1001,Kreisfreie Stadt,Flensburg,56.73,89504.0,44599.0,44905.0,1578.0
3,1002,Landeshauptstadt,Kiel,118.65,247548.0,120566.0,126982.0,2086.0
4,1003,Hansestadt,Lübeck,214.19,217198.0,104371.0,112827.0,1014.0


### Renaming variables

In [11]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Schlüsselnummer' : 'administration_unit_id'}, inplace = True)

In [12]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Regionale Bezeichnung' : 'administration_level'}, inplace = True)

In [13]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Kreisfreie Stadt – Landkreis' : 'bundesland_de'}, inplace = True)

In [14]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Fläche in qkm' : 'area_sqkm'}, inplace = True)

In [15]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Bevölkerung' : 'population'}, inplace = True)

In [16]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Bev.M' : 'male'}, inplace = True)

In [17]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Bev.W' : 'female'}, inplace = True)

In [18]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
population_de.rename(columns = {'Bev.qkm' : 'population_per_sqkm'}, inplace = True)

In [19]:
# See results
population_de.head()

Unnamed: 0,administration_unit_id,administration_level,bundesland_de,area_sqkm,population,male,female,population_per_sqkm
0,0,Staat,Deutschland,357574.84,83019213.0,40966691.0,42052522.0,232.0
1,1,Bundesland,Schleswig-Holstein,15804.3,2896712.0,1419457.0,1477255.0,183.0
2,1001,Kreisfreie Stadt,Flensburg,56.73,89504.0,44599.0,44905.0,1578.0
3,1002,Landeshauptstadt,Kiel,118.65,247548.0,120566.0,126982.0,2086.0
4,1003,Hansestadt,Lübeck,214.19,217198.0,104371.0,112827.0,1014.0


### Subsetting Bundesland

From this dataframe, I am only interested to retrieve data at the Bundesland level

In [20]:
#Creating a subset to select only information at the Bundesland level
# Bundesländer are listed from 1 to 16 within the administration_unit_id variable
bundesland_de = population_de[population_de['administration_unit_id'].isin([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])]

In [24]:
bundesland_de
# the 16 Bundesländer are listed

Unnamed: 0,administration_unit_id,administration_level,urban_rural_districts,area_sqkm,population,male,female,population_per_sqkm
1,1,Bundesland,Schleswig-Holstein,15804.3,2896712.0,1419457.0,1477255.0,183.0
17,2,Bundesland,Hamburg,755.09,1841179.0,902048.0,939131.0,2438.0
19,3,Bundesland,Niedersachsen,47709.51,7982448.0,3943243.0,4039205.0,167.0
69,4,Bundesland,Bremen,419.36,682986.0,338035.0,344951.0,1629.0
72,5,Bundesland,Nordrhein-Westfalen,34112.31,17932651.0,8798631.0,9134020.0,526.0
131,6,Bundesland,Hessen,21115.67,6265809.0,3093044.0,3172765.0,297.0
161,7,Bundesland,Rheinland-Pfalz,19851.82,4084844.0,2017576.0,2067268.0,206.0
201,8,Bundesland,Baden-Württemberg,35748.2,11069533.0,5501693.0,5567840.0,310.0
262,9,Freistaat,Bayern,70541.61,13076721.0,6483793.0,6592928.0,185.0
366,10,Bundesland,Saarland,2571.11,990509.0,486159.0,504350.0,385.0


### Dropping columns

In [21]:
# Dropping function with function df.drop(columns = ['variable'])
bundesland_de = bundesland_de.drop(columns = ['administration_level'])

In [22]:
# See results
bundesland_de

Unnamed: 0,administration_unit_id,bundesland_de,area_sqkm,population,male,female,population_per_sqkm
1,1,Schleswig-Holstein,15804.3,2896712.0,1419457.0,1477255.0,183.0
17,2,Hamburg,755.09,1841179.0,902048.0,939131.0,2438.0
19,3,Niedersachsen,47709.51,7982448.0,3943243.0,4039205.0,167.0
69,4,Bremen,419.36,682986.0,338035.0,344951.0,1629.0
72,5,Nordrhein-Westfalen,34112.31,17932651.0,8798631.0,9134020.0,526.0
131,6,Hessen,21115.67,6265809.0,3093044.0,3172765.0,297.0
161,7,Rheinland-Pfalz,19851.82,4084844.0,2017576.0,2067268.0,206.0
201,8,Baden-Württemberg,35748.2,11069533.0,5501693.0,5567840.0,310.0
262,9,Bayern,70541.61,13076721.0,6483793.0,6592928.0,185.0
366,10,Saarland,2571.11,990509.0,486159.0,504350.0,385.0


### Mixed datatypes

In [23]:
# Check for mixed data types
for col in bundesland_de.columns.tolist():
  weird = (bundesland_de[[col]].map(type) != bundesland_de[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (bundesland_de[weird]) > 0:
    print (col)
    #no mixed data type
    # no missing values

### Adding GDP information

To compliment the information at the Bundesland level, I will add the GDP information. I retrieved the GDP data from 2022 at the Bundesland level from the [Statistische Ämter des Bundes und der Länder](https://www.statistikportal.de/de).     

In [28]:
# Information about the GDP of each Bundesland
GDP_data = {
    'bundesland_de': ['Baden-Württemberg', 'Bayern', 'Berlin', 'Brandenburg', 'Bremen', 'Hamburg', 'Hessen', 'Mecklenburg-Vorpommern', 'Niedersachsen', 'Nordrhein-Westfalen', 'Rheinland-Pfalz', 'Saarland', 'Sachsen', 'Sachsen-Anhalt', 'Schleswig-Holstein', 'Thüringen'],
    'gdp_mill_euro': [572837, 716784, 179379, 88800, 38698, 144220, 323352, 53440, 339414, 793790, 171699, 38505, 146511, 75436, 112755, 71430]
}

# Create a DataFrame from the provided area data
gdp = pd.DataFrame(GDP_data)

# See results
gdp

Unnamed: 0,bundesland_de,gdp_mill_euro
0,Baden-Württemberg,572837
1,Bayern,716784
2,Berlin,179379
3,Brandenburg,88800
4,Bremen,38698
5,Hamburg,144220
6,Hessen,323352
7,Mecklenburg-Vorpommern,53440
8,Niedersachsen,339414
9,Nordrhein-Westfalen,793790


In [29]:
# Merge the existing DataFrame with the new DataFrame based on the 'State' column
bundesland_gdp = pd.merge(bundesland_de, gdp, on='bundesland_de')

# See results
bundesland_gdp

Unnamed: 0,administration_unit_id,bundesland_de,area_sqkm,population,male,female,population_per_sqkm,gdp_mill_euro
0,1,Schleswig-Holstein,15804.3,2896712.0,1419457.0,1477255.0,183.0,112755
1,2,Hamburg,755.09,1841179.0,902048.0,939131.0,2438.0,144220
2,3,Niedersachsen,47709.51,7982448.0,3943243.0,4039205.0,167.0,339414
3,4,Bremen,419.36,682986.0,338035.0,344951.0,1629.0,38698
4,5,Nordrhein-Westfalen,34112.31,17932651.0,8798631.0,9134020.0,526.0,793790
5,6,Hessen,21115.67,6265809.0,3093044.0,3172765.0,297.0,323352
6,7,Rheinland-Pfalz,19851.82,4084844.0,2017576.0,2067268.0,206.0,171699
7,8,Baden-Württemberg,35748.2,11069533.0,5501693.0,5567840.0,310.0,572837
8,9,Bayern,70541.61,13076721.0,6483793.0,6592928.0,185.0,716784
9,10,Saarland,2571.11,990509.0,486159.0,504350.0,385.0,38505


## 3.3 German population regional

### 3.3.1 Habitants per postcode

In [38]:
# See headers
plz_de.head()

Unnamed: 0,plz,note,einwohner,qkm,lat,lon
0,1067,01067 Dresden,11957,6.866839,51.06019,13.71117
1,1069,01069 Dresden,25483,5.339213,51.03964,13.7303
2,1097,01097 Dresden,14821,3.298022,51.06945,13.73781
3,1099,01099 Dresden,28018,58.505818,51.09272,13.82842
4,1108,01108 Dresden,5876,16.447222,51.1518,13.79227


### Dropping columns

In [7]:
# Dropping function with function df.drop(columns = ['variable'])
plz_de = plz_de.drop(columns = ['lat', 'lon', 'note'])

In [8]:
# See results
plz_de.head()

Unnamed: 0,plz,einwohner,qkm
0,1067,11957,6.866839
1,1069,25483,5.339213
2,1097,14821,3.298022
3,1099,28018,58.505818
4,1108,5876,16.447222


### Renaming variables

In [9]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
plz_de.rename(columns = {'einwohner' : 'habitants'}, inplace = True)

In [10]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
plz_de.rename(columns = {'qkm' : 'area_sqkm'}, inplace = True)

In [11]:
# See results
plz_de.head()

Unnamed: 0,plz,habitants,area_sqkm
0,1067,11957,6.866839
1,1069,25483,5.339213
2,1097,14821,3.298022
3,1099,28018,58.505818
4,1108,5876,16.447222


### Mixed types

In [44]:
# Check for mixed data types
for col in plz_de.columns.tolist():
  weird = (plz_de[[col]].map(type) != plz_de[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (plz_de[weird]) > 0:
    print (col)
    #no mixed data type

### Duplicates

In [45]:
# The following command will look for full duplicates within your df
plz_dups = plz_de[plz_de.duplicated()]

In [46]:
# See results
plz_dups
# no duplicate data

Unnamed: 0,plz,habitants,area_sqkm


### Missing data

In [47]:
# Finding missing values
# isnull() function is used to find missing observations, with “observations” referring to entries in your df = cells in Excel
plz_de.isnull().sum() 
#no missing data

plz          0
habitants    0
area_sqkm    0
dtype: int64

### Basic descriptive statistics

In [12]:
#Evaluate general trends of variables
plz_de.describe()
#"habitants" there are not habitants associated to some postcodes ISSUE 1
#"area_sqkm" some postcodes have a pretty small area ISSUE 2

Unnamed: 0,habitants,area_sqkm
count,8170.0,8170.0
mean,9831.355202,43.730794
std,8866.105266,52.630298
min,0.0,0.001829
25%,2789.25,12.738848
50%,6559.5,27.27194
75%,15162.25,56.228272
max,58782.0,891.943577


In [13]:
# ISSUE 1 & ISSUE 2
# subset for habitants less than 10 people
subset_df = plz_de[plz_de['habitants'] < 10]
subset_df
# After further investigation it is known that in Germany buildings can have their own postcode
# e.g. The Münchner Haus has the mailing address and post-code: “Münchner Haus, 82475, Zugspitze”
# so it is not odd for some German postcodes not to have any habitants
# Similarly the area in sqkm of a building would be small

Unnamed: 0,plz,habitants,area_sqkm
1956,27499,8,7.62113
2160,30669,0,4.694415
2349,33333,3,0.413182
4188,60306,0,0.005761
4189,60308,0,0.004477
4190,60310,0,0.007112
4192,60312,0,0.001829
4195,60315,0,0.017285
4419,64743,3,0.082066
4917,70629,8,4.368498


In [14]:
# Calculate the sum of 'sqkm' and 'habitants' variables
total_sqkm = plz_de['area_sqkm'].sum()
total_habitants = plz_de['habitants'].sum()

# Print the results
print(f"Total area in sqkm of Germany: {total_sqkm}")
print(f"Total habitants in Germany: {total_habitants}")

# The total number of inhabitants and area corresponds to that of Germany

Total area in sqkm of Germany: 357280.58764199994
Total habitants in Germany: 80322172


### 3.3.2 Regions per postcode

In [57]:
# See headers
ort_de.head()

Unnamed: 0,osm_id,ags,ort,plz,landkreis,bundesland
0,1104550,8335001,Aach,78267,Landkreis Konstanz,Baden-Württemberg
1,1255910,7235001,Aach,54298,Landkreis Trier-Saarburg,Rheinland-Pfalz
2,62564,5334002,Aachen,52062,Städteregion Aachen,Nordrhein-Westfalen
3,62564,5334002,Aachen,52064,Städteregion Aachen,Nordrhein-Westfalen
4,62564,5334002,Aachen,52066,Städteregion Aachen,Nordrhein-Westfalen


### Dropping columns

In [15]:
# Dropping function with function df.drop(columns = ['variable'])
ort_de = ort_de.drop(columns = ['osm_id', 'ags'])

In [16]:
# see results
ort_de.head()

Unnamed: 0,ort,plz,landkreis,bundesland
0,Aach,78267,Landkreis Konstanz,Baden-Württemberg
1,Aach,54298,Landkreis Trier-Saarburg,Rheinland-Pfalz
2,Aachen,52062,Städteregion Aachen,Nordrhein-Westfalen
3,Aachen,52064,Städteregion Aachen,Nordrhein-Westfalen
4,Aachen,52066,Städteregion Aachen,Nordrhein-Westfalen


### Renaming variables

In [17]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
ort_de.rename(columns = {'ort' : 'city'}, inplace = True)

In [18]:
# Rename variable using df.rename(columns = {'old_name' : 'new_name'}, inplace = True)
ort_de.rename(columns = {'landkreis' : 'region'}, inplace = True)

In [19]:
# see results
ort_de.head()

Unnamed: 0,city,plz,region,bundesland
0,Aach,78267,Landkreis Konstanz,Baden-Württemberg
1,Aach,54298,Landkreis Trier-Saarburg,Rheinland-Pfalz
2,Aachen,52062,Städteregion Aachen,Nordrhein-Westfalen
3,Aachen,52064,Städteregion Aachen,Nordrhein-Westfalen
4,Aachen,52066,Städteregion Aachen,Nordrhein-Westfalen


### Duplicates

In [20]:
# The following command will look for full duplicates within your df
ort_dups = ort_de[ort_de.duplicated()]
ort_dups # no dups

Unnamed: 0,city,plz,region,bundesland


### Missing data

In [21]:
# isnull() function is used to find missing observations, with “observations” referring to entries in your df = cells in Excel
ort_de.isnull().sum()
# the region would be imputed by the city variable

city             0
plz              0
region        1404
bundesland       0
dtype: int64

In [22]:
# Filling missing values with corresponding city values
ort_de['region'] = ort_de['region'].fillna(ort_de['city'])

In [23]:
# check for missing values
ort_de.isnull().sum() # no missing values

city          0
plz           0
region        0
bundesland    0
dtype: int64

### Mixed data types

In [24]:
# Check for mixed data types
for col in ort_de.columns.tolist():
  weird = (ort_de[[col]].map(type) != ort_de[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (ort_de[weird]) > 0:
    print (col) # no mixed data types

# 4. Merging German population regional

## 4.1 Key variable

In [25]:
# Are plz values unique on the plz_de df?
are_plz_values_unique = not plz_de['plz'].duplicated().any()

# Print the result
print(f"Are PLZ values unique? {are_plz_values_unique}")

Are PLZ values unique? True


In [26]:
# Are plz values unique on the ort_de df?
are_plz_values_unique = not ort_de['plz'].duplicated().any()

# Print the result
print(f"Are PLZ values unique? {are_plz_values_unique}")
# this is expected as some German poscodes are associated to several places,
# e.g. 13 places in Germany found with the postal code 99947
# Although a single postcode can be linked to several cities in Germany, 
# the postcode information is linked to the same region and Bundesland

Are PLZ values unique? False


In [27]:
# Investigating duplicates in the ort_de df
duplicate_plz_subset = ort_de[ort_de['plz'].duplicated(keep=False)]
duplicate_plz_subset
# there are 6071 postcodes values that are duplicated

Unnamed: 0,city,plz,region,bundesland
1,Aach,54298,Landkreis Trier-Saarburg,Rheinland-Pfalz
16,Aalen,73434,Ostalbkreis,Baden-Württemberg
18,Aasbüttel,25560,Kreis Steinburg,Schleswig-Holstein
21,Abentheuer,55767,Landkreis Birkenfeld,Rheinland-Pfalz
24,Abtsbessingen,99713,Kyffhäuserkreis,Thüringen
...,...,...,...,...
12834,Züsch,54422,Landkreis Trier-Saarburg,Rheinland-Pfalz
12836,Züsow,23992,Landkreis Nordwestmecklenburg,Mecklenburg-Vorpommern
12837,Züssow,17495,Landkreis Vorpommern-Greifswald,Mecklenburg-Vorpommern
12840,Zweifelscheid,54673,Eifelkreis Bitburg-Prüm,Rheinland-Pfalz


In [28]:
# To simplify the merging process, I will keep only the first unique postcode information
# Drop duplicates based on 'plz' column
ort_de_unique_plz = ort_de.drop_duplicates(subset='plz', keep='first')

# Print the result
print(f"Number of rows before removing duplicates: {len(ort_de)}")
print(f"Number of rows after removing duplicates: {len(ort_de_unique_plz)}")
#Number of rows after removing duplicates equals to those in the plz_de df

Number of rows before removing duplicates: 12854
Number of rows after removing duplicates: 8170


In [29]:
# Check removal of duplicates
# Are plz values unique on the ort_de_unique_plz df?
are_plz_values_unique = not ort_de_unique_plz['plz'].duplicated().any()

# Print the result
print(f"Are PLZ values unique? {are_plz_values_unique}")

Are PLZ values unique? True


## 4.2 Merge

In [30]:
# check size before merge for plz_de
plz_de.shape
# there are 8203 different postal codes for ordinary delivery, i.e. that are associated to a geographical area

(8170, 3)

In [31]:
# check size before merge for ort_de
ort_de_unique_plz.shape
# In 2018, 28,278 different postal codes are awarded in Germany, including 8,181 for places, 
# 16,137 for mailboxes, 3,095 for major customers and 865 so-called "Aktions-PLZ" (e.g. for lottery games). 
# Also buildings might have their own postcode

(8170, 4)

In [32]:
# Perform the merge
population_plz = plz_de.merge(ort_de_unique_plz, on='plz')

In [33]:
# See results
population_plz.head()

Unnamed: 0,plz,habitants,area_sqkm,city,region,bundesland
0,1067,11957,6.866839,Dresden,Dresden,Sachsen
1,1069,25483,5.339213,Dresden,Dresden,Sachsen
2,1097,14821,3.298022,Dresden,Dresden,Sachsen
3,1099,28018,58.505818,Dresden,Dresden,Sachsen
4,1108,5876,16.447222,Dresden,Dresden,Sachsen


In [34]:
# Checking for missing values afer merge
# isnull() function is used to find missing observations, with “observations” referring to entries in your df = cells in Excel
population_plz.isnull().sum()
# no missing data

plz           0
habitants     0
area_sqkm     0
city          0
region        0
bundesland    0
dtype: int64

In [35]:
# Check results
population_plz.shape
# all observation on the plz_de were listed in the resulted merged dataframe

(8170, 6)

In [36]:
# Check results
# Evaluate general trends of variables
population_plz.describe()
# the basis statistics reflects those originally observed for the ort_de df

Unnamed: 0,habitants,area_sqkm
count,8170.0,8170.0
mean,9831.355202,43.730794
std,8866.105266,52.630298
min,0.0,0.001829
25%,2789.25,12.738848
50%,6559.5,27.27194
75%,15162.25,56.228272
max,58782.0,891.943577


# 5. Exporting dataframes

## German companies

In [44]:
# Check shape before exporting
companies_reg.shape
# nearly 2.4 active companies were extracted from the original file for Germany

(2393515, 8)

In [45]:
# Check header before exporting
companies_reg.head()

Unnamed: 0,company_number,company_name,registered_address,bundesland_en,labor,cancer,plz_original,plz_imputed
0,K1101R_HRB150148,olly UG (haftungsbeschränkt),"Waidmannstraße 1, 22769 Hamburg.",Hamburg,0,0,22769.0,22769
1,R1101_HRB81092,BLUECHILLED Verwaltungs GmbH,Oststr.,North Rhine-Westphalia,0,0,,40549
2,H1101_H1101_HRB18423,Mittelständische Beteiligungsgesellschaft Brem...,"Langenstraße 2-4, 28195 Bremen.",Bremen,0,0,28195.0,28195
3,R1101_HRB45109,Albert Barufe GmbH,"Hans-Sachs-Straße 11, 40721 Hilden.",North Rhine-Westphalia,0,0,40721.0,40721
4,R1101_HRB37996,ITERGO Informationstechnologie GmbH,"ERGO-Platz 1, 40477 Düsseldorf.",North Rhine-Westphalia,0,0,40477.0,40477


In [46]:
# Exporting to prepared data folder
#"index = False" avoids the automatic creation of an unnamed column in the exported csv file
companies_reg.to_csv(os.path.join(path, '02_Data','Prepared_data', 'all_company_de_step1.csv'), index = False)

## German population Bundesland

In [30]:
# Check shape before exporting
bundesland_gdp.shape
# Financial, demographic information for all 16 Bundesländer in Germany

(16, 8)

In [31]:
# Check header before exporting
bundesland_gdp.head()

Unnamed: 0,administration_unit_id,bundesland_de,area_sqkm,population,male,female,population_per_sqkm,gdp_mill_euro
0,1,Schleswig-Holstein,15804.3,2896712.0,1419457.0,1477255.0,183.0,112755
1,2,Hamburg,755.09,1841179.0,902048.0,939131.0,2438.0,144220
2,3,Niedersachsen,47709.51,7982448.0,3943243.0,4039205.0,167.0,339414
3,4,Bremen,419.36,682986.0,338035.0,344951.0,1629.0,38698
4,5,Nordrhein-Westfalen,34112.31,17932651.0,8798631.0,9134020.0,526.0,793790


In [32]:
# Exporting to prepared data folder
#"index = False" avoids the automatic creation of an unnamed column in the exported csv file
bundesland_gdp.to_csv(os.path.join(path, '02_Data','Prepared_data', 'population_bundesland_step1.csv'), index = False)

## German population regional

In [77]:
# Check shape before exporting
population_plz.shape
# there is demographic information for 8170 German postcodes

(8170, 6)

In [78]:
# Check header before exporting
population_plz.head()

Unnamed: 0,plz,habitants,area_sqkm,city,region,bundesland
0,1067,11957,6.866839,Dresden,Dresden,Sachsen
1,1069,25483,5.339213,Dresden,Dresden,Sachsen
2,1097,14821,3.298022,Dresden,Dresden,Sachsen
3,1099,28018,58.505818,Dresden,Dresden,Sachsen
4,1108,5876,16.447222,Dresden,Dresden,Sachsen


In [79]:
# Exporting to prepared data folder
#"index = False" avoids the automatic creation of an unnamed column in the exported csv file
population_plz.to_csv(os.path.join(path, '02_Data','Prepared_data', 'population_regional_step1.csv'), index = False)