# Data Science Hackathon Newcastle 2023

By Abdullah Alshadadi

References:

```bibtex
@article{PAPAGIANNIDIS2018355,
title = {Identifying industrial clusters with a novel big-data methodology: Are SIC codes (not) fit for purpose in the Internet age?},
journal = {Computers & Operations Research},
volume = {98},
pages = {355-366},
year = {2018},
issn = {0305-0548},
doi = {https://doi.org/10.1016/j.cor.2017.06.010},
url = {https://www.sciencedirect.com/science/article/pii/S030505481730148X},
author = {Savvas Papagiannidis and Eric W.K. See-To and Dimitris G. Assimakopoulos and Yang Yang},
keywords = {Industry classification, SIC codes, Big data analytics, Clusters, Operations, Strategic co-operation, Regional policy, North East of England},
abstract = {In this paper we propose using a novel big-data-mining methodology and the Internet as a new source of useful meta-data for industry classification. The proposed methodology can be utilised as a decision support system for identifying industrial clusters in almost real time in a specific geographic region, contributing to strategic co-operation and policy development for operations and supply chain management across organisational boundaries through big data analytics. Our theoretical discussion on discerning industrial activity of firms in geographical regions starts by highlighting the limitations of the Standard Industrial Classification (SIC) codes. This discussion is followed by the proposed methodology, which has three main steps revolving around web-based data collection, pre-processing and analysis, and reporting of clusters. We discuss each step in detail, presenting the experimental approaches tested. We apply our methodology to a regional case, in the North East of England, in order to demonstrate how such a big data decision support system/analytics can work in practice. Implications for theory, policy and practice are discussed, as well as potential avenues for further research.}
}
```

## Introduction

Data Science Hackathon focuses on the goal to create a better solution to the Standard Industrial Classification (SIC) codes, which is a methodology for identifying industrial clusters in a geographic regions.

This hackathon has been tasked to use [Elevate Greece website](https://elevategreece.gov.gr/startup-database/) to gather the data from the startups in Greece and cluster the industries without relying on SIC codes.

### Investigating the website of Elevate Greece Startup Database

First of all, the website seems to have updated its information on startup. Since the start of the Data Science Hackathon, the dataset of startups were 700 now it is currently around 713 entries. Therefore, it is important to agree on a checkpoint of time where only that checkpoint of dataset will be retrieved for the hackathon.

![elevate-26-03-2023](assets/imgs/updated-data.png "The entries for Elevate Greece website has increased to 711")


* **The website layout**:

    1. The webpage of [Elevate Greece Startup Database](https://elevategreece.gov.gr/startup-database) contains a table that displays by default 15 startup companies with clickable elements for the next 15 entries till the 713 entry. Therefore, a webscraper that could click on the elements to view each 15 entries till the 713 entry is needed to retrieve each of the tables content.

    ![elevate-webpage-26-03-2023](assets/imgs/elevate-webpage.png "The webpage layout which contains a table of 15 entries with clickable elements to the next 713 entries")

    2. Clicking on one of the startup through the `Startup` column leads to the details of the startup company. However, this is the details that the startup company has provided to the Elevate Greece government sector, which follows Greece's SIC code. Thus, it is important to follow up the details with the company's respected website to gather up to date information.

    ![elevate-column-startup-26-03-2023](assets/imgs/column-to-view-details.png "An example showing where the `Startup` column")

    [![elevate-startup-details-26-03-2023](assets/imgs/startup-details.png "An example of the Elevate's webpage details on the selected startup company")](https://registry.elevategreece.gov.gr/company/i-love-dyslexia-english-language-innovation-eli-ike/)

    3. Either clicking on the `Website` column link at the Elevate Greece table's startup database as shown above at bullet point 1 or clicking on `Startup` column as shown in bullet point 2 shows the startup company's website. Thus, this is one of the main sources to gather information on the startup companies.

    ![elevate-column-website-26-03-2023](assets/imgs/column-to-view-website.png "An example showing where the `website` column")

    [![elevate-startup-details-website-26-03-2023](assets/imgs/startup-details-website.png "An example showing where the website link is in the Elevate's webpage details of a startup company")](https://registry.elevategreece.gov.gr/company/i-love-dyslexia-english-language-innovation-eli-ike/)

    

## Investigating the [Data Gathered So Far](https://github.com/LeeTaylorNewcastle/Data-Science-Hackathon-02-03-2023/tree/a145b1925d6ed9f7d5810e29c542fdc4139aac99)

This is the publicly retrieved data the team has gathered by [Lee Taylor](https://github.com/LeeTaylorNewcastle/Data-Science-Hackathon-02-03-2023/tree/a145b1925d6ed9f7d5810e29c542fdc4139aac99).

The data gathered are structured as so:

```
    data
    └── Data-Science-Hackathon-02-03-2023
        ├── Deprecated  # This folder was used for experiment on creating a webscraper
        │   ├── HTML Files  # Elevate Greece details on a startup company webpage
        │   ├── PDFs  # The downloadable PDF file that contains the table of startup database gathered by Elevate Greece
        │   ├── python_  # Different Python code 
        │   ├── Soup Output  # 
        │   └── templates  # Elevate Greece Startup Database table
        ├── Greek Startups  # Main Data
        │   ├── Copy Paste Website Text  # Manually extracted text data from the website pages
        │   ├── Excel Files Originals  # Manually extracted the Elevate Greece Startup Database table 
        │   └── Excel Files & Processing  # Retrieved data using the webscraper
        │       ├── soup_objects  # The html of each Elevate Greece details on a startup company webpage
        │       └── soup_to_text  # Extracted text from the Elevate Greece details on a startup company webpage
```

Viewing data the manually retrieved Elevate Greece Startup Database table, it would seem to have a complete missing data on website - `NaN`

In [4]:
import pandas as pd
mainpage_dataset = pd.read_csv("../Data-Science-Hackathon-02-03-2023/Greek Startups/Excel Files Originals/combined_csv.csv")
mainpage_dataset.head(5)

Unnamed: 0,Startup,Industry,Technology,Region,Employee count,Total funding €,Website
0,'I LOVE DYSLEXIA' ENGLISH LANGUAGE INNOVATION ...,EduTech - Education,"AI, Software, Web or Mobile Application",Attica,13,215,
1,100 MENTORS IKE,EduTech - Education,"AI, Blockchain, Data Analytics - Big Data, Sof...",Crete,21,-,
2,12GODS PC,AgriTech / FoodTech,Other,Attica,1,37K,
3,27 Research,Maritime,"AI, Data Analytics - Big Data",Attica,-,-,
4,33 CLOUDS IKE,Data Analytics - Big Data,"Data Analytics - Big Data, Software, Web or Mo...",Attica,9,-,


Furthermore, it has sometimes repeated the columns headers into the rows as seen below

In [15]:
mainpage_dataset.iloc[99]

Startup                    Startup
Industry                  Industry
Technology              Technology
Region                      Region
Employee count      Employee count
Total funding €    Total funding €
Website                    Website
Name: 99, dtype: object