The work presented herein serves as a preliminary exercise in preparation for my thesis. It involves the utilization of a publicly available HSE dataset to refine my coding skills and establish a suitable coding environment for my upcoming research. While this endeavor draws inspiration from the methodologies employed by Christopher Thiele, it is not a direct replication of his work. Instead, it represents an independent attempt to enhance the efficiency of the existing models, with a particular emphasis on reducing computational time without sacrificing accuracy. It is important to note that this preliminary work has been conducted without adherence to academic or professional standards, nor has it involved the use of proprietary Uniper data. The objective has been to conceptualize improvements based on Christopher's approaches, reimagined and implemented in a unique manner.

## Overview of Dataset
Columns description
1. Data: timestamp or time/date information
2. Countries: which country the accident occurred (anonymized)
3. Local: the city where the manufacturing plant is located (anonymized)
4. Industry sector: which sector the plant belongs to
5. Accident level: from I to VI, it registers how severe was the accident (I means not severe but VI means very severe)
6. Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)
7. Genre: if the person is male of female
8. Employee or Third Party: if the injured person is an employee or a third party
9. Critical Risk: some description of the risk involved in the accident
10. Description: Detailed description of how the accident happened.

https://www.kaggle.com/code/williamsamadi/hse-data-analytics-for-beginners

https://www.kaggle.com/datasets/ihmstefanini/industrial-safety-and-health-analytics-database



Accident Level (Severity) Classification Since Levels I and IV are provided, we can infer the following;

- Level 1 (I): Minor Accident
- Level 2 (II): Moderate Accident
- Level 3 (III): Major Accident
- Level 4 (IV): Serious Accident
- Level 5 (V): Severe Accident
- Level 6 (VI): Catastrophic Accident


Potential Accident Level (Severity) Classification: We infer the following;

- Level 1 (I): Low Potential
- Level 2 (II): Moderate Potential
- Level 3 (III): High Potential
- Level 4 (V): Very High Potential
- Level 5 (V): Extreme Potential
- Level 6 (VI): Critical Potential

## Reference
- https://www.kaggle.com/code/williamsamadi/hse-data-analytics-for-beginners

- https://www.kaggle.com/datasets/ihmstefanini/industrial-safety-and-health-analytics-database

## Analysis

In [31]:
#keep this cell
#pip install roman
#pip install scikit-learn

In [32]:
import pandas as pd
import roman 

import matplotlib.pyplot as plt
import plotly as px

import numpy as np
import scipy.stats as stats
from sklearn.linear_model import LogisticRegression

import os

In [33]:
print(os.getcwd())

c:\Users\M02555\OneDrive - Uniper SE\Uniper\Codes\Thesis


In [34]:
df = pd.read_csv('Datasets\Dataset_1.csv')
print(df)

                    Data   Countries     Local Industry Sector Accident Level  \
0    2016-01-01 00:00:00  Country_01  Local_01          Mining              I   
1    2016-01-02 00:00:00  Country_02  Local_02          Mining              I   
2    2016-01-06 00:00:00  Country_01  Local_03          Mining              I   
3    2016-01-08 00:00:00  Country_01  Local_04          Mining              I   
4    2016-01-10 00:00:00  Country_01  Local_04          Mining             IV   
5    2016-01-12 00:00:00  Country_02  Local_05          Metals              I   
6    2016-01-16 00:00:00  Country_02  Local_05          Metals              I   
7    2016-01-17 00:00:00  Country_01  Local_04          Mining              I   
8    2016-01-19 00:00:00  Country_02  Local_02          Mining              I   
9    2016-01-26 00:00:00  Country_01  Local_06          Metals              I   
10   2016-01-28 00:00:00  Country_01  Local_03          Mining              I   
11   2016-01-30 00:00:00  Co

In [35]:
#Changing Accident level, Potential Accident level from Roman numerals to Numbers

df["Accident Level"] = df["Accident Level"].apply(roman.fromRoman)

df["Potential Accident Level"] = df["Potential Accident Level"].apply(roman.fromRoman)


print(df)

                    Data   Countries     Local Industry Sector  \
0    2016-01-01 00:00:00  Country_01  Local_01          Mining   
1    2016-01-02 00:00:00  Country_02  Local_02          Mining   
2    2016-01-06 00:00:00  Country_01  Local_03          Mining   
3    2016-01-08 00:00:00  Country_01  Local_04          Mining   
4    2016-01-10 00:00:00  Country_01  Local_04          Mining   
5    2016-01-12 00:00:00  Country_02  Local_05          Metals   
6    2016-01-16 00:00:00  Country_02  Local_05          Metals   
7    2016-01-17 00:00:00  Country_01  Local_04          Mining   
8    2016-01-19 00:00:00  Country_02  Local_02          Mining   
9    2016-01-26 00:00:00  Country_01  Local_06          Metals   
10   2016-01-28 00:00:00  Country_01  Local_03          Mining   
11   2016-01-30 00:00:00  Country_01  Local_03          Mining   
12   2016-02-01 00:00:00  Country_02  Local_05          Metals   
13   2016-02-02 00:00:00  Country_01  Local_01          Mining   
14   2016-

In [36]:
# Convert the columns to the correct data types
df["Data"] = pd.to_datetime(df["Data"])
df["Accident Level"] = df["Accident Level"].astype("category")
df["Potential Accident Level"] = df["Potential Accident Level"].astype("category")
df["Genre"] = df["Genre"].astype("category")
df["Risco Critico"] = df["Risco Critico"].astype("category")

# Print the first few rows of the DataFrame
print(df)

          Data   Countries     Local Industry Sector Accident Level  \
0   2016-01-01  Country_01  Local_01          Mining              1   
1   2016-01-02  Country_02  Local_02          Mining              1   
2   2016-01-06  Country_01  Local_03          Mining              1   
3   2016-01-08  Country_01  Local_04          Mining              1   
4   2016-01-10  Country_01  Local_04          Mining              4   
5   2016-01-12  Country_02  Local_05          Metals              1   
6   2016-01-16  Country_02  Local_05          Metals              1   
7   2016-01-17  Country_01  Local_04          Mining              1   
8   2016-01-19  Country_02  Local_02          Mining              1   
9   2016-01-26  Country_01  Local_06          Metals              1   
10  2016-01-28  Country_01  Local_03          Mining              1   
11  2016-01-30  Country_01  Local_03          Mining              1   
12  2016-02-01  Country_02  Local_05          Metals              1   
13  20