<a href="https://colab.research.google.com/github/Sabareesh6/Python/blob/main/NLP_ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PROBLEM STATEMENT**

**DOMAIN:** Industrial safety. NLP-based Chatbot.

**CONTEXT:**  
The database comes from one of the biggest industries in Brazil and in the world.  
It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environments.

**DATA DESCRIPTION:**  
The database is basically records of accidents from 12 different plants in 3 different countries.  
Each line in the data is an occurrence of an accident.

**Columns Description:**
- **Data**: Timestamp or time/date information  
- **Countries**: Which country the accident occurred (anonymized)  
- **Local**: The city where the manufacturing plant is located (anonymized)  
- **Industry sector**: Which sector the plant belongs to  
- **Accident level**: From I to VI, it registers how severe the accident was (I = not severe, VI = very severe)  
- **Potential Accident Level**: Depending on the accident level, the database also registers how severe the accident *could* have been (due to other factors)  
- **Genre**: Whether the person is male or female  
- **Employee or Third Party**: Whether the injured person is an employee or a third party  
- **Critical Risk**: Description of the risk involved in the accident  
- **Description**: Detailed description of how the accident happened  


# **Importing necessary libraries, creating dataframe**

## **Import Libraries**


## **Importing Data**

In [3]:
#Moutning the drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import pandas as pd
data = pd.read_excel('/content/drive/MyDrive/G7_Project-NLP_ChatBot/Data Set - industrial_safety_and_health_database_with_accidents_description.xlsx')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425 entries, 0 to 424
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Unnamed: 0                425 non-null    int64         
 1   Data                      425 non-null    datetime64[ns]
 2   Countries                 425 non-null    object        
 3   Local                     425 non-null    object        
 4   Industry Sector           425 non-null    object        
 5   Accident Level            425 non-null    object        
 6   Potential Accident Level  425 non-null    object        
 7   Genre                     425 non-null    object        
 8   Employee or Third Party   425 non-null    object        
 9   Critical Risk             425 non-null    object        
 10  Description               425 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 36.7+ KB


In [6]:
data.shape

(425, 11)

In [7]:
data.head()

Unnamed: 0.1,Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee or Third Party,Critical Risk,Description
0,0,2016-01-01,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed,While removing the drill rod of the Jumbo 08 f...
1,1,2016-01-02,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems,During the activation of a sodium sulphide pum...
2,2,2016-01-06,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools,In the sub-station MILPO located at level +170...
3,3,2016-01-08,Country_01,Local_04,Mining,I,I,Male,Third Party,Others,Being 9:45 am. approximately in the Nv. 1880 C...
4,4,2016-01-10,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others,Approximately at 11:45 a.m. in circumstances t...


Observations:

The data-frame has 425 rows and 11 columns with the following data types:

1.   datetime64(1)
2.   int64(1)
3.   object(9)



# **Data Cleansing**

From the dataset, we need cleanse the data by doing the following

1. Remove the "Unnamed: 0" column as it is unnecessary.
2. Rename the "Data" column to "Date" since it represents date values.
3. Rename the "Countries" column to "Country".
4. Rename the "Genre" column to "Gender" to correctly reflect its content.
5. Rename the "Employee or Third Party" column to "Employee Type".
6. Check and drop Duplicate values
7. Check and drop null values
8. Extracting Date object as new columns

In [8]:
# Remove 'Unnamed: 0' column from Data frame
data.drop("Unnamed: 0", axis=1, inplace=True)

In [9]:
# Rename columns in the Data frame as mentioned above
data.rename(columns={'Data':'Date', 'Countries':'Country', 'Genre':'Gender', 'Employee or Third Party':'Employee Type'}, inplace=True)

In [10]:
# Check duplicates in a data frame
data.duplicated().sum()

np.int64(7)

In [11]:
#Viewing the duplicates in the Data frame
duplicates = data.duplicated()
data[duplicates]

Unnamed: 0,Date,Country,Local,Industry Sector,Accident Level,Potential Accident Level,Gender,Employee Type,Critical Risk,Description
77,2016-04-01,Country_01,Local_01,Mining,I,V,Male,Third Party (Remote),Others,In circumstances that two workers of the Abrat...
262,2016-12-01,Country_01,Local_03,Mining,I,IV,Male,Employee,Others,During the activity of chuteo of ore in hopper...
303,2017-01-21,Country_02,Local_02,Mining,I,I,Male,Third Party (Remote),Others,Employees engaged in the removal of material f...
345,2017-03-02,Country_03,Local_10,Others,I,I,Male,Third Party,Venomous Animals,On 02/03/17 during the soil sampling in the re...
346,2017-03-02,Country_03,Local_10,Others,I,I,Male,Third Party,Venomous Animals,On 02/03/17 during the soil sampling in the re...
355,2017-03-15,Country_03,Local_10,Others,I,I,Male,Third Party,Venomous Animals,Team of the VMS Project performed soil collect...
397,2017-05-23,Country_01,Local_04,Mining,I,IV,Male,Third Party,Projection of fragments,In moments when the 02 collaborators carried o...


In [12]:
#Dropping duplicates as removing them wouldnt affect the overall quality of the data frame
data.drop_duplicates(inplace=True)

In [15]:
#Checking null values
data.isnull().sum()

Unnamed: 0,0
Date,0
Country,0
Local,0
Industry Sector,0
Accident Level,0
Potential Accident Level,0
Gender,0
Employee Type,0
Critical Risk,0
Description,0


In [20]:
# Ensure 'Date' is in datetime format
data['Date'] = pd.to_datetime(data['Date'])

# Extract time-based features as new columns
data['Year'] = data['Date'].apply(lambda x: x.year)
data['Month'] = data['Date'].apply(lambda x: x.month)
data['Day'] = data['Date'].apply(lambda x: x.day)
data['Weekday'] = data['Date'].apply(lambda x: x.day_name())

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 418 entries, 0 to 424
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Date                      418 non-null    datetime64[ns]
 1   Country                   418 non-null    object        
 2   Local                     418 non-null    object        
 3   Industry Sector           418 non-null    object        
 4   Accident Level            418 non-null    object        
 5   Potential Accident Level  418 non-null    object        
 6   Gender                    418 non-null    object        
 7   Employee Type             418 non-null    object        
 8   Critical Risk             418 non-null    object        
 9   Description               418 non-null    object        
 10  Year                      418 non-null    int64         
 11  Month                     418 non-null    int64         
 12  Day                       4

## Data Cleaning Summary

- **Removed unnecessary column:** Dropped the redundant `"Unnamed: 0"` column.
- **Renamed columns for clarity and consistency:** Renamed 4 columns for improved clarity and consistency.
- **Duplicate data handling:** Removed 7 duplicate rows after review.
- **Column List:** Dataset includes key fields such as `Date`, `Country`, `Gender`, and `Description`.
- **Null Value Check:** No missing values found in the dataset.
- Extracted **Date** object into `Year`,`Month`,`Day`, and `Weekday`.
- **Final DataFrame Overview (After Cleansing):** Final dataset contains **418 entries** and **14 columns**.
- **Data Types:** Dataset includes **1 datetime column** , **4 int column** and **9 categorical columns**.