## 🧭 Data Loading — USGS Global Earthquake Dataset (1900–Present)

This dataset contains detailed information on significant global earthquakes
(magnitude ≥ 5.0) recorded since 1900. Sourced weekly from the United States
Geological Survey (USGS), it includes comprehensive data such as timestamp,
location (latitude & longitude), magnitude, depth, and measurement network details.

Earthquakes originate from tectonic activity across both highly seismic regions, such
as the Pacific Ring of Fire, and less active zones like Europe and Africa. The dataset
serves as a valuable resource for analyzing long-term seismic patterns, understanding
geological behavior, and developing predictive models.

### Key Columns Explained:
- **time**: Timestamp in milliseconds since Unix epoch (UTC) representing event occurrence.
- **latitude / longitude**: Coordinates of the earthquake's epicenter (in decimal degrees).
- **depth**: Depth of the event in kilometers.
- **mag**: Reported earthquake magnitude.
- **magType**: Type of magnitude measurement (e.g., mb, ml, mw).
- **nst**: Number of stations used to compute the earthquake solution.
- **gap**: Largest azimuthal gap between stations, in degrees.
- **dmin**: Distance to the nearest recording station (in degrees).
- **rms**: Root-mean-square residual of seismic station readings.
- **net / id**: Network and unique event identifiers.
- **updated**: Last update timestamp for the event record.
- **place**: Human-readable location description.
- **type**: Event type (e.g., “earthquake”, “explosion”, “quarry blast”).
- **horizontalError / depthError / magError**: Uncertainty measurements of calculations.
- **status**: Event status within USGS catalog (e.g., “reviewed”, “automatic”).
- **locationSource / magSource**: Networks providing the location and magnitude data.

This dataset supports exploratory analyses, time-series pattern detection, and
predictive-modeling of seismic events worldwide, making it highly applicable to
earthquake research and risk assessment projects.

Source Link: https://www.kaggle.com/datasets/usamabuttar/significant-earthquakes?resource=download


In [None]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:

# Load USGS Earthquake Data
import pandas as pd

# Replace this path or URL with your actual dataset location
data_path = "/content/drive/MyDrive/ImpactSense_Oct25/data/earthquakes_data.csv"
earthquake_df = pd.read_csv(data_path)

# Display basic information and preview
earthquake_df.info()
earthquake_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110133 entries, 0 to 110132
Data columns (total 23 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Unnamed: 0       110133 non-null  int64  
 1   time             110133 non-null  object 
 2   latitude         110133 non-null  float64
 3   longitude        110133 non-null  float64
 4   depth            109848 non-null  float64
 5   mag              110133 non-null  float64
 6   magType          110133 non-null  object 
 7   nst              39492 non-null   float64
 8   gap              49783 non-null   float64
 9   dmin             29839 non-null   float64
 10  rms              81386 non-null   float64
 11  net              110133 non-null  object 
 12  id               110133 non-null  object 
 13  updated          110133 non-null  object 
 14  place            109242 non-null  object 
 15  type             110133 non-null  object 
 16  horizontalError  28459 non-null   floa

Unnamed: 0.1,Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,0,1900-10-09T12:25:00.000Z,57.09,-153.48,,7.86,mw,,,,...,2022-05-09T14:44:17.838Z,"16 km SW of Old Harbor, Alaska",earthquake,,,,,reviewed,ushis,pt
1,1,1901-03-03T07:45:00.000Z,36.0,-120.5,,6.4,ms,,,,...,2018-06-04T20:43:44.000Z,"12 km NNW of Parkfield, California",earthquake,,,,,reviewed,ushis,ell
2,2,1901-07-26T22:20:00.000Z,40.8,-115.7,,5.0,fa,,,,...,2018-06-04T20:43:44.000Z,"6 km SE of Elko, Nevada",earthquake,,,,,reviewed,ushis,sjg
3,3,1901-12-30T22:34:00.000Z,52.0,-160.0,,7.0,ms,,,,...,2018-06-04T20:43:44.000Z,south of Alaska,earthquake,,,,,reviewed,ushis,abe
4,4,1902-01-01T05:20:30.000Z,52.38,-167.45,,7.0,ms,,,,...,2018-06-04T20:43:44.000Z,"113 km ESE of Nikolski, Alaska",earthquake,,,,,reviewed,ushis,abe


## 🔍 Data Inspection — Exploring Earthquake Dataset Structure

In this stage, we perform an initial inspection of the earthquake data
to understand its overall structure, completeness, and summary statistics.

Typical inspection steps include:
1. Viewing the dataset shape and column names.
2. Identifying data types and non-null counts.
3. Checking for missing values.
4. Reviewing unique entries in categorical columns.
5. Exploring basic descriptive statistics to detect scale variations or anomalies.

In [None]:
# View the dataset shape (rows × columns)
print("Dataset shape:", earthquake_df.shape)

Dataset shape: (110133, 23)


In [None]:
# Display the first few records
display(earthquake_df.head())

Unnamed: 0.1,Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,0,1900-10-09T12:25:00.000Z,57.09,-153.48,,7.86,mw,,,,...,2022-05-09T14:44:17.838Z,"16 km SW of Old Harbor, Alaska",earthquake,,,,,reviewed,ushis,pt
1,1,1901-03-03T07:45:00.000Z,36.0,-120.5,,6.4,ms,,,,...,2018-06-04T20:43:44.000Z,"12 km NNW of Parkfield, California",earthquake,,,,,reviewed,ushis,ell
2,2,1901-07-26T22:20:00.000Z,40.8,-115.7,,5.0,fa,,,,...,2018-06-04T20:43:44.000Z,"6 km SE of Elko, Nevada",earthquake,,,,,reviewed,ushis,sjg
3,3,1901-12-30T22:34:00.000Z,52.0,-160.0,,7.0,ms,,,,...,2018-06-04T20:43:44.000Z,south of Alaska,earthquake,,,,,reviewed,ushis,abe
4,4,1902-01-01T05:20:30.000Z,52.38,-167.45,,7.0,ms,,,,...,2018-06-04T20:43:44.000Z,"113 km ESE of Nikolski, Alaska",earthquake,,,,,reviewed,ushis,abe


In [None]:
# Display column names and their data types
print("\n--- Column Details ---")
earthquake_df.info()


--- Column Details ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110133 entries, 0 to 110132
Data columns (total 23 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Unnamed: 0       110133 non-null  int64  
 1   time             110133 non-null  object 
 2   latitude         110133 non-null  float64
 3   longitude        110133 non-null  float64
 4   depth            109848 non-null  float64
 5   mag              110133 non-null  float64
 6   magType          110133 non-null  object 
 7   nst              39492 non-null   float64
 8   gap              49783 non-null   float64
 9   dmin             29839 non-null   float64
 10  rms              81386 non-null   float64
 11  net              110133 non-null  object 
 12  id               110133 non-null  object 
 13  updated          110133 non-null  object 
 14  place            109242 non-null  object 
 15  type             110133 non-null  object 
 16  horizontalErro

In [None]:
# Check for missing values across all columns
print("\n--- Missing Values Summary ---")
missing_summary = earthquake_df.isnull().sum().sort_values(ascending=False)
display(missing_summary[missing_summary > 0])


--- Missing Values Summary ---


Unnamed: 0,0
horizontalError,81674
dmin,80294
nst,70641
magError,67061
gap,60350
magNst,60025
depthError,49715
rms,28747
place,891
depth,285


In [None]:
# View unique event types and magnitude types
print("\n--- Unique Event Types ---")
print(earthquake_df["type"].unique())


--- Unique Event Types ---
['earthquake' 'nuclear explosion' 'explosion' 'rock burst' 'mine collapse'
 'volcanic eruption' 'landslide']


In [None]:
print("\n--- Unique Magnitude Types ---")
print(earthquake_df["magType"].unique())


--- Unique Magnitude Types ---
['mw' 'ms' 'fa' 'ml' 'mint' 'mb' 'lg' 'uk' 'mh' 'mwc' 'md' 'mb_lg' 'mc'
 'ma' 'mwb' 'mww' 'mwr' 'mlg' 'm' 'Md' 'Ml' 'ms_20' 'mwp' 'Mi' 'Mb'
 'ml(texnet)']


In [None]:
# Get statistical overview for numerical columns
print("\n--- Descriptive Statistics ---")
display(earthquake_df.describe())


--- Descriptive Statistics ---


Unnamed: 0.1,Unnamed: 0,latitude,longitude,depth,mag,nst,gap,dmin,rms,horizontalError,depthError,magError,magNst
count,110133.0,110133.0,110133.0,109848.0,110133.0,39492.0,49783.0,29839.0,81386.0,28459.0,60418.0,43072.0,50108.0
mean,55066.0,3.692374,41.402767,61.340488,5.44138,141.146283,65.583967,4.199523,0.947658,7.86424,7.282601,0.148807,64.465774
std,31792.802936,30.432835,121.960837,107.639693,0.479188,117.023302,39.212698,5.191667,0.365004,3.912147,10.139548,0.145688,101.396169
min,0.0,-77.08,-179.997,-4.0,5.0,0.0,6.5,0.0,-1.0,0.0,-1.0,0.0,0.0
25%,27533.0,-17.855,-72.002,10.0,5.1,63.0,37.0,1.296,0.8,6.2,1.9,0.055,13.0
50%,55066.0,-0.696,100.391,33.0,5.3,103.0,57.0,2.514,0.95,7.75,4.6,0.081,30.0
75%,82599.0,29.8835,143.175,51.0,5.63,179.0,84.4,4.915,1.1,9.375,8.2,0.2,68.0
max,110132.0,87.386,180.0,700.0,9.5,929.0,360.0,50.901,69.32,99.0,1091.9,1.84,1144.0


In [None]:
# Checking for duplicate records using unique ID
duplicate_count = earthquake_df.duplicated(subset=['id']).sum()
print(f"\nDuplicate Records Based on 'id': {duplicate_count}")


Duplicate Records Based on 'id': 9143


- **Observations**

Dataset & structure

100–200 synthetic rows (depending on notebook). Columns: age, salary, department, years_experience / satisfaction, IsManager.
salary contained missing values (~10); median imputation applied.
Summary statistics & distributions

Salary: reasonable spread with some high/low outliers; median near mean after imputation.
Age and YearsExperience distributions are approximately uniform/normal depending on generator seed.
Department counts vary — some departments have more samples (possible class imbalance).
Missing values & imputation

Salary nulls were filled with median — simple and safe but may bias department‑specific salary distributions.
Recommendation: consider group-wise median (by department/education) or model-based imputation for production.
Outliers

Boxplots reveal outliers in salary (e.g., marketing / specific departments).
IQR filter removed a small fraction of rows. Review removed rows before permanent deletion.
Relationships & correlations

Correlations between numeric features (age, salary, satisfaction/experience) are low to moderate — no strong linear dependency found.
Scatterplots show weak positive relationship between experience and salary; managers cluster at higher salaries.
Department vs managerial status

Boxplots + stripplots show managers generally have higher median salaries than non‑managers within departments.
Degree of difference varies by department; some departments have insufficient manager samples to compare reliably.
Visual checks

Histograms, heatmap, pairplot are useful for distribution and pairwise interaction checks.
Stacked bar / crosstabs show education composition by department — helpful for group imputation/stratification.
