data-description

**time** : Time when the event occurred. Times are reported in milliseconds since the epoch ( 1970-01-01T00:00:00.000Z), and do not include leap seconds. In certain output formats, the date is formatted for readability.

**latitude** : Decimal degrees latitude. Negative values for southern latitudes.coordinates of the epicenter in latitude and longitude is provided

**longitude** : Decimal degrees longitude. Negative values for western longitudes.

**depth** : Depth of the event in kilometers

**mag** : The magnitude for the event

**magtype** : the method or algorithm used to calculate the prefferred magnitude for the event.Typical values are:
"Md","MI","Ms","Mw","Me","Mi","Mb","MLg"
https://www.usgs.gov/natural-hazards/earthquake-hazards/science/magnitude-types

**nst** : The total no of seismic stations used to  determine earthquake location

**gap** : The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties.

**dmin** :Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.

**rms** : The root-mean-square (RMS) travel time residual, in sec, using all weights. This parameter provides a measure of the fit of the observed arrival times to the predicted arrival times for this location. Smaller numbers reflect a better fit of the data. The value is dependent on the accuracy of the velocity model used to compute the earthquake location, the quality weights assigned to the arrival time data, and the procedure used to locate the earthquake.

**net** : The ID of a data contributor. Identifies the network considered to be the preferred source of information for this event.

**id** : A unique identifier for the event. This is the current preferred id for the event, and may change over time

**updated** :Time when the event was most recently updated. Times are reported in milliseconds since the epoch. In certain output formats, the date is formatted for readability.

**place** : Textual description of named geographic region near to the event. This may be a city name,

**type** :Type of seismic event.

**horizontal error** : Uncertainty of reported location of the event in kilometers.

**mag error** : Uncertainty of reported magnitude of the event. The estimated standard error of the magnitude. The uncertainty corresponds to the specific magnitude type being reported and does not take into account magnitude variations and biases between different magnitude scales. We report an "unknown" value if the contributing seismic network does not supply uncertainty estimates.

**magNst** : The total number of seismic stations used to calculate the magnitude for this earthquake.

**status** :Indicates whether the event has been reviewed by a human.Status is either automatic or reviewed. Automatic events are directly posted by automatic processing systems and have not been verified or altered by a human. Reviewed events have been looked at by a human. The level of review can range from a quick validity check to a careful reanalysis of the event.

**location source** : The network that originally authored the reported location of this event. ak, at, ci, hv, ld, mb, nc, nm, nn, pr, pt, se, us, uu, uw

**magsource** : Network that originally authored the reported magnitude for this event. ak, at, ci, hv, ld, mb, nc, nm, nn, pr, pt, se, us, uu, uw

In [432]:
import pandas as pd
import numpy as np
import regex as re
import requests

In [433]:
df = pd.read_csv("india.csv")
df1 = df

In [434]:
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)

In [435]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2020-04-22T18:47:00.201Z,17.0062,96.2746,10.0,3.1,ml,,275,2.919,0.62,us,us7000921c,2020-07-11T22:19:22.040Z,"25 km NNE of Yangon, Myanmar",earthquake,7.0,2.0,0.148,6,reviewed,us,us
1,2021-09-09T13:27:04.336Z,31.9732,72.2567,10.0,3.4,mb,,285,1.877,0.75,us,us7000f9u0,2021-09-09T20:57:54.470Z,"7 km W of Sahiwal, Pakistan",earthquake,10.6,2.0,0.49,1,reviewed,us,us
2,2020-06-23T13:47:32.941Z,23.9312,93.0128,10.0,3.6,mb,,166,1.213,0.48,us,us6000ahfx,2020-08-30T00:01:10.040Z,"12 km SE of Darlawn, India",earthquake,7.3,2.0,0.5,1,reviewed,us,us
3,2021-08-10T08:12:46.310Z,30.2767,78.1005,10.0,3.6,mb,,291,8.756,0.93,us,us6000f3yt,2021-08-12T05:49:20.385Z,"3 km SSE of R?ipur, India",earthquake,11.4,2.0,0.287,3,reviewed,us,us
4,2020-05-10T08:15:27.178Z,28.7268,77.3516,10.0,3.7,mb,,203,9.149,0.52,us,us70009ebf,2020-07-29T22:14:27.040Z,"6 km ESE of Loni, India",earthquake,8.9,2.0,0.253,4,reviewed,us,us


In [436]:
df.dtypes

time                object
latitude           float64
longitude          float64
depth              float64
mag                float64
magType             object
nst                float64
gap                  int64
dmin               float64
rms                float64
net                 object
id                  object
updated             object
place               object
type                object
horizontalError    float64
depthError         float64
magError           float64
magNst               int64
status              object
locationSource      object
magSource           object
dtype: object

In [437]:
df.drop(columns = ["nst"], inplace = True)

In [438]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2020-04-22T18:47:00.201Z,17.0062,96.2746,10.0,3.1,ml,275,2.919,0.62,us,us7000921c,2020-07-11T22:19:22.040Z,"25 km NNE of Yangon, Myanmar",earthquake,7.0,2.0,0.148,6,reviewed,us,us
1,2021-09-09T13:27:04.336Z,31.9732,72.2567,10.0,3.4,mb,285,1.877,0.75,us,us7000f9u0,2021-09-09T20:57:54.470Z,"7 km W of Sahiwal, Pakistan",earthquake,10.6,2.0,0.49,1,reviewed,us,us
2,2020-06-23T13:47:32.941Z,23.9312,93.0128,10.0,3.6,mb,166,1.213,0.48,us,us6000ahfx,2020-08-30T00:01:10.040Z,"12 km SE of Darlawn, India",earthquake,7.3,2.0,0.5,1,reviewed,us,us
3,2021-08-10T08:12:46.310Z,30.2767,78.1005,10.0,3.6,mb,291,8.756,0.93,us,us6000f3yt,2021-08-12T05:49:20.385Z,"3 km SSE of R?ipur, India",earthquake,11.4,2.0,0.287,3,reviewed,us,us
4,2020-05-10T08:15:27.178Z,28.7268,77.3516,10.0,3.7,mb,203,9.149,0.52,us,us70009ebf,2020-07-29T22:14:27.040Z,"6 km ESE of Loni, India",earthquake,8.9,2.0,0.253,4,reviewed,us,us


In [439]:
# cleaning the time column and removing the last 4 characters since i need to use a specific time format to bring out the 
# weather data through an api call

pattern = r'\.\d\d\dZ'


df["time"] = df["time"].str.replace(pattern,"")



In [440]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2020-04-22T18:47:00,17.0062,96.2746,10.0,3.1,ml,275,2.919,0.62,us,us7000921c,2020-07-11T22:19:22.040Z,"25 km NNE of Yangon, Myanmar",earthquake,7.0,2.0,0.148,6,reviewed,us,us
1,2021-09-09T13:27:04,31.9732,72.2567,10.0,3.4,mb,285,1.877,0.75,us,us7000f9u0,2021-09-09T20:57:54.470Z,"7 km W of Sahiwal, Pakistan",earthquake,10.6,2.0,0.49,1,reviewed,us,us
2,2020-06-23T13:47:32,23.9312,93.0128,10.0,3.6,mb,166,1.213,0.48,us,us6000ahfx,2020-08-30T00:01:10.040Z,"12 km SE of Darlawn, India",earthquake,7.3,2.0,0.5,1,reviewed,us,us
3,2021-08-10T08:12:46,30.2767,78.1005,10.0,3.6,mb,291,8.756,0.93,us,us6000f3yt,2021-08-12T05:49:20.385Z,"3 km SSE of R?ipur, India",earthquake,11.4,2.0,0.287,3,reviewed,us,us
4,2020-05-10T08:15:27,28.7268,77.3516,10.0,3.7,mb,203,9.149,0.52,us,us70009ebf,2020-07-29T22:14:27.040Z,"6 km ESE of Loni, India",earthquake,8.9,2.0,0.253,4,reviewed,us,us


In [441]:
# dropping some of the columns which does not contribute to my analysis as of now

df.drop(columns = ["gap", "dmin", "rms", "net", "updated", "horizontalError", "depthError", "magError"], inplace = True)

In [442]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,id,place,type,magNst,status,locationSource,magSource
0,2020-04-22T18:47:00,17.0062,96.2746,10.0,3.1,ml,us7000921c,"25 km NNE of Yangon, Myanmar",earthquake,6,reviewed,us,us
1,2021-09-09T13:27:04,31.9732,72.2567,10.0,3.4,mb,us7000f9u0,"7 km W of Sahiwal, Pakistan",earthquake,1,reviewed,us,us
2,2020-06-23T13:47:32,23.9312,93.0128,10.0,3.6,mb,us6000ahfx,"12 km SE of Darlawn, India",earthquake,1,reviewed,us,us
3,2021-08-10T08:12:46,30.2767,78.1005,10.0,3.6,mb,us6000f3yt,"3 km SSE of R?ipur, India",earthquake,3,reviewed,us,us
4,2020-05-10T08:15:27,28.7268,77.3516,10.0,3.7,mb,us70009ebf,"6 km ESE of Loni, India",earthquake,4,reviewed,us,us


In [443]:
# there are some "?" characters in place of all "a" in the places name, so cleaning them and replacing the ? with a

df["place"] = df["place"].str.replace("?","a")


In [444]:
df ["place"] = df["place"].str.replace("-"," ")

In [445]:
places_lst = df["place"].str.split(",")

In [446]:
loc_lst = []
country_lst = []
for i in range(0,614):
    
    if len(places_lst[i]) == 2:
        loc_lst.append(places_lst[i][0])
        country_lst.append(places_lst[i][1])
        
    
    else:
        loc_lst.append(places_lst[i][0])
        country_lst.append("None")
        

In [447]:
df["country"] = country_lst

In [448]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,id,place,type,magNst,status,locationSource,magSource,country
0,2020-04-22T18:47:00,17.0062,96.2746,10.0,3.1,ml,us7000921c,"25 km NNE of Yangon, Myanmar",earthquake,6,reviewed,us,us,Myanmar
1,2021-09-09T13:27:04,31.9732,72.2567,10.0,3.4,mb,us7000f9u0,"7 km W of Sahiwal, Pakistan",earthquake,1,reviewed,us,us,Pakistan
2,2020-06-23T13:47:32,23.9312,93.0128,10.0,3.6,mb,us6000ahfx,"12 km SE of Darlawn, India",earthquake,1,reviewed,us,us,India
3,2021-08-10T08:12:46,30.2767,78.1005,10.0,3.6,mb,us6000f3yt,"3 km SSE of Raipur, India",earthquake,3,reviewed,us,us,India
4,2020-05-10T08:15:27,28.7268,77.3516,10.0,3.7,mb,us70009ebf,"6 km ESE of Loni, India",earthquake,4,reviewed,us,us,India


In [449]:
df["place"] = loc_lst

In [450]:
#string1 = "Haryana Delhi Uttar Pradesh region"

pattern = r'\d{1,4}\s[a-z]*\s[A-Z]*\sof\s'

df["place"] = df["place"].str.replace(pattern,"")

In [451]:
df["place"] = df["place"].str.replace("Haryana Delhi","")

In [452]:
df["place"] = df["place"].str.replace("region","")

In [453]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,id,place,type,magNst,status,locationSource,magSource,country
0,2020-04-22T18:47:00,17.0062,96.2746,10.0,3.1,ml,us7000921c,Yangon,earthquake,6,reviewed,us,us,Myanmar
1,2021-09-09T13:27:04,31.9732,72.2567,10.0,3.4,mb,us7000f9u0,Sahiwal,earthquake,1,reviewed,us,us,Pakistan
2,2020-06-23T13:47:32,23.9312,93.0128,10.0,3.6,mb,us6000ahfx,Darlawn,earthquake,1,reviewed,us,us,India
3,2021-08-10T08:12:46,30.2767,78.1005,10.0,3.6,mb,us6000f3yt,Raipur,earthquake,3,reviewed,us,us,India
4,2020-05-10T08:15:27,28.7268,77.3516,10.0,3.7,mb,us70009ebf,Loni,earthquake,4,reviewed,us,us,India


In [454]:
# extract only the time information

date_lst = df["time"].str.split("T")
lst_t = []
for i in range(0,614):
    lst_t.append(date_lst[i][0])
    

In [455]:
df["time"] = lst_t

In [456]:
df.head()

Unnamed: 0,time,latitude,longitude,depth,mag,magType,id,place,type,magNst,status,locationSource,magSource,country
0,2020-04-22,17.0062,96.2746,10.0,3.1,ml,us7000921c,Yangon,earthquake,6,reviewed,us,us,Myanmar
1,2021-09-09,31.9732,72.2567,10.0,3.4,mb,us7000f9u0,Sahiwal,earthquake,1,reviewed,us,us,Pakistan
2,2020-06-23,23.9312,93.0128,10.0,3.6,mb,us6000ahfx,Darlawn,earthquake,1,reviewed,us,us,India
3,2021-08-10,30.2767,78.1005,10.0,3.6,mb,us6000f3yt,Raipur,earthquake,3,reviewed,us,us,India
4,2020-05-10,28.7268,77.3516,10.0,3.7,mb,us70009ebf,Loni,earthquake,4,reviewed,us,us,India


In [457]:
# getting the names of columns consistent

df.rename(columns = {'magType':'mag_type', 'magNst':'mag_nst', 'magSource':'mag_source','locationSource':'loc_source'}, inplace = True)

In [458]:
df.dtypes

time           object
latitude      float64
longitude     float64
depth         float64
mag           float64
mag_type       object
id             object
place          object
type           object
mag_nst         int64
status         object
loc_source     object
mag_source     object
country        object
dtype: object

In [459]:
df["country"] = df["country"].replace("None","Tibet")

In [460]:
df["country"] = df["country"].str.replace("region","")

In [461]:
df["country"] = df["country"].str.replace("India ", "India")

In [462]:
time_lst = df["time"].str.split("-")
year = []
month = []

for i in range(0,614):
    year.append(time_lst[i][0])
    month.append(time_lst[i][1])
    

In [463]:
df["year"] = year
df["month"] = month

In [465]:
df.drop(columns = ["time"], inplace = True)

In [466]:
df.to_csv("cleaned_earthquake_data.csv")