# Scanning Earthquake Datase

The aim of this jupyter notebook is to give an overview of the earthquake data set and analyse the main actions to do in next steps to finally get the clean data set that will be enriched by GDP FRED data.

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv("../data/consolidated_data.csv") 

The dataset has 23 columns and 3272774 rows, it has 0% of duplicates. The columns are:

* **"Unnamed: 0":** This column is just the key to distinct every row of the data set.
* **time:** This is the date when the magintude was measured. Times are reported in milliseconds since 
* **latitude:** The latitude coordinate where the magnitude was measured in decimal degrees.
* **longitude:** The longitude coordinate where the magnitude was measured in decimal degrees.
* **depth:** Depth of the event in kilometers
* **mag:** The magnitude for the event. It depends on column 'magType'
* **magType:** The method or algorithm used to calculate the preferred magnitude for the event (“Md”, “Ml”, “Ms”, “Mw”, “Me”, “Mi”, “Mb”, “MLg”)
* **nst:** The total number of seismic stations used to determine earthquake location.
* **gap:** The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties.
* **dmin:** Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.
* **rms:** The root-mean-square (RMS) travel time residual, in sec, using all weights. This parameter provides a measure of the fit of the observed arrival times to the predicted arrival times for this location. Smaller numbers reflect a better fit of the data. The value is dependent on the accuracy of the velocity model used to compute the earthquake location, the quality weights assigned to the arrival time data, and the procedure used to locate the earthquake.
* **net:** The ID of a data contributor. Identifies the network considered to be the preferred source of information for this event. (ak, at, ci, hv, ld, mb, nc, nm, nn, pr, pt, se, us, uu, uw)
* **id:** A unique identifier for the event. This is the current preferred id for the event, and may change over time.
* **updated:** Time when the event was most recently updated. Times are reported in milliseconds since the epoch. In certain output formats, the date is formatted for readability.
* **place:** Textual description of named geographic region near to the event. (ex.: '29km NE of Independence, CA', '11km SSW of Lake Nacimiento, CA','4km S of La Canada Flintridge, CA', '9km S of Cabazon, CA'...)
* **type:** Type of seismic event (ex.: 'sonic boom', 'earthquake', 'quarry blast', 'explosion','nuclear explosion', 'mine collapse', 'other event','chemical explosion', 'rock burst', 'ice quake'...)
* **horizontalError:** Uncertainty of reported location of the event in kilometers.
* **depthError:** Uncertainty of reported depth of the event in kilometers.
* **magError:** Uncertainty of reported magnitude of the event. The estimated standard error of the magnitude. The uncertainty corresponds to the specific magnitude type being reported and does not take into account magnitude variations and biases between different magnitude scales. We report an "unknown" value if the contributing seismic network does not supply uncertainty estimates.
* **magNst:** The total number of seismic stations used to calculate the magnitude for this earthquake.
* **status:** Indicates whether the event has been reviewed by a human (ex.: 'reviewed', 'automatic', 'manual')
* **locationSource:** The network that originally authored the reported location of this event.
* **magSource:** Network that originally authored the reported magnitude for this event. (ex.: ak, at, ci, hv, ld, mb, nc, nm, nn, pr, pt, se, us, uu, uw...)

In [6]:
df.shape

(3272774, 23)

In [7]:
df.columns

Index(['Unnamed: 0', 'time', 'latitude', 'longitude', 'depth', 'mag',
       'magType', 'nst', 'gap', 'dmin', 'rms', 'net', 'id', 'updated', 'place',
       'type', 'horizontalError', 'depthError', 'magError', 'magNst', 'status',
       'locationSource', 'magSource'],
      dtype='object')

In [15]:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)

False    100.0
dtype: float64


In [22]:
df['status'].unique()[:10]

array(['reviewed', 'automatic', 'manual', nan], dtype=object)

In [21]:
df['magSource'].unique()[:10]

array(['ci', 'uw', 'iscgem', 'nc', 'us', 'erl', 'slm', 'ath', 'pas',
       'hvo'], dtype=object)

In [23]:
df['place'].unique()[:10]

array(['29km NE of Independence, CA', '11km SSW of Lake Nacimiento, CA',
       '4km S of La Canada Flintridge, CA', '9km S of Cabazon, CA',
       '5km SE of Niland, CA', '13km SW of Ocotillo, CA',
       '30km SSW of Primm, NV', 'Washington',
       '8km NE of Mexicali, B.C., MX', 'Kermadec Islands, New Zealand'],
      dtype=object)

In [24]:
df['type'].unique()[:10]

array(['sonic boom', 'earthquake', 'quarry blast', 'explosion',
       'nuclear explosion', 'mine collapse', 'other event',
       'chemical explosion', 'rock burst', 'ice quake'], dtype=object)

Here you can see a sample of the dataset:

In [10]:
df.sample(7)

Unnamed: 0.1,Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
1457355,2241313,2003-01-23T19:45:44.231Z,63.4739,-151.2431,8.8,1.7,ml,,,,...,2019-02-12T20:59:32.287Z,Central Alaska,earthquake,,1.7,,,reviewed,ak,ak
1677988,1423966,2005-02-07T06:31:55.950Z,38.801167,-122.7375,0.723,,,6.0,144.0,0.01712,...,2017-01-10T16:11:12.423Z,Northern California,earthquake,0.39,1.08,,,reviewed,nc,nc
483296,2905502,1988-02-26T08:37:46.880Z,46.7405,-119.353333,-0.217,1.0,md,8.0,116.0,0.05904,...,2016-07-24T20:30:47.930Z,Washington,earthquake,0.036,0.07,0.1,4.0,reviewed,uw,uw
157058,2227893,1981-08-20T02:12:27.180Z,38.799833,-122.803167,1.012,1.16,md,11.0,79.0,0.01081,...,2016-12-13T04:07:01.896Z,Northern California,earthquake,0.31,0.47,0.13,9.0,reviewed,nc,nc
1834676,3227071,2006-08-03T06:34:50.420Z,37.005333,-121.471,3.543,0.82,md,11.0,101.0,0.02252,...,2017-01-15T02:18:52.118Z,Northern California,earthquake,0.4,0.51,0.23,2.0,reviewed,nc,nc
225628,672636,1983-05-12T12:35:00.540Z,32.145,-115.803,5.694,2.48,ml,13.0,272.0,0.507,...,2016-02-02T18:51:26.500Z,"53km SSW of Progreso, B.C., MX",earthquake,3.92,31.61,0.131,2.0,reviewed,ci,ci
1353602,535684,2001-12-18T08:14:18.700Z,44.8085,-111.0435,7.06,0.61,md,11.0,203.0,0.04651,...,2018-08-28T19:31:54.600Z,"Yellowstone National Park, Wyoming",earthquake,0.63,0.68,,4.0,reviewed,uu,uu


Therefore data set will be filtered by column "type" as "earthquake and "magType" as "ml" due to **ml** is the magnitud more accurate to get a measure and the one with more records beside "md" magnitude type.

**ML magnitude type defnintion**: *The original magnitude relationship defined by Richter and Gutenberg in 1935 for local earthquakes. It is based on the maximum amplitude of a seismogram recorded on a Wood-Anderson torsion seismograph. Although these instruments are no longer widely in use, ML values are calculated using modern instrumentation with appropriate adjustments. Reported by NEIC for all earthquakes in the US and Canada. Only authoritative for smaller events, typically M<4.0 for which there is no mb or moment magnitude. In the central and eastern United States, NEIC also computes ML, but restricts the distance range to 0-150 km. In that area it is only authoritative if there is no mb_Lg as well as no mb or moment magnitude.*


For more information about this data set click [here](https://earthquake.usgs.gov/data/comcat/index.php#mag)