# Analyzing earthquake dataset for Greece considering the actual territory of Greece on 2019
**This article covers analysis of [earthquake catalogue for Greece](https://www.kaggle.com/astefopoulos/earthquakes-in-greece-19012018) in period 1901 - 2018 considering the actual territory of Greece on 2019 using geojson file of [Greek boundaries](https://www.kaggle.com/lsind18/greeceborders) .**

>Short brief:
* Understand data
* Clean the outliers - data which do not belong to modern Greek area
* Pick out more recent and accurate data for last years as an improvement of seismographs
* Group recent data by Richter Magnitude value
* Show data on real map

# First steps
### 1. Getting started

Import necessary modules and load the dataframe to observe first five records.

**Most used modules in the notebook:**
- pandas - data manipulation and analysis of dataframes
- matplotlib - visualization of data
- json - gets actual greek borders from geojson data
- shapely - manipulation of planar features using functions from GEOS library - we need to plot polygons as an areas
- folium - map of the leaflet.js library. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

eq_df = pd.read_csv('/kaggle/input/earthquakes-in-greece-19012018/EarthQuakes in Greece.csv')
eq_df.head(5)

### 2. Get some statistics

* rename dataframe columns `LATATITUDE (N), LONGITUDE  (E), MAGNITUDE (Richter)` into something more readable: `Lat, Long, Magn`;
* check types and None values using *info()*;
* find out some descriptive statistics using *describe()*:

In [None]:
eq_df.rename(columns={'Date':'Day', 'LATATITUDE (N)':'Lat', 'LONGITUDE  (E)' : 'Long', 'MAGNITUDE (Richter)' : 'Magn' }, inplace=True)
eq_df.info()
eq_df.describe()

> Result:
* dataset contains 256 655 rows and 8 columns;
* there are no `None` values in the dataset;
* `Year, Month, Day, Hours, Minutes` are of int64 type, `Lat, Long and Magn` are of float type;
* `Year` is in range 1901 - 2018;
* `Magn` is in range 0.0 - 8.0, mean magnitude is 2.42.


# Richter Magnitude Scale and outliers

### 3. Understand Richter Magnitude Scale

According to the [Richter Magnitude Scale](http://www.geo.mtu.edu/UPSeis/magnitude.html), there can be micro, minor, light, moderate, strong, major and epic types of earthquakes. After some cleaning we organize the types of earthquakes into smaller dataframes by the Richter Magnitude value:
* eq_minor = (0; 3.9]
* eq_light = (3.9; 4.9\]
* eq_moder = (4.9; 5.9\]
* eq_major = (5.9; 7.9\]
* eq_great > 7.9


### 4. Find the outliers

The outliers table and full histogram of magnitudes frequencies are:

In [None]:
eq_df.hist(column='Magn', bins=100)
eq_df[(eq_df['Magn']> 7.3) | (eq_df['Magn']==0.0)].sort_values('Magn', ascending=False)

>Result:
The most destructive earthquake was at 11 of August, 1903 with epicenter at sea south of [Kythera](https://www.kythera.gr/en/about_kythera/history.php) which devastated the island. The other one took place in present [Kresna, Bulgaria on 4 of April, 1904](https://www.researchgate.net/publication/50301271_The_Kresna_earthquake_of_1904_in_Bulgaria). There are also rows with magnitude = 0.0 which will be excluded from small datasets.
Rows to ignore for futher analysis:
* magnitude = 0.0 (only 2 rows);
* earthquakes from adjacent areas.

In [None]:
eq_df = eq_df.loc[eq_df['Magn'] != 0]

# Plot actual Greek area on 2019

### 5. Create a multipolygon of actual Greek area

* Load coordinates of actual borders of Greece;

Full boundaries of Greece were found at [OSM Admin Boundaries Map](https://wambachers-osm.website/boundaries/) and downloaded [as a geojson file](https://www.kaggle.com/lsind18/greeceborders). *For futher analysis of this dataset it can be extended with coordinates of boundaries of adjoining countries.*
* Create a multipolygon geometry from geojson file.

Modules used:
* json module reads json file `Greece_AL2.GeoJson`
* [**shapely**](https://pypi.org/project/Shapely/) module helps to manipulate and analyse geometric objects in the Cartesian plane. Shapely module help to populate `countries` dictionary with PreparedGeometry object which is a modern Greek area. Dictionary `countries = {}` has only one key 'Greece' and it can be extended later with the keys of other countries names.

In [None]:
import json
from shapely.geometry import mapping, shape
from shapely.prepared import prep
from shapely.geometry import Point

with open('/kaggle/input/greeceborders/Greece_AL2.GeoJson') as json_file:
    data = json.load(json_file)

countries = {}
gr = '' # multipolygon
for feature in data['features']:
    geom = feature['geometry']
    gr = shape(geom)
    countries['Greece'] = prep(gr)

> Result:
Shape(geom) type variable was created for use in the next step

### 6. Combine area of Greece, real map and some entries from dataset

* combine dictionary `countries = {}` with full dataset `eq_df` and real map using module [**folium**](https://python-visualization.github.io/folium/). Folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library.
* Markers on a map are earthquakes by longitude and latitude with popus of Year when the accident happened for first 25 rows.

In [None]:
import folium

m = folium.Map([eq_df['Lat'].mean(), eq_df['Long'].mean()], #center of a map
               zoom_start=6, min_zoom = 5, max_zoom = 7) # max zoom is 18; restrict zooms not to scroll much
               
folium.GeoJson(gr).add_to(m) # add gr - multipolygon of greek boundaries
folium.LatLngPopup().add_to(m) # add custom popup of lat/long of selected point

for i in range(0,25): # add markers of first 25 earthquakes of the dataset
    folium.Marker([eq_df.iloc[i]['Lat'], eq_df.iloc[i]['Long']], 
                  popup=eq_df.iloc[i]['Year']).add_to(m) # add popup to markers as an accident year
m

> Result:
Map shows earthquakes in Greece and adjacent areas - Balkans, Turkey, etc.

# Drop the earthquakes not from Greek territory

### 7. Cut adjacent areas

* create a function `get_country(row)` which takes one row of dataset as a argument, extracts longitude and latitude and checks if it belongs to Greek area or not. For futher analysis of using areas of adjoing countries it can return country name. Function returns bool value showing if the point is inside Greece or outside - this value is added to `eq_df` dataset as a new column.
* create dataframe of accidents happend in actual area of Greece.

In [None]:
def get_country(row):
    point = Point(row['Long'], row['Lat'])
    for country, geom in countries.items():
        if geom.contains(point):
            return True # country name
    return False # unknown

eq_df['Country'] = eq_df.apply(get_country, axis=1)

In [None]:
eq_gr = eq_df[eq_df.Country == 1]
eq_gr = eq_gr.drop(columns='Country')
eq_gr

> Result: There are 197 059 entries (59 596 entries belong to other countries on 2019) in the same years 1901 - 2018 in the dataframe `eq_gr`.

# Create datasets based on Richter magnitude value

### 8. Create different datasets based on Richter magnitude value
* eq_minor = (0; 3.9]; eq_light = (3.9; 4.9\];  eq_moder = (4.9; 5.9\];  eq_major = (5.9; 7.9\]; eq_great > 7.9
* create scatter plot `Year-Month` with all earthquakes excluding minor (`Magn <= 3.9`). **Orange and red dots show the destructive earthquakes, lightgreen and lightblue - light (which are most common) and moderate respectively.**

In [None]:
eq_minor = eq_gr.loc[(eq_gr['Magn'] > 0) & (eq_gr['Magn'] <=3.9)]
eq_light = eq_gr.loc[(eq_gr['Magn'] > 3.9) & (eq_gr['Magn'] <=4.9)]
eq_moder = eq_gr.loc[(eq_gr['Magn'] > 4.9) & (eq_gr['Magn'] <= 5.9)]
eq_major = eq_gr.loc[(eq_gr['Magn'] > 5.9) & (eq_gr['Magn'] <=7.9)]
eq_great = eq_gr.loc[eq_gr['Magn'] > 7.9]

ax1 = eq_light.plot(kind='scatter', x='Year', y='Month', color='lightgreen', label='Light')
ax2 = eq_moder.plot(kind='scatter', x='Year', y='Month', color='lightblue',  label='Moder', ax=ax1)
ax3 = eq_major.plot(kind='scatter', x='Year', y='Month', color='orange', label='Major', ax=ax1)
ax4 = eq_great.plot(kind='scatter', x='Year', y='Month', color='r', label='Great',  ax=ax1)
ax1.legend(bbox_to_anchor=(1., 1.))

>Result: 
Plotting the types of earthquakes occured during years shows that last 40-50 years many light earthquakes were registered which connected with the abilities of modern seismographs to catch very minor earthquakes.

# Analyze dataset only with modern data

### 9. Create dataset only with modern data
* understand when the records of full dataframe become more accurate

In [None]:
count_y = eq_gr['Year'].value_counts()
count_y.plot(grid=True)

> Result: the graph above shows that a lot of data started to appear since the end of the century.

> Steps:
* Create a dataframe for 1998 - 2018;
* Create bar plot on *year-count* summarize the number of records per year.

In [None]:
eq_df_20y = eq_gr[eq_gr['Year']>=1998]
pd.crosstab(index=eq_df_20y['Year'], columns='count').plot(kind='bar', figsize=(5,5), grid=True)
eq_df_20y

> Result: 
* dataset 1998-2018 has 172 987 records that means the whole century (from 1901 to 1997) had only 24 072 records
  * f.e. about 5000 entries were made for Magnitude less than 3. 
* According to a crosstab plot the majority of records were made in 2014 (more than it was recorded during the whole century!)

### 10. Create a dataset for 2014 year
* find descriptive info about it
* mark epicenters on the map

In [None]:
eq_2014 = eq_gr[eq_gr['Year']==2014]
eq_2014.info()

Mark earthquakes epicenters with Magnitude more than 3.9 using CircleMarker:

In [None]:
from folium.plugins import FastMarkerCluster, MarkerCluster
mc = MarkerCluster(name="Marker Cluster")

folium_map = folium.Map([eq_2014['Lat'].mean(), eq_df['Long'].mean()], #center of a map
               zoom_start=6, min_zoom = 5, tiles='Stamen Terrain') # max zoom is 18; restrict zooms not to scroll nuch
               
for index, row in eq_2014[eq_2014.Magn>3.9].iterrows():
    popup_text = "Day: {} <br> Month: {}".format(
                      int(row["Day"]),
                      int(row["Month"])
                      )
    folium.CircleMarker(location=[row["Lat"],row["Long"]],
                        radius= 1.5 * row['Magn'],
                        color="red",
                        popup=popup_text,
                        fill=True).add_to(mc)

mc.add_to(folium_map)

folium.LayerControl().add_to(folium_map)
folium_map

In [None]:
eq_light14 = eq_light[eq_light.Year==2014]
eq_moder14 = eq_moder[eq_moder.Year==2014]
eq_major14 = eq_major[eq_major.Year==2014]

Mark light, moderate and major earthquakes (green, blue, red) of 2014 year on a map:

In [None]:
m = folium.Map([eq_2014['Lat'].mean(), eq_df['Long'].mean()], #center of a map
               zoom_start=6, min_zoom = 5) # max zoom is 18;
              
for i in range(0,len(eq_light14)): # add markers of eq with diff magn 
    folium.Circle(
        radius=1000 * eq_light14.iloc[i]['Magn'],
        location=[eq_light14.iloc[i]['Lat'], eq_light14.iloc[i]['Long']],
        popup="D: {} <br> Mo: {}".format(
                      int(row["Day"]),
                      int(row["Month"])),
        color='green',
        fill=False,
    ).add_to(m) 

for i in range(0,len(eq_moder14)): # add markers of eq with diff magn 
    folium.Circle(
        radius=1000 * eq_moder14.iloc[i]['Magn'],
        location=[eq_moder14.iloc[i]['Lat'], eq_moder14.iloc[i]['Long']],
        popup="D: {} <br> Mo: {}".format(
                      int(row["Day"]),
                      int(row["Month"])),
        color='blue',
        fill=False,
    ).add_to(m) 

for i in range(0,len(eq_major14)): # add markers of eq with diff magn 
    folium.Circle(
        radius=1000 * eq_major14.iloc[i]['Magn'],
        location=[eq_major14.iloc[i]['Lat'], eq_major14.iloc[i]['Long']],
        popup="D: {} <br> Mo: {}".format(
                      int(row["Day"]),
                      int(row["Month"])),
        color='red',
        fill=False,
    ).add_to(m)
m

All these datasets can be used for futher (f.e., predictive) analysis.

# Conclusion

* #### All these datasets can be used for futher (f.e., predictive) analysis.
* #### Visualization of data helps to understand more dangerous regions.
* #### According to a crosstab plot the majority of records were made in 2014 (more than it was recorded during the whole century!)

Please upvote this kernel if you find it useful 🙋   
Feel free to give any suggestions to improve my code.
