# 2. Univariate Analysis

This notebook contains: 

    2.1. Statistical Overview of the Entire Dataset
    2.2. Univariate Analysis for Categorical Variables
    2.3. Univariate Analysis for Numerical Variables
    2.4. Insights

## 2.1. Statistical Overview of the Entire Dataset

This section provides a brief statistical overview of the entire preprocessed dataset. 

The required packages are imported over here:

In [1]:
import pandas as pd
import numpy as np
import plotly as ply
import plotly.express as px

The downloaded preprocessed dataset *Dataset for Analysis* is then read into a Pandas DataFrame:

In [2]:
univariate_analysis = pd.read_csv("C:\\Users\\Sharmila\\Sharmila_Kuthunur Analysis of NOAA Significant Earthquakes Dataset\\1. Preprocessing\\Dataset for Analysis.csv")

In [3]:
# Viewing the entire dataset

print(univariate_analysis)

      Unnamed: 0  Year  Tsu  Vol                              Name  Latitude  \
0              0 -2150    ?    ?      JORDAN: BAB-A-DARAA,AL-KARAK    31.100   
1              1 -2000  Tsu    ?                     SYRIA: UGARIT    35.683   
2              2 -2000    ?    ?                   TURKMENISTAN: W    38.000   
3              3 -1610  Tsu  Vol  GREECE: THERA ISLAND (SANTORINI)    36.400   
4              4 -1566    ?    ?           ISRAEL: ARIHA (JERICHO)    31.500   
...          ...   ...  ...  ...                               ...       ...   
6139        6139  2019    ?    ?         PAKISTAN: MIRPUR DISTRICT    33.106   
6140        6140  2019    ?    ?          INDONESIA: MALUKU: AMBON    -3.450   
6141        6141  2019    ?    ?                  TURKEY: ISTANBUL    40.890   
6142        6142  2019    ?    ?              CHILE: SOUTH CENTRAL   -40.815   
6143        6143  2019    ?    ?                 CHILE: CONCEPCION   -35.473   

      Longitude  Mag MMI Int       Coun

In [4]:
print(univariate_analysis.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6144 entries, 0 to 6143
Data columns (total 12 columns):
Unnamed: 0     6144 non-null int64
Year           6144 non-null int64
Tsu            6144 non-null object
Vol            6144 non-null object
Name           6144 non-null object
Latitude       6144 non-null float64
Longitude      6144 non-null float64
Mag            6144 non-null object
MMI Int        6144 non-null object
Country        6144 non-null object
Region         5415 non-null object
UpdatedYear    6144 non-null int64
dtypes: float64(2), int64(3), object(7)
memory usage: 576.1+ KB
None


In [5]:
# First five rows

print(univariate_analysis.head())

   Unnamed: 0  Year  Tsu  Vol                              Name  Latitude  \
0           0 -2150    ?    ?      JORDAN: BAB-A-DARAA,AL-KARAK    31.100   
1           1 -2000  Tsu    ?                     SYRIA: UGARIT    35.683   
2           2 -2000    ?    ?                   TURKMENISTAN: W    38.000   
3           3 -1610  Tsu  Vol  GREECE: THERA ISLAND (SANTORINI)    36.400   
4           4 -1566    ?    ?           ISRAEL: ARIHA (JERICHO)    31.500   

   Longitude  Mag MMI Int       Country                     Region  \
0       35.5  7.3       ?        JORDAN       BAB-A-DARAA,AL-KARAK   
1       35.8    ?    10.0         SYRIA                     UGARIT   
2       58.2  7.1    10.0  TURKMENISTAN                          W   
3       25.4    ?       ?        GREECE   THERA ISLAND (SANTORINI)   
4       35.3    ?    10.0        ISRAEL            ARIHA (JERICHO)   

   UpdatedYear  
0            0  
1          150  
2          150  
3          540  
4          584  


In [6]:
# Last five rows

print(univariate_analysis.tail())

      Unnamed: 0  Year Tsu Vol                       Name  Latitude  \
6139        6139  2019   ?   ?  PAKISTAN: MIRPUR DISTRICT    33.106   
6140        6140  2019   ?   ?   INDONESIA: MALUKU: AMBON    -3.450   
6141        6141  2019   ?   ?           TURKEY: ISTANBUL    40.890   
6142        6142  2019   ?   ?       CHILE: SOUTH CENTRAL   -40.815   
6143        6143  2019   ?   ?          CHILE: CONCEPCION   -35.473   

      Longitude  Mag MMI Int    Country            Region  UpdatedYear  
6139     73.766  5.6       ?   PAKISTAN   MIRPUR DISTRICT         4169  
6140    128.347  6.5       ?  INDONESIA            MALUKU         4169  
6141     28.173  5.7       ?     TURKEY          ISTANBUL         4169  
6142    -72.002  6.1       ?      CHILE     SOUTH CENTRAL         4169  
6143    -73.162  6.8       ?      CHILE        CONCEPCION         4169  


## 2.2. Univariate Analysis of Categorical Variables 

This section analyzes four categorial variables: Country, Region, Tsu, and Vol. Briefly, they are - 
    
    Country: The name of the country where the earthquake occurred.
    Region: The name of the specific region in the country where the earthquake occurred.
    Tsu: Occurrence of tsunamis. Denoted with 'Tsu' if a tsunami occurred or '?' if it did not occur.
    Vol: Occurrence of volcanoes. Denoted with 'Vol' if a volcano occurred or '?' if it did not occur. 

Tasks performed in this section are:

    2.2.1. Histogram of earthquakes in each country
    2.2.2. Histogram of earthquakes in each region
    2.2.3. Histogram of earthquakes in the same country and region
    2.2.4. Understanding Tsu and Vol columns
    


### 2.2.1. Histogram of earthquakes in each country

In [7]:
# Number of unique countries

print(len(univariate_analysis['Country'].unique()))

340


In [8]:
# Frequency of earthquakes in each country

univariate_analysis['Country'].value_counts()

CHINA                     598
IRAN                      378
INDONESIA                 369
JAPAN                     345
ITALY                     325
                         ... 
BRITISH VIRGIN ISLANDS      1
FIJI                        1
NEW HAMPSHIRE               1
INDONESIA-MALAYSIA          1
AFGHANISTAN; INDIA          1
Name: Country, Length: 340, dtype: int64

The count of earthquakes in each country can be represented as a histogram below:

In [10]:
# Plotting the histogram

histogram_countries = univariate_analysis['Country']
fig = px.histogram(histogram_countries, x= "Country")
fig.update_layout(
    title_text='Histogram of earthquakes in each country', # title of plot
    xaxis_title_text='Name of the Country', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.2.2. Histogram of earthquakes in each region

In [11]:
# Number of unique regions

print(len(univariate_analysis['Region'].unique()))

2873


In [12]:
# Frequency of earthquakes in each region

print(univariate_analysis['Region'].value_counts())

 YUNNAN PROVINCE                      144
 SICHUAN PROVINCE                      77
 HONSHU                                60
 CENTRAL                               59
 NORTHERN                              51
                                     ... 
 ZANGEZUR, NAKHITCHEVAN                 1
 MEXICO CITY,ACAPULCO                   1
 GANSU PROVINCE, SHANXI PROVINCE        1
 QUEZALTENANGO, SAN MARCOS, SOLOLA      1
 DAMGHAN-TORUD                          1
Name: Region, Length: 2872, dtype: int64


The count of earthquakes in each region can be represented in the histogram below:

In [13]:
# Plotting the histogram

histogram_region = univariate_analysis['Region']
fig = px.histogram(histogram_region, x="Region")
fig.update_layout(
    title_text='Histogram of earthquakes in each region', # title of plot
    xaxis_title_text='Name of the Region', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.2.3. Histogram of earthquakes in the same country and region

In [14]:
# Number of earthquakes occurred in the same country and region

print(len(univariate_analysis['Name'].unique()))

3790


In [15]:
# Names of the regions where more than one earthquake has occurred

multiple_earthquakes = pd.DataFrame(univariate_analysis, columns=['Name'])
multiple_earthquakes_output = multiple_earthquakes[multiple_earthquakes.duplicated()]
print(multiple_earthquakes_output)

                            Name
7        ISRAEL: ARIHA (JERICHO)
26                        GREECE
37    PORTUGAL: CABO SAN VICENTE
51                        GREECE
55             ISRAEL: PALESTINE
...                          ...
6139   PAKISTAN: MIRPUR DISTRICT
6140    INDONESIA: MALUKU: AMBON
6141            TURKEY: ISTANBUL
6142        CHILE: SOUTH CENTRAL
6143           CHILE: CONCEPCION

[2354 rows x 1 columns]


In [17]:
# The frequency of earthquakes in each of the regions

print(univariate_analysis['Name'].value_counts())

CHINA: YUNNAN PROVINCE                               79
TURKEY                                               49
CHINA: SICHUAN PROVINCE                              48
RUSSIA: KURIL ISLANDS                                39
SOLOMON ISLANDS                                      34
                                                     ..
GREECE: SAGIADHA-KONISPOLIS (THESPROTIA)              1
CHINA: SHANXI PROVINCE: LINFEN                        1
FRANCE: CORRENCON,CHATEAU-BERNARD,LE GUA,RENCUREL     1
CALIFORNIA: NORTHRIDGE                                1
FRANCE: TARBES, LOURDES                               1
Name: Name, Length: 3790, dtype: int64


In [18]:
# Plotting the histogram

histogram_region = univariate_analysis['Name']
fig = px.histogram(histogram_region, x="Name")
fig.update_layout(
    title_text='Histogram of earthquakes in each country and region', # title of plot
    xaxis_title_text='Name of the Country along with region', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.2.4. Understanding Tsu and Vol columns

The *Tsu* and *Vol* columns in the dataset represent the occurrence of a tsunami or a volcano. A basic statistical overview of these variables is provided to assess how many tsunamis and volcanoes occurred as a side effect to the earthquakes.

In [19]:
# Count of tsunami occurrences along with missing value count

tsunami_analysis = univariate_analysis['Tsu'].value_counts()
print(tsunami_analysis)

?      4291
Tsu    1853
Name: Tsu, dtype: int64


In [20]:
# Count of volcano occurrences along with missing value count

volcano_analysis = univariate_analysis['Vol'].value_counts()
print(volcano_analysis)

?      6080
Vol      64
Name: Vol, dtype: int64


## 2.3. Univariate Analysis of Numerical Variables

This section analyzes five numerical variables: Year, Latitude, Longitude, Magnitude, MMI Int, and Updated Year. Briefly, they are: 

Year: The year in which the earthquake occurred. 
Latitude and Longitude: Geographical coordinates of the earthquake location.
Magnitude: The intensity of the earthquake as measured by seismological instruments.
MMI Int: An ascending order of representation of the occurred damage.
Updated Year: The offset value of the column *Year*. 

This section is further organized as the analysis of following variables: 

    2.3.1. Year
    2.3.2. Latitude and Longitude
    2.3.3. Magnitude

### 2.3.1. Year

The Year values range from -2150 to 2019. Along with a statistical overview of the column, a histogram is represented.

In [21]:
# Statistical overview of the column 'Year'

univariate_analysis.Year.describe()

count    6144.000000
mean     1804.691243
std       376.404999
min     -2150.000000
25%      1820.000000
50%      1928.000000
75%      1988.000000
max      2019.000000
Name: Year, dtype: float64

In [22]:
# Plotting the histogram

histogram_year = univariate_analysis['Year']
fig = px.histogram(histogram_year, x="Year")
fig.update_layout(
    title_text='Histogram of earthquake occurrence', # title of plot
    xaxis_title_text='Year', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

### 2.3.2. Latitude and Longitude

Latitudes and longitudes range from -90 to +90 each. This section contains the following subsections: 

    2.3.2.1. Minimum latitude
    2.3.2.2. Maximum latitude
    2.3.2.3. Minimum longitude
    2.3.2.4. Maximum longitude
    2.3.2.5. Entries with same latitude and longitude

#### 2.3.2.1. Minimum latitude

In [23]:
# Finding the minimum latitude value

print(univariate_analysis['Latitude'].min())

-62.877


In [26]:
# Accessing the index of that value

latitude_min = univariate_analysis.loc[univariate_analysis['Latitude'] == '-0.018000000000000002']
print(latitude_min)

Empty DataFrame
Columns: [Unnamed: 0, Year, Tsu, Vol, Name, Latitude, Longitude, Mag, MMI Int, Country, Region, UpdatedYear]
Index: []


#### 2.3.2.2. Maximum latitude

In [27]:
# Finding maximum latitude

print(univariate_analysis['Latitude'].max())

73.122


The output is incorrect since question marks(?) were added to replace missing values. In order to arrive at the right answer, the rows with '?' are dropped and the result is stored in another dataframe. The maximum value of this new dataframe is then calculated.

In [28]:
# Creating a new dataframe with all the values except the missing entries in Latitude column

latitude_max = univariate_analysis[univariate_analysis.Latitude != '?']

# Printing the maximum latitude

print(latitude_max.max())

Unnamed: 0               6143
Year                     2019
Tsu                       Tsu
Vol                       Vol
Name           ZAMBIA: KAPUTA
Latitude               73.122
Longitude                 180
Mag                         ?
MMI Int                     ?
Country                ZAMBIA
UpdatedYear              4169
dtype: object


#### 2.3.2.3. Minimum longitude

In [29]:
# Finding the minimum longitude value

print(univariate_analysis['Longitude'].min())

-179.984


In [30]:
# Accessing the index of that value

longitude_min = univariate_analysis.loc[univariate_analysis['Longitude'] == '-0.05']
print(longitude_min)

Empty DataFrame
Columns: [Unnamed: 0, Year, Tsu, Vol, Name, Latitude, Longitude, Mag, MMI Int, Country, Region, UpdatedYear]
Index: []


#### 2.3.2.4. Maximum Longitude

In [31]:
# Finding maximum longitude

print(univariate_analysis['Longitude'].max())

180.0


A similar problem as latitude is faced.  In order to arrive at the right answer, the rows with '?' are dropped and the result is stored in another dataframe. The maximum value of this new dataframe is then calculated.

In [32]:
# Creating a new dataframe with all the values except the missing entries in longitude column

longitude_max = univariate_analysis[univariate_analysis.Longitude != '?']

# Printing the maximum longitude

print(longitude_max.max())

Unnamed: 0               6143
Year                     2019
Tsu                       Tsu
Vol                       Vol
Name           ZAMBIA: KAPUTA
Latitude               73.122
Longitude                 180
Mag                         ?
MMI Int                     ?
Country                ZAMBIA
UpdatedYear              4169
dtype: object


#### 2.3.2.5. Entries with same latitude and longitude

The next piece of code is aimed at understanding if there were multiple earthquakes in exactly the same geographical coordinates.

In [35]:
# Creating a new dataframe with latitude and longitude values from the main dataframe univariate_analysis

multiple_earthquakes = pd.DataFrame(univariate_analysis, columns=['Latitude', 'Longitude'])

# Creating another dataframe with only the duplicate values from multiple_earthquakes

rows = multiple_earthquakes[multiple_earthquakes.duplicated()]
print(rows)

      Latitude  Longitude
6       35.683     35.800
15      37.000     22.500
34       0.000      0.000
43      32.000     35.500
60      36.100     36.100
...        ...        ...
4198    -5.000    130.000
4205   -29.900    -71.300
4276    39.000     44.000
4307     0.000      0.000
5530    23.970     97.569

[721 rows x 2 columns]


The output is incorrect since missing values exist. The rows which have either latitude or longitdue, or both are dropped.

In [37]:
# Creating a new dataframe by excluding missing values from Longitude column

rows_updated_longitude = rows[rows.Longitude != '?']
#print(rows_updated_longitude)

# Creating a new dataframe by excluding missing values from rows_updated_longitude

rows_updated_latitude = rows[rows.Latitude != '?']
print(rows_updated_latitude)

      Latitude  Longitude
6       35.683     35.800
15      37.000     22.500
34       0.000      0.000
43      32.000     35.500
60      36.100     36.100
...        ...        ...
4198    -5.000    130.000
4205   -29.900    -71.300
4276    39.000     44.000
4307     0.000      0.000
5530    23.970     97.569

[721 rows x 2 columns]


### 2.3.3. Magnitude

Magnitude represents the energy released from the earthquake at its source. This section contains: 

    2.3.3.1. Minimum magnitude
    2.3.3.2. Maximum magnitude
    2.3.3.3. Frequency of magnitudes
    2.3.3.4. Histogram of magnitudes

In [38]:
# Overview of the magnitude column

print(univariate_analysis['Mag'].describe())

count     6144
unique      65
top          ?
freq      1783
Name: Mag, dtype: object


#### 2.3.3.1. Minimum magnitude

In [39]:
# Finding the minimum magnitude

print(univariate_analysis['Mag'].min())

1.6


In [40]:
# Accessing the entire row of this magnitude

magnitude_min = univariate_analysis.loc[univariate_analysis['Mag'] == '1.6']
print(magnitude_min)

      Unnamed: 0  Year Tsu Vol  Name  Latitude  Longitude  Mag MMI Int  \
5427        5427  2007   ?   ?  UTAH    39.464   -111.207  1.6       ?   

     Country Region  UpdatedYear  
5427    UTAH    NaN         4157  


#### 2.3.3.2. Maximum magnitude

In [41]:
# Creating a new dataframe my excluding missing entries from Mag column 

magnitude_maximum = univariate_analysis[univariate_analysis.Mag != '?']

# Printing the maximum value of the Mag column in the new dataframe

print(magnitude_maximum.Mag.max())

9.5


In [42]:
# Accessing the entire row of this magnitude


magnitude_max = univariate_analysis.loc[univariate_analysis['Mag'] == '9.5']
print(magnitude_max)

      Unnamed: 0  Year  Tsu  Vol                           Name  Latitude  \
3775        3775  1960  Tsu  Vol  CHILE: PUERTO MONTT, VALDIVIA   -38.143   

      Longitude  Mag MMI Int Country                   Region  UpdatedYear  
3775    -73.407  9.5    12.0   CHILE   PUERTO MONTT, VALDIVIA         4110  


#### 2.3.3.3. Frequency of magnitudes

In [43]:
# Counting the frequency of earthquake magnitudes

magnitude_count = univariate_analysis['Mag'].value_counts()
print(magnitude_count)

?      1783
7.5     259
7.0     224
6.0     203
6.5     199
       ... 
3.7       2
9.5       1
2.2       1
9.2       1
1.6       1
Name: Mag, Length: 65, dtype: int64


In [44]:
# Frequency of magnitudes in ascending order

print(univariate_analysis.groupby('Mag').size())

Mag
1.6       1
2.1       2
2.2       1
3.1       3
3.2       4
       ... 
9.0       2
9.1       3
9.2       1
9.5       1
?      1783
Length: 65, dtype: int64


#### 2.3.3.4. Histogram of magnitudes

In [45]:
histogram_region = univariate_analysis['Mag']
fig = px.histogram(histogram_region, x="Mag")
fig.update_layout(
    title_text='Histogram of magnitudes', # title of plot
    xaxis_title_text='Magnitude number', # xaxis label
    yaxis_title_text='Count', # yaxis label
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

In [46]:
import plotly.graph_objects as go

labels = ['Magnitude']
values = univariate_analysis['Mag'].unique()
print(len(values))


65


In [47]:
# A pie chart of the distribution of magnitudes

fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.show()

## 2.4. Insights

### Univariate analysis of categorical variables 

    1. There are four categorical variables:  Country, Region, Tsu, and Vol.
    
#### Country

    1. 340 unique countries have experienced significant earthquakes in the duration -2150 to 2019. 
    2. Out of these 341 countries, China is seen to have the highest number of earthquakes (598) during the entire duration from -2150 to 2019. The trend in its count is analyzed in greater detail in the multivariate section.

#### Region

    1. 2873 unique regions have experienced earthquakes during the years -2150 to 2019. 
    2. 1950-1999 experienced the highest number of eathquakes: 1464.
    3. Out of these, Yunnan Province in China has the highest number of earthquakes. 
    4. 2353 unique places have experienced more than one earthquake in the same country and region.
    
#### Tsu and Vol

    1. There are 1853 occurrences of tsunamis and 64 of volcanoes. 
    2. There are 6080 missing entries for Vol, which means either the volcano did not occur or was not documented.
    

### Univariate analysis of numerical variables

#### Year

    1. Years 1950 - 1999 experienced the maximum number of earthquakes: 1464
    
#### Latitude and Longitude

    1. Ecuador: Quito experienced a 5.5 magnitude earthquake in 2014 and was the place with minimum latitude of    -0.098000000000000002.
    2. Zambia: Kaputa experienced an earthquake in 2019 and was the place with maximum latitude of 9.999.
    3. France: Pyrenees experienced a 4.7 magnitude earthquake in 2006 and was the place with minimum longitude of -0.05.
    4. Zambia: Kaputa experienced another earthquake in 2019 and was the place with maximum latitude of 99.9.
    5. 673 unique geographical locations experienced more than one earthquake.
   
#### Magnitude

    1. Earthquakes of magnitude 7.5 are the most frequent ones that have occurred 259 times.
    2. Utah experienced the lowest magnitude earthquake of 1.6 in 2007.
    3. Chile: Peurto Montt experienced the highest magnitude earthquake of 9.5 in 1960.    