# Exploratory Data Analysis (EDA) on Lightning Strike Data with Missing Values

This notebook focuses on exploring lightning strike data collected by the **National Oceanic and Atmospheric Association (NOAA)** for the month of August 2018.

### Dataset 1
The first dataset comprises five columns:

| date | center_point_geom | longitude | latitude | number_of_strikes |
|------|-------------------|-----------|----------|-------------------|
|      |                   |           |          |                   |

- **date**: Date of the lightning strike.
- **center_point_geom**: Geographic location of the lightning strike.
- **longitude**: Longitude of the lightning strike.
- **latitude**: Latitude of the lightning strike.
- **number_of_strikes**: Number of strikes at that location.

### Dataset 2

The second dataset comprises of seven columns:

| date | zip_code | city | state | state_code |  center_point_geom |  number_of_strikes |
|------|----------|------|-------|------------|--------------------|--------------------|
|      |          |      |       |            |                    |                    |

- **date**: Date of the lightning strike.
- **zip_code**: Zip code of the lcoation.
- **city**: Name of the city.
- **state**: Name of the state.
- **state_code**: State code .
- **center_point_geom**: Geography lcoation of the lightning strike.
- **number_of_strikes**: Number of strikes at that location.


The second dataset has four unique columns: `zip_code`, `city`, `state`, and `state_code`.  

There are three columns that are common between them: `date`, `center_point_geom`, and `number_of_strikes`.




In [1]:
#Import necessary libraries 

import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sb 

In [2]:
#Load the dataset and preview it  

Dataset1 = pd.read_csv('eda_missing_data_dataset1.csv')
Dataset2 = pd.read_csv('eda_missing_data_dataset2.csv')

In [3]:
Dataset1.head(5)

Unnamed: 0,date,center_point_geom,longitude,latitude,number_of_strikes
0,2018-08-01,POINT(-81.6 22.6),-81.6,22.6,48
1,2018-08-01,POINT(-81.1 22.6),-81.1,22.6,32
2,2018-08-01,POINT(-80.9 22.6),-80.9,22.6,118
3,2018-08-01,POINT(-80.8 22.6),-80.8,22.6,69
4,2018-08-01,POINT(-98.4 22.8),-98.4,22.8,44


In [4]:
Dataset2.head(5)

Unnamed: 0,date,zip_code,city,state,state_code,center_point_geom,number_of_strikes
0,2018-08-08,3281,Weare,New Hampshire,NH,POINT(-71.7 43.1),1
1,2018-08-14,6488,Heritage Village CDP,Connecticut,CT,POINT(-73.2 41.5),3
2,2018-08-16,97759,"Sisters city, Black Butte Ranch CDP",Oregon,OR,POINT(-121.4 44.3),3
3,2018-08-18,6776,New Milford CDP,Connecticut,CT,POINT(-73.4 41.6),48
4,2018-08-08,1077,Southwick,Massachusetts,MA,POINT(-72.8 42),2


In [8]:
#Number of rows and column of each dataset 

print(f"The number of rows and columns of Dataset1 is", Dataset1.shape)
print(f"The number of rows and columns of Dataset2 is", Dataset2.shape)

The number of rows and columns of Dataset1 is (717530, 5)
The number of rows and columns of Dataset2 is (323700, 7)


In [11]:
#Lets join the two dataset and preview it 

Dataset_joined = Dataset1.merge(Dataset2, how='left', on=['date','center_point_geom'])
Dataset_joined.head(5)

Unnamed: 0,date,center_point_geom,longitude,latitude,number_of_strikes_x,zip_code,city,state,state_code,number_of_strikes_y
0,2018-08-01,POINT(-81.6 22.6),-81.6,22.6,48,,,,,
1,2018-08-01,POINT(-81.1 22.6),-81.1,22.6,32,,,,,
2,2018-08-01,POINT(-80.9 22.6),-80.9,22.6,118,,,,,
3,2018-08-01,POINT(-80.8 22.6),-80.8,22.6,69,,,,,
4,2018-08-01,POINT(-98.4 22.8),-98.4,22.8,44,,,,,


The preview confirms that the new dataframe is missing some data.

## Data Overview and Joining

Check for missing data in the joined dataset and preview the joined dataset information.

In [17]:
Dataset_null = Dataset_joined[pd.isnull(Dataset_joined['state_code'])] 
print(f'The number of rows and colums of null value dataframe is', Dataset_null.shape)

The number of rows and colums of null value dataframe is (393830, 10)


In [20]:
#Preview it 

Dataset_joined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 717530 entries, 0 to 717529
Data columns (total 10 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   date                 717530 non-null  object 
 1   center_point_geom    717530 non-null  object 
 2   longitude            717530 non-null  float64
 3   latitude             717530 non-null  float64
 4   number_of_strikes_x  717530 non-null  int64  
 5   zip_code             323700 non-null  float64
 6   city                 323700 non-null  object 
 7   state                323700 non-null  object 
 8   state_code           323700 non-null  object 
 9   number_of_strikes_y  323700 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 60.2+ MB


It can be observed that the non-null count of `state_code` is 393830. 

## Data Analysis

Create a new DataFrame of just latitude, longitude, and the number of strikes, group by latitude and longitude, and plot the geographical position of missing data lightning strikes using Plotly Express.

In [23]:
# Let's create a new df of just latitude, longitude, and number of strikes and group by latitude and longitude

Dataset_null_Lat_Long = Dataset_null[['longitude', 'latitude', 'number_of_strikes_x']].groupby(by = 'number_of_strikes_x').sum().sort_values('number_of_strikes_x',ascending=False).reset_index()

In [24]:
#Preview it 

Dataset_null_Lat_Long

Unnamed: 0,number_of_strikes_x,longitude,latitude
0,1494,-80.2,21.3
1,1260,-85.5,24.3
2,1044,-75.5,36.9
3,1026,-80.4,22.4
4,1023,-84.2,22.2
...,...,...,...
458,5,-794464.0,270635.3
459,4,-1015619.5,345967.1
460,3,-6883831.3,2366209.4
461,2,-2246552.8,766160.6


In [28]:
#Plot the geo position of missing data lightning strike 

import plotly.express as pltexp

Figure = pltexp.scatter_geo(Dataset_null_Lat_Long[Dataset_null_Lat_Long.number_of_strikes_x >= 300],
               lat = 'latitude',
                lon = 'longitude',
                size = 'number_of_strikes_x'
               )

Figure.update_layout(
    title_text = 'Missing data',
    geo_scope='usa'
)


Figure.show()






# Conclusion 

It can be observed that most of the lightning strike has occured on water, and that explains why `zip_code`,`city`,	`state`,`state_code` were missing from the dataset.	 