# Work with missing data in Python.
How to identify and decide how to work with missing data in a Python Notebook.
- Data 
    - the group of NOAA lighning strike data sets. 2 slices of data from Aug 2018.
- Goal:
    - How to identify missing data.
    - You'll learn through comparisons of these 2 data sets, how to find missing data.

In [1]:
# import Python libraries and packages.
import numpy as np
import pandas as pd
import seaborn as sns
import datetime
from matplotlib import pyplot as plt

In [None]:
# import the first dataset
df = pd.read_csv('put the first dataset here.')

### Look only at the number of lightning strikes of August 2018.

In [None]:
# lets explore the column pandas and the overall size of the first dataset.
df.head()

# overall size of the dataset.
df.shape

In [None]:
# we'll use the second dataset for August 2018.
df_zip = pd.read_csv('place df2 here')

In [None]:
# explore the column and the overall size of the 2nd df.
df_zip.head()
df_zip.shape

In [None]:
# lets merge the 2 dfs together.
df_joined = df.merge(df_zip, how='left', on=['date', 'center_point_geom'])

In [None]:
# explore the dataset.
df_joined.head()

In [None]:
# to search for missing data.
df_joined.describe()

In [None]:
# find the total amount of data that's missing using this code.
df_null_geo = df_joined[pd.isnull(df_joined.state_code)]
df_null_geo.shape # total rows and columns that are null.

In [None]:
# for more info
df_null_geo.info()

In [None]:
# lets look at what values are missing in our df.
df_null_geo.head()

## Learn how these missing values impact our data.
How? By creating a data visualization.
- A plotted map will help us see where the majority of the missing values are located geographically.

In [None]:
# to design the map, lets create another df.
top_missing = df_null_geo[['latitude', 'longitude', 'number_of_strikes_x']
                         ].groupby(['latitude', 'longitude']
                                  ).sum().sort_values('number_of_strikes_x', ascending=False).reset_index()
top_missing.head(10)

In [None]:
# lets plot the missing data.
import plotly.express as px

fig = px.scatter_geo(top_missing[top_missing.number_of_strikes_x>=300],
                    lat='latitude',
                    lon='longitude',
                    size='number_of_strikes_x')
fig.update_layout(
    title_text = 'Missing data',
)

fig.show()

In [None]:
# lets plot the data on smaller scale, limit the geographic scope.
# this shows missing data on the borders or in spots over bodies of water like lakes.
import plotly.express as px

fig = px.scatter_geo(top_missing[top_missing.number_of_strikes_x>=300],
                    lat='latitude',
                    lon='longitude',
                    size='number_of_strikes_x')
fig.update_layout(
    title_text = 'Missing data',
    geo_scope='usa',
)

fig.show()

 Given that the missing data occurs where there are no zip codes, it makes sense that it occurs around borders near bodies of water. You'll also notice that some other locations with missing data that are not over bodies of water. They are on land. Here, you'll need to reach out to the NOAA about.