## Integrated Public Alert and Warning System Data

The Integrated Public Alert and Warning System (IPAWS) Archived Alerts V1 dataset from FEMA provides historical data on public alerts and warnings issued through the IPAWS system. This system is a national platform designed to disseminate emergency alerts to the public through various channels, including radio, television, cell phones, and the internet.

Documents past alerts issued for emergencies like natural disasters, public safety incidents, and other critical events. This dataset aids in research, analysis, and improvement of emergency communication and response.

It contains several important data points such as: Issuing agency or authority, Date and time of the alert issuance, Type of alert (e.g., AMBER Alert, Severe Weather, Evacuation Notice), Geographical coverage (e.g., counties, states, ZIP codes), Alert language and category, CAP (Common Alerting Protocol) elements, such as urgency, severity, and certainty. It also links to full alert messages or additional resources.

This Data is available as a machine-readable dataset for public access, the formats include JSON and CSV for integration with analytical tools.

### Data Usage

Here, we will be extracting the data from its API and cleaning it for our usecase of **Decision Paralysis** for *Disaster Response and Emergency Management*.

This is a huge dataset and for Proof of Concept purposes, we will be fetching a part of the actual dataset based on RAM and Memory constraints on a normal PC.

### Data Extraction

In [None]:
!pip install psutil



In [None]:
import requests
import time
import psutil

# Initializing parameters
url = "https://www.fema.gov/api/open/v1/IpawsArchivedAlerts"
top = 1000  # Number of records to fetch per request
skip = 0    # Starting point for records to skip
all_records = []

# Setting the timeout duration (20 minutes)
timeout_duration = 20 * 60  # 20 minutes in seconds
start_time = time.time()  # Record the start time

# Define a RAM usage limit in bytes (e.g., 70% of total RAM)
ram_limit = psutil.virtual_memory().total * 0.7

# Checking current RAM usage and returning True if usage exceeds the limit.
def check_ram_usage():
    used_ram = psutil.virtual_memory().used
    return used_ram > ram_limit

try:
    while True:
        # Check for timeout
        elapsed_time = time.time() - start_time
        if elapsed_time > timeout_duration:
            print("Timeout occurred. Stopping execution.")
            break

        # Check for RAM usage
        if check_ram_usage():
            print("RAM usage exceeded the limit. Stopping execution.")
            break

        # Constructing the API request URL with $skip and $top parameters
        request_url = f"{url}?$skip={skip}&$top={top}"
        response = requests.get(request_url)

        if response.status_code == 200:
            data = response.json()
            alerts = data.get('IpawsArchivedAlerts', [])
            if not alerts:
                # No more records to fetch
                break
            all_records.extend(alerts)

            # Increment skip by the number of records fetched
            skip += len(alerts)
        else:
            print(f"Error: {response.status_code}")
            break
except Exception as e:
    print(f"An error occurred: {e}")

print(f"Total records fetched: {len(all_records)}")

RAM usage exceeded the limit. Stopping execution.
Total records fetched: 414000


### Data Preprocessing

In [None]:
import pandas as pd
df = pd.DataFrame(all_records)
print(len(df))

414000


Expanding the data nested within attributes

In [None]:
def extract_info(row):
    info = row.get('info')  # Get the 'info' column value
    if not info or not isinstance(info, list):  # Check if info is valid and a list
        return pd.Series()  # Return empty Series for invalid rows

    # Extract the first dictionary from the list
    info = info[0] if len(info) > 0 else {}

    result = {
        'web': info.get('web'),
        'event': info.get('event'),
        'onset': info.get('onset'),
        'expires': info.get('expires'),
        'urgency': info.get('urgency'),
        'category': ', '.join(info.get('category', [])),  # Join list to single string if needed
        'headline': info.get('headline'),
        'severity': info.get('severity'),
        'certainty': info.get('certainty'),
        'effective': info.get('effective'),
        'senderName': info.get('senderName'),
        'description': info.get('description'),
        'instruction': info.get('instruction'),
        'responseType': ', '.join(info.get('responseType', [])),
    }

    # Extract nested areas information
    area_info = info.get('areas', [{}])[0] if info.get('areas') else {}
    result['areaDesc'] = area_info.get('areaDesc')

    # Extract polygon coordinates if present
    polygon = area_info.get('polygon', {})
    result['polygon_type'] = polygon.get('type')
    result['polygon_coordinates'] = polygon.get('coordinates')

    # Extract eventCode information
    event_codes = info.get('eventCode', [])
    for i, event_code in enumerate(event_codes):
        result[f'eventCode_{i}_name'] = event_code.get('name')
        result[f'eventCode_{i}_value'] = event_code.get('value')

    # Extract parameters information
    parameters = info.get('parameters', [])
    for i, param in enumerate(parameters):
        result[f'param_{param.get("name")}'] = param.get('value')

    return pd.Series(result)

# Apply function to DataFrame
extracted_info = df.apply(extract_info, axis=1)

In [None]:
# Combine the extracted columns with the original DataFrame
df = pd.concat([df.drop(columns=['info', 'originalMessage', 'sender', 'source', 'scope', 'restriction', 'addresses', 'code', 'note', 'incidents', 'cogId', 'xmlns'], inplace=True), extracted_info], axis=1)


In [None]:
df

Unnamed: 0,areaDesc,category,certainty,description,effective,event,eventCode_0_name,eventCode_0_value,eventCode_1_name,eventCode_1_value,...,param_waterspoutDetection,param_windGust,param_windThreat,polygon_coordinates,polygon_type,responseType,senderName,severity,urgency,web
0,Port Heiden to Cape Sarichef,Met,Observed,AAA\n\n.TONIGHT...SW wind 40 kt increasing to ...,2021-10-10T18:33:00-08:00,Storm Warning,SAME,NWS,NationalWeatherService,SRW,...,,,,,,Avoid,NWS Anchorage AK,Severe,Immediate,http://www.weather.gov
1,Kiska to Attu,Met,Likely,AAA\n\n.TONIGHT...SW wind 20 kt becoming S 45 ...,2021-10-10T18:33:00-08:00,Gale Warning,SAME,NWS,NationalWeatherService,GLW,...,,,,,,Avoid,NWS Anchorage AK,Moderate,Expected,http://www.weather.gov
2,"Carter, OK; Coal, OK; Garvin, OK; Johnston, OK...",Met,Observed,The National Weather Service in Norman has iss...,2021-10-10T20:59:00-05:00,Severe Thunderstorm Warning,SAME,SVR,NationalWeatherService,SVW,...,,,RADAR INDICATED,"[[[-96.43, 34.41], [-97.36, 34.19], [-97.24, 3...",Polygon,Shelter,NWS Norman OK,Severe,Immediate,http://www.weather.gov
3,Green Bay south of line from Cedar River to R...,Met,Likely,Winds have subsided over the Bay of Green Bay ...,2021-10-10T21:45:00-05:00,Small Craft Advisory,SAME,NWS,NationalWeatherService,SCY,...,,,,,,Avoid,NWS Green Bay WI,Minor,Expected,http://www.weather.gov
4,Coastal waters from Baffin Bay to Port Aransas...,Met,Likely,* WHAT...South winds 20 to 25 knots with gusts...,2021-10-10T21:56:00-05:00,Small Craft Advisory,SAME,NWS,NationalWeatherService,SCY,...,,,,,,Avoid,NWS Corpus Christi TX,Minor,Expected,http://www.weather.gov
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413995,Marion; Wahkiakum; Clackamas; Columbia; Washin...,Met,Likely,THE NATIONAL WEATHER SERVICE IN PORTLAND HAS C...,2016-12-17T14:37:34-08:00,Winter Weather Advisory,SAME,NWS,NationalWeatherService,WWY,...,,,,,,Execute,NWS Portland OR,Moderate,Expected,http://www.weather.gov
413996,Marion,Met,Possible,KSC115-252200-\n/O.CAN.KICT.SV.A.0520.000000T0...,2016-12-25T14:58:14-06:00,Severe Thunderstorm Watch,SAME,SVA,,,...,,,,,,Monitor,NWS Wichita KS,Severe,Future,http://www.weather.gov
413997,Rich; Utah; Weber; Wasatch; Cache; Morgan; Jua...,Met,Likely,* AFFECTED AREA...THE WASATCH MOUNTAINS OF NOR...,2016-12-25T15:33:36-07:00,Winter Storm Warning,SAME,WSW,,,...,,,,,,Prepare,NWS Salt Lake City UT,Severe,Expected,http://www.weather.gov
413998,Marquette to Munising MI; Huron Islands to Mar...,Met,Likely,A SMALL CRAFT ADVISORY REMAINS IN EFFECT UNTIL...,2016-12-27T10:29:49-05:00,Small Craft Advisory,SAME,NWS,NationalWeatherService,SCY,...,,,,,,Avoid,NWS Marquette MI,Minor,Expected,http://www.weather.gov


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414000 entries, 0 to 413999
Data columns (total 80 columns):
 #   Column                          Non-Null Count   Dtype 
---  ------                          --------------   ----- 
 0   areaDesc                        378233 non-null  object
 1   category                        413805 non-null  object
 2   certainty                       413805 non-null  object
 3   description                     413183 non-null  object
 4   effective                       413007 non-null  object
 5   event                           413805 non-null  object
 6   eventCode_0_name                413805 non-null  object
 7   eventCode_0_value               413805 non-null  object
 8   eventCode_1_name                279535 non-null  object
 9   eventCode_1_value               279535 non-null  object
 10  expires                         413805 non-null  object
 11  headline                        413530 non-null  object
 12  instruction                   

#### Extracting Relevant Columns

In [None]:
important_comlumns = ['areaDesc', 'category', 'certainty', 'description', 'effective', 'event', 'headline', 'instruction', 'responseType', 'senderName', 'severity', 'urgency']
df = df[important_comlumns]
df

Unnamed: 0,areaDesc,category,certainty,description,effective,event,headline,instruction,responseType,senderName,severity,urgency
0,Port Heiden to Cape Sarichef,Met,Observed,AAA\n\n.TONIGHT...SW wind 40 kt increasing to ...,2021-10-10T18:33:00-08:00,Storm Warning,Storm Warning issued October 10 at 6:33PM AKDT...,,Avoid,NWS Anchorage AK,Severe,Immediate
1,Kiska to Attu,Met,Likely,AAA\n\n.TONIGHT...SW wind 20 kt becoming S 45 ...,2021-10-10T18:33:00-08:00,Gale Warning,Gale Warning issued October 10 at 6:33PM AKDT ...,,Avoid,NWS Anchorage AK,Moderate,Expected
2,"Carter, OK; Coal, OK; Garvin, OK; Johnston, OK...",Met,Observed,The National Weather Service in Norman has iss...,2021-10-10T20:59:00-05:00,Severe Thunderstorm Warning,Severe Thunderstorm Warning issued October 10 ...,A Tornado Watch remains in effect for the warn...,Shelter,NWS Norman OK,Severe,Immediate
3,Green Bay south of line from Cedar River to R...,Met,Likely,Winds have subsided over the Bay of Green Bay ...,2021-10-10T21:45:00-05:00,Small Craft Advisory,Small Craft Advisory issued October 10 at 9:45...,,Avoid,NWS Green Bay WI,Minor,Expected
4,Coastal waters from Baffin Bay to Port Aransas...,Met,Likely,* WHAT...South winds 20 to 25 knots with gusts...,2021-10-10T21:56:00-05:00,Small Craft Advisory,Small Craft Advisory issued October 10 at 9:56...,"Inexperienced mariners, especially those opera...",Avoid,NWS Corpus Christi TX,Minor,Expected
...,...,...,...,...,...,...,...,...,...,...,...,...
413995,Marion; Wahkiakum; Clackamas; Columbia; Washin...,Met,Likely,THE NATIONAL WEATHER SERVICE IN PORTLAND HAS C...,2016-12-17T14:37:34-08:00,Winter Weather Advisory,Winter Weather Advisory issued December 17 at ...,\n\n,Execute,NWS Portland OR,Moderate,Expected
413996,Marion,Met,Possible,KSC115-252200-\n/O.CAN.KICT.SV.A.0520.000000T0...,2016-12-25T14:58:14-06:00,Severe Thunderstorm Watch,Severe Thunderstorm Watch issued December 25 a...,\n\n,Monitor,NWS Wichita KS,Severe,Future
413997,Rich; Utah; Weber; Wasatch; Cache; Morgan; Jua...,Met,Likely,* AFFECTED AREA...THE WASATCH MOUNTAINS OF NOR...,2016-12-25T15:33:36-07:00,Winter Storm Warning,Winter Storm Warning issued December 25 at 3:3...,A WINTER STORM WARNING FOR HEAVY SNOW MEANS TH...,Prepare,NWS Salt Lake City UT,Severe,Expected
413998,Marquette to Munising MI; Huron Islands to Mar...,Met,Likely,A SMALL CRAFT ADVISORY REMAINS IN EFFECT UNTIL...,2016-12-27T10:29:49-05:00,Small Craft Advisory,Small Craft Advisory issued December 27 at 10:...,A SMALL CRAFT ADVISORY MEANS THAT WIND SPEEDS ...,Avoid,NWS Marquette MI,Minor,Expected


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414000 entries, 0 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      378233 non-null  object
 1   category      413805 non-null  object
 2   certainty     413805 non-null  object
 3   description   413183 non-null  object
 4   effective     413007 non-null  object
 5   event         413805 non-null  object
 6   headline      413530 non-null  object
 7   instruction   369840 non-null  object
 8   responseType  413805 non-null  object
 9   senderName    413790 non-null  object
 10  severity      413805 non-null  object
 11  urgency       413805 non-null  object
dtypes: object(12)
memory usage: 37.9+ MB


In [None]:
df = df.dropna(subset=['category', 'areaDesc', 'senderName'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 413805 entries, 0 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      378233 non-null  object
 1   category      413805 non-null  object
 2   certainty     413805 non-null  object
 3   description   413183 non-null  object
 4   effective     413007 non-null  object
 5   event         413805 non-null  object
 6   headline      413530 non-null  object
 7   instruction   369840 non-null  object
 8   responseType  413805 non-null  object
 9   senderName    413790 non-null  object
 10  severity      413805 non-null  object
 11  urgency       413805 non-null  object
dtypes: object(12)
memory usage: 41.0+ MB


In [None]:
df = df.dropna(subset=['areaDesc'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 378233 entries, 0 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      378233 non-null  object
 1   category      378233 non-null  object
 2   certainty     378233 non-null  object
 3   description   377813 non-null  object
 4   effective     377700 non-null  object
 5   event         378233 non-null  object
 6   headline      378041 non-null  object
 7   instruction   343739 non-null  object
 8   responseType  378233 non-null  object
 9   senderName    378218 non-null  object
 10  severity      378233 non-null  object
 11  urgency       378233 non-null  object
dtypes: object(12)
memory usage: 37.5+ MB


In [None]:
df = df.dropna(subset=['senderName'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 378218 entries, 0 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      378218 non-null  object
 1   category      378218 non-null  object
 2   certainty     378218 non-null  object
 3   description   377801 non-null  object
 4   effective     377696 non-null  object
 5   event         378218 non-null  object
 6   headline      378037 non-null  object
 7   instruction   343737 non-null  object
 8   responseType  378218 non-null  object
 9   senderName    378218 non-null  object
 10  severity      378218 non-null  object
 11  urgency       378218 non-null  object
dtypes: object(12)
memory usage: 37.5+ MB


In [None]:
unique_counts = df.nunique()
print(unique_counts)

areaDesc         95212
category            22
certainty            5
description     301009
effective       257084
event              206
headline        281505
instruction      38560
responseType        13
senderName         855
severity             5
urgency              5
dtype: int64


In [None]:
df['category'].unique()

array(['Met', 'Other', 'Rescue', 'Safety', 'Security', 'Geo', 'Health',
       'Infra', 'Env', 'Transport', 'Fire', 'Safety, Geo',
       'Safety, Rescue', 'Met, Safety', 'Met, Health', 'Other, Safety',
       'Met, Safety, Health', 'Safety, Health', 'Safety, Met',
       'Safety, Security', 'Safety, Other', 'Health, Env'], dtype=object)

In [None]:
df['certainty'].unique()

array(['Observed', 'Likely', 'Possible', 'Unknown', 'Unlikely'],
      dtype=object)

In [None]:
df['event'].unique()

       'Small Craft Advisory', 'Tornado Watch',
       'Marine Weather Statement', 'Coastal Flood Advisory',
       'High Surf Advisory', 'Flood Watch', 'required monthly test',
       'Flash Flood Watch', 'Coastal Flood Statement',
       'Child Abduction Emergency', 'Hard Freeze Watch',
       'required weekly test', 'Dense Fog Advisory',
       'Local Area Emergency', 'Civil Emergency Message',
       'Required Weekly Test', 'Blowing Dust Advisory',
       'Required Monthly Test', 'REQUIRED WEEKLY TEST', 'RWT',
       'The Police Unity Tour will be visiting D', 'Dust Advisory',
       'practice/demo message', 'Law Enforcement Blue Alert',
       'Low Water Advisory', 'Severe Thunderstorm Watch',
       'Administrative Message', 'Routine Weekly Test',
       'Water Main Break in Denton', 'IPAWS Amber Alert Template',
       'Hydrologic Outlook', 'Lakeshore Flood Advisory', 'Storm Watch',
       'Hazardous Seas Watch', 'Air Stagnation Advisory',
       'Silver Alert Raul Hernandez Dye

Removing Test Alerts which are irrelevant for our use-case

In [None]:
# Drop rows where 'event' column contains the word 'test' (case-insensitive)
filtered_df = df[~df['event'].str.contains('test', case=False, na=False)]

# Display the updated dataframe
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 375714 entries, 0 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      375714 non-null  object
 1   category      375714 non-null  object
 2   certainty     375714 non-null  object
 3   description   375324 non-null  object
 4   effective     375444 non-null  object
 5   event         375714 non-null  object
 6   headline      375552 non-null  object
 7   instruction   342784 non-null  object
 8   responseType  375714 non-null  object
 9   senderName    375714 non-null  object
 10  severity      375714 non-null  object
 11  urgency       375714 non-null  object
dtypes: object(12)
memory usage: 37.3+ MB


In [None]:
unique_counts = filtered_df.nunique()
print(unique_counts)

areaDesc         94988
category            21
certainty            5
description     300473
effective       254945
event              174
headline        281314
instruction      38432
responseType        13
senderName         595
severity             5
urgency              5
dtype: int64


In [None]:
filtered_df['responseType'].unique()

array(['Avoid', 'Shelter', 'AllClear', 'Execute', 'Monitor', 'Prepare',
       '', 'None', 'Evacuate', 'Prepare, Monitor', 'Assess',
       'Prepare, Monitor, Avoid', 'Prepare, Avoid'], dtype=object)

In [None]:
filtered_df['severity'].unique()

array(['Severe', 'Moderate', 'Minor', 'Extreme', 'Unknown'], dtype=object)

In [None]:
filtered_df['urgency'].unique()

array(['Immediate', 'Expected', 'Past', 'Future', 'Unknown'], dtype=object)

In [None]:
filtered_df['description'].head(10)

Unnamed: 0,description
0,AAA\n\n.TONIGHT...SW wind 40 kt increasing to ...
1,AAA\n\n.TONIGHT...SW wind 20 kt becoming S 45 ...
2,The National Weather Service in Norman has iss...
3,Winds have subsided over the Bay of Green Bay ...
4,* WHAT...South winds 20 to 25 knots with gusts...
5,The Tornado Watch has been cancelled and is no...
6,AAA\n\n.TONIGHT...N wind 35 kt. Seas 12 ft. Ra...
7,The Severe Thunderstorm Warning has been cance...
8,The Severe Thunderstorm Warning has been cance...
9,* WHAT...South winds 15 to 20 kts with gusts t...


In [None]:
filtered_df['description'] = filtered_df['description'].str.replace(r'\*|\n+|\.{3}', ' ', regex=True)
filtered_df['description'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['description'] = filtered_df['description'].str.replace(r'\*|\n+|\.{3}', ' ', regex=True)


'AAA .TONIGHT SW wind 40 kt increasing to 50 kt after midnight. From Port Moller E, SW wind 35 kt increasing to 45 kt after midnight. Seas 22 ft. .MON W wind 40 kt. Seas 21 ft. Rain showers. .MON NIGHT W wind 35 kt. Seas 18 ft subsiding to 12 ft after midnight. .TUE AND TUE NIGHT W wind 30 kt. Seas 12 ft. .WED THROUGH FRI W wind 30 kt. Seas 14 ft. '

In [None]:
filtered_df['description'] = (
    filtered_df['description']
    .str.strip()                       # Remove leading and trailing spaces
    .str.replace(r'\s+', ' ', regex=True)  # Replace multiple spaces with a single space
)
filtered_df['description'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['description'] = (


'AAA .TONIGHT SW wind 40 kt increasing to 50 kt after midnight. From Port Moller E, SW wind 35 kt increasing to 45 kt after midnight. Seas 22 ft. .MON W wind 40 kt. Seas 21 ft. Rain showers. .MON NIGHT W wind 35 kt. Seas 18 ft subsiding to 12 ft after midnight. .TUE AND TUE NIGHT W wind 30 kt. Seas 12 ft. .WED THROUGH FRI W wind 30 kt. Seas 14 ft.'

In [None]:
filtered_df['instruction'] = filtered_df['instruction'].str.replace(r'\n', ' ', regex=True)
filtered_df['instruction'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['instruction'] = filtered_df['instruction'].str.replace(r'\n', ' ', regex=True)


In [None]:
filtered_df['headline'] = filtered_df['headline'].str.replace(r'\*|\n+|\.{3}', ' ', regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['headline'] = filtered_df['headline'].str.replace(r'\*|\n+|\.{3}', ' ', regex=True)


In [None]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 375714 entries, 0 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      375714 non-null  object
 1   category      375714 non-null  object
 2   certainty     375714 non-null  object
 3   description   375324 non-null  object
 4   effective     375444 non-null  object
 5   event         375714 non-null  object
 6   headline      375552 non-null  object
 7   instruction   342784 non-null  object
 8   responseType  375714 non-null  object
 9   senderName    375714 non-null  object
 10  severity      375714 non-null  object
 11  urgency       375714 non-null  object
dtypes: object(12)
memory usage: 45.3+ MB


In [None]:
filtered_df = filtered_df.dropna(subset=['description', 'headline', 'instruction'])
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 342758 entries, 2 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      342758 non-null  object
 1   category      342758 non-null  object
 2   certainty     342758 non-null  object
 3   description   342758 non-null  object
 4   effective     342681 non-null  object
 5   event         342758 non-null  object
 6   headline      342758 non-null  object
 7   instruction   342758 non-null  object
 8   responseType  342758 non-null  object
 9   senderName    342758 non-null  object
 10  severity      342758 non-null  object
 11  urgency       342758 non-null  object
dtypes: object(12)
memory usage: 34.0+ MB


In [None]:
filtered_df.nunique()

Unnamed: 0,0
areaDesc,88645
category,21
certainty,5
description,276872
effective,239157
event,149
headline,263848
instruction,37337
responseType,13
senderName,336


In [None]:
# Check for duplicate rows in the entire DataFrame
duplicates = filtered_df.duplicated(keep=False)

# Display rows that are duplicates (if any)
duplicate_rows = filtered_df[duplicates]

print("Are there any duplicate rows in the dataset?")
print(duplicates.any())  # True if there are duplicate rows, False otherwise

if duplicates.any():
    print("\nDuplicate rows:")
    print(duplicate_rows)

Are there any duplicate rows in the dataset?
True

Duplicate rows:
                                                 areaDesc category certainty  \
3448    Southern Brevard County; Indian River; St. Luc...      Met    Likely   
3471    Southern Brevard County; Indian River; St. Luc...      Met    Likely   
6398                                        Affected Area    Other  Unlikely   
7279         Juneau Borough and Northern Admiralty Island      Met    Likely   
7280    Grand Traverse Bay south of a line Grand Trave...      Met    Likely   
...                                                   ...      ...       ...   
412189                                            Jackson      Met  Observed   
412190      Dyer; Lake; Pemiscot; Mississippi; Lauderdale      Met  Observed   
412191                       Wheeler; Telfair; Montgomery      Met  Observed   
412192             Knox; Whitley; Clay; Bell; Owsley; Lee      Met  Possible   
412193  Emmet; Calhoun; Humboldt; Franklin; Pocahonta

In [None]:
filtered_df = filtered_df.drop_duplicates(keep='first')
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 325901 entries, 2 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   areaDesc      325901 non-null  object
 1   category      325901 non-null  object
 2   certainty     325901 non-null  object
 3   description   325901 non-null  object
 4   effective     325847 non-null  object
 5   event         325901 non-null  object
 6   headline      325901 non-null  object
 7   instruction   325901 non-null  object
 8   responseType  325901 non-null  object
 9   senderName    325901 non-null  object
 10  severity      325901 non-null  object
 11  urgency       325901 non-null  object
dtypes: object(12)
memory usage: 32.3+ MB


In [None]:
filtered_df['effective'] = pd.to_datetime(filtered_df['effective'], utc=True)
print(filtered_df['effective'])

KeyError: 0

In [None]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 325901 entries, 2 to 413999
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype              
---  ------        --------------   -----              
 0   areaDesc      325901 non-null  object             
 1   category      325901 non-null  object             
 2   certainty     325901 non-null  object             
 3   description   325901 non-null  object             
 4   effective     325847 non-null  datetime64[ns, UTC]
 5   event         325901 non-null  object             
 6   headline      325901 non-null  object             
 7   instruction   325901 non-null  object             
 8   responseType  325901 non-null  object             
 9   senderName    325901 non-null  object             
 10  severity      325901 non-null  object             
 11  urgency       325901 non-null  object             
dtypes: datetime64[ns, UTC](1), object(11)
memory usage: 40.4+ MB


In [None]:
filtered_df.to_csv('IPAWS_Filtered_Data.csv', index=False)