# FEMA Flood Insurance Claim Dataset

Data wrangling, cleaning, and preprocessing for Springboard Capstone Project 2.

In [2]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

  import pandas.util.testing as tm


In [3]:
# Read in dataset into dataframe called claims
claims = pd.read_csv('openFEMA_claims20190331.csv')

FileNotFoundError: [Errno 2] File openFEMA_claims20190331.csv does not exist: 'openFEMA_claims20190331.csv'

## Data Wrangling and Cleaning <br>
Explore variables, clean up invalid entries and fill missing values based on variable type.

In [3]:
claims.isna().sum()

agriculturestructureindicator                 2264124
asofdate                                            0
basefloodelevation                            1937506
basementenclosurecrawlspacetype                    45
reportedcity                                     4506
condominiumindicator                            58112
policycount                                         8
countycode                                       6746
crsdiscount                                         8
dateofloss                                          0
elevatedbuildingindicator                       55352
elevationcertificateindicator                 1813374
elevationdifference                                 8
censustract                                     58879
floodzone                                      162683
houseworship                                  2189014
latitude                                        53114
locationofcontents                             894425
longitude                   

**Initial review of variables:** <br>
There are many missing values to deal with but all variables seem relevant. Only variable to drop initially is asofdate which just describes the date the dataset was downloaded.

In [4]:
# Drop asofdate
claims.drop(columns=['asofdate'], inplace=True)

**Categorical Variables** <br>
All of the categorical variables have varying number of categories and methods of indication. Using the metadata downloaded with the dataset, clean up and reclassify the categories to make them easier to understand or look up. This includes turning arbitrary numerical indicators into string indicators. Then fill all NaNs with a new category "not_given".

In [5]:
# Reclassify categories for readability

# First, fix invalid entries in nonprofitindicator, obstructiontype
claims['nonprofitindicator'] = claims['nonprofitindicator'].replace('0', np.nan)
claims['obstructiontype'] = claims['obstructiontype'].replace({'*':np.nan, 0:np.nan})

# Define function to reclassify categories by column
def reclass(df, col_list, r_dict_list):
    for i in range(0,len(col_list)):
        df[col_list[i]] = df[col_list[i]].replace(r_dict_list[i]) 
        
# Define replacement dictionaries by column
bdcst_d = {0.0:'N', 1.0:'F', 2.0:'U', 3.0:'C', 4.0:'S'}
crsd_d = {0.0: 'SHFA0', 0.5:'SHFA5', 1.0:'SHFA10', 1.5:'SHFA15', 2.0:'SHFA20', 2.5:'SHFA25', 3.0:'SHFA30', 3.5:'SHFA35', 4.0:'SHFA40', 4.5:'SHFA45'}
eci_d = {1.0:'nocert_pre82', 2.0:'nocert_post82', 3.0:'cert_bfe', 4.0:'cert_nobfe'}
loc_d = {'Lowest floor only above ground level (No basement/enclosure/crawlspace/subgrade crawlspace)':'lowest_only', 'Lowest floor above ground level and higher floors (No basement/enclosure/crawlspace/subgrade crawlspace)':'lowest_above', 'Basement/Enclosure/Crawlspace/Subgrade Crawlspace and above':'bec_above', 'Manufactured (mobile) home or travel trailer on foundation':'mobile', 'Above ground level more than one full floor':'above', 'Basement/Enclosure/Crawlspace/Subgrade Crawlspace only':'bec_only'}
ot_d = {1:'SF', 2:'MF_4', 3: 'MF_5+', 4:'NR'}
rm_d = {1: 'M', 2: 'Sp', 3:'Al', 4:'V', 5:'U', 6:'Pr', 7:'R', 8:'Te', 9:'Mp'}
obt_d = {1.0:'A', 10.0: 'B', 15.0:'C', 20.0: 'D', 24.0:'E', 30.0:'F', 34.0:'G', 40.0:'H', 50.0:'I', 54.0:'J', 60.0:'K', 70.0:'L', 80.0:'M', 90.0:'N', 92.0:'O', 94.0:'P', 95.0:'Q', 96.0:'R', 97.0:'S', 98.0:'T'}


# Define list of necessary columns and associated dicts
reclass_col = ['basementenclosurecrawlspacetype','crsdiscount', 'elevationcertificateindicator', 'locationofcontents', 'occupancytype', 'ratemethod', 'obstructiontype']

reclass_dict = [bdcst_d, crsd_d, eci_d, loc_d, ot_d, rm_d, obt_d]



In [6]:
# Call reclass function
reclass(claims, reclass_col, reclass_dict)

In [7]:
# Categorical columns: fill NaNs with 'not_given'

# Define columns to fill NaNs
cat_cols = ['agriculturestructureindicator', 'elevatedbuildingindicator', 'houseworship', 'nonprofitindicator', 'postfirmconstructionindicator', 'smallbusinessindicatorbuilding', 'primaryresidence', 'basementenclosurecrawlspacetype','condominiumindicator', 'crsdiscount', 'floodzone', 'elevationcertificateindicator', 'locationofcontents', 'occupancytype','obstructiontype', 'ratemethod']

# Fill NaNs
claims[cat_cols] = claims[cat_cols].fillna('not_given')

**Numerical Continuous Variables** <br>
Numerical variables have many NaNs to fill. First, create a new '_NaN' column to keep track of where values were filled. Then fill with the mode of each column. Note that some of these have unexpected values at huge ranges that don't make sense for the dataset. May be an indication of additional missing values, but will leave them for now.

In [8]:
# Numerical continuous columns: create NaN indicator column, then fill with mode

# Define fill function
def create_nancol_num(df, col_list):
    for col in col_list:
        df[col+'_NaN'] = np.where(np.isnan(df[col].values), 1, 0)
        df[col] = df[col].fillna(df[col].mode()[0])

# Create list of applicable columns
num_vars = ['lowestadjacentgrade', 'lowestfloorelevation','amountpaidonbuildingclaim', 'amountpaidoncontentsclaim', 'amountpaidonincreasedcostofcomplianceclaim', 'totalbuildinginsurancecoverage', 'totalcontentsinsurancecoverage', 'elevationdifference', 'basefloodelevation']

In [9]:
# Call fill function
create_nancol_num(claims, num_vars)

**Numerical Discrete Variables** <br>
Handle the same as continuous variables. Create a '_NaN' column and then fill with mode.

In [10]:
# Numerical discrete columns: create NaN indicator column, then fill with mode
# Use create_nancol_num function as defined above

#Create list of applicable columns
disc_vars = ['numberoffloorsintheinsuredbuilding', 'policycount']

In [11]:
# Call fill function
create_nancol_num(claims, disc_vars)

**Date Variables** <br>
Convert the date variables into datetime objects in order to extract features. Yearofloss already exists, but month of loss may be relevant. Additionally year of construction and year of nb (or policy). Then create '_NaN' column to keep track of missing values and fill with mode.

In [12]:
# Convert dateofloss, originalconstructiondate originalnbdate to datetime object

# first fix invalid entry in constructiondate
claims['originalconstructiondate'] = claims['originalconstructiondate'].replace('1111-11-11', np.nan)

# convert to datetime
claims['originalnbdate'] = pd.to_datetime(claims['originalnbdate'])
claims['originalconstructiondate'] = pd.to_datetime(claims['originalconstructiondate'])
claims['dateofloss'] = pd.to_datetime(claims['dateofloss'])

In [13]:
# create new column for monthofloss and constructionyear, nbyear
claims['monthofloss'] = claims['dateofloss'].dt.month
claims['constructionyear'] = claims['originalconstructiondate'].dt.year
claims['nbyear'] = claims['originalnbdate'].dt.year

In [14]:
# New date columns: create NaN indicator column, then fill with mode
# Use create_nancol_num function as defined above

# Create list of applicable columns
date_vars = ['monthofloss', 'constructionyear', 'nbyear']

In [15]:
# call fill function
create_nancol_num(claims, date_vars)

**Location Variables** <br>
Location essentially all describe the same thing, will likely be highly correlated with eachother. Note that the reportedzip variable has the least missing values and may be best suited for modeling if we need to drop correlated variables. For now, fill with 'not_given' except for lat/long.

In [16]:
# Location variables: fill NaNs with 'not_given'

# Define columns to fill NaNs
loc_vars = ['reportedcity', 'countycode', 'censustract', 'state', 'reportedzip']

# Fill NaNs
claims[loc_vars] = claims[loc_vars].fillna('not_given')

In [17]:
claims.isna().sum()

agriculturestructureindicator                          0
basefloodelevation                                     0
basementenclosurecrawlspacetype                        0
reportedcity                                           0
condominiumindicator                                   0
policycount                                            0
countycode                                             0
crsdiscount                                            0
dateofloss                                             0
elevatedbuildingindicator                              0
elevationcertificateindicator                          0
elevationdifference                                    0
censustract                                            0
floodzone                                              0
houseworship                                           0
latitude                                           53114
locationofcontents                                     0
longitude                      

## Preprocessing <br>
Some categorical features have many classes, clean up to be easier to interpret. Note that additional features were developed based on date columns during the cleaning process above (e.g. year of construction).

**Floodzone** <br>
Simplify floodzone categorical variable 

In [18]:
# Create simplified floodzone variable

conditions = [
    (claims['floodzone'] == 'AE'),
    (claims['floodzone'] == 'X'),
    (claims['floodzone'] == 'VE'),
    (claims['floodzone'] == 'not_given'),
    (claims['floodzone'].str[0] == 'A'),
    (claims['floodzone'].str[0] == 'V'),
    (claims['floodzone'].str[0] == 'C')
]

options = ['AE', 'X', 'VE', 'not_given', 'A', 'V', 'C']

claims['floodzone_simp'] = np.select(conditions, options, default='other') 

**Obstruction Type** <br>
Simplify obstructiontype variable

In [19]:
# Create simplified obstructiontype variable
claims['obstype_simp'] = claims['obstructiontype']

claims['obstype_simp'].replace('C', 'WO_cs', inplace=True)
claims['obstype_simp'].replace(['D', 'E', 'F', 'G'], 'WO_bw', inplace=True)
claims['obstype_simp'].replace(['B', 'H'], 'WO_nw', inplace=True)
claims['obstype_simp'].replace(['I', 'J'], 'WO_nbw', inplace=True)
claims['obstype_simp'].replace('K', 'WO', inplace=True)
claims['obstype_simp'].replace('L', 'WC', inplace=True)
claims['obstype_simp'].replace('M', 'WOC', inplace=True)
claims['obstype_simp'].replace(['N', 'O'], 'WE', inplace=True)
claims['obstype_simp'].replace(['P', 'Q', 'R', 'S', 'T'], 'WO_el', inplace=True)

**Create final dataframe** <br>
Select relevant features, exclude lat/long, datetime vars, repetitive location variables. Use this new dataframe to get dummies.

In [20]:
# check data types
claims.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2418007 entries, 0 to 2418006
Data columns (total 57 columns):
 #   Column                                          Dtype         
---  ------                                          -----         
 0   agriculturestructureindicator                   object        
 1   basefloodelevation                              float64       
 2   basementenclosurecrawlspacetype                 object        
 3   reportedcity                                    object        
 4   condominiumindicator                            object        
 5   policycount                                     float64       
 6   countycode                                      object        
 7   crsdiscount                                     object        
 8   dateofloss                                      datetime64[ns]
 9   elevatedbuildingindicator                       object        
 10  elevationcertificateindicator                   object        
 11

In [36]:
# Identify columns to drop, drop them

cols_to_drop = ['dateofloss', 'originalconstructiondate', 'originalnbdate', 'longitude', 'latitude', 'reportedcity', 'countycode', 'censustract', 'floodzone', 'obstructiontype', 'reportedzip']
df = claims.drop(columns=cols_to_drop)

In [37]:
df.shape

(2418007, 46)

In [38]:
# Get dummies for categorical columns
d_cat = pd.get_dummies(df[['agriculturestructureindicator', 'elevatedbuildingindicator', 'houseworship', 'nonprofitindicator', 'postfirmconstructionindicator','smallbusinessindicatorbuilding', 'primaryresidence','basementenclosurecrawlspacetype', 'condominiumindicator', 'locationofcontents', 'crsdiscount', 'elevationcertificateindicator', 'occupancytype', 'ratemethod', 'obstype_simp', 'floodzone_simp', 'state']])

In [39]:
d_cat.shape

(2418007, 174)

In [40]:
# Concat dummy columns and drop original cat columns

df = df.drop(columns=['agriculturestructureindicator', 'elevatedbuildingindicator', 'houseworship', 'nonprofitindicator', 'postfirmconstructionindicator','smallbusinessindicatorbuilding', 'primaryresidence','basementenclosurecrawlspacetype', 'condominiumindicator', 'locationofcontents', 'crsdiscount', 'elevationcertificateindicator', 'occupancytype', 'ratemethod', 'obstype_simp', 'floodzone_simp', 'state'])
df = pd.concat([df, d_cat], axis=1)

In [41]:
df.shape

(2418007, 203)

In [43]:
# check datatypes again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2418007 entries, 0 to 2418006
Columns: 203 entries, basefloodelevation to state_not_given
dtypes: float64(13), int64(16), uint8(174)
memory usage: 936.2 MB


In [45]:
# separate predictor and response variables into X and y

X = df.drop(columns=['amountpaidonbuildingclaim', 'amountpaidonbuildingclaim_NaN'])
y = df[['amountpaidonbuildingclaim']]

**Feature Scaling** <br>
Use StandardScaler to scale features for modeling. 


In [47]:
# import packages
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [48]:
# split into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

In [49]:
# initiate scaler based on train set
scaler = StandardScaler()
scaler.fit(X_train)

# scale train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Export preprocessed datasets**

In [1]:
X_train_scaled.to_csv('X_train_scaled.csv', index=False)
X_test_scaled.to_csv('X_test_scaled.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

NameError: name 'X_train_scaled' is not defined