## Client Project: Leveraging Social Media to Map Disasters

### Contents:
- [Mapping Outline](#Mapping-Outline)
- [Import Packages and Data](#Import-Packages-and-Data)
- [Preprocessing for Modeling](#Preprocessing-for-Modelling)
- [Map](#Map)
    - [Mapping Libraries](#Mapping-Libraries)
- [Limitations](#Limitations)

## Mapping Outline

Our Steps to Mapping:
1. Convert county to latitude longitude
2. Create new columns for latitude and longitude
4. Randomize certain amount of points by adding/subtracting a random amount of degrees.
5. use Folium map.

## Import Packages and Data

In [1]:
import pandas as pd

In [2]:
#!pip install pgeocode

In [3]:
#!pip install geopy

In [4]:
#!pip install addfips

In [5]:
#pip install geocoder

In [6]:
import addfips

In [7]:
import pgeocode

In [8]:
from pgeocode import Nominatim 

In [9]:
import geopy

In [10]:
nomi = pgeocode.Nominatim('us')

In [11]:
df = pd.read_csv('../datasets/DisasterDeclarationsSummaries.csv')

## Preprocessing for Modelling

In [12]:
#previewing columns that need to be dropped
df.head()

Unnamed: 0,disasterNumber,ihProgramDeclared,iaProgramDeclared,paProgramDeclared,hmProgramDeclared,state,declarationDate,fyDeclared,disasterType,incidentType,title,incidentBeginDate,incidentEndDate,disasterCloseOutDate,declaredCountyArea,placeCode,hash,lastRefresh,id
0,1,0,1,1,1,GA,1953-05-02T00:00:00.000Z,1953,DR,Tornado,TORNADO,1953-05-02T00:00:00.000Z,1953-05-02T00:00:00.000Z,1954-06-01T00:00:00.000Z,,,1dcb40d0664d22d39de787b706b0fa69,2019-07-26T18:08:57.368Z,5d1bbd8c8bdcfa6efb32fd8d
1,2,0,1,1,1,TX,1953-05-15T00:00:00.000Z,1953,DR,Tornado,TORNADO & HEAVY RAINFALL,1953-05-15T00:00:00.000Z,1953-05-15T00:00:00.000Z,1958-01-01T00:00:00.000Z,,,61612cea5779e361b429799098974b6a,2019-07-26T18:08:57.370Z,5d1bbd8c8bdcfa6efb32fd8e
2,3,0,1,1,1,LA,1953-05-29T00:00:00.000Z,1953,DR,Flood,FLOOD,1953-05-29T00:00:00.000Z,1953-05-29T00:00:00.000Z,1960-02-01T00:00:00.000Z,,,86f3e47785cb7acc51364d4535d36101,2019-07-26T18:08:57.369Z,5d1bbd8c8bdcfa6efb32fd8f
3,6,0,1,1,1,MI,1953-06-09T00:00:00.000Z,1953,DR,Tornado,TORNADO,1953-06-09T00:00:00.000Z,1953-06-09T00:00:00.000Z,1956-03-30T00:00:00.000Z,,,2208518c84c44f8e4164248d47f89ead,2019-07-26T18:08:57.369Z,5d1bbd8c8bdcfa6efb32fd92
4,4,0,1,1,1,MI,1953-06-02T00:00:00.000Z,1953,DR,Tornado,TORNADO,1953-06-02T00:00:00.000Z,1953-06-02T00:00:00.000Z,1956-02-01T00:00:00.000Z,,,1dbe5937a01fc74c8e699912e3f555cb,2019-07-26T18:08:57.370Z,5d1bbd8c8bdcfa6efb32fd91


In [13]:
#filtering our data frame because we are only concerned with floods from 2019
df = df[(df['incidentType'] =='Flood') & (df['fyDeclared'] == 2019) ]

In [14]:
#only including relevant columns
df = df[['state', 'fyDeclared', 'incidentType', 'declaredCountyArea', 'placeCode']]

In [15]:
df.head()

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,placeCode
38348,OK,2019,Flood,Wagoner (County),99145.0
38514,LA,2019,Flood,Pointe Coupee (Parish),99077.0
39609,OK,2019,Flood,Osage (County),99113.0
39643,LA,2019,Flood,St. Martin (Parish),99099.0
40137,LA,2019,Flood,Assumption (Parish),99007.0


Not entirely sure what placeCode is, could possibly translate to zipcode.

In [16]:
#checking for nulls

df.isnull().sum()

state                 0
fyDeclared            0
incidentType          0
declaredCountyArea    3
placeCode             3
dtype: int64

Because there is only 3 nulls, we will not lose a lot of data by dropping them.

In [17]:
df.dropna(inplace = True)

In [18]:
#converting column to appropriate data type to perform functions on.
df = df.astype({"placeCode": str})

In [19]:
df = df.astype({"state": str})

In [20]:
df.head()

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,placeCode
38348,OK,2019,Flood,Wagoner (County),99145.0
38514,LA,2019,Flood,Pointe Coupee (Parish),99077.0
39609,OK,2019,Flood,Osage (County),99113.0
39643,LA,2019,Flood,St. Martin (Parish),99099.0
40137,LA,2019,Flood,Assumption (Parish),99007.0


In [21]:
#Take out 99 from placeCode
df['placeCode'] = df['placeCode'].str[2:]

Learned that FEMA has their own numbering system where 99 is given to US countries and the 3 numbers that follow refer to county. This follows the FIPS government system for assigning location codes. 

In [22]:
df.head()

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,placeCode
38348,OK,2019,Flood,Wagoner (County),145.0
38514,LA,2019,Flood,Pointe Coupee (Parish),77.0
39609,OK,2019,Flood,Osage (County),113.0
39643,LA,2019,Flood,St. Martin (Parish),99.0
40137,LA,2019,Flood,Assumption (Parish),7.0


In [23]:
#instantiating addfips to see if we can convert to zipcode.
af = addfips.AddFIPS()

In [24]:
#creating a function that can get the state fip from our state column.
def fip_state(x):
    af.get_state_fips(x)

In [25]:
df.dtypes

state                 object
fyDeclared             int64
incidentType          object
declaredCountyArea    object
placeCode             object
dtype: object

In [26]:
#get fip_state to then find the zipcode 
df['state_fip'] = df['state'].apply(lambda x: af.get_state_fips(x))

In [27]:
df.head()

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,placeCode,state_fip
38348,OK,2019,Flood,Wagoner (County),145.0,40
38514,LA,2019,Flood,Pointe Coupee (Parish),77.0,22
39609,OK,2019,Flood,Osage (County),113.0,40
39643,LA,2019,Flood,St. Martin (Parish),99.0,22
40137,LA,2019,Flood,Assumption (Parish),7.0,22


In [28]:
#combining state and county codes to eventually get zipcode.
df['fip_code'] = df['state_fip'] + df['placeCode']

In [29]:
df = df.astype({"fip_code": float})

In [30]:
df = df.astype({"fip_code": int})

In [31]:
#dropping placeCode and state_fip because we don't need them anymore.
df.drop(columns=['placeCode','state_fip'], inplace = True)

In [32]:
df.head()

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code
38348,OK,2019,Flood,Wagoner (County),40145
38514,LA,2019,Flood,Pointe Coupee (Parish),22077
39609,OK,2019,Flood,Osage (County),40113
39643,LA,2019,Flood,St. Martin (Parish),22099
40137,LA,2019,Flood,Assumption (Parish),22007


At this point we are trying to convert fip code to zip code or fip code to latitude longitude however there doesn't seem to be a library that does this quickly. Looking to use geocode to convert county column to latitude longitude.

In [33]:
import geocoder
g = geocoder.osm('United States')

In [34]:
df['declaredCountyArea']

38348          Wagoner (County)
38514    Pointe Coupee (Parish)
39609            Osage (County)
39643       St. Martin (Parish)
40137       Assumption (Parish)
                  ...          
50759           Arthur (County)
50760            Grant (County)
50771         St. Mary (Parish)
50774        Marinette (County)
50796              Lee (County)
Name: declaredCountyArea, Length: 651, dtype: object

Initially some hesitation to use geocoding because we assumed the way county was formatted in addition to their being duplicate county names might lead to some errors or missed data. However there are already very few rows, most likely meaning less chance of duplicates. Also considering that the areas affected by floods are few relative to total counties in the United States, so the chance of duplicates is even less.

In [35]:
#taking each county and printing the resulty geocoded county.
def Geocode(county): 
    result = geocoder.osm(county) 
    print(result)

In [36]:
# check what locations have the error 
#df['location'] = df['declaredCountyArea'].apply(Geocode)

In [37]:
df.head(180)

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code
38348,OK,2019,Flood,Wagoner (County),40145
38514,LA,2019,Flood,Pointe Coupee (Parish),22077
39609,OK,2019,Flood,Osage (County),40113
39643,LA,2019,Flood,St. Martin (Parish),22099
40137,LA,2019,Flood,Assumption (Parish),22007
...,...,...,...,...,...
47181,SD,2019,Flood,Brule (County),46015
47187,SD,2019,Flood,Jones (County),46075
47194,IA,2019,Flood,Kossuth (County),19109
47197,MN,2019,Flood,McLeod (County),27085


In [38]:
from geopy.extra.rate_limiter import RateLimiter

In [39]:
#doing a sample function and test to see if we can return latitude coordinates
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here", timeout=10)


location = geolocator.geocode("St. Croix Indian Reservation")

In [40]:
location.latitude

38.2239489

In [41]:
# function in order to get latitude for each county
def latitude(loc): 
    location = geolocator.geocode(loc)
    return location.latitude

In [42]:
#function in order to get longitude for each county 
def longitude(loc): 
    location = geolocator.geocode(loc)
    return location.longitude

In [43]:
df.head(190)

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code
38348,OK,2019,Flood,Wagoner (County),40145
38514,LA,2019,Flood,Pointe Coupee (Parish),22077
39609,OK,2019,Flood,Osage (County),40113
39643,LA,2019,Flood,St. Martin (Parish),22099
40137,LA,2019,Flood,Assumption (Parish),22007
...,...,...,...,...,...
47296,NE,2019,Flood,Pierce (County),31139
47307,ND,2019,Flood,Pembina (County),38067
47310,MN,2019,Flood,Traverse (County),27155
47311,SD,2019,Flood,Clark (County),46025


In [44]:
# drop the nonetype objects - error in searching the location
null = df[df['declaredCountyArea'] == "Sac and Fox Indian Reservation (Also KS)"].index

In [45]:
#these were the nulls where we could not find coordinates.
null_1 = df[df['declaredCountyArea'] == "Mitchell (County)"].index

In [46]:
null_2 = df[df['declaredCountyArea'] == "Ponca (TDSA)"].index

In [47]:
null_3 = df[df['declaredCountyArea'] == "Oglala Sioux Tribe of the Pine Ridge Reservation"].index

In [48]:
 df[df['declaredCountyArea'] == "Redwood (County)"]

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code
48831,MN,2019,Flood,Redwood (County),27127


In [49]:
 df[df['declaredCountyArea'] == "Franklin (County)"]

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code
41368,AR,2019,Flood,Franklin (County),5047
44912,IA,2019,Flood,Franklin (County),19069
45995,NE,2019,Flood,Franklin (County),31061
47328,TX,2019,Flood,Franklin (County),48159
49062,AR,2019,Flood,Franklin (County),5047


In [50]:
 df[df['declaredCountyArea'] == "Cocke (County)"]

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code
48069,TN,2019,Flood,Cocke (County),47029


In [51]:
#dropping counties with errors.
df.drop(null, inplace = True)

In [52]:
df.drop(null_1, inplace = True)

In [53]:
df.drop(null_2, inplace = True)

In [54]:
df.drop(null_3, inplace = True)

In [55]:
#looked at half of the dataframe as a baseline for the map  
df_portion = df.head(100) 

In [56]:
#creating a latitude column
df_portion['latitude'] = df_portion['declaredCountyArea'].apply(latitude)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [57]:
#created a longitude column
df_portion['longitude'] = df_portion['declaredCountyArea'].apply(longitude)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [58]:
#checking to make sure our coordinates looks correct.
df_portion.head()

Unnamed: 0,state,fyDeclared,incidentType,declaredCountyArea,fip_code,latitude,longitude
38348,OK,2019,Flood,Wagoner (County),40145,35.954222,-95.563624
38514,LA,2019,Flood,Pointe Coupee (Parish),22077,30.619428,-91.574089
39609,OK,2019,Flood,Osage (County),40113,38.673605,-95.736774
39643,LA,2019,Flood,St. Martin (Parish),22099,30.181143,-91.608112
40137,LA,2019,Flood,Assumption (Parish),22007,29.918543,-91.0534


In [59]:
#savingthe dataframe
df_portion.to_csv('./dfmap.csv')

## Mapping

In [60]:
#https://towardsdatascience.com/geocode-with-python-161ec1e62b89

### Mapping Libraries

In [61]:
## Mapping Libraries
from shapely.geometry import Point
from random import random
from random import uniform
import geopandas as gpd
from folium.plugins import MarkerCluster,FeatureGroupSubGroup,Fullscreen
from ipyleaflet import basemaps, basemap_to_tiles, Marker, Map, Popup
from folium.map import Layer,FeatureGroup
import folium
from branca.element import Template, MacroElement

## Process strings for html (for popups)
import html
import re

In [62]:
# creating a location list to add the cirlces
locations = df_portion[['latitude', 'longitude']]
locationlist = locations.values.tolist()
len(locationlist)
locationlist[7]

[33.155212, -95.2255189]

In [63]:
#this function creates a random distribution of points given the boundaries.
def newpoint():
    return  uniform(30.50139,40.85694 ), uniform(-83.75583, -100.01197)
#specifies how many 'new points' we want, default is 10.
points = (newpoint() for x in range(100))
#prints all the generated coordinates
for point in points:
    df = pd.DataFrame(list(points)) 
    pointlist = df.values.tolist()
    len(pointlist)
    #for point in range(0, len(pointlist)):
        #folium.vector_layers.CircleMarker(pointlist[point]).add_to(map)  

Because we did not have specific coordinate or county data from the tweets we decided to just add a normal distribution to the FEMA data points to simulate what our map would look like if we had access to that data.

In [84]:
#imports
from random import random
from random import uniform

#instantiating a folium map
map = folium.Map(location = [35.954222, -95.563624], 
                          zoom_start = 3, ## zoom level
                          titles = 'Detected Floods in the U.S.')





#adding our original points for disasters (floods) in 2019
for point in range(0, len(locationlist)):
    # Add a circle marker for the point
    folium.vector_layers.CircleMarker(locationlist[point],
                        color = 'blue', 
                        fill = True,             
                        fill_color = 'blue',
                        radius=5,
                        weight=3, 
                        control = True,
                        overlay = True).add_to(map)
#returns a uniform point given the input range, default longitude range is 0-180, latitude range is 0-90.
for point in range(0, len(pointlist)):
        folium.vector_layers.CircleMarker(pointlist[point],
                                          radius=5,
                                          color = 'blue',
                        weight=3, 
                        control = True,
                        overlay = True,
                        fill = True,             
                        fill_color = 'blue').add_to(map)  


In [85]:
map

Our fake generated data did as we wanted, to create a map that appears to have a random distribution of data within the areas affected by floods. If we were to make a full production model with data, it would probably be better to use much smaller markers or perhaps use a heat map to indicate which areas were affected more since having so many points leads to a cluttered map which can be hard to interpret unless you zoom in a lot.

## Limitations

From our map it's evident that most of the floods in the United States in 2019 are from the midwest and south with very few elsewhere which means that the robustness of our model will vary from state to state. We won't have as much data to train and test on in states were floods are rare. Additionally we found that not as many people tweeted about floods; most of our Tweets were from weather accounts and government agencies so the actual 'social mapping' concept is limited as there are very few people live tweeting a flood, or it didn't provide much more information than the tweets generated by weather/gov't accounts. It is a possibility as the popularity of Twitter increases that the intended functionality of the model increases. Addtionally, privacy is increasingly a concern for all social media users and most people opt to not include their city/county or state location in their tweet which also limits the functionality of our model.

One way to get around this would be to scrape the following or followers list of a user and scan those users tweets/bios to pinpoint a probable location of that user by getting data from his/her friends. For example if a user has a large majority of their followers/following who have city X in their bio or as their tweet location, we can predict that this person is likely from city X. We could also scrape the user's past tweets to see if at any point they had their location on when tweeting, or scrape the content of their tweets and liked tweets to see which location is mentioned the most.

## Findings and Conclusions

primary findings from the model are that it is easier to validate tweets pertaining to floods on a state by state basis rather than on a national level. This is most likely because discussions on Twitter can be state specific so NLP in one state is not the best to apply in another state. If we were to make a production model we would most likely have to make state specific models as incorporating all states into one model caused a significant decrease in our scores. Another thing to consider from our models is that we can't ONLY optimize for specificity or sensitivity given the context of our problem. False negatives and false positives both have adverse effects and highly favoring one over the other has detrimental effects. For example, the goal of this model would be to inform the population of the scope of a natural disaster as well as direct aid to locations affected by natural disasters. If we have too many false negatives then the number of people who will not receive aid increases and many others will not know the full scope of an affected area which could further cause injury / damage. On the flip side if we have too many false positives, we could incur costs on a population as they spend extra time and resources preparing for something that is not imminent. More importantly this might increase the adverse effects of false negatives because now you have aid and resources being directed to an area that is not actually affected which takes away aid from actual affected areas (true positives). Both of these things had to be considered when selecting the 'best' model.