# City-level Analysis of Human Sentiments of Heat Exposure Using Twitter Data

Author: Fangzheng Lyu

This notebook is related to the paper [Mapping dynamic human sentiments of heat exposure with location-based social media data](https://www.tandfonline.com/doi/full/10.1080/13658816.2024.2343063)

One city-scale analysis was conducted using collected data in the City of Chicago. Unexpected hot weather was detected in Chicago with the highest temperature being 88 degrees Fahrenheit. In this case study, both census tracts and 1 km spatial resolution, with approximately 800 spatial units each for the Cook County, are selected to ensure comprehensive area representation with the amount of social media getting collected. The two spatial unit we select are:
- Census Tracts
- 1km Spatial Resolution

## Notebook Outline
- [Processing Twitter/X Data](#processing)
- [Understanding How Human Sentiments of Heat Exposure from Tweet Posts](#understand)
- [Aggregate the result to the census tract](#aggregate)
- [Visualization - Census Tract](#census)
- [Visualization - 1km Spatial Resolution](#1km)

In [1]:
## Import Library
import pytz
from datetime import datetime, timedelta
import os
import geopandas as gpd
import json
from shapely.geometry import Polygon, Point, MultiPolygon
import shapefile
import re
import shapefile as shp  # Requires the pyshp package
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.colors as mcolors
import numpy as np
import random
import csv

<a id='processing'></a>

## 1. Processing Twitter/X Data

The following cell will allow users to extract and filter the social media data Twitter/X.

In [2]:
## Load chicago shapefile
shapefile = gpd.read_file("./geo/geo_export_5bb8636f-65b7-450a-8fd9-7f01027fd84b.shp")
chicago_shape = shapefile["geometry"][0]

Filter all the Twitter/X data by location, find all data within the city of Chicago

In [None]:
## get the twitter in chicago
## City scale analysis
## This block of code will takes a long time
## We iterate through all the twitter collected for find twitter in chicago
## Get the filename
filelist = os.listdir('./data/')
filelist

twitter_in_chicago = []

# Opening JSON file
for filename in filelist:
    filepath = "./data/"+filename
    print(filepath)
    f = open(filepath)
    data = json.load(f)
    
    ## Read the data if the centroid of the twitter point polygon lies within the boundary of the city of Chicago
    for i in range(0, len(data)):
        try:
            ##Need to deal with case when the shapefile is too big
            text = data[i]["text"]
            t = data[i]['created_at']
            ## Case 1
            ## Twitter with exact geospatial location
            if (data[i]['geo']!=None):
                lat = data[i]['geo']['coordinates'][0]
                lon = data[i]['geo']['coordinates'][1]
                exact_loc = Point(lon, lat)
                if chicago_shape.contains(exact_loc):
                    ## print("inside")
                    twitter_in_chicago.append((exact_loc, t, text))
            else:
                ## Twitter with a polygon bounding box
                poly = data[i]['place']['bounding_box']["coordinates"][0]
                lon = -1000
                lat = -1000

                lon = [p[0] for p in poly]
                lat = [p[1] for p in poly]
                centroid = (sum(lon) / len(poly), sum(lat) / len(poly))
                point = Point(centroid)
                ## check if a centroid is in the bounding box of chicago
                if chicago_shape.contains(point):
                    ## print("inside")
                    twitter_in_chicago.append((poly, t, text))
        except:
            ## no geographical location
            pass
    # Closing file
    f.close()

./data/250000-tweets-2021-09-25_04-59-49.json
./data/250000-tweets-2021-09-26_01-46-49.json


In [None]:
print ("There are in total "+str(len(twitter_in_chicago))+" geo-tagged Twitter Collected in Chicago in 9/25/2021 & 9/26/2021")

In [None]:
## Example of Twitter Message
twitter_in_chicago[3]

<a id='understand'></a>

## 2. Understanding How Human Sentiments of Heat Exposure from Tweet Posts

The following cell will allow users to apply heat dictionary generated using pretrained NLP model to understand the Twitter post

A keyword-based NLP mehtod is adopted to the generated the heat dictionary. And the heat dictionary is used to access whether each Tweet post is talking about weather and how much is it talking about hot/cold weather.

In [None]:
## Read the word heat dictionary
f = open('./geo/data20000.txt','r')
content = f.read()
f.close()
dict_word = {}
content_list = content.split(",")
for i in range(0,len(content_list)):
    try:
        word = content_list[i].split(":")[0].split("'")[1]
        #print(content_list[i].split(":"))
        val = float(content_list[i].split(":")[1])
        dict_word[word] = val
    except:
        pass

Apply the heat dictionary onto all the Tweets found in the city of Chicago.

In [None]:
## Iterate through all twitter data in chicago

d_twitter = []
for i in range(0, len(twitter_in_chicago)):
    loc = twitter_in_chicago[i][0]
    t = twitter_in_chicago[i][1]
    text = twitter_in_chicago[i][2]
    res = re.findall(r'\w+', text.lower())
    val = 0
    for word in res:
        if word in dict_word.keys():
            val = val + dict_word[word]
    ## remove weather-irrelevant twitter
    ## if none of the word in the heat dictionary show up 
    if (val!=0):
        d_twitter.append((loc, t, val))

In [None]:
print("There are "+str(len(d_twitter))+" weather-related Twitter in Chicago")

In [None]:
m_dic = {}
m_dic['Jan'] = 1
m_dic['Feb'] = 2
m_dic['Mar'] = 3
m_dic['Apr'] = 4
m_dic['May'] = 5
m_dic['Jun'] = 6
m_dic['Jul'] = 7
m_dic['Aug'] = 8
m_dic['Sep'] = 9
m_dic['Oct'] = 10
m_dic['Nov'] = 11
m_dic['Dec'] = 12

In [None]:
## Find the time difference between the current time and the Twitter post time
today = datetime(2021,9, 25, 0)
weather_related_twitter = []
sec = []
for twitter in d_twitter:
    loc = twitter[0]
    t = twitter[1].split()
    val = twitter[2]
    month = m_dic[t[1]]
    day = int(t[2])
    year = int(t[5])
    hour = int(t[3].split(":")[0])
    minute = int(t[3].split(":")[1])
    twitter_t = datetime(year, month, day, hour, minute)
    diff_minute = abs(twitter_t - today).total_seconds() / 60.0
    weather_related_twitter.append((loc, diff_minute, val))
    sec.append(diff_minute)

In [None]:
# Let's take a look at the temporal distributed with weather-related Tweets posted across time

plt.hist(sec, bins=100, alpha=0.5)
plt.title('When the tweets are posted')
plt.xlabel('Minute')
plt.ylabel('count')

plt.show()

<a id='aggregate'></a>

## 3. Aggregate the result to the census tract

The following cell will allow users to aggregate the human sentiments of heat exposure from each Tweets to the spaital domain in the city of Chicago.

Aggregate the result into census tract. [Inverse Distance Weighting (IDW)](https://en.wikipedia.org/wiki/Inverse_distance_weighting) is used when a census tract value is missinng.

In [None]:
## Integrate into census tract level
chicago = gpd.read_file("./Census_tract/geo_export_dc0b9c70-c036-4bcc-a602-8e9b9d36ea9f.shp")

In [None]:
chicago

In [None]:
## Generate a random point from a polygon
import random

def generate_random(number, polygon):
    minx, miny, maxx, maxy = polygon.bounds
    pnt = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    return pnt

In [None]:
## Function to generate random location twitter
## For monte caro experiment
## Enable exact and poly if you want to see how many twitter has exact location and how many comes with a polygon
#exact = 0
#poly = 0
def generate_random_loc(weather_related_twitter):
    random_loc_twitter = []
    for ele in weather_related_twitter:
        loc = ele[0]
        point = 0
        #print(loc)
        if (type(loc)==Point):
            ## exact location extracted
            point = loc
            #exact = exact+1
        else:
            ## Select a random point from a multi-polygon
            point = generate_random(1, Polygon(loc))
            #poly = poly+1
        random_loc_twitter.append([point, ele[2]])
    return random_loc_twitter

In [None]:
d_final_census_track = {}

for i in range(0, 10):
    ## try 10 random time
    print("current "+str(i))

    ## Conduct kernel density estimation
    random_loc_twitter = generate_random_loc(weather_related_twitter)

    ### fill with inverse distance weighting


    for index, row in chicago.iterrows():
        key = index
        ele = row['geometry'] 
        lon = ele.centroid.x
        lat = ele.centroid.y
        ## iterate through all the values in the existing twitter
        up = 0
        down = 0
        IDW = 0
        for twitter in random_loc_twitter:
            pt = twitter[0]
            curr_x = pt.x
            curr_y = pt.y
            curr_val = twitter[1]

            distx = (curr_x-lon)*82
            disty = (curr_y-lat)*111

            w = 1/np.sqrt(distx*distx+disty*disty)

            down = down+w
            up = up+w*curr_val
        rt = up/down

        if (key not in d_final_census_track.keys()):
            d_final_census_track[index]=[rt]
        else:
            d_final_census_track[index].append(rt)

Calculate Normalized Human Sentiments of Heat Exposure

In [None]:
heat_exposure_map_census_track = {}
for key in d_final_census_track.keys():
    ## Get the average hot exposure
    heat_exposure_map_census_track[key] = np.mean(d_final_census_track[key])

In [None]:
## nomalization to 0-1
mn = min(heat_exposure_map_census_track.values())
mx = max(heat_exposure_map_census_track.values())
for key in heat_exposure_map_census_track.keys():
    norm = (heat_exposure_map_census_track[key]-mn)/(mx-mn)
    heat_exposure_map_census_track[key] = norm

In [None]:
chicago["he_val"]=list(heat_exposure_map_census_track.values())

Show the result geopandas dataframe for visualization

In [None]:
chicago

<a id='census'></a>

## 4. Visualization - Census Tract

The following cell conduct a census tract level analysis of human sentiments of urban heat in the city of Chicago.

In [None]:
# Let's take a look at how the heat exposure variable is distributed with a histogram
chicago["he_val"].hist(bins=40)
plt.xlabel("Normalized Heat Exposure")
plt.ylabel("Number of census tracts")
plt.title("Human Centiments of Heat Exposure")
plt.show()

In [None]:
## Creating Choropleth Map with geopandas 
chicago.plot(column = 'he_val', #Assign numerical data column
                      legend = True, #Decide to show legend or not
                      figsize = [20,10],
                      cmap = 'YlOrRd',
                      legend_kwds = {'label': "Normalized Heat Exposure"}) #Name the legend

In [None]:
he_val= list(chicago['he_val'])

In [None]:
## Show the percentile value for 5 classes

print("The 20th percentile is " + str(np.percentile(he_val, 20)))
print("The 40th percentile is " + str(np.percentile(he_val, 40)))
print("The 60th percentile is " + str(np.percentile(he_val, 60)))
print("The 80th percentile is " + str(np.percentile(he_val, 80)))

In [None]:
## Quantile Map
chicago.plot(column = 'he_val', #Assign numerical data column
                      scheme="quantiles", 
                      k=5,
                      legend = True, #Decide to show legend or not
                      figsize = [20,10],
                      cmap = 'YlOrRd') #Name the legend

In [None]:
he_val= list(chicago['he_val'])
mn, mx = min(he_val), max(he_val)

In [None]:
mn,mx

In [None]:
## Show the equal interval value for 5 class

print("The 1th value is " + str(1*(mx-mn)/5))
print("The 2nd value is " + str(2*(mx-mn)/5))
print("The 3rd value is " + str(3*(mx-mn)/5))
print("The 4th value is " + str(4*(mx-mn)/5))

In [None]:
## Equal Interval
chicago.plot(column = 'he_val', #Assign numerical data column
                      scheme="equal_interval", 
                      k=5,
                      legend = True, #Decide to show legend or not
                      figsize = [20,10],
                      cmap = 'YlOrRd') #Name the legend

In [None]:
!pip install jenkspy

In [None]:
import jenkspy
he_val= list(chicago['he_val'])

[a0, a1, a2, a3, a4, a5] = jenkspy.jenks_breaks(he_val, n_classes=5)
print("The 1th value is " + str(a1))
print("The 2nd value is " + str(a2))
print("The 3rd value is " + str(a3))
print("The 4th value is " + str(a4))

In [None]:
## Natural Break
chicago.plot(column = 'he_val', #Assign numerical data column
                      scheme="natural_breaks", 
                      k=5,
                      legend = True, #Decide to show legend or not
                      figsize = [20,10],
                      cmap = 'YlOrRd') #Name the legend

<a id='1km'></a>

## 5. Visualization - 1km Spatial Resolution

The following cell conduct a analysis of human sentiments of urban heat at 1km spatial resolution in the city of Chicago.

In [None]:
## Generate raster based about 1 km spatial resoltuion 
## one Degree latitude = 111 km
## In chicago, where latitude = 41.881832, one Degree longitude = 82 km
## We use this estimation for the following ananlsis
## This work as the city of Chicago is small
lat_start = 41.05
lon_start = -87.96

incre_lat = 1/111
incre_lon = 1/82

lat_end = 42.05
lon_end = -87.5

raster = []

lat = lat_start

while(lat<lat_end):
    lon = lon_start
    while(lon<lon_end):
        curr_point = Point(lon, lat)
        if (curr_point.within(chicago_shape)):
            raster.append([lon, lat])
        lon = lon+incre_lon
    lat = lat+incre_lat

Generate reuslt for the human sentiments of heat exposure at different timeframe at fine temporal granularity.

In [None]:
## Generate a random point from a polygon
import random

def generate_random(number, polygon):
    minx, miny, maxx, maxy = polygon.bounds
    pnt = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    return pnt

In [None]:
## Function to generate random location twitter
## For monte caro experiment
## Enable exact and poly if you want to see how many twitter has exact location and how many comes with a polygon
#exact = 0
#poly = 0
def generate_random_loc(weather_related_twitter):
    random_loc_twitter = []
    for ele in weather_related_twitter:
        loc = ele[0]
        point = 0
        #print(loc)
        if (type(loc)==Point):
            ## exact location extracted
            point = loc
            #exact = exact+1
        else:
            ## Select a random point from a multi-polygon
            point = generate_random(1, Polygon(loc))
            #poly = poly+1
        random_loc_twitter.append([point, ele[2]])
    return random_loc_twitter

Calcualte the human sentiments of heat exposure. Using Inverse Distance Weighting (IDW) for those spatial unit that doesn't have a points. And using Monte-Carlo simulation to take care of those multi-polygon locations in the social media posts.

In [None]:
## Set seed for reproducibility
## This may take a while

d_final = {}

for i in range(0, 10):
    ## try 100 random time
    print("current "+str(i))

    ## Conduct kernel density estimation
    random_loc_twitter = generate_random_loc(weather_related_twitter)

    ### fill with inverse distance weighting


    for ele in raster:
        lon = ele[0]
        lat = ele[1]
        ## iterate through all the values in the existing twitter
        up = 0
        down = 0
        IDW = 0
        for twitter in random_loc_twitter:
            pt = twitter[0]
            curr_x = pt.x
            curr_y = pt.y
            curr_val = twitter[1]

            distx = (curr_x-lon)*82
            disty = (curr_y-lat)*111

            w = 1/np.sqrt(distx*distx+disty*disty)

            down = down+w
            up = up+w*curr_val
        rt = up/down
        
        key = (ele[0],ele[1])
        if (key not in d_final.keys()):
            d_final[key]=[rt]
        else:
            d_final[key].append(rt)

In [None]:
heat_exposure_map = {}
for key in d_final.keys():
    ## Get the average hot exposure
    heat_exposure_map[key] = np.mean(d_final[key])

In [None]:
## nomalization to 0-1
mn = min(heat_exposure_map.values())
mx = max(heat_exposure_map.values())
for key in heat_exposure_map.keys():
    norm = (heat_exposure_map[key]-mn)/(mx-mn)
    heat_exposure_map[key] = norm

In [None]:
lonl = []
latl = []
var = []

final_heat_exposure_map = {}
for key in heat_exposure_map.keys():
    
    lonl.append(key[0])
    latl.append(key[1])
    var.append(heat_exposure_map[key])

In [None]:
df = pd.DataFrame(np.column_stack([lonl, latl, var]), 
                  columns=['lon', 'lat', 'val'])

In [None]:
df

In [None]:
df.plot.scatter(title='Human Sentiments of Heat Expososure from 6 to 9', x='lon', y='lat', c='val', figsize = [10,10], subplots=True, marker="s", s = 155, colormap='viridis')