# Analysis of Safecast radiation data

In [17]:
import pandas as pd
import plotly.express as px

## Data loading

### Load the dataset

In [18]:
data = pd.read_csv('data/safecast.csv')

df = pd.DataFrame(data)
df

Unnamed: 0,value,unit,location_name,captured_at,device_id,height,devicetype_id,station_id,latitude,longitude
0,19.500000,cpm,,2024-03-20T03:00:03.000Z,100221.0,13.0,Pointcast V1,,31.833193,130.301922
1,19.100000,status,,2024-03-20T03:00:05.000Z,100229.0,13.0,"DeviceID:10022,Temperature:19.1,Battery Voltag...",,31.833193,130.301922
2,14.666667,cpm,"Phoenix,AZ",2024-03-20T03:00:35.000Z,4841.0,,,,33.665600,-112.182700
3,26.000000,cpm,"Earl's House, Johns Creek, GA, USA",2024-03-20T06:00:12.066Z,90.0,615.0,,,34.067301,-84.211610
4,0.172000,usv,"Earl's House, Johns Creek, GA, USA",2024-03-20T06:00:12.449Z,90.0,615.0,,,34.067301,-84.211610
...,...,...,...,...,...,...,...,...,...,...
447,19.000000,cpm,"Bad Pyrmont, DE",2024-03-20T06:00:13.000Z,108.0,,,,51.980700,9.234500
448,17.333333,cpm,"Waterland, NL",2024-03-20T03:00:12.000Z,205.0,,,,52.427600,4.971100
449,17.000000,cpm,"Berlin, DE",2024-03-20T00:00:18.000Z,204.0,,,,52.449400,13.312700
450,10.650000,PM10 ug/m3,,2024-03-20T06:00:16.385Z,244.0,,,,53.864000,-3.047000


## Data preprocessing

Now we can drop all rows with data that has incorrect unit. Most of measurements we get use `cpm` unit, so we will remove measurements with different units.

In [19]:
df = df[df['unit'] == 'cpm']

In [20]:
df

Unnamed: 0,value,unit,location_name,captured_at,device_id,height,devicetype_id,station_id,latitude,longitude
0,19.500000,cpm,,2024-03-20T03:00:03.000Z,100221.0,13.0,Pointcast V1,,31.833193,130.301922
2,14.666667,cpm,"Phoenix,AZ",2024-03-20T03:00:35.000Z,4841.0,,,,33.665600,-112.182700
3,26.000000,cpm,"Earl's House, Johns Creek, GA, USA",2024-03-20T06:00:12.066Z,90.0,615.0,,,34.067301,-84.211610
5,38.000000,cpm,,2024-03-20T09:00:07.000Z,65008.0,,,,34.482545,136.163097
6,44.000000,cpm,,2024-03-20T09:00:37.000Z,65008.0,,,,34.482545,136.163113
...,...,...,...,...,...,...,...,...,...,...
444,22.500000,cpm,,2024-03-20T12:00:01.000Z,200091.0,5.0,Pointcast V1,,42.381242,-71.111946
446,14.500000,cpm,"Wadsworth, IL",2024-03-20T03:00:18.000Z,216.0,,,,42.434700,-87.901200
447,19.000000,cpm,"Bad Pyrmont, DE",2024-03-20T06:00:13.000Z,108.0,,,,51.980700,9.234500
448,17.333333,cpm,"Waterland, NL",2024-03-20T03:00:12.000Z,205.0,,,,52.427600,4.971100


We can see that location name is missing for some measurements. We will replace the `NaN` value with the `'Unknown location'` string.

In [25]:
df.loc[df['location_name'].isnull(), 'location_name'] = 'Unknown location'

We can convert all `float64` types to `float32` for faster calculations

In [26]:
df.dtypes

value            float32
unit              object
location_name     object
captured_at       object
device_id        float32
height           float32
devicetype_id     object
station_id       float32
latitude         float32
longitude        float32
dtype: object

In [27]:
to_convert = ['value', 'device_id', 'height', 'station_id', 'latitude', 'longitude']
for col in to_convert:
    df.loc[:, col] = pd.to_numeric(df[col], errors='coerce')
    
df.dtypes

value            float32
unit              object
location_name     object
captured_at       object
device_id        float32
height           float32
devicetype_id     object
station_id       float32
latitude         float32
longitude        float32
dtype: object

## Data visualization

In [44]:
fig = px.scatter_geo(
    df, 
    lat='latitude', 
    lon='longitude', 
    color='value',
    hover_name='location_name',
    title='Radiation levels',
    color_continuous_scale=['green', 'yellow', 'red', 'purple'],
)
fig.update_layout(height=800)
fig.show()