# Weather Station Clustering using DBSCAN

The objective here is to cluster the location of weather stations in Canada.DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#importing dataset
df=pd.read_csv("G:\Data science\Datasets\weather-stations.csv")
df.head()

Unnamed: 0,Stn_Name,Lat,Long,Prov,Tm,DwTm,D,Tx,DwTx,Tn,...,DwP,P%N,S_G,Pd,BS,DwBS,BS%,HDD,CDD,Stn_No
0,CHEMAINUS,48.935,-123.742,BC,8.2,0.0,,13.5,0.0,1.0,...,0.0,,0.0,12.0,,,,273.3,0.0,1011500
1,COWICHAN LAKE FORESTRY,48.824,-124.133,BC,7.0,0.0,3.0,15.0,0.0,-3.0,...,0.0,104.0,0.0,12.0,,,,307.0,0.0,1012040
2,LAKE COWICHAN,48.829,-124.052,BC,6.8,13.0,2.8,16.0,9.0,-2.5,...,9.0,,,11.0,,,,168.1,0.0,1012055
3,DISCOVERY ISLAND,48.425,-123.226,BC,,,,12.5,0.0,,...,,,,,,,,,,1012475
4,DUNCAN KELVIN CREEK,48.735,-123.728,BC,7.7,2.0,3.4,14.5,2.0,-1.0,...,2.0,,,11.0,,,,267.7,0.0,1012573


The columns in the data set are,

    Name in the table-Meaning
    
    Stn_Name-Station Name
    Lat-Latitude (North+, degrees)
    Long-Longitude (West - , degrees)
    Prov-Province
    Tm-Mean Temperature (°C)
    DwTm-Days without Valid Mean Temperature
    D-Mean Temperature difference from Normal (1981-2010) (°C)
    Tx-Highest Monthly Maximum Temperature (°C)
    DwTx-Days without Valid Maximum Temperature
    Tn-Lowest Monthly Minimum Temperature (°C)
    DwTn-Days without Valid Minimum Temperature
    S-Snowfall (cm)
    DwS-Days without Valid Snowfall
    S%N-Percent of Normal (1981-2010) Snowfall
    P-Total Precipitation (mm)
    DwP-Days without Valid Precipitation
    P%N-Percent of Normal (1981-2010) Precipitation
    S_G-Snow on the ground at the end of the month (cm)
    Pd-Number of days with Precipitation 1.0 mm or more
    BS-Bright Sunshine (hours)
    DwBS-Days without Valid Bright Sunshine
    BS%-Percent of Normal (1981-2010) Bright Sunshine
    HDD-Degree Days below 18 °C
    CDD-Degree Days above 18 °C
    Stn_No-Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically).
    NA-Not Available

In [3]:
#general information on dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1341 entries, 0 to 1340
Data columns (total 25 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Stn_Name  1341 non-null   object 
 1   Lat       1341 non-null   float64
 2   Long      1341 non-null   float64
 3   Prov      1341 non-null   object 
 4   Tm        1256 non-null   float64
 5   DwTm      1256 non-null   float64
 6   D         357 non-null    float64
 7   Tx        1260 non-null   float64
 8   DwTx      1260 non-null   float64
 9   Tn        1260 non-null   float64
 10  DwTn      1260 non-null   float64
 11  S         586 non-null    float64
 12  DwS       586 non-null    float64
 13  S%N       198 non-null    float64
 14  P         1227 non-null   float64
 15  DwP       1227 non-null   float64
 16  P%N       209 non-null    float64
 17  S_G       798 non-null    float64
 18  Pd        1227 non-null   float64
 19  BS        0 non-null      float64
 20  DwBS      0 non-null      floa

In [5]:
#data cleaning
df=df[pd.notnull(df['Tm'])].reset_index()
df.head()

Unnamed: 0,index,Stn_Name,Lat,Long,Prov,Tm,DwTm,D,Tx,DwTx,...,DwP,P%N,S_G,Pd,BS,DwBS,BS%,HDD,CDD,Stn_No
0,0,CHEMAINUS,48.935,-123.742,BC,8.2,0.0,,13.5,0.0,...,0.0,,0.0,12.0,,,,273.3,0.0,1011500
1,1,COWICHAN LAKE FORESTRY,48.824,-124.133,BC,7.0,0.0,3.0,15.0,0.0,...,0.0,104.0,0.0,12.0,,,,307.0,0.0,1012040
2,2,LAKE COWICHAN,48.829,-124.052,BC,6.8,13.0,2.8,16.0,9.0,...,9.0,,,11.0,,,,168.1,0.0,1012055
3,4,DUNCAN KELVIN CREEK,48.735,-123.728,BC,7.7,2.0,3.4,14.5,2.0,...,2.0,,,11.0,,,,267.7,0.0,1012573
4,5,ESQUIMALT HARBOUR,48.432,-123.439,BC,8.8,0.0,,13.1,0.0,...,8.0,,,12.0,,,,258.6,0.0,1012710


In [55]:
#statistical analysis
df.describe()

Unnamed: 0,index,Lat,Long,Tm,DwTm,D,Tx,DwTx,Tn,DwTn,...,S_G,Pd,BS,DwBS,BS%,HDD,CDD,x,y,clus
count,1256.0,1256.0,1256.0,1256.0,1256.0,357.0,1256.0,1256.0,1255.0,1255.0,...,733.0,1144.0,0.0,0.0,0.0,1256.0,1256.0,1256.0,1256.0,1256.0
mean,665.015127,51.322661,-97.054425,-12.062341,2.186306,-2.768908,2.625717,1.772293,-26.310438,1.737052,...,30.425648,7.443182,,,,773.27715,0.0,51.322661,-97.054425,1.872611
std,386.828023,6.216936,23.368497,10.416366,4.903077,4.840769,8.853532,4.161374,12.591393,4.22203,...,33.066732,4.761164,,,,311.119277,0.0,6.216936,23.368497,2.556083
min,0.0,41.949,-140.868,-38.2,0.0,-12.0,-29.8,0.0,-49.7,0.0,...,0.0,0.0,,,,26.0,0.0,41.949,-140.868,-1.0
25%,328.75,47.31575,-117.0165,-18.5,0.0,-7.1,-3.5,0.0,-35.45,0.0,...,2.0,4.0,,,,560.1,0.0,47.31575,-117.0165,-1.0
50%,656.5,49.922,-103.125,-13.8,0.0,-4.6,2.5,0.0,-29.1,0.0,...,25.0,7.0,,,,813.5,0.0,49.922,-103.125,3.0
75%,1005.25,53.39375,-73.825,-5.6,2.0,2.7,10.0,1.0,-21.7,1.0,...,45.0,11.0,,,,988.0,0.0,53.39375,-73.825,5.0
max,1340.0,82.5,-52.753,9.6,27.0,7.8,22.0,27.0,5.3,27.0,...,253.0,28.0,,,,1523.4,0.0,82.5,-52.753,7.0


## Clustering based on Location,mean,max and min Temperature.

In [7]:
#data preprocessing
df['x']=np.asarray(df['Lat'])
df['y']=np.asarray(df['Long'])

In [33]:
from sklearn.preprocessing import StandardScaler
xy=df[['x','y','Tm','Tx','Tn']]
xy=np.nan_to_num(xy)
xy=StandardScaler().fit_transform(xy)

In [34]:
#modeling
from sklearn.cluster import DBSCAN
db=DBSCAN(eps=0.3,min_samples=10).fit(xy)

In [39]:
df['clus']=db.labels_
df[['Stn_Name','x','y','Tm','Tx','Tn','clus']].head()

Unnamed: 0,Stn_Name,x,y,Tm,Tx,Tn,clus
0,CHEMAINUS,48.935,-123.742,8.2,13.5,1.0,0
1,COWICHAN LAKE FORESTRY,48.824,-124.133,7.0,15.0,-3.0,0
2,LAKE COWICHAN,48.829,-124.052,6.8,16.0,-2.5,0
3,DUNCAN KELVIN CREEK,48.735,-123.728,7.7,14.5,-1.0,0
4,ESQUIMALT HARBOUR,48.432,-123.439,8.8,13.1,1.9,0


In [40]:
df.groupby(['clus'])['Stn_Name'].count()

clus
-1    402
 0    178
 1     21
 2      9
 3    254
 4     54
 5    305
 6     19
 7     14
Name: Stn_Name, dtype: int64

As we can see for outliers, the cluster label is -1 and majority of the stations remain outside of clusters that is as outliers .

In [56]:
df.groupby(['clus'])['x','y','Tm','Tx','Tn'].mean()

  df.groupby(['clus'])['x','y','Tm','Tx','Tn'].mean()


Unnamed: 0_level_0,x,y,Tm,Tx,Tn
clus,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
-1,55.9301,-94.840269,-18.011194,-1.368408,-32.793267
0,50.09309,-123.54914,6.238202,13.194944,-1.534831
1,52.082238,-121.11519,-0.552381,9.895238,-13.2
2,54.294111,-125.540444,-3.244444,8.6,-17.255556
3,52.175157,-108.163642,-13.748425,4.342913,-30.42126
4,49.95913,-112.797222,-4.153704,15.162963,-22.944444
5,45.897525,-74.033518,-16.420984,-2.920328,-31.381967
6,45.492053,-63.485947,-10.178947,4.321053,-22.421053
7,47.502929,-54.176357,-4.371429,7.2,-14.307143


It is obvious that we have 3 main clusters with the majority of stations in those excluding outliers.

    Cluster 0 : Has a mean temperature of 6.2°C and a maximum and minimum temperature of 13.2°C and -1.5°C repectively in lattitude=50 degrees and longitude=-123 degrees .

    Cluster 3 : Has a mean temperature of -13.7°C and a maximum and minimum temperature of 4.3°C and -30.4°C repectively in lattitude=52.17 degrees and longitude=-108 degrees.

    Cluster 5 : Has a mean temperature of -16.4°C and a maximum and minimum temperature of -2.9°C and -31.3°C repectively in lattitude=45.89 degrees and longitude=-74 degrees .


## Test Case

Lets consider a imaginary weather station A with Lat=51.3,Long=-97,Tm=-12,Tx=2.6,Tn=-26 and find its cluster .

Our imaginary station A belongs to Cluster 3 which has a mean temperature of -13.7°C and a maximum and minimum temperature of 4.3°C and -30.4°C repectively in lattitude=52.17 degrees and longitude=-108 degrees.