[Reference](https://towardsdatascience.com/contact-tracing-using-less-than-30-lines-of-python-code-6c5175f5385f)

[1] “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” Ester, M., H. P. Kriegel, J. Sander, and X. Xu, In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996

[2] “DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). In ACM Transactions on Database Systems (TODS), 42(3), 19.

[3] Sklearn Documentation.

There are three algorithms:
- Density-based Clustering: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS (Ordering Points to Identify Clustering Structure)
- Hierarchal-based Clustering: CURE (Clustering Using Representatives) and BIRCH (Balanced Iterative Reducing Clustering, and using Hierarchies)
- Partitioning-based Clustering: K-means and CLARANS (Clustering Large Applications based upon Randomized Search)

In [12]:
import numpy as np
import pandas as pd
!pip install pygal
import pygal
from sklearn.cluster import DBSCAN



In [3]:
from IPython.display import display, HTML
base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

In [6]:
!pip install fsspec

Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/a5/8b/1df260f860f17cb08698170153ef7db672c497c1840dcc8613ce26a8a005/fsspec-0.8.4-py3-none-any.whl (91kB)
[K     |███▋                            | 10kB 19.2MB/s eta 0:00:01[K     |███████▏                        | 20kB 1.7MB/s eta 0:00:01[K     |██████████▉                     | 30kB 2.2MB/s eta 0:00:01[K     |██████████████▍                 | 40kB 2.4MB/s eta 0:00:01[K     |██████████████████              | 51kB 1.9MB/s eta 0:00:01[K     |█████████████████████▋          | 61kB 2.2MB/s eta 0:00:01[K     |█████████████████████████▏      | 71kB 2.3MB/s eta 0:00:01[K     |████████████████████████████▊   | 81kB 2.6MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 2.2MB/s 
[?25hInstalling collected packages: fsspec
Successfully installed fsspec-0.8.4


In [7]:
dataFrame = pd.read_json("https://raw.githubusercontent.com/Branden-Kang/Python-practice/master/Data/MOCK_DATA.json")
dataFrame.head()

Unnamed: 0,User,TimeStamp,Longitude,Latitude
0,Arthur,2020-08-27 17:33:33,60.077519,13.988041
1,Walter,2020-08-27 20:13:18,60.029391,13.903152
2,Arthur,2020-08-27 18:22:23,60.078368,13.933152
3,Walter,2020-08-27 03:38:36,60.002145,13.967506
4,James,2020-08-27 01:11:35,60.040521,13.966431


In [9]:
disp_dict = {}
for index, row in dataFrame.iterrows():
    if row['User'] not in disp_dict.keys():
        disp_dict[row['User']] = [(row['Latitude'], row['Longitude'])]
    else:
        disp_dict[row['User']].append((row['Latitude'], row['Longitude']))
xy_chart = pygal.XY(stroke=False)
[xy_chart.add(k,v) for k,v in sorted(disp_dict.items())]
display(HTML(base_html.format(rendered_chart=xy_chart.render(is_unicode=True))))

In [14]:
safe_distance = 0.0018288 # a radial distance of 6 feet in kilometers
model = DBSCAN(eps=safe_distance, min_samples=2, metric='haversine').fit(dataFrame[['Latitude', 'Longitude']])
# For metric, there is euclidean, manhattan, and Minkowski
core_samples_mask = np.zeros_like(model.labels_, dtype=bool)
core_samples_mask[model.core_sample_indices_] = True
labels = model.labels_
dataFrame['Cluster'] = model.labels_.tolist()

In [15]:
disp_dict_clust = {}
for index, row in dataFrame.iterrows():
    if row['Cluster'] not in disp_dict_clust.keys():
        disp_dict_clust[row['Cluster']] = [(row['Latitude'], row['Longitude'])]
    else:
        disp_dict_clust[row['Cluster']].append((row['Latitude'], row['Longitude']))
print(len(disp_dict_clust.keys()))
from pygal.style import LightenStyle
dark_lighten_style = LightenStyle('#F35548')
xy_chart = pygal.XY(stroke=False, style=dark_lighten_style)
[xy_chart.add(str(k),v) for k,v in disp_dict_clust.items()]
display(HTML(base_html.format(rendered_chart=xy_chart.render(is_unicode=True))))

18


"-1" means noise.

## Obtain all clusters a specific person belongs to

In [17]:
inputName = "William"
inputNameClusters = set()
for i in range(len(dataFrame)):
    if dataFrame['User'][i] == inputName:
        inputNameClusters.add(dataFrame['Cluster'][i])

## Get people within a specific cluster.

In [19]:
infected = set()
for cluster in inputNameClusters:
    if cluster != -1:
        namesInCluster = dataFrame.loc[dataFrame['Cluster'] == cluster, 'User']
        for i in range(len(namesInCluster)):
            name = namesInCluster.iloc[i]
            if name != inputName:
                infected.add(name)

In [20]:
print(infected)

{'Doreen', 'John', 'James'}


In [21]:
def contactTracing(dataFrame, inputName):
    #Check if name is valid
    assert (inputName in dataFrame['User'].tolist()), "User Doesn't exist"
    #Social distance
    safe_distance = 0.0018288 #6 feets in kilometers
    #Apply model, in case of larger dataset or noisy one, increase min_samples
    model = DBSCAN(eps=safe_distance, min_samples=2, metric='haversine').fit(dataFram[['Latitude', 'Longitude']])
    #Get clusters found bt the algorithm 
    labels = model.labels_
    #Add the clusters to the dataframe
    dataFram['Cluster'] = model.labels_.tolist()
    #Get the clusters the inputName is a part of
    inputNameClusters = set()
    for i in range(len(dataFrame)):
        if dataFrame['User'][i] == inputName:
            inputNameClusters.add(dataFrame['Cluster'][i])
   #Get people who are in the same cluster as the inputName              
    infected = set()
    for cluster in inputNameClusters:
        if cluster != -1: #as long as it is not the -1 cluster
            namesInCluster = dataFrame.loc[dataFrame['Cluster'] == cluster, 'User'] #Get all names in the cluster
            for i in range(len(namesInCluster)):
              #locate each name on the cluster
                name = namesInCluster.iloc[i]
                if name != inputName: #Don't want to add the input to the results
                    infected.add(name)
    print("Potential infections are:",*infected,sep="\n" )