# Data Analytics (HX)

We will be using a open source dataset of Pokemon Go Spawns at the San Francisco Bay Area, from [Kaggle](https://www.kaggle.com/datasets/kveykva/sf-bay-area-pokemon-go-spawns).

Using this dataset, we will attempt to analyse for various data insights and obtain a comprehensive description of the data by investigating the distribution of pokemon by various factors (Name, Type, Time etc.)

## Extracting the dataset from Google Drive

Run these cells to download the dataset!

In [None]:
# Download the csv file from google drive into the sample_data folder
! wget "https://drive.google.com/u/0/uc?id=1NGvsf9pwkXW_3G6xM7ar-OhnviYNjFKJ&export=download" -O "pokemon-spawns.csv" 
! mv pokemon-spawns.csv /content/sample_data/

## Data Pre-Processing
1. Read the csv file we have using the Pandas library

2. Display the dataframe obtained

3. Optimise memory usage

4. Filter unnecessary information, noisy data and perform miscellaneous data appending or slicing

In [None]:
# Converting the csv file into a pandas dataframe
import pandas as pd
import numpy as np

raw_df = pd.read_csv('sample_data/pokemon-spawns.csv')

In [None]:
# Sampling random indexes of the dataframe to see the kind of data we are dealing with
print(raw_df.shape)
raw_df.sample(5)

In [None]:
# Retrieving all Pidgey spawns
raw_df[raw_df["name"] == "Pidgey"]

In [None]:
# Checking for weird data with invalid encounter_ms
raw_df[raw_df["encounter_ms"] <= 0]

In [None]:
# Column rename and checking for weird data with invalid disappear_ms
raw_df.rename(columns={"disppear_ms":"disappear_ms"}, inplace=True)
raw_df[raw_df["disappear_ms"] <= 0]

In [None]:
# Retrieving all Nidoran ♀ spawns
raw_df[(raw_df['num'] == 29)]

In [None]:
# Retrieving all Nidoran ♂ spawns
raw_df[(raw_df['num'] == 32)]

Interesting Points to note:

1. There exists datapoints with encounter_ms = -1, but there exists no datapoints with disappear_ms = -1.

2. The timestamps of encounter_ms are later than timestamps of disappear_ms.

3. 2 different name labels for Nidoran (eg. Nidoran♂ and Nidoran (m))

4. There is a lack of information regarding pokemon types.

Assumptions / Conclusions:

1. Datapoints with encounter_ms = -1 indicate pokemon that weren't encountered by Pokemon Trainers (ie. left to despawn)

2. Mislabelled data? disappear_ms should be spawn_ms instead.

3. Duplicate labels for Nidoran

In [None]:
# Ralabelling encounter_ms = -1
raw_df['encounter_ms'].replace(-1, np.nan, inplace=True)
raw_df

In [None]:
# Relabelling to spawn_ms
raw_df.rename(columns={"disappear_ms":"spawn_ms"}, inplace=True)
raw_df.head()

In [None]:
# Fixing the Nidoran issue
raw_df['name'].replace('Nidoran (f)','Nidoran♀', inplace=True)
raw_df[(raw_df['num'] == 29)]

In [None]:
# A more foolproof way of fixing the Nidoran issue + Add Pokemon Types
type_df = pd.read_csv("https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv")
type_df

In [None]:
# Filtering the type df to get the information that we need
type_df = type_df.loc[type_df['Generation'] == 1, ['#','Name','Type 1','Type 2']].drop_duplicates(subset=['#'])
type_df.head()

In [None]:
# Adding types to the dataframe, and rearranging the dataframe columns
raw_df = raw_df.merge(type_df,how='left',left_on='num',right_on='#').drop(columns=['#','name'])
cols = list(raw_df)
cols = cols[:3] + cols[-3:] + cols[3:7]
raw_df = raw_df[cols]
raw_df

In [None]:
# Lowercase Column Names for linguistic Consistency
cols = list(raw_df)
cols = [name.lower() for name in cols]
raw_df.columns = cols
raw_df.head()

In [None]:
# Information about the datatype used for every column
raw_df.info()

More information on S2 Cells:
https://pokemongohub.net/post/article/comprehensive-guide-s2-cells-pokemon-go/amp/

Pokemon Pokedex:
https://www.pokemon.com/us/pokedex/

In [None]:
# Dropping information about S2 Cells
raw_df.drop(columns = ["s2_id","s2_token"], inplace=True)
raw_df.head()

In [None]:
# Optimising data storage
raw_df['num'] = raw_df['num'].astype('uint16')
raw_df['name'] = raw_df['name'].astype('category')
raw_df['type 1'] = raw_df['type 1'].astype('category')
raw_df['type 2'] = raw_df['type 2'].astype('category')
raw_df['lat'] = raw_df['lat'].astype('float32')
raw_df['lng'] = raw_df['lng'].astype('float32')
raw_df['encounter_ms'] = pd.to_datetime(raw_df['encounter_ms'], unit='ms')
raw_df['spawn_ms'] = pd.to_datetime(raw_df['spawn_ms'], unit='ms')

In [None]:
raw_df.info()

In [None]:
raw_df.sample(5)

In [None]:
raw_df.describe()

In [None]:
df = raw_df.copy()

## Google Colab Data Tables

Colab includes an extension that renders pandas dataframes into interactive displays that can be filtered, sorted, and explored dynamically.

Here, we will be using this built in google colab feature to further filter and refine our analysis.

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

In [None]:
# Taking another look at our processed dataframe
df

In [None]:
# Filtering 20000 random indexes from our dataframe and analysing using Colab Tables
df.sample(20000)

In [None]:
# Challenge 1: Look for any anomalous data not found near the San Francisco region; determine the location of these anomalous points

In [None]:
data_table.disable_dataframe_formatter()

## Seaborn & Matplotlib

Seaborn and Matplotlib are 2 common libraries used to visualise dataframes in a graphical or chart format. 

Both libraries are sufficient for most use cases. 

Here, we will be deriving various useful from our data using visualisation, such as the most common & least common pokemon spawns at the San Francisco Bay Area.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Taking another look at our processed dataframe
df

In [None]:
# Count plot of all pokemon
df.num.value_counts().plot(kind = 'bar')

In [None]:
# Count plot of the 10 most common pokemon
df.name.value_counts().sort_values()[-10:].plot(kind = 'bar')

In [None]:
# Challenge 2: Count plot of the 10 least common pokemon

In [None]:
# Count plot of different types
counts = df['type 1'].value_counts().add(df['type 2'].value_counts(), fill_value = 0)
counts.plot(kind='bar')

In [None]:
# Normalized Probabilities Plot
(100*counts/counts.sum()).plot(kind='bar')

In [None]:
# Spawn duration
spawn = df[df['encounter_ms'].notnull()]['encounter_ms'] - df[df['encounter_ms'].notnull()]['spawn_ms']
sns.histplot(spawn.astype('timedelta64[s]')/60)

In [None]:
sns.histplot(data=df,x='encounter_ms')

In [None]:
# Scatterplot of encounter times of different pokemon
sns.scatterplot(data=df,x='encounter_ms',y='num')

In [None]:
# Geospatial analysis of plots
df.plot(kind='scatter', x='lng', y='lat')

In [None]:
# Selecting only points at San Francisco
sanfrancisco_df = df[(df['lng'] <= -121.674) & (df['lat'] >= 37.196)]
sanfrancisco_df.plot(kind = 'scatter', x = 'lng', y='lat')

In [None]:
sanfrancisco_df.plot(kind='hexbin', x='lng', y='lat', gridsize=20, colormap='binary')

In [None]:
# Using Seaborn to generate a hexbin jointplot
sns.jointplot(x=sanfrancisco_df['lng'],y=sanfrancisco_df['lat'], kind='hex', joint_kws={'gridsize':20})

## Plot.ly
One caveat for both Seaborn & Matplotlib is the lack of interactivity (ability to zoom in to graphs, pan graph windows, and display information about datapoints when hovering over them). Plotly provides a viable, yet relatively high performance alternative. 

In [None]:
import plotly.express as px

In [None]:
# Taking another look at our processed dataframe
df

In [None]:
# Plotting a count plot of pokemon types using plotly
fig = px.bar(x=counts.index, y=counts, labels={'y':'Counts','x':'Pokemon Types'}, color=counts.index, title='Histogram of Pokemon Types')
fig.show()

In [None]:
# Plotly Histogram
fig = px.histogram(df,x='encounter_ms')
fig.show()

In [None]:
# Interactive scatterplot with plotly
fig = px.scatter(df.sample(30000), x="encounter_ms", y="num")
fig.show()

In [None]:
fig = px.scatter(sanfrancisco_df.sample(30000),x='lng',y='lat',hover_name='name',color='num',hover_data=['spawn_ms'])
fig.show()

## Kepler.gl

Plotting geospatial data using Matplotlib is viable using a traditional scatterplot or a hexbin plot, but there offers way more efficient and useful tools specifically engineered to handle such geospatial datasets. 

Kepler.gl is a powerful open source geospatial analysis tool for large-scale data sets, that offers both visual appeal and a broad range of functionality including interactivity, filters and different map types. 

For more information, visit their [GitHub](https://github.com/keplergl/kepler.gl)

Here, we will be importing our dataset into keplergl, and exploring the various functionality within this tool.

In [None]:
! pip install keplergl

In [None]:
from keplergl import KeplerGl
# to view Kepler GL output
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
# Drop unnecessary data to avoid bloating
spawn_map_df = df.drop(columns=['encounter_ms','type 1','type 2'])

In [None]:
spawn_map_df.head()

In [None]:
# Displaying a map of all spawns
spawn_map = KeplerGl(height=400)
spawn_map.add_data(data=spawn_map_df.sample(30000), name="Spawn Map Data")
spawn_map

In [None]:
# Drop unnecessary data to avoid bloating
unencountered_map_df = df[df['encounter_ms'].isnull()].drop(columns={'encounter_ms'})
print(unencountered_map_df.shape)
unencountered_map_df

In [None]:
# Challenge 3: Displaying a map of all unencountered pokemon

In [None]:
# Using the kepler.gl webapp
unencountered_map_df.to_csv('unencountered_map.csv')
spawn_map_df.sample(30000).to_csv('spawn_map.csv')

## Insights gained:
1. Anomalous data found at **Tokyo** region
2. Top 10 most common pokemon: **Pidgey, Zubat, Rattata, Spearow, Weedle, Doduo, Ekans, Paras, Eevee, Magikarp**
3. Top 10 least common pokemon: **Omastar, Muk, Gyarados, Vaporeon, Cloyster, Machamp, Vileplume, Dewgong, Victreebel, Alakazam**
4. Most common spawn types: **Normal, Flying, Poison**
5. Spawn Duration: **15mins**
6. Dataset clustered on **26 July 2016**, encounters peak at **18:20**
7. **No underlying hotspots** for unencountered pokemon

# Machine Learning (HC)


## [Netron.app](https://netron.app/)

Netron allows you to view your neural network, deep learning and machine learning models.

You can use it with most of the commonly used machine learning libraries such as PyTorch, Tensorflow and scikit-learn.

For more information, visit their [GitHub](https://github.com/lutzroeder/netron)

Here, we will be building a simple [CNN for MNIST](https://www.tensorflow.org/tutorials/images/cnn) with TensorFlow so that we can analyse the neural network, then we will import a more well developed model ([YoloV5](https://github.com/ultralytics/yolov5)) to see the full capabilities of this library.


In [None]:
!pip install -q netron
import netron
import portpicker
from google.colab import output

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
x_train = x_train/255

y_train = tf.one_hot(y_train.astype(np.int32), depth=10)

In [None]:
plt.imshow(x_train[123][:,:,0], cmap="gray")

In [None]:
batch_size = 64
num_classes = 10
epochs = 2

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (5,5), padding='same', activation='relu', input_shape=(28,28,1)),
    tf.keras.layers.Conv2D(32, (5,5), padding='same', activation='relu'),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Conv2D(64, (3,3), padding='same', activation='relu'),
    tf.keras.layers.Conv2D(64, (3,3), padding='same', activation='relu'),
    tf.keras.layers.MaxPool2D(strides=(2,2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

In [None]:
model.compile(optimizer=tf.keras.optimizers.RMSprop(epsilon=1e-08), loss='categorical_crossentropy', metrics=['acc'])

In [None]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.1)

In [None]:
model.save('mnist_model.h5')

In [None]:
model.summary()

In [None]:
port = portpicker.pick_unused_port()
with output.temporary():
  netron.start('mnist_model.h5', address=("0.0.0.0", port), browse=False)

output.serve_kernel_port_as_iframe(port, height='500')

In [None]:
!curl --output resnet.h5 -L https://github.com/Hyperparticle/one-pixel-attack-keras/raw/master/networks/models/resnet.h5

In [None]:
resnet = tf.keras.models.load_model('resnet.h5')

In [None]:
resnet.summary()

In [None]:
port = portpicker.pick_unused_port()
with output.temporary():
  netron.start('resnet.h5', address=("0.0.0.0", port), browse=False)

output.serve_kernel_port_as_iframe(port, height='500')

## [ANN Visualizer](https://github.com/RedaOps/ann-visualizer)

ANN Visualizer is more useful if you're using a normal python script instead of running in Google Colab.

It's easy to use with basically 0 configuration needed

In [None]:
!pip install ann_visualizer

In [None]:
from ann_visualizer.visualize import ann_viz

ann_viz(model, title="CNN", view=True, filename="visualized")

## Manual Visualization

In [None]:
layer_names = [layer.name for layer in resnet.layers]

In [None]:
layer_outputs = [layer.output for layer in resnet.layers]

In [None]:
feature_map_model = tf.keras.models.Model(resnet.input, layer_outputs)

In [None]:
!wget https://live.staticflickr.com/65535/49852239266_6d2486b7e3_k_d.jpg -O sample.jpg

In [None]:
from tensorflow.keras.utils import load_img, img_to_array
image = load_img('./sample.jpg')

In [None]:
plt.imshow(image)

In [None]:
image = load_img('./sample.jpg', target_size=(32,32))
plt.imshow(image)
im = img_to_array(image)
im = im.reshape((1,) + im.shape)
im /= 255.

In [None]:
feature_maps = feature_map_model.predict(im)

In [None]:
count = 0
for layer_name, feature_map in zip(layer_names, feature_maps):
  if count >= 10:
    break
  if len(feature_map.shape) == 4:

    n_features = feature_map.shape[-1]  # number of features in the feature map
    size       = feature_map.shape[ 1]  # feature map shape (1, size, size, n_features)

    display_grid = np.zeros((size, size * n_features))

    for i in range(n_features):
      x  = feature_map[0, :, :, i]
      x -= x.mean()
      x /= x.std ()
      x *=  64
      x += 128
      x  = np.clip(x, 0, 255).astype('uint8')
      # Tile each filter into a horizontal grid
      display_grid[:, i * size : (i + 1) * size] = x# Display the grid
    scale = 20. / n_features
    plt.figure( figsize=(scale * n_features, scale) )
    plt.title ( layer_name )
    plt.grid  ( False )
    plt.imshow( display_grid, aspect='auto', cmap='viridis' )
    count += 1

## [Tensorboard](https://www.tensorflow.org/tensorboard)

Tensorboard is much more well known compared to the other libraries we've introduced. Nevertheless, it is still an extremely useful resource for visualising machine learning models that we've decided to include it here.

Mastering the use of Tensorboard will make your Machine Learning journey much easier, especially for people who find it hard to visualise models in their mind.

In [None]:
%load_ext tensorboard
import datetime

In [None]:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

In [None]:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
x_train = x_train/255

y_train = tf.one_hot(y_train.astype(np.int32), depth=10)

batch_size = 256 # Changed this to make it go faster
num_classes = 10
epochs = 5

model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (5,5), padding='same', activation='relu', input_shape=(28,28,1)),
    tf.keras.layers.Conv2D(32, (5,5), padding='same', activation='relu'),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Conv2D(64, (3,3), padding='same', activation='relu'),
    tf.keras.layers.Conv2D(64, (3,3), padding='same', activation='relu'),
    tf.keras.layers.MaxPool2D(strides=(2,2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.RMSprop(epsilon=1e-08), loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.1,
                    callbacks=[tensorboard_callback] # New addition
                    )

In [None]:
!ls

In [None]:
%tensorboard --logdir logs/fit

### Tensorboard with PyTorch

In [None]:
import torch
import torchvision
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

# Writer will output to ./runs/ directory by default
writer = SummaryWriter()

x = torch.arange(-5, 5, 0.1).view(-1, 1)
y = -5 * x + 0.1 * torch.randn(x.size())

model = torch.nn.Linear(1, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)

def train_model(iter):
    for epoch in range(iter):
        y1 = model(x)
        loss = criterion(y1, y)
        writer.add_scalar("Loss/train", loss, epoch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

train_model(10)
writer.flush()

writer.close()

In [None]:
%tensorboard --logdir runs

In [None]:
!ls