# Plotly - Exercises
Welcome to this notebook on the Plotly library! In this notebook, we will be exploring the Plotly library for data visualization, which is an interactive, web-based plotting library for Python. It allows users to create beautiful, interactive visualizations that can be easily shared and published to the web.

Through a series of exercises, we will cover the basics of creating different types of plots using Plotly. The exercises will cover various topics such as time series, scatter plots, bar charts, heatmaps, and more. We will use real-world datasets to create interactive visualizations and explore the various customization options available in Plotly.

Whether you are a beginner or an experienced data scientist, this notebook will provide you with the knowledge and skills needed to create engaging and informative visualizations using Plotly. 

**So let's get started!**

#### Package imports

In [1]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

## Part 1: Machine Learning and AI Visualizations
In this notebook, we will focus on building visualizations commonly used in machine learning. Specifically, we will work on creating a confusion matrix, and a K-Nearest neighbors visualization. These visualizations are important for understanding the performance of machine learning models and for interpreting the results of the models. By the end of this notebook, you will have gained experience creating different types of charts and will be able to use these visualizations to evaluate and improve their own machine learning models.

### 1.1: Constructing a confusion matrix
In machine learning, a confusion matrix is a table that is used to evaluate the performance of a classification algorithm. The table contains information about actual and predicted classifications of a dataset. It is a way to measure the performance of a classification model by showing the number of correct and incorrect predictions, broken down by each class.

A confusion matrix is typically used for binary classification problems, where the output can take one of two possible values (e.g., true or false, positive or negative). The matrix is square and has four cells: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

- **True positives (TP)** are the cases where the actual value is positive and the model correctly predicts it as positive.
- **False positives (FP)** are the cases where the actual value is negative, but the model incorrectly predicts it as positive.
- **True negatives (TN)** are the cases where the actual value is negative, and the model correctly predicts it as negative.
- **False negatives (FN)** are the cases where the actual value is positive, but the model incorrectly predicts it as negative.

By analyzing the values in the confusion matrix, one can calculate various performance metrics of the classifier such as accuracy, precision, recall, F1 score, and specificity. These metrics help to evaluate how well the model is performing and identify areas where it needs improvement.

The **Titanic dataset** is a popular dataset in the field of machine learning. It is a classic example used to demonstrate the application of different machine learning models. The dataset contains information about the passengers who were aboard the ill-fated Titanic when it sank in 1912. The information includes the passenger's age, sex, class, fare, and whether or not they survived the disaster. The dataset has been widely used to train machine learning models to predict whether a passenger survived or not based on the other available information.

In [2]:
titanic = pd.read_csv('https://kuleuven-mda.s3.eu-central-1.amazonaws.com/titanic_train.csv')
y = titanic.pop('Survived').values
X_train, X_test, y_train, y_test = train_test_split(titanic, y, test_size=0.33, random_state=42)
X_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
6,7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
718,719,3,"McEvoy, Mr. Michael",male,,0,0,36568,15.5,,Q
685,686,2,"Laroche, Mr. Joseph Philippe Lemercier",male,25.0,1,2,SC/Paris 2123,41.5792,,C
73,74,3,"Chronopoulos, Mr. Apostolos",male,26.0,1,0,2680,14.4542,,C
882,883,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S


The sklearn pipelines will be used to train a basic classifier:

In [3]:
categorical_cols = ['Embarked', 'Sex', 'Pclass', 'Cabin']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ('pca', PCA(n_components=10))
])

numeric_cols = ['Age', 'Fare', 'SibSp', 'Parch']
numeric_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=5)),
    ('scaler', RobustScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])
clf.fit(X_train, y_train);

We will not do any cross validation etc and just keep the example basic. The fitted model will now be applied to the testing data:

In [4]:
X_test.loc[:, 'Prediction'] = clf.predict(X_test)
X_test.loc[:, 'Survived'] = y_test
X_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Prediction,Survived
709,710,3,"Moubarek, Master. Halim Gonios (""William George"")",male,,1,1,2661,15.2458,,C,1,1
439,440,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,C.A. 18723,10.5,,S,0,0
840,841,3,"Alhomaki, Mr. Ilmari Rudolf",male,20.0,0,0,SOTON/O2 3101287,7.925,,S,0,0
720,721,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.0,0,1,248727,33.0,,S,1,1
39,40,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C,0,1


**Task 1**: construct a confusion matrix based on the Prediction and Survived labels. Make sure that the colors range from #caf0f8 to #03045e. Hide the legend/color scale, and provide meaningful labels and text annotations.

Tip: use the confusion_matrix function from sklearn to obtain the values of the confusion matrix.

In [None]:
cm = confusion_matrix(X_test.Survived, X_test.Prediction)

fig = go.Figure(
        data=go.Heatmap(
            z=cm,
            x=['Died', 'Survived'],
            y=['Died', 'Survived'],
            colorscale=[[0, '#caf0f8'], [1, '#03045e']],
            showscale=False,
            text=[[f'True Negative<br>{cm[0,0]}', f'False Negative<br>{cm[1,0]}'],
                  [f'False Positive<br>{cm[0,1]}', f'True Positive<br>{cm[1,1]}']],
            texttemplate="%{text}",
            textfont={"size":20}
            )
        )
fig.update_layout(
    title={
        'text': "Confusion Matrix",
        'x': 0.5
    }
)

fig.update_xaxes(title_text = "True Label")
fig.update_yaxes(title_text = "Prediction",)

fig.show()

### 1.2: K-Nearest Neighbours Visualization
The K-Nearest Neighbors (K-NN) algorithm is a supervised machine learning algorithm that can be used for both classification and regression problems. Given a new data point, the algorithm classifies the data point by finding the K nearest data points in the training data and selecting the majority class among the K nearest neighbors. In the K-NN algorithm, the value of K is a hyperparameter that needs to be set before training the model. Choosing the right value of K is important, as a low value of K may result in overfitting while a high value of K may result in underfitting.

The K-NN algorithm is a simple and effective algorithm that can be used for a variety of applications, such as image recognition, recommender systems, and natural language processing.

#### Generating the data
In this exercise, we will explore the K-Nearest Neighbors algorithm using simulated data. The dataset consists of two multivariate random normal distributions, one for each class, with different means and standard deviations. We will generate this data using NumPy and visualize it using Plotly.

In [6]:
np.random.seed(12345)
points_1 = np.random.multivariate_normal([1, 1], [[1, 0.2], [0.2, 1]], 250)
points_2 = np.random.multivariate_normal([-1, -1], [[1, 0.5], [0.5, 1]], 250)

points_1_df = pd.DataFrame(points_1, columns=['x', 'y'])
points_1_df.loc[:, 'group'] = 'A'
points_2_df = pd.DataFrame(points_2, columns=['x', 'y'])
points_2_df.loc[:, 'group'] = 'B'

points = pd.concat([points_1_df, points_2_df])
points.sample(5, random_state=12345)

Unnamed: 0,x,y,group
22,-1.188274,-1.034579,B
244,1.203311,0.597267,A
21,-0.42593,-0.89978,B
131,-1.410641,-1.503741,B
52,1.396734,1.350864,A


#### Building the model
Given the simplicity of the model, we will not implement a pipeline.

In [7]:
clf = KNeighborsClassifier(15, weights='uniform')
y = points.pop('group')
clf.fit(points, y) # No hyperparameter tuning

**Task 1**: build a contour plot that shows the classification decision made by the nearest neighboard algorithm in the area of x: [-5, 5] and y: [-5, 5]. Make sure that the colours range from '#006d77' to '#e29578' with '#edf6f9' being the midpoint.

In [None]:
xrange = np.arange(-5, 5, 0.05)
yrange = np.arange(-5, 5, 0.05)

plot_data = pd.DataFrame({'x': xrange, 'y': yrange})
plot_data.loc[:, 'prediction'] = clf.predict_proba(plot_data)[:,0]
xx, yy = np.meshgrid(xrange, yrange)
grid_data = pd.DataFrame(np.c_[xx.ravel(), yy.ravel()], columns=['x', 'y'])
Z = clf.predict_proba(grid_data)[:, 1]
Z = Z.reshape(xx.shape)

# Plot the figure
fig = go.Figure(data=[
    go.Contour(
        x=xrange,
        y=yrange,
        z=Z,
        colorscale=[[0, '#e29578'], [0.5, '#edf6f9'], [1, '#006d77']],
        showscale=False,
    )
])

fig.show()

**Task 2**: Overlay the contour plot with the original datapoints.

In [None]:
points = pd.concat([points_1_df, points_2_df])

for group, color in zip(points.group.unique(),['#006d77', '#e29578']):
    group_data = points[points.group == group].copy()
    fig.add_trace(
        go.Scatter(
            x=group_data['x'], 
            y=group_data['y'],
            mode='markers',
            fillcolor=color,
            marker={'opacity': 0.8},
            name=f'Group: {group}'
        )
    )
fig.update_layout(yaxis_range=[-5,5], xaxis_range=[-5, 5])
fig.show()

## Part 2: geographic analysis
Geographic data analysis is an important aspect of data science, as it allows for the exploration and visualization of data related to geographic locations. By analyzing geographic data, data scientists can gain insights into various phenomena, such as the distribution of populations, patterns of migration, and the spread of diseases. Geographic data is also important in fields such as marketing and urban planning, where it can be used to identify and analyze trends and patterns.

The next part of this notebook will focus on building geographic visualizations using the Plotly library. These visualizations can range from simple scatterplots with markers indicating location to complex maps with different layers of data. By the end of this section, students will have gained experience creating geographic visualizations and will be able to apply this knowledge to their own projects.

#### Data
The dataset used in this notebook provides information on a sample of water access points in Liberia and Sierra Leone. The data was collected by the Water Point Data Exchange, which is a global platform for sharing data on water points. The dataset includes various attributes related to the water access points, such as their location, type, and functionality. This dataset is particularly useful for exploring the accessibility and quality of water sources in the two countries, and can provide insights into the challenges faced by communities in accessing clean water.

In [10]:
water_points = pd.read_csv('https://kuleuven-mda.s3.eu-central-1.amazonaws.com/water_points.csv')
water_points.head()

Unnamed: 0,id,lat,lon,source,tech,country
0,390008,7.943277,-11.74318,Protected Shallow Well,Hand Pump - India Mark,Sierra Leone
1,451864,9.233856,-11.19589,Surface Water (River/Stream/Lake/Pond/Dam),Hand Pump - Vergnet,Sierra Leone
2,404187,6.27001,-10.701981,Protected Shallow Well,Hand Pump - Afridev,Liberia
3,420002,8.438398,-13.158348,Protected Shallow Well,Hand Pump - Afridev,Sierra Leone
4,410166,6.976923,-9.288272,Protected Shallow Well,Hand Pump - Afridev,Liberia


The dataset used in this notebook provides information on a sample of water access points in Liberia and Sierra Leone. The data was collected by the Water Point Data Exchange, which is a global platform for sharing data on water points. The dataset includes various attributes related to the water access points, such as their location, type, and functionality. This dataset is particularly useful for exploring the accessibility and quality of water sources in the two countries, and can provide insights into the challenges faced by communities in accessing clean water.

In [11]:
id_ = 82198

**Task**: build the visualization as shown on the slides of the lecture on data visualization. Highlight the water access point with the id stored in the id_ variable above. The color codes are: #db9862, #58524e, #a6523c, #b29e8f, #fbf7f1, and #f9f4eb. The comments will guide you through the process.

In [12]:
### Extract the information from the id_ above and store it in a dictionary
info = water_points[water_points.id == id_].to_dict(orient='records')[0]

# Initialize the figure, but this time, make use of the make_subplots function of the plotly.subplots interface
# Specify the following parameters:
# - Number of rows: 2
# - Number of columns: 2
# - Column widths: 0.7, 0.3
# - Explore the specs parameter and provide it with the correct values such that it allows both scattergeo and xy plots.

fig = make_subplots(rows=2, cols=2, 
                    specs=[[{'type': 'scattergeo'},{'type': 'xy'}],
                           [{'type': 'scattergeo'},{'type': 'xy'}]],
                   column_widths=[0.7, 0.3],
                   subplot_titles=("", "", "", "Water source"))

# Add the general plot that highlights the locations of water access points using the Scattergeo object
# the marker color is equal to  "#db9862" and the opacity was set to 0.5
# give the geo attribute the value 'geo'
# put the figure on row 1 column 1
fig.add_trace(
    go.Scattergeo(
        lon = water_points['lon'],
        lat = water_points['lat'],
        hoverinfo= 'none',
        marker=dict(color="#db9862", opacity=0.5),
        geo='geo'
    ),
    row=1, col=1
)


# Add the second map that shows the location of the two contries.
# Explore the examples in the plotly documentation to achieve this
# set the geo attribute equal to 'geo2' and put the map in the second row, first column
fig.add_trace(go.Choropleth(
        locationmode = 'country names',
        locations = water_points['country'],
        text = water_points['country'],
        hoverinfo= 'none',
        z = [1]*len(water_points),
        colorscale = [[0,'#58524e'],[1,'#58524e']],
        autocolorscale = False,
        showscale = False,
        geo = 'geo2',
    ),
    row=2, col=1
)

# Create the highlighted point:
# tip: for size, opacity in zip([10, 25, 50, 100], [1, 0.5, 0.25, 0.05]):
#.        fig.add_trace(...)
# color of the marker is #a6523c
for size, opacity in zip([10, 25, 50, 100], [1, 0.5, 0.25, 0.05]):
    fig.add_trace(go.Scattergeo(
            lon = [info['lon']],
            lat = [info['lat']],
            name = 'Highlighted point',
            hoverinfo= 'none',
            marker = dict(
                size = size,
                color = '#a6523c',
                opacity = opacity
            )
        ),
        row=1, col=1
    )

# Create the bar chart making sure that the water source for the highlighted point
# is in the color '#a6523c' and the other sources are in the color '#b29e8f'
# Put this in row 2, column 2
for source in water_points.source.value_counts().sort_values(ascending=True).index:
    number = water_points.source.value_counts()[source]
    if source == info['source']:
        color = '#a6523c'
    else:
        color = '#b29e8f'
    fig.add_trace(
        go.Bar(
            x=[number], 
            y=[source],
            text=[number],
            orientation='h',
            marker=dict(color=color),
            hoverinfo='none'
        ),
        row=2, col=2
    )

# Add the text annotations: just use the fig.add_annotation() method
fig.add_annotation(
    text="Water Access Points: Liberia and Sierra Leone",
    xref="paper", yref="paper",
    font=dict(family='<b>Times New Roman<b>', size=15),
    x=0.55, y=1, showarrow=False, xanchor='left', yanchor='top')

fig.add_annotation(
    text=f"""
    Liberia and Sierra Leone are two West African countries that face significant challenges<br>
    in accessing clean water. Both countries have low access rates to safe drinking water,<br>
    particularly in rural areas, and limited infrastructure for water supply and sanitation<br>
    services. This has resulted in high rates of waterborne diseases and health problems,<br>
    particularly among vulnerable populations.<br>
    <br>
    <b>Selected water access point</b><br>
    Identifier: {info['id']}<br>
    Water source: {info['source']}<br>
    Technology: {info['tech']}<br>
    Country: {info['country']}
    
    """,
    xref="paper", yref="paper",
    align="left",
    font=dict(family='<b>Times New Roman<b>', size=10),
    x=0.512, y=0.93, showarrow=False, xanchor='left', yanchor='top')

# Update the layout (difficult step)
# - Make sure the legend is hidden
# - Make the geo and geo2 layouts based on the examples provided in the example gallery
#     GEO 1:
#       - country color: #e7d3bb, landcolor: #fbf7f1, bgcolor: rgba(255, 255, 255, 0.0)
#       - Other specs should be inferred from picture shown during lecture
#     GEO 2:
#       - landcolor: #f9f4eb
#       - Other specs should be inferred from picture shown during lecture
# - use the simple_white template
# - Set the font equal to Times New Roman
fig.update_layout(
    showlegend=False,
    geo = go.layout.Geo(
        scope = 'africa',
        countrycolor = "#e7d3bb",
        landcolor='#fbf7f1',
        showframe = False,
        showcountries = True,
        lonaxis_range= [-15, -6],
        lataxis_range= [2, 10],
        domain = dict(x = [ 0, 0.5 ], y = [ 0, 1]),
        bgcolor = 'rgba(255, 255, 255, 0.0)',
    ),
    geo2 = go.layout.Geo(
        scope = 'africa',
        showframe = False,
        landcolor = "#f9f4eb",
        showcountries = False,
        domain = dict(x = [ 0, 0.2 ], y = [ 0, 0.7]),
        bgcolor = 'rgba(255, 255, 255, 0.0)',
    ),
    template='simple_white',
    font=dict(
        family="Times New Roman"
    )
)

# Ensure that the x axis of the barplot has the correct label 'Number of Access Points'
fig.update_xaxes(title_text="Number of Access Points", title_font=dict(size=10), row=2, col=2,)

fig.show()