# **STUDENT AI** - PARALLEL PLOTS

## Objectives

Create interactive parallel box plots to embedd in the streamlit dashboard. This will visualize the feature relationship with the selected target variables.

## Inputs

standard dataset ... Numerical variables will need to be 'boxed' into discreet buckets to better visualize teh relationships.

## Outputs

Interactive boxplot for later user display. Saved as html file.


---

# Import required libraries

In [None]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from feature_engine.discretisation import ArbitraryDiscretiser
%matplotlib inline

print('All Libraries Loaded')

# Change working directory

### Set the working directory to notebook parent folder
If the output does not match, click **'clear all outputs'** and then **'restart'** the notebook. 
Then run cells from top to bottom.

In [None]:
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print('If correct, Active Directory should read: /workspace/student-AI')
print(f"Active Directory: {current_dir}")

### Load cleaned dataset

In [None]:
df = pd.read_csv(f"outputs/dataset/Expanded_data_with_more_features_clean.csv")
df.head()

### Significant Feature Variables
Based on the previous analysis, I determined that 'ParentMaritalStatus', 'PracticeSport', 'IsFirstChild', 'NrSiblings', 'WklyStudyHours' features have very limited to no influence on the student performance prediction. For the purpose of a cleaner visualization, these will be dropped.

In [None]:
columns_to_drop = ['ParentMaritalStatus', 'PracticeSport', 'IsFirstChild','NrSiblings','WklyStudyHours']
df_dropped = df.drop(columns=columns_to_drop, axis=1)
df_dropped.head()

### Defining numerical variable bins.

To better visualize the data, the continuous numerical data should be 'discretized' into individual bins to better group the students performance. This would normally be based on client wishes, or the actual established grouping in the educational facility. As a baseline, I will use the mean of 68 with a standard deviation of 14 (Determined in Notebook 3). Basedon these values, I will group students performance into:

* Exceptional > 96% (2 standard deviations above mean)
* Above Average > 82% (1 standard deviation above mean)
* Average 54% - 82%  (within 1 SD)
* Below Average 40% - 54% (< 1 SD below mean)
* Failed        <40% (< 2 SD below mean)



In [None]:
# from previous notebook
mean = 68  
std = 14   

# define scores_map based on Mean and SD
scores_map = [
    -np.Inf,
    mean - 2 * std,  # Failed
    mean - std,      # Below Average
    mean + std,      # Above Average
    mean + 2 * std,  # Exceptional
    np.Inf
]
discretiser = ArbitraryDiscretiser(binning_dict={
    'MathScore': scores_map,
    'ReadingScore': scores_map,
    'WritingScore': scores_map
    })
df_parallel = discretiser.fit_transform(df_dropped)
df_parallel.head()

### Reorder Colums
Reorder Columns for more clear diagram (binary categories first then split into more values)

In [None]:
df_parallel = df_parallel[['Gender', 'LunchType', 'TestPrep', 'EthnicGroup', 'ParentEduc', 'MathScore', 'ReadingScore', 'WritingScore']]
df_parallel.head()

### Create Labels and Custom Colors for Plots

In [None]:
labels_map = {
    0: "Failed",
    1: "Below Average",
    2: "Average",
    3: "Above Average",
    4: "Exceptional"
}

color_scale = [
    (0.00, "red"),          # Failed
    (0.25, "orange"),       # Below Average
    (0.50, "yellow"),       # Average
    (0.75, "lightgreen"),   # Above Average
    (1.00, "green")         # Exceptional
]


### Create Parallel Plot for Math Score

In [None]:
#drop unwanted target variable
df_parallel_maths = df_parallel.drop(['ReadingScore', 'WritingScore'], axis=1)

# create parallel plot with custom colors
fig_parallel_maths = px.parallel_categories(
    df_parallel_maths,
    color='MathScore',
    color_continuous_scale=color_scale,
    )

# change legend to show categorical names
fig_parallel_maths.update_layout(
    coloraxis_colorbar=dict(
        tickvals=[0, 1, 2, 3, 4],
        ticktext=list(labels_map.values())
    )
)
#show figure
fig_parallel_maths.write_html('outputs/html/parallel_plot_maths.html')
fig_parallel_maths.show()

### Create Parallel Plot for Reading Score

In [None]:
#drop unwanted target variable
df_parallel_reading = df_parallel.drop(['MathScore', 'WritingScore'], axis=1)

# create parallel plot with custom colors
fig_parallel_reading = px.parallel_categories(df_parallel_reading, color='ReadingScore', color_continuous_scale=color_scale)

# change legend to show categorical names
fig_parallel_reading.update_layout(
    coloraxis_colorbar=dict(
        tickvals=[0, 1, 2, 3, 4],
        ticktext=list(labels_map.values())
    )
)
#show figure
fig_parallel_reading.write_html('outputs/html/parallel_plot_reading.html')
fig_parallel_reading.show()

### Create Parallel Plot for Writing Score

In [None]:
#drop unwanted target variable
df_parallel_writing = df_parallel.drop(['MathScore', 'ReadingScore'], axis=1)

# create parallel plot with custom colors
fig_parallel_writing = px.parallel_categories(
    df_parallel_writing,
    color='WritingScore',
    color_continuous_scale=color_scale)

# change legend to show categorical names
fig_parallel_writing.update_layout(
    coloraxis_colorbar=dict(
        tickvals=[0, 1, 2, 3, 4],
        ticktext=list(labels_map.values())
    )
)
#show figure
fig_parallel_writing.write_html('outputs/html/parallel_plot_writing.html')
fig_parallel_writing.show()