# Interview Overview
## Task
- The following project investigates the UCI Iris dataset, which is an introductory classification dataset.  
- You are tasked with replicating a visualization. The main task is at the bottom of the notebook, within section 6. 
- The rest of the code in this notebook is present for context and variable usage. 
## Permissions
- You are permitted to use any resource that you will have under your normal working conditions.  
- You may ask questions, but I am judging how well you can work alone on a python project.
- Please talk through your thought process.
 

-----------------------------------------
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-----------------------------------------

# Iris Dataset Analysis Checklist

## 1. Load Data
- Read the data from the source (e.g., CSV, Excel, etc.)
- Display the first few rows to ensure correct loading
- Check data types and structure

## 2. Fix Data
- Handle missing values (if any)
  - Drop or fill missing values
  - Ensure the data has no NaNs or NULLs
- Ensure correct data types for each column

## 3. Preprocess
- Identify numerical and categorical columns
- Normalize numerical features
  - Use StandardScaler or MinMaxScaler
- Encode categorical variables (if needed)
  - One-hot encoding or label encoding

## 4. Visualize
- Visualize feature distributions
  - Plot histograms for each numerical feature
  - Box plots for understanding spread and outliers
- Visualize relationships
  - Scatter plots between features
  - Pair plots to examine correlations

## 5. Model
- Split data into training and testing sets
  - Typically 70-80% for training, 20-30% for testing
- Select a baseline model
  - Simple classifier (e.g., logistic regression, decision tree)
- Train the baseline model
- Evaluate the baseline model
  - Accuracy, precision, recall, F1-score, etc.
- Explore more complex models
  - Decision trees, random forests, support vector machines (SVMs), neural networks, etc.
- Hyperparameter tuning
  - Use GridSearchCV or RandomizedSearchCV to find optimal parameters
- Compare complex models to the baseline
  - Assess improvement in performance metrics

## 6. Inspect
- Confusion Matrix
- Highlight misclassified points on original PCA plot
- Raw Values
  - View the raw values of the misclassified points.
- Nearest Neighbor
  - View the raw values next to the raw values of it's nearest neighbor
  - Visualize the misclassified normalized values as a bar stack
    - Compare to the NN in misclassified class 
    - Compare to the mean of the GT and misclassified classes
## 7. Finalize
- Select the best model based on evaluation metrics
- Save the model for future use
- Prepare documentation and interpretation of results

In [4]:
import os
import sys
import numpy as np
import pandas as pd

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

In [5]:
input_dir = '../datasets/iris/'
file_name = 'iris.data'

file_path = os.path.join(input_dir, file_name)

column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width','class']
df = pd.read_csv(file_path, sep=',', names=column_names)

In [6]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
# Identify numerical features to normalize
numerical_features = df.select_dtypes(include=[float, int]).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Normalize only the numerical features
for feature in numerical_features:
    df[f'{feature}_norm'] = scaler.fit_transform(df[feature].values.reshape(-1, 1)).flatten()

In [8]:
features = [f for f in df.columns if '_norm' in f]

In [9]:
# Define colors for unique classes
unique_classes = df['class'].unique()
colors = {
    unique_class: f"rgba({(i * 50) % 255}, {(i * 100) % 255}, {(i * 150) % 255}, 0.5)"
    for i, unique_class in enumerate(unique_classes)
}

# Create 4 rows and 3 columns of subplots
fig = make_subplots(
    rows=4, 
    cols=3, 
    subplot_titles=[
        f"{cls} {feature}" 
        for feature in df.columns[:-1] 
        for cls in unique_classes
    ]
)

features = features

# Histogram bin size for consistent intervals
x_ranges = {
    feature: [df[feature].min(), df[feature].max()] 
    for feature in features
}

num_bins = 100
bin_sizes = {
    feature: (x_ranges[feature][1]-x_ranges[feature][0])/num_bins
    for feature in features
}

# Loop through each feature and class to add histograms to subplots
for i, feature in enumerate(features):
    row = i + 1
    bin_size = bin_sizes[feature]
    for j, sample_class in enumerate(unique_classes):
        col = j + 1
        sample_data = df[df['class'] == sample_class][feature]

        fig.add_trace(
            go.Histogram(
                x=sample_data,
                xbins=dict(start=x_ranges[feature][0], end=x_ranges[feature][1], size=bin_size),
                marker=dict(color=colors[sample_class]),
                name=f"{sample_class} {feature}"
            ),
            row=row,
            col=col
        )

        fig.update_xaxes(
            range=[x_ranges[feature][0], x_ranges[feature][1]],  # Set desired range
            row=row,
            col=col,
        )

# Layout updates for consistent axes and plot size
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=True)

fig.update_layout(
    height=1000,
    width=1000,
    title_text="Histograms of Features by Class",
    showlegend=True,
)

# Display the plot
fig.show()

In [10]:
features = features

# Create subplot layout with a row for each feature
fig = make_subplots(rows=len(features), cols=1)

# Define colors for unique classes
unique_classes = df['class'].unique()
colors = {
    unique_class: f"rgba({(i * 50) % 255}, {(i * 100) % 255}, {(i * 150) % 255}, 0.5)"
    for i, unique_class in enumerate(unique_classes)
}

# Add a violin plot with point distribution for each feature
for i, feature in enumerate(features):
    row = i + 1  # Row number

    # Add violin plot for each class
    for class_label in unique_classes:
        class_data = df[df['class'] == class_label]

        # Violin plot
        fig.add_trace(
            go.Violin(
                y=class_data[feature],
                x0=class_label,
                line_color=colors[class_label],
                name=f'{class_label} - {feature}',
                box_visible=True,
                meanline_visible=True,
                points="all",
            ),
            row=row,
            col=1
        )

# Set the layout properties
fig.update_layout(
    height=1000,
    width=1000,
    title="Violin Plots with Point Distribution",
    showlegend=True
)

# Display the plot
fig.show()

In [11]:
from sklearn.decomposition import PCA

In [12]:
X_scaled = df[features]

# Perform PCA with 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame with the PCA components and the class labels
df_pca = pd.DataFrame(X_pca, columns=['PCA_1', 'PCA_2'])
df_pca['class'] = df['class']

# Define colors for unique classes
unique_classes = df['class'].unique()
colors = {
    unique_class: f"rgba({(i * 50) % 255}, {(i * 100) % 255}, {(i * 150) % 255}, 0.5)"
    for i, unique_class in enumerate(unique_classes)
}

# Create a new scatter plot with plotly.graph_objects
fig = go.Figure()

# Add a scatter trace for each class
for class_label in unique_classes:
    class_data = df_pca[df_pca['class'] == class_label]
    fig.add_trace(
        go.Scatter(
            x=class_data['PCA_1'],
            y=class_data['PCA_2'],
            mode='markers',
            marker=dict(color=colors[class_label]),
            name=f'Class {class_label}'
        )
    )

# Set layout properties
fig.update_layout(
    title='Scatter Plot of First and Second PCA Components',
    xaxis_title='PCA 1',
    yaxis_title='PCA 2',
    legend_title='Class',
    showlegend=True,
    width=750,
    height=600,
)

# Display the plot
fig.show()

## 5. Model
- Split data into training and testing sets
  - Typically 70-80% for training, 20-30% for testing
- Select a baseline model
  - Simple classifier (e.g., logistic regression, decision tree)
- Train the baseline model
- Evaluate the baseline model
  - Accuracy, precision, recall, F1-score, etc.
- Explore more complex models
  - Decision trees, random forests, support vector machines (SVMs), neural networks, etc.
- Hyperparameter tuning
  - Use GridSearchCV or RandomizedSearchCV to find optimal parameters
- Compare complex models to the baseline
  - Assess improvement in performance metrics

In [13]:
from sklearn.model_selection import KFold, train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder


In [14]:
from sklearn import linear_model

In [15]:
# Convert categorical target variable into numerical values
label_encoder = LabelEncoder()

target = 'class'
df[f'{target}_encoded'] = label_encoder.fit_transform(df[target])
target_encoded = f'{target}_encoded'

In [16]:
# Create a dummy model
# model = DummyClassifier(strategy="uniform")
model = linear_model.LogisticRegression()

In [17]:
df.dtypes

sepal_length         float64
sepal_width          float64
petal_length         float64
petal_width          float64
class                 object
sepal_length_norm    float64
sepal_width_norm     float64
petal_length_norm    float64
petal_width_norm     float64
class_encoded          int64
dtype: object

In [18]:
# Initialize K-Fold with K=3
kf = KFold(n_splits=3, shuffle=True, random_state=42)

features = features
target = 'class'#_encoded
# Loop through K-Fold cross-validation
y_test_cumulative = np.empty(df.shape[0], dtype=object)
y_test_pred_cumulative = np.empty(df.shape[0], dtype=object)
for fold, (train_index, test_index) in enumerate(kf.split(df)):
    # Create training and test sets
    train_data = df.iloc[train_index]
    test_data = df.iloc[test_index]

    # Create smaller training and validation sets from the training set (20% validation)
    train_subset, val_subset = train_test_split(train_data, test_size=0.2, random_state=42)

    # Extract features and targets for each set
    X_train = train_subset[features]
    y_train = train_subset[target]

    X_val = val_subset[features]
    y_val = val_subset[target]

    X_test = test_data[features]
    y_test = test_data[target]

    

    # Fit the model on the smaller training set
    model.fit(X_train, y_train)

    # Predict on the validation set and test set
    y_val_pred = model.predict(X_val)
    y_test_pred = model.predict(X_test)

    y_test_cumulative[test_index] = y_test
    y_test_pred_cumulative[test_index] = y_test_pred

    # Calculate accuracy for both validation and test sets
    val_accuracy = accuracy_score(y_val, y_val_pred)

    print(f"Fold {fold + 1}")
    print(f"Validation Accuracy: {val_accuracy}")
    print("---")
df['pred'] = y_test_pred_cumulative
print("----------------------------------------------")
test_accuracy = accuracy_score(y_test_cumulative, y_test_pred_cumulative)
print(f"Cumulative Test Accuracy: {test_accuracy:.2f}")

# Generate a classification report to get precision, recall, and F1-score
report = classification_report(df[target], df['pred'], target_names=df[target].unique())

print("Classification Report:")
print(report)



Fold 1
Validation Accuracy: 0.85
---
Fold 2
Validation Accuracy: 1.0
---
Fold 3
Validation Accuracy: 0.85
---
----------------------------------------------
Cumulative Test Accuracy: 0.95
Classification Report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        50
Iris-versicolor       0.92      0.92      0.92        50
 Iris-virginica       0.92      0.92      0.92        50

       accuracy                           0.95       150
      macro avg       0.95      0.95      0.95       150
   weighted avg       0.95      0.95      0.95       150



In [19]:
import pickle

In [20]:
output_file = '/Volumes/RyanMercerTB3/dev_RyanMercer/output/iris_dataframe.pkl'

df.to_pickle(output_file)

In [21]:
df = pd.read_pickle(output_file)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,sepal_length_norm,sepal_width_norm,petal_length_norm,petal_width_norm,class_encoded,pred
0,5.1,3.5,1.4,0.2,Iris-setosa,-0.900681,1.032057,-1.341272,-1.312977,0,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa,-1.143017,-0.124958,-1.341272,-1.312977,0,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa,-1.385353,0.337848,-1.398138,-1.312977,0,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa,-1.506521,0.106445,-1.284407,-1.312977,0,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa,-1.021849,1.26346,-1.341272,-1.312977,0,Iris-setosa


## 6. Inspect
- Confusion Matrix
- Highlight misclassified points on original PCA plot
- Raw Values
  - View the raw values of the misclassified points.
- Nearest Neighbor
  - View the raw values next to the raw values of it's nearest neighbor
  - Visualize the misclassified normalized values as a bar stack
    - Compare to the NN in misclassified class 
    - Compare to the mean of the GT and misclassified classes

In [22]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [23]:
conf_matrix = confusion_matrix(df['class'], df['pred'])

# Create a heatmap with plotly.graph_objects
fig = go.Figure(
    data=go.Heatmap(
        z=conf_matrix,  # Confusion matrix values
        x=label_encoder.classes_,  # Labels for x-axis (predicted classes)
        y=label_encoder.classes_,  # Labels for y-axis (true classes)
        colorscale='Blues',  # Color scheme for the heatmap
        text=conf_matrix,  # Text for displaying values
        texttemplate="%{z}",  # Template for text (displays integer values)
        textfont={"size": 14},  # Font size for the text
        showscale=False  # Hide the color scale bar
    )
)

# Set axis titles and layout
fig.update_layout(
    title='Confusion Matrix',
    xaxis_title='Predicted Labels',
    yaxis_title='True Labels',
    width=650,
    height=600,
)

fig.show()

In [24]:
from sklearn.neighbors import NearestNeighbors

In [25]:
# Find the extrema for the x-axis limits
def find_extrema(*args):
    min_val = min(min(arr) for arr in args)
    max_val = max(max(arr) for arr in args)
    return min_val, max_val

In [26]:
# Avoid selecting the sample itself as the nearest neighbor
def get_nearest_neighbor(features, labeled_class, target_df, sample):
    nn = NearestNeighbors(n_neighbors=2)  # Include the sample itself
    df_labeled_class = target_df[target_df[target] == labeled_class]
    nn.fit(df_labeled_class[features])
    
    # Get indices of the nearest neighbors
    indices = nn.kneighbors([sample])[1][0]
    indices = [df_labeled_class.iloc[i].name for i in indices]
    
    # Exclude the sample's index
    nearest_index = indices[1] if indices[0] == sample.name else indices[0]
    
    return target_df.iloc[nearest_index][features]

In [27]:
# Helper function to add a horizontal stacked bar plot
def add_bar_plot(fig, data, features, row, col, show_yaxis, x_limits):
    colors = ['green' if v > 0 else 'red' for v in data]

    for feature, value, color in zip(features, data, colors):
        base = 0
        width = value - base

        fig.add_trace(
            go.Bar(
                x=[width],
                y=[feature],
                orientation='h',
                marker_color=color,
                base=base,
                name=feature
            ),
            row=row,
            col=col
        )
    if not show_yaxis:
        fig.update_yaxes(showticklabels=False, row=row, col=col)

    # Set x-axis limits for the subplot
    fig.update_xaxes(range=x_limits, row=row, col=col)

# ---------------- Main Interview Question -----------------
- Your task is to replicate the visualization shown below.
- There is likely too much to do, but I am more interested in your process.

## Purpose
- There are misclassified samples after testing the model on the Iris dataset.
- It is important to judge whether the misclassified samples are reasonable, or if there is still room to improve the model.
- This visualization allows us to judge whether a misclassified sample looks similar to other samples in the labeled class, or in the mislabeled class. 

## Plot structure
### Row 1: Misclassified Sample
- Col 1: The feature values have been previously normalized so that their means are 0 and std is 1.
- Col 2: 
- Col 3: 
- Col 4:
### Row 2: Nearest Neighbors
- Below that is the nearest neighbor in the labeled class (row 2, col 1).
- Col 1: The nearest neighbor in the labeled class ('class' values match, 'pred' values mismatch)
- Col 2: The difference between the mislabeled sample and the nearest neighbor in the labeled class
- Col 3: The nearest neighbor in the mislabeled class ('class' values mismatch, 'pred' values match)
- Col 4: The difference between the mislabeled sample and the nearest neighbor in the mislabeled class
### Row 3: Median Samples
- Col 1: The median of each feature value in the labeled class ('class' values match, 'pred' values mismatch)
- Col 2: The difference between the mislabeled sample and the median of each feature value in the labeled class 
- Col 3: The median of each feature value in the mislabeled class ('class' values mismatch, 'pred' values match)
- Col 4: The difference between the mislabeled sample and the median of each feature value in the mislabeled class

## Useful Info
- All the needed data is in the DataFrame df
- The relevant features end with the suffix "_norm"
- The class labels are in the 'class' column.
- The class predictions are in the 'pred' column.
- I generated the visualization using plotly graph objects.

## Visualization to Replicate:
![title](Iris_InspectMisclassified.png)

In [32]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class',
       'sepal_length_norm', 'sepal_width_norm', 'petal_length_norm',
       'petal_width_norm', 'class_encoded', 'pred'],
      dtype='object')

In [31]:
relevant_features = [f for f in df.columns if "_norm" in f]
print(relevant_features)

['sepal_length_norm', 'sepal_width_norm', 'petal_length_norm', 'petal_width_norm']
