## Survivor Data Modeling

### Import Dependencies

In [29]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, LeaveOneGroupOut

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import sys
import math
import os

# Get the current working directory
current_dir = os.getcwd()

# Get the path to the survivorData directory
data_dir = os.path.join(current_dir, '..', 'survivorData')

### Read in csv files as Dataframes

In [30]:
# List of CSV file names
csv_files = [
    'advantage_movement.csv',
    'boot_mapping.csv',
    'castaways.csv',
    'castaway_details.csv',
    'challenge_description.csv',
    'challenge_results.csv',
    'confessionals.csv',
    'jury_votes.csv',
    'screen_time.csv',
    'season_palettes.csv',
    'season_summary.csv',
    'survivor_auction.csv',
    'tribe_colours.csv',
    'tribe_mapping.csv',
    'viewers.csv',
    'vote_history.csv'
]

# Create a dictionary to store the DataFrames
dataframes = {}

# Loop through each CSV file and read its data into a DataFrame
for csv_file in csv_files:
    # Specify the relative path to the CSV file
    file_path = os.path.join(data_dir, csv_file)
    
    # Read the data from the CSV file into a pandas DataFrame
    df = pd.read_csv(file_path)
    
    # Store the DataFrame in the dictionary
    dataframes[csv_file] = df

### Clean up the data and feature engineer

In [31]:
# Convert tribe_status column to category type
dataframes['tribe_colours.csv']['tribe_status'] = dataframes['tribe_colours.csv']['tribe_status'].astype('category')

# Convert result to a categorical variable
dataframes['castaways.csv']['result'] = pd.Categorical(dataframes['castaways.csv']['result'])
dataframes['castaways.csv']['result'] = dataframes['castaways.csv']['result'].cat.codes

# Merge challenge_results and castaway_details dataframes on castaway_id
castawayAll = pd.merge(dataframes['castaways.csv'], dataframes['castaway_details.csv'], on='castaway_id', how ="left")

castawayAll['genderNumber'] = np.where(castawayAll['gender'] == 'Male', 1,
                                     np.where(castawayAll['gender'] == 'Female', 2,
                                              np.where(castawayAll['gender'] == 'Non-binary', 3, 0)))
castawayAll['won'] = np.where(castawayAll['result'] == 'Sole Survivor', 1, 0)

castawayAll = castawayAll.dropna(subset=['age'])

# Drop rows where 'season' is equal to 44
castawayAll = castawayAll[castawayAll['season'] != 44]

## Predicting Color Values in the survivor dataset: A Comparative Analysis of SVM and Logistic Regression Models

### Introduction:
In this overview, we will delve into the rationale behind predicting color values in a dataset and explore the unique situation it presents when it comes to quantifying results. Specifically, we will compare the performance of Support Vector Machine (SVM) and Logistic Regression models in predicting color values and evaluate the effectiveness of various metrics such as average accuracy, Hamming distance, and Euclidean distance.

### Predicting Color Values in a Dataset:
Color is an essential aspect of visual data analysis and has a wide range of applications, including image processing, computer vision, and data visualization. By predicting color values in a dataset, we can gain insights into patterns, trends, and relationships that are not immediately apparent to the human eye. This predictive modeling approach allows us to develope skills to uncover hidden information and make data-driven decisions.

### Comparison of SVM and Logistic Regression Models:
To determine the best approach for predicting color values, we have chosen to compare the performance of two popular machine learning algorithms: Support Vector Machine (SVM) and Logistic Regression. Both models are widely used in classification tasks and have demonstrated success in various domains.

1. Support Vector Machine (SVM):
SVM is a powerful algorithm that aims to find an optimal hyperplane in a high-dimensional feature space. It separates data points into distinct classes by maximizing the margin between them. SVM is known for its ability to handle complex datasets and nonlinear relationships effectively. Its versatility and robustness make it a suitable candidate for predicting color values in a dataset.

2. Logistic Regression:
Logistic Regression is a probabilistic machine learning algorithm used for binary classification tasks. It models the relationship between the input features and the probability of a particular outcome. Logistic Regression is known for its simplicity, interpretability, and efficiency. While it may not capture complex nonlinear relationships as effectively as SVM, it can still yield accurate predictions in certain scenarios.

### Quantifying Results:
To evaluate the performance of our SVM and Logistic Regression models in predicting color values, we will utilize several metrics that address different aspects of model performance:

1. Average Accuracy:
Average accuracy measures the percentage of correctly classified instances across all classes.

2. Hamming Distance:
Hamming distance quantifies the dissimilarity between two color values by measuring the number of positions at which they differ. Since color values can be represented as vectors in a multi-dimensional space, Hamming distance provides a useful metric for evaluating the similarity or dissimilarity of predicted colors.

3. Euclidean Distance:
Euclidean distance measures the straight-line distance between two points in a multi-dimensional space. In the context of color values, Euclidean distance allows us to determine the proximity of predicted colors to the ground truth. 

### Conclusion:
Predicting color values in a dataset presents a unique situation due to its visual and perceptual nature. By comparing the performance of SVM and Logistic Regression models, we can gain insights into the effectiveness of these algorithms in accurately predicting color values. Furthermore, by utilizing metrics such as average accuracy, Hamming distance, and Euclidean distance, we can comprehensively evaluate the performance of the models and make informed decisions regarding their suitability for color prediction tasks.



## Color prediction with SVM model

In [32]:
# Prepare the data for training
X = dataframes['tribe_colours.csv']['tribe_status'].values.reshape(-1, 1)
y = dataframes['tribe_colours.csv']['tribe_colour']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create an SVM model
svm_model = svm.SVC()

# Train the model
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy_svm = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy_svm)

# Calculate Hamming distance for all predictions
hamming_distances = []

for idx in range(len(y_pred)):
    hamming_dist = sum(el1 != el2 for el1, el2 in zip(y_pred[idx], y_test.iloc[idx]))
    hamming_distances.append(hamming_dist)

# Calculate average Hamming distance for incorrect predictions
avg_hamming_distance_svm = np.mean(hamming_distances)

# The Hamming distance measures the dissimilarity between two strings of equal length. In the context of hex color values, 
# each character represents a component of the color (e.g., red, green, and blue), and the Hamming distance calculates 
# the number of positions at which the predicted and actual color values differ.

# Since hex color values consist of six characters (e.g., #RRGGBB), the Hamming distance for hex color values can range from 0 to 6.

# Print average Hamming distance for incorrect predictions
print("Average Hamming Distance for Incorrect Predictions:", avg_hamming_distance_svm)

# Calculate Euclidean distance for color similarity
euclidean_distances = []
for idx in range(len(y_pred)):
    predicted_color = y_pred[idx][1:]  # Remove the "#" character from the predicted color
    actual_color = y_test.iloc[idx][1:]  # Remove the "#" character from the actual color
    
    # Convert hex color values to RGB tuples
    predicted_rgb = tuple(int(predicted_color[i:i+2], 16) for i in (0, 2, 4))
    actual_rgb = tuple(int(actual_color[i:i+2], 16) for i in (0, 2, 4))
    
    # Calculate Euclidean distance between RGB tuples
    distance = math.sqrt(sum((p - a) ** 2 for p, a in zip(predicted_rgb, actual_rgb)))
    euclidean_distances.append(distance)

# Calculate average Euclidean distance for all predictions
avg_euclidean_distance_svm = np.mean(euclidean_distances)

# The Euclidean distance measures the spatial or geometric distance between two colors in the RGB color space.
# In the RGB color space, each color is represented by three components: red (R), green (G), and blue (B). 
# The Euclidean distance calculates the straight-line distance between two colors in this three-dimensional space.

# Print average Euclidean distance for all predictions
print("Average Euclidean Distance for All Predictions:", avg_euclidean_distance_svm)

Accuracy: 0.21739130434782608
Average Hamming Distance for Incorrect Predictions: 3.891304347826087
Average Euclidean Distance for All Predictions: 205.5951583794654


  return array[key] if axis == 0 else array[:, key]


## Color prediction with logistic regression model

In [33]:
# Create a Logistic Regression model
logreg_model = LogisticRegression()

# Train the model
logreg_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logreg_model.predict(X_test)

# Calculate accuracy
accuracy_logistic_regression = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy_logistic_regression)

# Calculate Hamming distance for all predictions
hamming_distances = []

for idx in range(len(y_pred)):
    hamming_dist = sum(el1 != el2 for el1, el2 in zip(y_pred[idx], y_test.iloc[idx]))
    hamming_distances.append(hamming_dist)

# Calculate average Hamming distance for incorrect predictions
avg_hamming_distance_logistic_regression = np.mean(hamming_distances)

# The Hamming distance measures the dissimilarity between two strings of equal length. In the context of hex color values, 
# each character represents a component of the color (e.g., red, green, and blue), and the Hamming distance calculates 
# the number of positions at which the predicted and actual color values differ.

# Since hex color values consist of six characters (e.g., #RRGGBB), the Hamming distance for hex color values can range from 0 to 6.

# Print average Hamming distance for incorrect predictions
print("Average Hamming Distance for Incorrect Predictions:", avg_hamming_distance_logistic_regression)

# Calculate Euclidean distance for color similarity
euclidean_distances = []
for idx in range(len(y_pred)):
    predicted_color = y_pred[idx][1:]  # Remove the "#" character from the predicted color
    actual_color = y_test.iloc[idx][1:]  # Remove the "#" character from the actual color
    
    # Convert hex color values to RGB tuples
    predicted_rgb = tuple(int(predicted_color[i:i+2], 16) for i in (0, 2, 4))
    actual_rgb = tuple(int(actual_color[i:i+2], 16) for i in (0, 2, 4))
    
    # Calculate Euclidean distance between RGB tuples
    distance = math.sqrt(sum((p - a) ** 2 for p, a in zip(predicted_rgb, actual_rgb)))
    euclidean_distances.append(distance)

# Calculate average Euclidean distance for all predictions
avg_euclidean_distance_logistic_regression = np.mean(euclidean_distances)

# The Euclidean distance measures the spatial or geometric distance between two colors in the RGB color space.
# In the RGB color space, each color is represented by three components: red (R), green (G), and blue (B). 
# The Euclidean distance calculates the straight-line distance between two colors in this three-dimensional space.

# Print average Euclidean distance for all predictions
print("Average Euclidean Distance for All Predictions:", avg_euclidean_distance_logistic_regression)

Accuracy: 0.21739130434782608
Average Hamming Distance for Incorrect Predictions: 4.0
Average Euclidean Distance for All Predictions: 195.6007378809655


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
