# Practice

## Libraries

In [42]:
from termcolor import colored # type: ignore                                          # Colored text
from random import Random  # type: ignore                                             # Random number generator
import math  # type: ignore                                                           # Mathematical functions
import pandas as pd  # type: ignore                                                   # Data manipulation
import numpy as np  # type: ignore                                                    # Scientific computing
import matplotlib.pyplot as plt  # type: ignore                                       # Data visualization
from scipy.stats import binom as binomial  # type: ignore                             # Binomial distribution
from scipy.stats import norm as normal  # type: ignore                                # Normal distribution
from scipy.stats import poisson as poisson  # type: ignore                            # Poisson distribution
from scipy.stats import t as student  # type: ignore                                  # Student distribution
from scipy.stats import chi2  # type: ignore                                          # Chi-squared distribution
from scipy.stats import ttest_1samp  # type: ignore                                   # One-sample t-test
from scipy.stats import chisquare  # type: ignore                                     # Chi-squared test
from scipy.special import comb  # type: ignore                                        # Combinations
from mlxtend.frequent_patterns import apriori  # type: ignore                         # Apriori algorithm
from mlxtend.frequent_patterns import fpgrowth  # type: ignore                        # FP-growth algorithm
from mlxtend.frequent_patterns import association_rules  # type: ignore               # Association rules
from mlxtend.preprocessing import TransactionEncoder  # type: ignore                  # Transaction encoder
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis  # type: ignore  # Discriminant Analysis
from tensorflow import keras  # type: ignore                                          # Deep Learning library
from tensorflow.keras import Model  # type: ignore                                    # Model class
from tensorflow.keras.layers import Input, Dense, BatchNormalization  # type: ignore  # Layers
from tensorflow.keras.utils import to_categorical  # type: ignore                     # One-hot encoding
from tensorflow.keras.optimizers import Adam  # type: ignore                          # Optimizer
from livelossplot import PlotLossesKeras  # type: ignore                              # Live plot
from keras.src.optimizers import RMSprop  # type: ignore                              # Optimizer
from sklearn.model_selection import train_test_split  # type: ignore                  # Train-test split
from sklearn.metrics import roc_auc_score # type: ignore                              # ROC AUC score
from simanneal import Annealer  # type: ignore                                        # Simulated Annealing
from inspyred import ec  # type: ignore                                               # Evolutionary Computation
import warnings  # type: ignore                                                       # Disable warnings
from Resources.Functions import *  # type: ignore                                     # Custom functions
warnings.filterwarnings("ignore")                                                     # Disable warnings
outputColor = "blue"                                                                  # Color for the output

## Information About the Dataset
- Information about the dataset `../Data/Gesture-Original.csv` used in the following questions.
- This dataset contains data from 64 muscle sensors `V1 - V64` that are placed on the body of test subjects. In addition, the file contains 2 columns.
- `gesture` always contains the hand gesture performed by a test subject during the measurement of the 64 sensors. The possible values `Pare`, `rock`, `paper`, `scissors`
and `okay`.
- The `okay` column does not contain any new data, but simply indicates whether the subject performed the `okay` hand gesture during the measurement.

In [46]:
# Load in data and filter data
gestureOriginal = pd.read_csv("../Data/Gesture-Original.csv", delimiter=';')

### Question 1:
- Apply linear discriminant analysis to the `../Data/Gesture-Original.csv` dataset.
- `gesture` is the dependent variable.
- Use all other variables as independent variables except the column Okay.

Answer the following questions:
- One of the assumptions for being able to apply LDA is that there is no dependence exists between the independent variables. Show that here or not is met.2. 
- How many discriminant functions are created?
- Why exactly are there so many?
- Which independent variable plays the largest role in the first discriminant function? Please indicate how you got here.

In [47]:
# Create linear discriminant analysis model
# independentVariables = gestureOriginal[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64']]

independentVariables = gestureOriginal.drop(columns=['gesture', 'okay'])                                                        # Independent variables
dependentVariable = gestureOriginal['gesture']                                                                                  # Dependent variable
lda = LinearDiscriminantAnalysis()
lda.fit(independentVariables, dependentVariable)

# independentVariables.describe()
# independentVariables.corr(method='pearson')

# Show some information about the discriminant analysis
print(colored(f"There are {(len(dependentVariable.unique()))-1} dimensions and there are {len(dependentVariable.unique())} different possibilities for the dependent variable and also, there are {len(independentVariables.columns)} independent variables. And there are {min(len(gestureOriginal['gesture'].unique()) - 1, independentVariables.shape[1])} discriminant functions.", outputColor))
print(colored(f"\nThe reason for the number of discriminant functions is that the number of discriminant functions is the minimum of the number of dependent variables and the number of independent variables.", outputColor))

[34mThere are 3 dimensions and there are 4 different possibilities for the dependent variable and also, there are 64 independent variables. And there are 3 discriminant functions.[0m
[34m
The reason for the number of discriminant functions is that the number of discriminant functions is the minimum of the number of dependent variables and the number of independent variables.[0m


In [48]:
# Determine the most important variable
result = most_important_variable(independentVariables, dependentVariable)

# Print the most important variable
print(colored("The independent variable that plays the most significant role in the first discriminant function is:", outputColor))
print(colored(f"Variable: {result['Variable']}, Coefficient: {result['Coefficient']}", outputColor))

[34mThe independent variable that plays the most significant role in the first discriminant function is:[0m
[34mVariable: V39, Coefficient: 0.006181404091898121[0m


### Question 2:
- With 64 independent variables we are dealing with a rather large number.
    - Use proper technique to limit the number of variables.
    - I want to limit myself to 10 variables. What percentage of the information from the original. Can I keep the dataset with this?
    - Explain how you arrived at this number?
    - Create a data set with those 10 variables.
        - Perform a discriminant analysis. Compare the explained variance with that of the previous exercise.

In [65]:
from sklearn.decomposition import PCA

# Limit the number of variables
pca_dim = min(independentVariables.shape[1], independentVariables.shape[0])                                             # Number of dimensions (dependent variables)
pcamodel = PCA(n_components=pca_dim)                                                                                    # Create PCA model
principalComponents = pcamodel.fit_transform(independentVariables)                                                      # Fit and transform the data
col_names = ['PC{}'.format(i) for i in range(1, 11)]                                                                    # Get the column names
new_independentVariables = pd.DataFrame(data=principalComponents[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], columns= col_names) # Create a new dataframe with the principal components

# Conclusion what did we do. We combined the 64 independent variables into 10 principal components, and we used these 10 principal components as independent variables.
print(colored(f"The new dataset contains {new_independentVariables.shape[1]} independent variables.", outputColor))

print(colored(f"We expect to keep {round(pcamodel.explained_variance_ratio_[range(0,10)].sum() * 100, 2)}% of the information from the original dataset.", outputColor))

# Explain how we arrived at this number
print(colored(f"\nWe arrived at this number by calculating the sum of the explained variance ratios of the first 10 principal components.", outputColor))

[34mThe new dataset contains 10 independent variables.[0m
[34mWe expect to keep 51.83% of the information from the original dataset.[0m
[34m
We arrived at this number by calculating the sum of the explained variance ratios of the first 10 principal components.[0m


In [82]:
# Create a new discriminant analysis model
lda_new = LinearDiscriminantAnalysis()
lda_new.fit(new_independentVariables, dependentVariable)

# Compare the lda models with lda_new model
print(colored(f"The first discriminant function of the new model explains {round(lda_new.explained_variance_ratio_[0] * 100, 2)}% of the variance.", outputColor))
print(colored(f"The first discriminant function of the old model explains {round(lda.explained_variance_ratio_[0] * 100, 2)}% of the variance.", outputColor))

print(colored(f"\nThe second discriminant function of the new model explains {round(lda_new.explained_variance_ratio_[1] * 100, 2)}% of the variance.", outputColor))
print(colored(f"The second discriminant function of the old model explains {round(lda.explained_variance_ratio_[1] * 100, 2)}% of the variance.", outputColor))

print(colored(f"\nThe third discriminant function of the new model explains {round(lda_new.explained_variance_ratio_[2] * 100, 2)}% of the variance.", outputColor))
print(colored(f"The third discriminant function of the old model explains {round(lda.explained_variance_ratio_[2] * 100, 2)}% of the variance.", outputColor))

print(colored(f"\nWe can obviously see that the old model with more independent variables can quicker explain more variance than the new model with less independent variables.", outputColor))

[34mThe first discriminant function of the new model explains 72.63% of the variance.[0m
[34mThe first discriminant function of the old model explains 91.15% of the variance.[0m
[34m
The second discriminant function of the new model explains 16.87% of the variance.[0m
[34mThe second discriminant function of the old model explains 5.33% of the variance.[0m
[34m
The third discriminant function of the new model explains 10.5% of the variance.[0m
[34mThe third discriminant function of the old model explains 3.52% of the variance.[0m
[34m
We can obviously see that the old model with more independent variables can quicker explain more variance than the new model with less independent variables.[0m


### Question 3:
- We created our own model with more test data to predict whether someone will have it performs `okay` hand gesture. The data is available in `../Data/Gesture-Original.csv`. Note that the predicted values pare not just `0` or `1`, but can be close to `0` or close to `1` or somewhere in between. Take `0.5` as the threshold.

Answer the following questions:
- You want to know how well the model can predict whether the test subject has diabetes performing `okay` gesture. What metric do you use?
- How much is this?
- Calculate the F-measure in which the importance of recall and precision is equal weigh. How much is this?
- Create an ROC curve.
- What is the AUC?
- What can you conclude based on the answer to question e?

### Question 4:
Use dataframe `../Data/Gesture-Original.csv` to train an artificial neural network around the column predict `okay`.
- Scale the data with min-max normalization
- Create a neural network model for the entire data set with the following values for the parameters:
    - 4 hidden layers: 32, 16, 8 and 4 neurons
    - learning rate = 0.001
    - epochs=100
    - Make the right choices according to the objective of the ANN - for the other parameters involved correspond to the functions used.

### Question 5:
We want to solve the following optimization problem without doing the calculations ourselves:
- You have `30` cards, each with its own value: 2, 4, 6, ..., 60.
- You must divide the cards into two piles so that the sum of all values of pile `1`.
- As close as possible to `2` times the sum of stack `2`.

Answer the following questions:
- Solve this problem with a genetic algorithm. Make sure you use the following parameters:
 - popSize=100
 - max_evaluations=5000
 -run=100
 - mutation_rate = 0.1
- Which cards are in pile `1`? Also indicate how you arrived at this.
- How much difference is there between the sum of stack `1` and `2` * the sum of stack `2`?
- Are you sure there isn't a better solution? And why?
