# Candidate Test 2022 Analysis

This exercise focuses on the candidate tests from two television networks: DR and TV2. Data from both tests have been given on a scale of five responses (-2, -1, 0, 1, 2). Consider normalizing the data or performing similar scaling transformations as needed.

---

There are 6 datasets included in this exercise:

- `alldata.xlsx`: Contains responses from both TV stations.
- `drdata.xlsx`: Contains responses from DR.
- `drq.xlsx`: Contains questions from DR.
- `tv2data.xlsx`: Contains responses from TV2.
- `tv2q.xlsx`: Contains questions from TV2.
- `electeddata.xlsx`: Contains responses from both TV stations for candidates who were elected to the parliament. Note that 9 members are missing; 7 of them didn't take any of the tests. Additionally, some notable figures like Mette F. and Lars Løkke did not participate in any of the tests.

---

It's entirely up to you how you approach this data, but at a *minimum*, your analysis should include:

- PCA or some other dimensionality reduction technique to plot responses in two dimensions, thereby visualizing the "political landscape". The colors of the plotted points should match the political party colors (see below).
- An analysis/description of which questions are most crucial concerning their placement on the axes.
- Average positions of parties concerning each question, preferably with accompanying plots of each (or selected) question.
- Age of the candidates grouped by parties.
- An overview of the most "confident" candidates, i.e., those with the highest proportion of "strongly agree" or "strongly disagree" responses.
- Differences in responses between candidates, both inter-party and intra-party, along with an explanation of which parties have the most internal disagreements.
- Classification models to predict candidates' party affiliations. Investigate if there are any candidates who seem to be in the "wrong" party based on their political landscape positions. You must use the following three algorithms: **Decision Tree, Random Forrest, and Gradient Boosted Tree**.
- A clustering analysis where you attempt various cluster numbers, which would correspond to different parties. Discuss whether there is room for more clusters/parties or if a reduction is needed. Make sure you cover: **K-Means, Hierarchical clustering, and DBSCAN.**
- An overview of the political landscape of the elected candidates, highlighting which members agree or disagree the most and which parties or party members have significant disagreements.
- Feel free to explore further and remember that preprocessing, methodology, and evaluation metrics are not mentioned explicitly, but are implicitly assumed.

---

The following parties are represented:

| Party letter | Party name | Party name (English) | Political position |
| :-: | :-: | :-: | :-: |
| A | Socialdemokratiet | Social Democrats | Centre-left |
| V | Venstre | Danish Liberal Party | Centre-right |
| M | Moderaterne | Moderates | Centre-right |
| F | Socialistisk Folkeparti | Socialist People's Party | Left-wing |
| Æ | Danmarksdemokraterne | Denmark Democrats | Right-wing |
| I | Liberal Alliance | Liberal Alliance | Right-wing |
| C | Konservative | Conservative People's Party | Right-wing |
| Ø | Enhedslisten | Red-Green Alliance | Far-left |
| B | Radikale Venstre | Social Liberal Party | Centre-left |
| D | Nye Borgerlige | New Right | Far-right |
| Z | Alternativet | The Alternative | Centre-left |
| O | Dansk Folkeparti | Danish People's Party | Far-right |
| G | Frie Grønne | Free Greens | Centre-left |
| K | Kristendemokraterne | Christian Democrats | Centre-right |

Below you can see the results and the colors chosen to represent the parties. Use these colors in your analysis above.

![Alt text](image-1.png)


Others have undertaken similar analyses. You can draw inspiration from the following (use Google tranlsate if your Danish is rusty):

- [Analysis of where individual candidates stand relative to each other and their parties](https://v2022.dumdata.dk/)
- [Candidate Test 2022 – A deep dive into the data](https://kwedel.github.io/kandidattest2022/)
- [The Political Landscape 2019](https://kwedel.github.io/kandidattest2019/)



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

In [None]:
candidates_df = pd.read_excel('alldata.xlsx')

display(candidates_df.head(5))
display(candidates_df.shape)

#Might make sense to do some scaling, not sure. Probably will need to encode the candidate and polytical party names
scaler = StandardScaler()

### Pre-processing

In [None]:
colour_mapping = {
    'Socialdemokratiet': '#C82518',  
    'Venstre': '#01438E',  
    'Moderaterne': '#B48CD2',  
    'Socialistisk Folkeparti': '#eb94d1',  
    'Danmarksdemokraterne': '#FFA500',  
    'Liberal Alliance': '#3FB2BE', 
    'Det Konservative Folkeparti': '#00583C',  
    'Enhedslisten': '#F7660D', 
    'Radikale Venstre': '#FF00FF',  
    'Nye Borgerlige': '#00505B',  
    'Alternativet': '#00FF00',  
    'Dansk Folkeparti': '#FCD03B',  
    'Frie Grønne, Danmarks Nye Venstrefløjsparti': '#d9b99b',  
    'Kristendemokraterne': '#53619B',  
    'Løsgænger': '#808080'  
}

colours = candidates_df['parti'].map(colour_mapping)

### PCA

In [None]:
pca_candidates_df = candidates_df.copy()

pca_candidates_df.drop("navn", axis=1, inplace=True)
pca_candidates_df.drop("storkreds", axis=1, inplace=True)
pca_candidates_df.drop("alder", axis=1, inplace=True)

# Numeric only df for PCA
X = pca_candidates_df.select_dtypes(include=['number'])
display(X.shape)

# Perform PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)
display(pca.explained_variance_ratio_)

# Plot the reduced data
plt.figure(figsize=(10, 6))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=colours, marker='o', label='Candidates')
plt.title('PCA of Candidate Responses')
plt.legend()
plt.grid()
plt.show()



### DBSCAN

In [None]:
#Use reduced data from PCA 
#Use DBSCAN to cluster data

eps_values = [0.5, 1.0, 1.5]  
min_samples_values = [5, 10, 15]  

best_silhouette_score = -1
best_eps, best_min_samples = 0, 0
best_labels = []

for eps in eps_values:
    for min_samples in min_samples_values:
        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(reduced_data)

        if len(np.unique(labels)) > 1:  # Ensure more than one cluster is formed
            silhouette_avg = silhouette_score(reduced_data, labels)
            if silhouette_avg > best_silhouette_score:
                best_silhouette_score = silhouette_avg
                best_eps = eps
                best_min_samples = min_samples
                best_labels = labels

display(best_eps, best_min_samples )

plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=best_labels, cmap='viridis')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('DBSCAN Clustering')
plt.show()

### Crucial question analysis

In [None]:
#Finding out how much original features contribute to the two PCA. Printing out in descending order. Interestingly enough, the first PCA is pretty much solely determined by age

# Get the PCA loadings
loadings = pca.components_

# Calculate the absolute values of loadings for each feature
absolute_loadings = np.abs(loadings)

# Create a DataFrame to show the absolute loadings
loading_df = pd.DataFrame(absolute_loadings, columns=X.columns, index=['PC1', 'PC2'])

# Sort the features based on their loadings in PC1
sorted_features_pc1 = loading_df.loc['PC1'].sort_values(ascending=False)
print("Features contributing the most to PC1:")
display(sorted_features_pc1)

# Sort the features based on their loadings in PC2
sorted_features_pc2 = loading_df.loc['PC2'].sort_values(ascending=False)
print("\nFeatures contributing the most to PC2:")
display(sorted_features_pc2)

### Average position of parties concerning each question

In [None]:

#Dropping features that are not relevant for assessing the mean for every party in relation to every question
cleansed_candidates = candidates_df.copy()
cleansed_candidates.drop("navn", axis=1, inplace=True)
cleansed_candidates.drop("storkreds", axis=1, inplace=True)
cleansed_candidates.drop("alder", axis=1, inplace=True)

display(cleansed_candidates.shape)
display(candidates_df.shape)

grouped_parties = cleansed_candidates.groupby("parti")

#Calculate the mean for every feature (question) within every group (party)
mean_by_party = grouped_parties.mean()

display(mean_by_party)

# Plot for mean scores by party for each question. It is hard to see the mean for a specific question, but a general radical or mild tendency of parties can be seen
mean_by_party.plot(kind='bar', figsize=(10, 6))
plt.title('Mean Scores by Party for Each Question')
plt.xlabel('Party')
plt.ylabel('Mean Score')
plt.legend(title='Question', title_fontsize='12', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

### Age of candidates grouped by parties

In [None]:
#Similar as previous one, but we want to keep the "alder" attribute

cleansed_candidates = candidates_df.copy()
cleansed_candidates.drop("navn", axis=1, inplace=True)
cleansed_candidates.drop("storkreds", axis=1, inplace=True)

grouped_parties = cleansed_candidates.groupby("parti")

means_by_party = grouped_parties.mean()

# Select the "question 530" column from the mean_by_party DataFrame
alder_means = means_by_party['alder']

# Create a bar plot for mean scores of "question 530" by party
alder_means.plot(kind='bar', figsize=(10, 6))
plt.title('Mean Age by Party')
plt.xlabel('Party')
plt.ylabel('Mean Age')
plt.show()

### Most confident candidates in terms of their answers to questions

In [None]:
# Calculate the total count of +2 and -2 answers for each candidate
confidance_df = candidates_df.copy()

confidance_df['+2_count'] = (candidates_df == 2).sum(axis=1)
confidance_df['-2_count'] = (candidates_df == -2).sum(axis=1)

confidance_df['ConfidentVotes'] = confidance_df['+2_count'] + confidance_df['-2_count']

sorted_confidance_df = confidance_df[['ConfidentVotes', 'navn']].sort_values(by='ConfidentVotes', ascending=False)
print(sorted_confidance_df)

# Extract the top 10 rows
top_10 = sorted_confidance_df.head(10)

# Create a bar plot for the 'ConfidentVotes' column
plt.figure(figsize=(10, 6))
plt.barh(top_10['navn'], top_10['ConfidentVotes'], color='skyblue')
plt.xlabel('Confident Votes')
plt.ylabel('Candidate')
plt.title('Top 10 Candidates with the Most Confident Votes')
plt.gca().invert_yaxis()  # Invert the y-axis for better visualization
plt.show()


### Intra-party disagreements

In [None]:
# 1. Calculate Intra-Party Differences
intra_disagreements = candidates_df.copy()
intra_disagreements.drop("navn", axis=1, inplace=True)
intra_disagreements.drop("storkreds", axis=1, inplace=True)
intra_disagreements.drop("alder", axis=1, inplace=True)
display(intra_disagreements.shape)
#This calculates how much each candidate in a party disagrees in relation to other candidateso of the party (the disagreement is expressed as mean)
intra_party_diff = intra_disagreements.groupby('parti').apply(lambda group: group.drop('parti', axis=1).diff(axis=1).mean(axis=1))

sorted_intra_party_diff = intra_party_diff.groupby(level=0).apply(lambda x: x.sort_values(ascending=False))

# Display the sorted DataFrame with scrollable output
with pd.option_context('display.max_rows', None):
    print(sorted_intra_party_diff)

#Perhaps not the most elegant representation, but it does display which candidates in each party are most disagreeing with the remaining ones in a descending order

### Inter-party disagreements

In [None]:
inter_disagreements = candidates_df.copy()
inter_disagreements.drop("navn", axis=1, inplace=True)
inter_disagreements.drop("storkreds", axis=1, inplace=True)
inter_disagreements.drop("alder", axis=1, inplace=True)
inter_disagreements.drop("parti", axis=1, inplace=True)
display(inter_disagreements.shape)

inter_party_diff = inter_disagreements.diff(axis=1).mean(axis=1)

sorted_inter_party_diff = inter_party_diff.sort_values(ascending=False)

#Descending list of all candidates based on their disagreement mean in relation to all other candidates
with pd.option_context('display.max_rows', None):
    print(sorted_inter_party_diff)

### Candidate affiliation prediction

In [None]:
cleansed_df = candidates_df.copy()
cleansed_df.drop("navn", axis=1, inplace=True)
cleansed_df.drop("storkreds", axis=1, inplace=True)

X = cleansed_df.drop('parti', axis=1)
y = cleansed_df['parti']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Model Building
# Train Decision Tree model
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

# Train Gradient Boosted Tree model
gbt_model = GradientBoostingClassifier()
gbt_model.fit(X_train, y_train)

# Step 4: Model Evaluation
# Evaluate models
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    return accuracy, report

dt_accuracy, dt_report = evaluate_model(dt_model, X_test, y_test)
rf_accuracy, rf_report = evaluate_model(rf_model, X_test, y_test)
gb_accuracy, gb_report = evaluate_model(gbt_model, X_test, y_test)

# Print model evaluation results
print("Decision Tree Accuracy:\n", dt_accuracy)
print("Decision Tree Classification Report:\n", dt_report)
print("Random Forest Accuracy:\n", rf_accuracy)
print("Random Forest Classification Report:\n", rf_report)
print("Gradient Boosted Tree Accuracy:\n", gb_accuracy)
print("Gradient Boosted Tree Classification Report:\n", gb_report)

### Identifying switchovers

In [None]:
# Generate predictions
predicted_party = rf_model.predict(X)

# Calculate prediction probabilities for each party
party_probabilities = rf_model.predict_proba(X)
display(party_probabilities.shape)
display(party_probabilities)

# Set a threshold for switching parties (adjust as needed)
# This should drastically affect how many candidates are considered switchovers, but for some reason it does not
switch_threshold = 0.7

# Identify candidates for switching parties
switch_candidates = []

for i in range(len(X)):
    if predicted_party[i] != y[i]:  # Check if predicted party is different from the actual party
        new_party_prob = np.argmax(party_probabilities[i])
        if new_party_prob > switch_threshold:
            switch_candidates.append((i, predicted_party[i]))

# 'switch_candidates' contains the index and predicted new party for candidates who may want to switch.

# Print the results or further analyze them
for index, new_party in switch_candidates:
    print(f"Candidate {candidates_df.loc[index, 'navn']} should consider switching to Party {new_party}")

### Clustering analysis

In [None]:
# K-Means Clustering
inertia_values = []
silhouette_scores = []
k_values = range(2, 15)  # Try different numbers of clusters

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0)
    kmeans.fit(reduced_data)
    inertia_values.append(kmeans.inertia_)
    if k > 1:  # Silhouette score is not defined for a single cluster
        silhouette_scores.append(silhouette_score(reduced_data, kmeans.labels_))

# Visualize K-Means results
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(k_values, inertia_values, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('K-Means Elbow Method')

plt.subplot(1, 2, 2)
plt.plot(k_values[0:], silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('K-Means Silhouette Score')

plt.tight_layout()
plt.show()

#It seems that the sweet spot for the K-Means clustering is somewhere around 8 clusters as we want to maximize Silhouette Score and minimize Inertia

# Hierarchical Clustering
linked = linkage(reduced_data, method='ward')  # Try different linkage methods

# Visualize dendrogram
plt.figure(figsize=(10, 6))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()