# Candidate Test 2022 Analysis Part 1

This exercise focuses on the candidate tests from two television networks: DR and TV2. Data from both tests have been given on a scale of five responses (-2, -1, 0, 1, 2).

---

There are 6 datasets included in this exercise:

- `alldata.xlsx`: Contains responses from both TV stations.
- `drdata.xlsx`: Contains responses from DR.
- `drq.xlsx`: Contains questions from DR.
- `tv2data.xlsx`: Contains responses from TV2.
- `tv2q.xlsx`: Contains questions from TV2.
- `electeddata.xlsx`: Contains responses from both TV stations for candidates who were elected to the parliament. Note that 9 members are missing; 7 of them didn't take any of the tests. Additionally, some notable figures like Mette F. and Lars Løkke did not participate in any of the tests.

---

It's entirely up to you how you approach this data, but at a *minimum*, your analysis should include:
- Age of the candidates grouped by parties.
- An overview of the most "confident" candidates, i.e., those with the highest proportion of "strongly agree" or "strongly disagree" responses.
- Differences in responses between candidates, both interparty and intra-party, along with an explanation of which parties have the most internal disagreements.
- Classification models to predict candidates' party affiliations. Investigate if there are any candidates who seem to be in the "wrong" party based on their political landscape positions. You must use the following three algorithms: **Decision Tree, Random Forrest, and Gradient Boosted Tree**, and **two other** classification algorithms of your choice, i.e. a total of 5 models are to be trained.

---

The following parties are represented:

| Party letter |       Party name        |    Party name (English)     | Political position |
|:------------:|:-----------------------:|:---------------------------:|:------------------:|
|      A       |    Socialdemokratiet    |      Social Democrats       |    Centre-left     |
|      V       |         Venstre         |    Danish Liberal Party     |    Centre-right    |
|      M       |       Moderaterne       |          Moderates          |    Centre-right    |
|      F       | Socialistisk Folkeparti |  Socialist People's Party   |     Left-wing      |
|      D       |  Danmarksdemokraterne   |      Denmark Democrats      |     Right-wing     |
|      I       |    Liberal Alliance     |      Liberal Alliance       |     Right-wing     |
|      C       |      Konservative       | Conservative People's Party |     Right-wing     |
|      Æ       |      Enhedslisten       |     Red-Green Alliance      |      Far-left      |
|      B       |    Radikale Venstre     |    Social Liberal Party     |    Centre-left     |
|      D       |     Nye Borgerlige      |          New Right          |     Far-right      |
|      Z       |      Alternativet       |       The Alternative       |    Centre-left     |
|      O       |    Dansk Folkeparti     |    Danish People's Party    |     Far-right      |
|      G       |       Frie Grønne       |         Free Greens         |    Centre-left     |
|      K       |   Kristendemokraterne   |     Christian Democrats     |    Centre-right    |

Below you can see the results and the colors chosen to represent the parties. Use these colors in your analysis above.\

![Party colors](image-1.png)

Others have undertaken similar analyses. You can draw inspiration from the following (use Google Translate if your Danish is rusty):

- [Analysis of where individual candidates stand relative to each other and their parties](https://v2022.dumdata.dk/)
- [Candidate Test 2022 – A deep dive into the data](https://kwedel.github.io/kandidattest2022/)
- [The Political Landscape 2019](https://kwedel.github.io/kandidattest2019/)


# Candidate Data

The data we have been provided in the first place has already been
prepared to a degree, for example, DR and TV2 data has all questions and 
responses but since TV2 asked regional questions, these are already filtered out
from "alldata.xlsx". This is good because it saves us time having to clean the 
data. Having good data is crucial for any analysis.

Based on the image above, we have assigned hexadecimal color representations to
have been added to the table by referencing the image with an eyedropper tool in
an image editor.

Additionally, from our data analysis, we have found that some candidates are
Løsgænger or independent candidates. These candidates are not associated with
any party and have been included as a "fake" party for the sake of analysis. 
These candidates are not included in the image above so they have been assigned 
the color white.

For consistency with the original dataset, the header of this data has been 
written in Danish, just like the original dataset.

## Preparation for the Analysis
In order to analyse the data, the first step is to load everything so that we can access it through Python.
We will start by loading the necessary libraries and then load the data.

In [None]:
# Importing the necessary modules
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import the reference data for parties
party_information = pd.read_csv('partyInformation.csv', header=0)

# Create a color key for parties based on the letter
color_key = party_information[['parti', 'farve']].set_index('parti').to_dict()['farve']
color_key_shorthand = party_information[['bogstav', 'farve']].set_index('bogstav').to_dict()['farve']
# partyInformation

In [None]:
# Reading all the data for candidates and their responses
data = pd.read_excel('alldata.xlsx', header=0)
# data

## Data Cleanup
The data cleanup undergoes the following steps:
- Correction of party names to match the reference table, some parties have longer official names that are not in the reference table.
    - Since most of the data already matches the reference table, we only need to correct the few that don't.
- Verification of duplicate candidate names.
- Removal of candidates where their age is set to 0.
    - This is obviously nonsensical and likely a placeholder for missing data, so we remove these candidates for simplicity. 

In [None]:
# Data Cleanup
# Correct party names to match the reference table, as a function for reusability
alternative_pary_names = {
    'Frie Grønne, Danmarks Nye Venstrefløjsparti': 'Frie Grønne',
    'Det Konservative Folkeparti': 'Konservative',
}
data['parti'] = data['parti'].replace(alternative_pary_names)

# Verify duplicate candidates
if data['navn'].duplicated().sum() > 0:
    print("Duplicate candidate names detected.")
else:
    print("No duplicate candidate names detected.")

data = data[data['alder'] > 0]

In [None]:
# Extract info from every candidate and attach the party information for easy overview
# Also detaches the questions from the candidate data for easier analysis of the candidates themselves
candidateInfo = data.merge(party_information, on='parti', how='left')[['bogstav', 'navn', 'alder', 'storkreds', 'parti', 'holdning']]
# candidateInfo

In [None]:
# To sort all future charts by party, we can create a plot order based on the letters assigned to each party, UFG is removed and added at the end for unaligned candidates to show up last
plot_order = party_information['bogstav']
plot_order = plot_order[plot_order != 'UFG']
plot_order = plot_order.sort_values(ascending=True)
plot_order = pd.concat([plot_order, pd.Series('UFG')], ignore_index=True)
plot_order = plot_order.to_frame('bogstav').merge(party_information[['parti', 'bogstav']], on='bogstav', how='left')
# plot_order

alignment_order = {
    'Far-left': -3,
    'Left-wing': -2,
    'Centre-left': -1,
    'Non-Aligned': 0,
    'Centre-right': 1,
    'Right-wing': 2,
    'Far-right': 3
}

In [None]:
average_age = candidateInfo['alder'].mean()

average_age_by_party = candidateInfo.groupby('bogstav')['alder'].mean(numeric_only=True).sort_values(ascending=False)
average_age_by_party = average_age_by_party.to_frame()
average_age_by_alignment = candidateInfo.groupby('holdning')['alder'].mean(numeric_only=True).reindex(alignment_order.keys())

# We group all left and right parties together for a more general overview
average_age_by_leaning = candidateInfo[candidateInfo['holdning'] != 'Non-Aligned'].replace({'holdning': {
    'Far-left': 'Left-leaning',
    'Left-wing': 'Left-leaning',
    'Centre-left': 'Left-leaning',
    'Centre-right': 'Right-leaning',
    'Right-wing': 'Right-leaning',
    'Far-right': 'Right-leaning'
}}).groupby('holdning')['alder'].mean(numeric_only=True).sort_values(ascending=False)

print(f"The average age of candidates is {average_age:.2f} years.")

ages_classifications, ages_axis = plt.subplots(ncols=3, figsize=(18, 8))

sns.barplot(x='alder', y='bogstav', data=average_age_by_party, hue='bogstav', palette=color_key_shorthand, order=plot_order['bogstav'], ax=ages_axis[0])
sns.barplot(x='holdning', y='alder', data=average_age_by_alignment.to_frame(), hue='holdning', palette='blend:#f00,#00f', ax=ages_axis[1])
sns.barplot(x='holdning', y='alder', data=average_age_by_leaning.to_frame(), order=['Left-leaning', 'Right-leaning'], hue='holdning', palette=['#0000ff', '#ff0000'],ax=ages_axis[2])

ages_axis[0].set_title("Average Age by Party")
ages_axis[0].set_xlabel("Average Age")
ages_axis[0].set_ylabel("Party")

ages_axis[1].set_title("Average Age by Political Orientation")
ages_axis[1].set_ylabel("Average Age")
ages_axis[1].set_xlabel("Political Orientation")
ages_axis[1].set_xticklabels(ages_axis[1].get_xticklabels(), rotation=45)

ages_axis[2].set_title("Average Age by Leaning")
ages_axis[2].set_ylabel("Average Age")
ages_axis[2].set_xlabel("Leaning")

if not os.path.exists('analysis-images/age'):
    os.makedirs('analysis-images/age')
plt.tight_layout()
plt.savefig('analysis-images/age/average_age.png')
plt.show()

In [None]:
# Create a violin plot of the average age of candidates by party
age_by_party, age_party_axis = plt.subplots(figsize=(16, 9))
sns.violinplot(x='alder', y='bogstav', hue='parti', order=plot_order['bogstav'], data=candidateInfo, palette=color_key, saturation=1, axes=age_party_axis)

# Customize the display labels
plt.ylabel("Party")
plt.xlabel("Average Age")
plt.xticks(range(5, 95, 5))
plt.title("Violin Plot of Average Age by Party")

# Show the plot
plt.legend().remove()
if not os.path.exists('analysis-images/age'):
    os.makedirs('analysis-images/age')
plt.savefig('analysis-images/age/party.png')
plt.show()

In [None]:
# Create a violin plot of the average age of candidates by party
alignments_ordered = ['Far-left', 'Left-wing', 'Centre-left', 'Centre-right', 'Right-wing', 'Far-right']
age_by_alignment, alignment_axis = plt.subplots(figsize=(8, 6))
sns.boxplot(x='holdning', y='alder', hue='holdning', order=alignments_ordered, data=candidateInfo, saturation=1, ax=alignment_axis)

# Customize the display labels
plt.ylabel("Average Age")
plt.yticks(range(15, 85, 5))
plt.xlabel("Political Orientation")
plt.title("Violin Plot of Average Age by Political Orientation")

# Show the plot
if not os.path.exists('analysis-images/age'):
    os.makedirs('analysis-images/age')
plt.savefig('analysis-images/age/alignment.png')
plt.show()

## Analysis of Question Data

In [None]:
# The following function produces a histogram for every column in a given dataset for the first n(dataset_length) rows
def create_histograms(dataset, dataset_length, name='all'):
    numeric_data = dataset.select_dtypes(exclude=['object'])
    row_amount = 10
    col_amount = 5
    questions_graph, question_axes = plt.subplots(nrows=row_amount, ncols=col_amount, figsize=(16, 16))
    
    for i in range(dataset_length):
        subplot = question_axes[i // col_amount, i % col_amount]
        sns.histplot(numeric_data.iloc[:, i], ax=subplot)
        subplot.set_title(numeric_data.columns[i])

    plt.tight_layout()
    if not os.path.exists('analysis-images/questions'):
        os.makedirs('analysis-images/questions')
    plt.savefig('analysis-images/questions/histo_' + name + '.png')
    plt.show()

In [None]:
# Since we only care about the questions as a party member and not the candidate, we can drop the candidate name
question_amount = len(data.select_dtypes(exclude=['object']).columns) - 1 # Subtract 1 to remove the candidate age

create_histograms(data, question_amount)

In [None]:
# Create a series of histograms for the DR quesions regardless of party affiliation
selected_party = party_information[party_information['bogstav'] == 'A']['parti'].values[0]

create_histograms(data[data['parti'] == selected_party], question_amount, selected_party)

In [None]:
# The following produces the answer "mode" for every question in the dataset
# That is, a candidate answer that appears most frequently for each party in every question
mode_by_party = data.drop(columns=['navn', 'storkreds', 'alder']).groupby('parti').agg(lambda x: x.mode().iloc[0])
mode_by_party

In [None]:
# To get accurate maximum and minimums from parties, we drop the unaligned candidates
affiliated_data = data[data['parti'] != 'Løsgænger']

deviation_by_question = affiliated_data.select_dtypes(exclude='object').drop(columns='alder').std(numeric_only=True)
deviation_by_question_bottom = round(deviation_by_question.min() - 0.1, 1)
deviation_by_question_top = round(deviation_by_question.max() + 0.1, 1)
deviation_by_question_adjusted = deviation_by_question - deviation_by_question_bottom

question_deviation_plot, question_deviation_axis = plt.subplots(figsize=(18, 5))
sns.barplot(x=deviation_by_question.index, y=deviation_by_question_adjusted, ax=question_deviation_axis, bottom=deviation_by_question_bottom)

plt.xlabel('Question ID')
plt.xticks(rotation=90)
plt.ylabel('Standard Deviation')
plt.yticks(np.arange(start=deviation_by_question_bottom, stop=deviation_by_question.max() + 0.1, step=0.1))
plt.title('Standard Deviation of Answers by Question')
if not os.path.exists('analysis-images/questions'):
    os.makedirs('analysis-images/questions')
plt.savefig('analysis-images/questions/answers_deviation.png')
plt.show()

In [None]:
# Find the Standard Deviation of answers for each party
deviation_by_party = affiliated_data.drop(columns='alder').groupby('parti').std(numeric_only=True)

# Find the question with the highest Standard Deviation for each party
max_deviation_question = deviation_by_party.idxmax(axis=1)
max_deviation_values = deviation_by_party.max(axis=1)
mean_deviation_by_party = deviation_by_party.mean(axis=1)

print("Question with the biggest disagreement within parties: " + max_deviation_question.mode()[0])
print("Party with the biggest internal disagreement: " + max_deviation_values.idxmax())

# Combine the results into a DataFrame
parti_disagreement = pd.DataFrame({
    'max_std_id': max_deviation_question,
    'max_std_value': max_deviation_values,
    'mean_std': mean_deviation_by_party
}).sort_values('mean_std', ascending=False)

In [None]:
# Create a bar plot of the standard deviation of answers by party
question_deviation_plot, question_deviation_axis = plt.subplots(figsize=(16, 9))

sns.barplot(data=parti_disagreement, x=parti_disagreement.index, y='mean_std', order=plot_order['parti'], hue=parti_disagreement.index, palette=color_key, ax=question_deviation_axis)

plt.xlabel('Party')
plt.xticks(rotation=45)
plt.ylabel('Standard Deviation')
plt.title('Standard Deviation of Answers by Party')

if not os.path.exists('analysis-images/questions'):
    os.makedirs('analysis-images/questions')
plt.savefig('analysis-images/questions/party_deviation.png')
plt.show()

In [None]:
# The confidence here is defined as the number of "extreme" answers (-2 or 2) given by a candidate
data = data.assign(certainty=data.drop(columns='alder').select_dtypes(exclude='object').apply(lambda row: row[np.abs(row) == 2].size, axis=1))

data[['navn', 'parti', 'certainty']]

In [None]:
# Boxplot of the confidence of candidates by party
confidence_by_party, confidence_party_axis = plt.subplots(figsize=(16, 9))

sns.boxplot(data=data, x='certainty', y='parti', order=plot_order['parti'], hue='parti', ax=confidence_party_axis, palette=color_key, saturation=1)

plt.xlabel('Party')
plt.ylabel('Certainty')
plt.title('Certainty of Candidates by Party')
if not os.path.exists('analysis-images/certainty'):
    os.makedirs('analysis-images/certainty')
plt.savefig('analysis-images/certainty/party_boxplot.png')
plt.show()

In [None]:
# Find the most confident parties
party_confidence = data.groupby('parti').mean(numeric_only=True)

# Turn the average certainty into a percentage
party_confidence['certainty_percent'] = (party_confidence['certainty'] / (data.select_dtypes(exclude='object').columns.size - 1)) * 100 # Subtract 1 to remove the candidate age

party_confidence_p, party_confidence_axis = plt.subplots(figsize=(16, 6))
sns.barplot(data=party_confidence, x='parti', y='certainty_percent', hue='parti', order=plot_order['parti'], palette=color_key, ax=party_confidence_axis)

plt.xlabel('Party')
plt.xticks(rotation=45)
plt.ylabel('Average Certainty')
plt.title('Average Certainty by Party')
if not os.path.exists('analysis-images/certainty'):
    os.makedirs('analysis-images/certainty')
plt.savefig('analysis-images/certainty/party_barplot.png')
plt.show()

In [None]:
# Train a classification model to predict party affiliation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Prepare the data for training
x = data.drop(columns=['navn', 'parti', 'storkreds', 'certainty'])
y = data['parti']

# Split the data into training and testing sets
# Surprisingly, the test size of 0.4 gives better results than 0.2 or 0.5
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Standardize the data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# Train the models
models = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Support Vector Machine': SVC(),
    'Logistic Regression': LogisticRegression(max_iter=1000)
}

model_accuracy = {
    'model': models.keys(),
    'accuracy': []
}

for name, model in models.items():
    model.fit(x_train, y_train)
    model_accuracy['accuracy'].append(model.score(x_test, y_test))
    
model_accuracy = pd.DataFrame(model_accuracy)
model_accuracy
    

In [None]:
# Find the candidates that are the most misaligned with their party
party_predictions = pd.DataFrame(index=data.index, data={'name': data['navn']})

full_x = np.concatenate((x_train, x_test))
full_y = np.concatenate((y_train, y_test))

party_predictions['prediction'] = models['Support Vector Machine'].predict(full_x)
party_predictions['actual'] = full_y
party_predictions['is_misaligned'] = party_predictions['prediction'] != party_predictions['actual']

misaligned_candidates = party_predictions[party_predictions['is_misaligned']]
misaligned_candidates = misaligned_candidates.drop(columns='is_misaligned')

misaligned_candidates['orientation_pred'] = misaligned_candidates['prediction'].map(party_information.set_index('parti')['holdning'])
misaligned_candidates['orientation_actual'] = misaligned_candidates['actual'].map(party_information.set_index('parti')['holdning'])

misaligned_candidates['orientation_pred'].replace(alignment_order, inplace=True)
misaligned_candidates['orientation_actual'].replace(alignment_order, inplace=True)
misaligned_candidates['alignment_difference'] = misaligned_candidates['orientation_actual'] -  misaligned_candidates['orientation_pred']
misaligned_candidates['alignment_change'] = np.abs(misaligned_candidates['alignment_difference'])

# Remove candidates in wrong party with the same political orientation
misaligned_candidates = misaligned_candidates[misaligned_candidates['alignment_difference'] != 0]

misaligned_candidates.sort_values('alignment_change', ascending=False)