# Candidate Test 2022 Analysis

This exercise focuses on the candidate tests from two television networks: DR and TV2. Data from both tests have been given on a scale of five responses (-2, -1, 0, 1, 2). Consider normalizing the data or performing similar scaling transformations as needed.

---

There are 6 datasets included in this exercise:

- `alldata.xlsx`: Contains responses from both TV stations.
- `drdata.xlsx`: Contains responses from DR.
- `drq.xlsx`: Contains questions from DR.
- `tv2data.xlsx`: Contains responses from TV2.
- `tv2q.xlsx`: Contains questions from TV2.
- `electeddata.xlsx`: Contains responses from both TV stations for candidates who were elected to the parliament. Note that 9 members are missing; 7 of them didn't take any of the tests. Additionally, some notable figures like Mette F. and Lars Løkke did not participate in any of the tests.

---

It's entirely up to you how you approach this data, but at a *minimum*, your analysis should include:

- PCA or some other dimensionality reduction technique to plot responses in two dimensions, thereby visualizing the "political landscape". The colors of the plotted points should match the political party colors (see below).
- An analysis/description of which questions are most crucial concerning their placement on the axes.
- Average positions of parties concerning each question, preferably with accompanying plots of each (or selected) question.
- Age of the candidates grouped by parties.
- An overview of the most "confident" candidates, i.e., those with the highest proportion of "strongly agree" or "strongly disagree" responses.
- Differences in responses between candidates, both inter-party and intra-party, along with an explanation of which parties have the most internal disagreements.
- Classification models to predict candidates' party affiliations. Investigate if there are any candidates who seem to be in the "wrong" party based on their political landscape positions. You must use the following three algorithms: **Decision Tree, Random Forrest, and Gradient Boosted Tree**.
- A clustering analysis where you attempt various cluster numbers, which would correspond to different parties. Discuss whether there is room for more clusters/parties or if a reduction is needed. Make sure you cover: **K-Means, Hierarchical clustering, and DBSCAN.**
- An overview of the political landscape of the elected candidates, highlighting which members agree or disagree the most and which parties or party members have significant disagreements.
- Feel free to explore further and remember that preprocessing, methodology, and evaluation metrics are not mentioned explicitly, but are implicitly assumed.

---

The following parties are represented:

| Party letter | Party name | Party name (English) | Political position |
| :-: | :-: | :-: | :-: |
| A | Socialdemokratiet | Social Democrats | Centre-left |
| V | Venstre | Danish Liberal Party | Centre-right |
| M | Moderaterne | Moderates | Centre-right |
| F | Socialistisk Folkeparti | Socialist People's Party | Left-wing |
| Æ | Danmarksdemokraterne | Denmark Democrats | Right-wing |
| I | Liberal Alliance | Liberal Alliance | Right-wing |
| C | Konservative | Conservative People's Party | Right-wing |
| Ø | Enhedslisten | Red-Green Alliance | Far-left |
| B | Radikale Venstre | Social Liberal Party | Centre-left |
| D | Nye Borgerlige | New Right | Far-right |
| Z | Alternativet | The Alternative | Centre-left |
| O | Dansk Folkeparti | Danish People's Party | Far-right |
| G | Frie Grønne | Free Greens | Centre-left |
| K | Kristendemokraterne | Christian Democrats | Centre-right |

Below you can see the results and the colors chosen to represent the parties. Use these colors in your analysis above.

![Alt text](image-1.png)


Others have undertaken similar analyses. You can draw inspiration from the following (use Google tranlsate if your Danish is rusty):

- [Analysis of where individual candidates stand relative to each other and their parties](https://v2022.dumdata.dk/)
- [Candidate Test 2022 – A deep dive into the data](https://kwedel.github.io/kandidattest2022/)
- [The Political Landscape 2019](https://kwedel.github.io/kandidattest2019/)



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import folium
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA

In [None]:
candidates_df = pd.read_excel('alldata.xlsx')

display(candidates_df.head(5))
display(candidates_df.shape)

#Might make sense to do some scaling, not sure. Probably will need to encode the candidate and polytical party names
scaler = StandardScaler()

### Pre-processing

In [None]:
colour_mapping = {
    'Socialdemokratiet': '#C82518',  
    'Venstre': '#01438E',  
    'Moderaterne': '#B48CD2',  
    'Socialistisk Folkeparti': '#eb94d1',  
    'Danmarksdemokraterne': '#FFA500',  
    'Liberal Alliance': '#3FB2BE', 
    'Det Konservative Folkeparti': '#00583C',  
    'Enhedslisten': '#F7660D', 
    'Radikale Venstre': '#FF00FF',  
    'Nye Borgerlige': '#00505B',  
    'Alternativet': '#00FF00',  
    'Dansk Folkeparti': '#FCD03B',  
    'Frie Grønne, Danmarks Nye Venstrefløjsparti': '#d9b99b',  
    'Kristendemokraterne': '#53619B',  
    'Løsgænger': '#808080'  
}

colours = candidates_df['parti'].map(colour_mapping)

### PCA

In [None]:
# Numeric only df for PCA
X = candidates_df.select_dtypes(include=['number'])
display(X.shape)

# Perform PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)
display(pca.explained_variance_ratio_)

# Plot the reduced data
plt.figure(figsize=(10, 6))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=colours, marker='o', label='Candidates')
plt.title('PCA of Candidate Responses')
plt.legend()
plt.grid()
plt.show()

### Crucial question analysis

In [None]:
#Finding out how much original features contribute to the two PCA. Printing out in descending order. Interestingly enough, the first PCA is pretty much solely determined by age

# Get the PCA loadings
loadings = pca.components_

# Calculate the absolute values of loadings for each feature
absolute_loadings = np.abs(loadings)

# Create a DataFrame to show the absolute loadings
loading_df = pd.DataFrame(absolute_loadings, columns=X.columns, index=['PC1', 'PC2'])

# Sort the features based on their loadings in PC1
sorted_features_pc1 = loading_df.loc['PC1'].sort_values(ascending=False)
print("Features contributing the most to PC1:")
display(sorted_features_pc1)

# Sort the features based on their loadings in PC2
sorted_features_pc2 = loading_df.loc['PC2'].sort_values(ascending=False)
print("\nFeatures contributing the most to PC2:")
display(sorted_features_pc2)

### Average position of parties concerning each question

In [None]:

#Dropping features that are not relevant for assessing the mean for every party in relation to every question
cleansed_candidates = candidates_df.copy()
cleansed_candidates.drop("navn", axis=1, inplace=True)
cleansed_candidates.drop("storkreds", axis=1, inplace=True)
cleansed_candidates.drop("alder", axis=1, inplace=True)

display(cleansed_candidates.shape)
display(candidates_df.shape)

grouped_parties = cleansed_candidates.groupby("parti")

#Calculate the mean for every feature (question) within every group (party)
mean_by_party = grouped_parties.mean()

display(mean_by_party)

# Plot for mean scores by party for each question. It is hard to see the mean for a specific question, but a general radical or mild tendency of parties can be seen
mean_by_party.plot(kind='bar', figsize=(10, 6))
plt.title('Mean Scores by Party for Each Question')
plt.xlabel('Party')
plt.ylabel('Mean Score')
plt.legend(title='Question', title_fontsize='12', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()