### Importing Python Libraries

In [1]:
import pandas as pd
import numpy as np

# For Task 3
# from sklearn.model_selection import train_test_split
# from sklearn.ensemble~ import RandomForestClassifier
# from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

### Importing Dataset

In [2]:
# Importing and loading the dataset
df = pd.read_csv("dataset.csv")

## Task 1

#### To get a feel for the employed dataset, do some initial exploration and answer the following questions:

### 1.1: How many questions exist in the survey?

In the dataset's header, answers of the same question share the same 'prefix'. For example, `q3_1`, `q3_2`, ... `q3_10` are all possible answers of the same question. Therefore, we can remove the content after the underscore from the header, remaining with `q1`, `q2`, ... `q10`, and only count the unique values.

In [3]:
# List of columns, each of which represent the options for each question
columns = df.columns

# Number of unique question prefixes
uniq_questions = set([col.split('_')[0] for col in columns])

# Total number of actual questions
num_questions = len(uniq_questions)
print("Number of questions in the survey:", num_questions)

Number of questions in the survey: 10


### 1.2: How many respondents are in the survey?

Each row within the dataset represents a single unique respondent with their unique responses, apart from the first row which serves as a header which contains details on the content on each column. Therefore, counting the number of available rows in the dataset bar the first row gives us the number of respondents.

In [4]:
# Returns the number of rows without header, which represents the number of respondents
num_respondents = df.shape[0]

print("Number of respondents in the survey:", num_respondents)

Number of respondents in the survey: 2288


### 1.3: What are the different question types based on their selection options?

The possible question types based on the respondents' selections are divided into two, whether the question permits for:
1. Single or multiple responses
2. Single or multiple options

To determine the single/multiple selection type for each question, the code iterates through the dataset columns, splitting each column name to extract the question prefix. It then checks whether the prefix already exists in the selection_types dictionary. If it does, it evaluates whether the selection type is already marked as `Multiple`; if not, it updates the selection type to `Multiple`. If the question prefix is not found in the dictionary, it's assumed to be a single-selection question, and it's marked accordingly.

Next, the code determines the question types by iterating through the dataset columns again. For each column, it extracts the question prefix and checks if it exists in the question_types dictionary. If not, it creates an entry for the question prefix. Then, it calls the determine_question_type function to add the question type to the corresponding question prefix entry.

<!-- First, we ascertain whether each question allows for a single or multiple responses. This is done by iterating over the dataset columns, calculating the sum of responses for each respondent across related columns. Based on the maximum sum, we categorise the question as either "Single" or "Multiple" responses.

We then identify if each question permits single or multiple options. We traverse the dataset columns, updating a dictionary with the selection type for each question prefix. If multiple selections are detected, the dictionary entry is updated to "Multiple"; otherwise, it defaults to "Single". -->

This information is then combined for each question. The result for "responses" is omitted when the question only allows for a single option, as it is not possible for a question with one input to be able to have multiple response options inputted.

In [6]:
## todo: optimise!!!

# Function to determine question type
def determine_question_type(question):
    # Get columns related to the question
    question_columns = [col for col in df.columns if col.startswith(question)]
    
    # Check if it's a single response or multiple responses question
    responses_sum = df[question_columns].sum(axis=1)
    max_sum = responses_sum.max()
    if max_sum == 1:
        return "Single"
    else:
        return "Multiple"
    
# Get question types for all questions
question_types = {}

# Determine single/multiple selection type for each question
selection_types = {}

for col in df.columns:
    question_prefix = col.split('_')[0]
    if question_prefix in selection_types:
        # If the question prefix already exists, check if it's a multiple-selection question
        if selection_types[question_prefix] != 'Multiple':
            selection_types[question_prefix] = 'Multiple'
    else:
        # If the question prefix is not in the dictionary, assume it's a single-selection question
        selection_types[question_prefix] = 'Single'

for col in df.columns:
    question_prefix = col.split('_')[0]
    if question_prefix not in question_types:
        question_types[question_prefix] = set()
    
    # Add the question type
    question_types[question_prefix].add(determine_question_type(question_prefix))

# Print combined question and selection types
print("Question Types:")
for question, q_types in question_types.items():
    q_type_str = " and ".join(q_types)
    print(f"{question}: {selection_types[question]} options", end = "")
    if selection_types[question] == 'Multiple':
        print(f" and {q_type_str} responses")
    else:
        print("")

Question Types:
q1: Single options
q2: Single options
q3: Multiple options and Single responses
q4: Multiple options and Single responses
q5: Multiple options and Single responses
q6: Multiple options and Multiple responses
q7: Multiple options and Multiple responses
q8: Multiple options and Multiple responses
q9: Multiple options and Multiple responses
q10: Multiple options and Multiple responses


## Task 2

#### If you were asked to build a model for a question in the dataset, using the rest of the questions as explanatory variables:

### 2.1: Which performance measures would you use?

When building a model for questions in the dataset, the choice of performance measures hinges upon the nature of the target variable and the specific objectives of the analysis. As shown in Task 1, the dataset comprises questions that fall under different classification categories, each demanding different types of evaluation metrics.

For instance, questions 1 and 2 present Binary Classification Problems, with only two possible outcomes [1]. For these types of questions, measures such as Accuracy, Precision, Recall, F1-score, and the Area under the ROC curve (AUC-ROC) may be used to assess the performance of the model.

On the other hand, questions 3, 4 and 5 can be described as Multi-Class Classification Problems with Mutually Exclusive Classes, as they entail multiple potential outputs, yet only one output can be selected [1]. In this case, similar performance measures as questions 1 and 2 could be used, being accuracy, precision, recall, F1-Score and confusion matrices. Additionally, the Micro F1-score, which aggregates F1-scores across all classes using micro-averaging, provides a balanced assessment of the model's performance, especially when dealing with class imbalance [2]. Apart from making use of Accuracy, which provides an overall measure of how correct the model's predictions are, Balanced Accuracy can also be used, which gives equal weight to each class and is therefore insensitive to class distribution [2]. This could therefore be useful if the survey is found to not have a balanced distribution of responses.

Finally, questions 6 until 10 pose Multi-Class Classification Problems with Non-Exclusive Classes, with the difference from the previous set of questions being that multiple outputs can be simultaneously selected [1]. Here, evaluation metrics such as Hamming Loss, precision at k, recall at k, F1-score at k, and subset accuracy become instrumental in gauging the model's predictive capabilities.

##### todo: add about unbalance - question 1 is unbalanced
##### Todo: add references

### 2.2: Does the selection of the performance metric depend on the type of question?

Yes, as shown in Task 2 Part 1, the selection of the performance metric depends on the type of question being addressed. Different types of questions present distinct characteristics, necessitating the use of specific performance metrics that are most suitable for evaluating the model’s effectiveness. Here's how the choice of performance metric aligns with the type of question:

Binary Classification (Question 1 and 2):
- In these problems, where there are only two possible outcomes, metrics such as accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC-ROC) are commonly used. These metrics provide insights into the model's ability to correctly classify instances into one of the two classes, taking into account True Positives (TP) represent correctly classified positive instances, True Negatives (TN) denote correctly classified negative instances, False Positives (FP) indicate negative instances incorrectly classified as positive, and False Negatives (FN) signify positive instances incorrectly classified as negative [1].

Multi-Class Classification (Question 3 to 5):
- In the case of Multi-Class Classification Problems, where there are multiple possible outputs but only one can be selected, accuracy, precision, recall, and F1-score can also be utilised, and also confusion matrices. These metrics enable the assessment of the model's performance across multiple classes, providing insights into its ability to distinguish between different categories.

Multi-Label Classification (Question 6 to 10):
- For multi-label classification problems, where multiple outputs can be simultaneously selected, performance metrics such as Hamming Loss, precision at k, recall at k, F1-score at k, and subset accuracy are often employed. These metrics are tailored to evaluate the model's ability to predict multiple labels for each instance accurately, considering the complexities of multi-label prediction tasks.

Each performance metric has its strengths and weaknesses, and the choice depends on the specific characteristics of the dataset and the objectives of the modeling task. Therefore, it's essential to carefully consider the nature of the target variable and the context of the problem before selecting the appropriate performance metric for evaluation.

##### todo: add references

## Task 3

#### Consider the fifth question (q5) as a response variable and build a predictive model using some or all other questions as explanatory variables.

### 3.1.1: Implement a suitable algorithm for this task. 

##### From [1]:
The size of your dataset will greatly influence the type of algorithm
you can use. That said there is no golden rule to the amount of data required to train
an ML model. However, some ML algorithms cope better where there is less
available data. For example, an image classification task can require tens of thousands of images. Regression tasks require many more observations than features
(10 as much) while NLP can require 1000’s of examples due to a large number of
words and complex phrases. Some algorithms such as neural networks can be
extremely data-hungry while other algorithms such as a random forest can produce
reasonably good results with limited amounts of data. The process of finding the
correct model based on the size of available data is an empirical one

##### todo: change

We chose to use the Random Forest algorithm for several reasons:

Ensemble Learning: Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. It works by building multiple decision trees during training and outputting the mode of the classes (classification) or the mean prediction (regression) of the individual trees. This ensemble approach helps to reduce overfitting and increase the generalization of the model.

Handles Non-linearity and Interactions: Random Forest can effectively capture non-linear relationships and interactions between variables in the dataset. This is particularly useful when dealing with complex data where the relationship between the features and the target variable is not straightforward.

Feature Importance: Random Forest provides a measure of feature importance, which can be valuable for understanding the contribution of each feature to the predictive performance of the model. This can aid in feature selection and feature engineering, leading to better model performance.

Robust to Overfitting: Random Forest is less prone to overfitting compared to individual decision trees, especially when the number of trees in the forest is sufficiently large. By aggregating the predictions from multiple trees, Random Forest reduces the variance of the model and improves its generalization ability.

Works well with Mixed Data Types: Random Forest can handle both numerical and categorical features without requiring feature scaling or encoding. This makes it convenient for datasets with mixed data types, such as the survey dataset provided, where some questions may have categorical response options.

Overall, Random Forest is a versatile and powerful algorithm that tends to perform well across a wide range of datasets and can be a suitable choice for building predictive models in many real-world scenarios.

The process:

1. Extract Independent and Dependent Variables in Dataset
2. Identify and handle Missing Values
3. Encoding Categorical Variables
4. Splitting Dataset
5. Feature Scaling
6. Data Imbalance

#### Pre-Processing

In [3]:
# Extracting response variable (q5) and explanatory variables
response_columns = df.filter(regex='^q5_').columns.tolist()
explanatory_columns = df.drop(columns=response_columns)
# y = df[response_columns]
# X = explanatory_columns

#### Handling Missing Values

Next, let's divide the dataset and check which questions have missing responses, if any. After grouping the different questions, we count the number of values per question by isolating the data using the `isnull().sum()` function to calculate the number of missing responses. Since the missing values account for the entire response and not just a single possible response per question, we divide the total number of missing values per the amount of possible responses, hence giving the amount of missing responses per question.

In [10]:
# Create a dictionary to store sections
sections = {}

# Iterate through columns and group them by question number prefix
for column in explanatory_columns.columns.tolist():
    question_number = column.split('_')[0]  # Extract question number prefix
    if question_number in sections:
        sections[question_number].append(column)
    else:
        sections[question_number] = [column]

# Iterate through sections to calculate missing values
missing_values_per_question = {}
for section, columns in sections.items():
    section_data = explanatory_columns[columns]
    missing_values_per_question[section] = section_data.isnull().sum()

# Calculate average missing values per column for each question
average_missing_values_per_question = {}
for question, missing_values in missing_values_per_question.items():
    num_columns = len(sections[question])
    total_missing_values = missing_values.sum()
    average_missing_values_per_question[question] = total_missing_values / num_columns

# Print average missing values per column for each question
print("Missing Responses in Explanatory Variables:")
for question, average_missing_values in average_missing_values_per_question.items():
    print(f"{question}: {int(average_missing_values)}")
    
# Checking Missing Values in Response Variable
missing_values_response = df[response_columns].isna().sum().sum()
print("\nMissing Responses in Response Variable:")
print(f"q5: {missing_values_response}")

Missing Responses in Explanatory Variables:
q1: 0
q2: 0
q3: 0
q4: 0
q6: 811
q7: 0
q8: 1255
q9: 0
q10: 0

Missing Responses in Response Variable:
q5: 0


Questions q6 and q8 exhibit a notable number of missing responses, with 811 and 1255 missing values respectively. Given that the total number of respondents in the dataset is 2288, this is quite a significant amount of missing responses.

CHATGPT:


Handling missing values in a dataset is a crucial step in data preprocessing before building a predictive model. Here are some common techniques to handle missing values:

1. Removal: You can simply remove rows or columns with missing values. However, this approach might lead to loss of valuable data, especially if the missing values are not randomly distributed.
2. Imputation: Replace missing values with a statistical measure such as mean, median, or mode of the column. This method maintains the structure of the dataset but may introduce bias, especially if the missing values are not missing at random.
3. Predictive Imputation: Use predictive models to estimate missing values based on other available features. This method can provide more accurate imputations but requires more computational resources.
4. K-nearest neighbors (KNN) imputation: Replace missing values with the average of nearest neighbors. This method is effective for datasets with non-linear patterns.
5. Interpolation: Replace missing values by estimating values between existing data points. This method is useful for time-series data or datasets with a clear order.
6. Multiple Imputation: Generate multiple imputations for missing values to capture uncertainty. This method is suitable for complex datasets with high variability.

The choice of method depends on the nature of the data, the extent of missingness, and the assumptions about the missing data mechanism. It's often advisable to explore the data distribution and understand the reasons for missingness before selecting an appropriate imputation strategy. Additionally, it's good practice to evaluate the impact of imputation on the model performance.

===========================

Imputation with Mode: Since the survey dataset likely consists of categorical variables (responses to various questions), imputing missing values with the mode (most frequent response) of each respective question could be a suitable approach. This maintains the categorical nature of the data and doesn't introduce new values that might not accurately represent the respondents' choices.

Create Separate Category for Missing Values: Alternatively, you could treat missing values as a separate category within each question. This approach allows the model to learn from the absence of a response as its own distinct category. However, this method may not be suitable if missingness is not informative or is too prevalent.

Predictive Imputation: If missing values are not too prevalent and there are enough complete cases, you could use predictive imputation methods like K-nearest neighbors (KNN) or decision trees to estimate missing values based on the responses to other questions. However, this approach might be computationally expensive and may not be necessary unless there's a compelling reason to believe that missing values follow a specific pattern.

In [50]:
# from sklearn.impute import SimpleImputer

# # Define the imputer with strategy as 'most_frequent' to impute with mode
# imputer = SimpleImputer(strategy='most_frequent')

# # Apply the imputer to the dataset
# df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# # Check if there are any missing values left
# missing_values = df_imputed.isnull().sum().sum()
# if missing_values == 0:
#     print("No missing values left after imputation.")
# else:
#     print("There are still missing values after imputation.")

# # Verify if there are any remaining missing values
# print("Remaining missing values:")
# print(df_imputed.isnull().sum())

In [20]:
# Checking Missing Values in Response Variable
y.isna().sum()

q5_1     0
q5_2     0
q5_3     0
q5_4     0
q5_5     0
q5_6     0
q5_7     0
q5_8     0
q5_9     0
q5_10    0
dtype: int64

In [19]:
# Checking Amount of Missing Values in Explanatory Variable
X.isna().sum() #*100/len(X)

q1_1      0
q2_1      0
q3_1      0
q3_2      0
q3_3      0
         ..
q10_6     0
q10_7     0
q10_8     0
q10_9     0
q10_10    0
Length: 72, dtype: int64

In [12]:
# Step 3: Check for missing values per question
missing_values_per_question = df.isnull().sum()

# Step 4: Display the missing values per question
print("Missing Values per Question:")
print(missing_values_per_question)

Missing Values per Question:
q1_1      0
q2_1      0
q3_1      0
q3_2      0
q3_3      0
         ..
q10_6     0
q10_7     0
q10_8     0
q10_9     0
q10_10    0
Length: 82, dtype: int64


### 3.1.2: Evaluate the implemented model using the metrics you proposed in Task 2.

### 3.1.3: Interpret the trained model (if possible) and the obtained results.

#### Let’s assume that the options of the fifth question (q5) are semantically ordered, e.g. q5 corresponds to the question “How much do you like this photograph from 1-10?” with options q5_1=1, q5_2=2 and so on. 

### 3.2.1: Implement a predictive model incorporating this new piece of information.

### 3.2.2: Select appropriate metrics to evaluate the performance of the model in this scenario.

## Task 4

#### In this task, you should identify similar respondents. Specifically, you should do the following:

### 4.1: Create groups of “like-minded”/similar respondents by proposing and implementing one or more suitable algorithms.

### 4.2: Evaluate whether the clusters you created are good enough.

### 4.3: Compare different algorithms against each other to select one of them.

### 4.4: Describe the groups you identified.

### 4.5: Create visualisations for the clustering results.

### 4.6: How would you assign a new respondent to an existing group?

## Task 5

#### Based on the results obtained from Task 4

### 5.1: Explain whether the clustering information could be used to build more accurate models for Task 3, and describe what you would do to build such models.

# References

[1] . Fergus and C. Chalmers, ‘Performance Evaluation Metrics’, in Applied Deep Learning: Tools, Techniques, and Implementation, P. Fergus and C. Chalmers, Eds., Cham: Springer International Publishing, 2022, pp. 115–138. doi: 10.1007/978-3-031-04420-5_5.
[2] M. Grandini, E. Bagli, and G. Visani, ‘Metrics for Multi-Class Classification: an Overview’. arXiv, Aug. 13, 2020. doi: 10.48550/arXiv.2008.05756.