# Final Project Proposal - Big Five personality trait test

## Group members:
- Omar M. Hussein.
- Julian Ruggiero.
- Eli Weiss.

# Introduction

In psychological trait theory, the Big Five personality traits, also known as the five-factor model (FFM) and the OCEAN model, is a suggested taxonomy, or grouping, for personality traits, developed from the 1980s onwards. The theory states that personality can be boiled down to five core factors, known by the acronym CANOE or OCEAN, which are as follows:

1. Conscientiousness
    - Describes a person’s ability to regulate their impulse control in order to engage in goal-directed behaviors (Grohol, 2019). It measures elements such as control, inhibition, and persistency of behavior.
2. Agreeableness
    - Refers to how people tend to treat relationships with others. Unlike extraversion which consists of the pursuit of relationships, agreeableness focuses on people’s orientation and interactions with others (Ackerman, 2017).
3. Neuroticism
    - Describes the overall emotional stability of an individual through how they perceive the world. It considers how likely a person is to interpret events as threatening or difficult. (John & Srivastava, 1999)
4. Openness to Experience
    - Refers to one’s willingness to try new things as well as engage in imaginative and intellectual activities. It includes the ability to “think outside of the box.” (John & Srivastava, 1999)
5. Extraversion
    - Reflects the tendency and intensity to which someone seeks interaction with their environment, particularly socially. It encompasses the comfort and assertiveness levels of people in social situations. Additionally, it also reflects the sources from which someone draws energy. (John & Srivastava, 1999)

The FFM-associated test was used by Cambridge Analytica, and was part of the "psychographic profiling" controversy during the 2016 US presidential election. 
On our project we will take data that correspond to answers related to personality traits and, after creating clusters or groups, we will be able to predict which type of personality a person has according to his survey answers.
 
<img src="big5.png">

# Research questions
__What is your personality type based off the answers provided to a questionnaire?__

If we consider the structure of the data (more details in the next section), the answers to each question in the survey provides a score towards one of 5 personality trait, and for example some conclusions can be made like the ones below:

- In marriages where one partner scores lower than the other on agreeableness, stability, and openness, there is likely to be marital dissatisfaction (Myers, 2011).

- A high score on conscientiousness predicts better high school and university grades (Myers, 2011). Contrarily, low agreeableness and low conscientiousness predict juvenile delinquency (John & Srivastava, 1999). However, this study does not imply that being low in conscientiousness and low agreeableness would destroy your career as there are many disagreeable people in high positions of power of international companies.

However, some critics think that more than five traits are needed to account for the wide personality differences among people.
Other critics argue that five traits are too many. For example, they point out that openness correlates positively with extraversion. These critics argue that just three traits (neuroticism, extraversion, and agreeableness) should be enough to fully describe personality.

In our research we want to answer these type of critics and evaluate wheater 5 would be an appropiate number of groups and decide, based on our data, whether it would be convenient to have more than 5 personality traits or less than 5. After our grouping we will be able to classify a new survey responses in one of our personality types.

# Data to be used

The dataset we will be using is called big-five-personality-test located at [Kaggle](https://www.kaggle.com/tunguz/big-five-personality-test) and contains 1,015,342 questionnaire answers collected online by Open Psychometrics and stored in a CSV format.
Each of the questions were posed in the form of a statement rated on a five-point ordered scale using radio buttons. The scale was labeled from 1 = Disagree to 5 = Agree.

Examples from questionnaire 
-	I am the life of the party.
-	I don't talk a lot.

Each of the 5 personality traits has its own set of questions and is identified in the corresponding column name to which of the 5 traits it belong to.
The answer would be either 1,2,3,4 or 5 depending on how much the person agrees with the question posed.

# Approach

The following bullet points discuss how we are going to address our work.

- Considering how large the data is (around 400 MB), we are going to load the data to Amazon S3 bucket and then load it from there into a Pandas data frame.
- Prior to embarking on the implementation of a clustering algorithm, we will perform appropiate EDA creating bar charts, histograms, and other graphics to further understand the nature of our data.
- Perform any required data preparation work, including any feature engineering adjustments we consider necessary for our work.
- Apply feature selection and/or dimensionality reduction techniques to identify explanatory variables for inclusion within our models.
- Apply a hierarchical clustering algorithm to the data, interpreting the Dendogram to get a sense of the number of clusters we think should be imposed on the data.
- Implement a K-means clustering algorithm. We will by using a range of values for K to create an elbow plot and a silhouette plot for the data set and use the plots to select an appropriate value for K.
- Compare the output of these plots and determine if the value of K is in line with the number of clusters we selected from the output of the hierarchical Dendrogram. Use our domain knowledge to contrast the results and define the final number of clusters.
- Apply a K-means clustering using the selected value of K.
- Perform EDA on the groupings and, considering we don't have the actual data labels for each record, use our domain knowledge to define each different group/class names.
- Apply our knowledge of feature selection and/or dimensionality reduction techniques to identify explanatory variables for inclusion within our 3 different supervised machine learning classification algorithms.
- For our supervised learning models we will create 3 types:
    - KNN
    - LightGBM
    - Random forests <br>
    
  We will split the data into training and testing set and cross-validate each of them to get different metrics (AUC, F1, Recall and Accuracy). Depending on whether we have balance or imbalance data in our response we will select which of these metrics are more appropiate. In this context we believe that we should try to get models with high Recall, where the proportion of people that belongs to a specific personality was classified by our algorithms as actually having that personality. <br>
  
  
- Create an emsemble model. To try to reduce the bias and/or variance of our initial models we plan to combine them together in order to create a strong learner (or ensemble model) that achieves better performance in terms of the metrics mentioned before. We will use the concept of stacking, which considers heterogeneous weak learners, learn them in parallel and combines them by training a meta-model to output a prediction based on the different weak model predictions. This technique mainly try to produce strong models less biased than their components (even if variance can also be reduced). <br>

  In this project the weak learners are the KNN, LightGBM and Random Forests and we decide to learn a Neural Network as meta-model. Then the neural network will take as inputs the outputs of our three weak learners and will learn to return final predictions based on it. <br>
  
    
- After all the previous steps are accomplished, we would provide conclusions based off our results.

# Responsibilities

### Omar

### Eli

### Julian