<a href="https://www.kaggle.com/code/marcelocruzeta/machine-learning-experiment?scriptVersionId=248543600" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 1. Personality Classification Project

## Project Overview

This project aims to classify individuals as either Introverts or Extroverts based on various behavioral and social characteristics using machine learning techniques. The primary objective is to build and validate models that can effectively predict personality types from input features derived from a dataset.

## Dataset

The dataset used in this project comprises responses to questions related to social behavior, emotional experiences, and personal preferences. Key features include:

- **Stage_fear**: Indicating fear of public speaking.
- **Drained_after_socializing**: Reflecting how individuals feel after interacting socially.
- **Time_spent_Alone**: The amount of time spent alone.
- **Social_event_attendance**: Frequency of attendance at social events.
- **Going_outside**: How often individuals engage in outdoor activities.
- **Friends_circle_size**: The size of their social circle.
- **Post_frequency**: How often they post on social media.

## Methodology

### Data Preprocessing
- **Handling Missing Values**: Categorical variables were filled with 'Unknown' and mapped to numeric values. Numeric features were imputed using the `IterativeImputer` technique to fill in any missing gaps.
- **Model**: Data was split into training and sets to train and evaluate model performance effectively.

### Model Training
Three distinct machine learning models were utilized to predict personality types:
1. **Random Forest Classifier**: A robust ensemble learning method used for classification tasks.
2. **CatBoost Classifier**: A gradient boosting library that efficiently handles categorical features without extensive pre-processing.
3. **XGBoost Classifier**: An optimized gradient boosting framework designed for speed and performance.

The models underwent training on the training dataset, followed by evaluation on the validation set to assess their accuracy and classification metrics. 

### Majority Voting for Consensus
To enhance prediction reliability, predictions from the three models were combined using a majority voting approach. For each prediction:
- If at least two out of three models agree on a personality type, that type is selected as the final prediction.
- This method helps mitigate individual model biases and improves overall prediction robustness.

## Final Submission
The final predictions were compiled into a submission file named `submission.csv`, containing two columns: `id` and `Predicted_Personality`, ready for evaluation and further analysis.

## Unique Aspects of the Project
- **Ensemble Modeling**: The use of three different models and majority voting showcases the effectiveness of ensemble learning in achieving high prediction accuracy.
- **Data Imputation Strategy**: The employment of `IterativeImputer` for filling missing numeric values is a sophisticated approach that maintains data integrity and relationships between features.
- **Multi-Model Comparison**: This project emphasizes the importance of evaluating and comparing multiple models to ensure the best possible results, reflecting a thorough understanding of model performance and selection.

## Conclusion
This project illustrates a complete machine learning pipeline, from data preprocessing to model training and final prediction submission, demonstrating effective techniques for personality classification based on behavioral characteristics. Future enhancements could include hyperparameter tuning, model optimization, and exploring additional algorithms or features to further improve classification accuracy.

# 2. Imports

In [1]:
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
from sklearn.experimental import enable_iterative_imputer  # Needed to use IterativeImputer
from sklearn.impute import IterativeImputer  # For iterative imputation of missing values
from sklearn.model_selection import train_test_split  # For splitting the dataset into training and validation sets
from sklearn.ensemble import RandomForestClassifier  # Importing the Random Forest classifier for model training
from sklearn.metrics import classification_report, accuracy_score  # For evaluating model performance metrics
from catboost import CatBoostClassifier  # Importing the CatBoost classifier for gradient boosting on decision trees
import xgboost as xgb  # Importing the XGBoost classifier
from collections import Counter  # Importing Counter for counting hashable objects, useful for tallying predictions
from tabulate import tabulate  # For displaying the DataFrame with borders
from colorama import Fore, Style  # Import Colorama for coloring text

# 3. Data Preprocessing for Personality Classification

This notebook outlines the preprocessing steps taken to prepare data for a personality classification project on Kaggle, focusing on distinguishing between Introverts and Extroverts.

## Preprocessing Function: `preprocess`

1. **Copy DataFrame**: Creates a copy of the input DataFrame to avoid modifying the original.

2. **Handle Categorical Variables**: Fills missing values in `Stage_fear` and `Drained_after_socializing` with 'Unknown' and maps them to numerical values:
   - `Yes` → 1
   - `No` → 0
   - `Unknown` → -1

3. **Select Numeric Columns**: Specifies columns that will have missing values imputed:
   - `Time_spent_Alone`, `Social_event_attendance`, `Going_outside`, `Friends_circle_size`, `Post_frequency`
   
4. **Impute Missing Numeric Values**: Uses `IterativeImputer` to fill missing values in the selected numeric columns.

5. **Return Preprocessed DataFrame**: Outputs the processed DataFrame.

## Loading and Preprocessing the Data

- Loads training and testing datasets.
- Preprocesses the training data by removing the `id` and `Personality` columns, then maps `Personality` to numerical values (0 for Introvert, 1 for Extrovert).
- Preprocesses the test data by removing the `id` column.

## Conclusion

This code effectively prepares the dataset by handling missing values and transforming categorical variables into numerical format, which is crucial for training machine learning models.

In [2]:
def preprocess(df):
    # Make a copy of the DataFrame
    df = df.copy()
    
    # Fill categorical variables and map them to numeric values
    df['Stage_fear'] = df['Stage_fear'].fillna('Unknown').map({'Yes': 1, 'No': 0, 'Unknown': -1})
    df['Drained_after_socializing'] = df['Drained_after_socializing'].fillna('Unknown').map({'Yes': 1, 'No': 0, 'Unknown': -1})
    
    # Select numeric columns for imputation
    num_cols = ['Time_spent_Alone', 'Social_event_attendance', 
                'Going_outside', 'Friends_circle_size', 
                'Post_frequency']
    
    # Use IterativeImputer to fill missing numeric values
    imputer = IterativeImputer(random_state=42)
    df[num_cols] = imputer.fit_transform(df[num_cols])
    
    return df

# Load the datasets
train = pd.read_csv('/kaggle/input/playground-series-s5e7/train.csv')
test = pd.read_csv('/kaggle/input/playground-series-s5e7/test.csv')

# Preprocess training data
X = preprocess(train.drop(['id', 'Personality'], axis=1))
y = train['Personality'].map({'Introvert': 0, 'Extrovert': 1})  # Target variable

# Preprocess test data
X_test = preprocess(test.drop('id', axis=1))

# Optionally print the first few rows of preprocessed data to verify
print(X.head())
print(y.head())
print(X_test.head())

   Time_spent_Alone  Stage_fear  Social_event_attendance  Going_outside  \
0               0.0           0                      6.0            4.0   
1               1.0           0                      7.0            3.0   
2               6.0           1                      1.0            0.0   
3               3.0           0                      7.0            3.0   
4               1.0           0                      4.0            4.0   

   Drained_after_socializing  Friends_circle_size  Post_frequency  
0                          0                 15.0        5.000000  
1                          0                 10.0        8.000000  
2                         -1                  3.0        0.000000  
3                          0                 11.0        5.000000  
4                          0                 13.0        5.664129  
0    1
1    1
2    0
3    1
4    1
Name: Personality, dtype: int64
   Time_spent_Alone  Stage_fear  Social_event_attendance  Going_outside  \

# 4. Model Training and Prediction for Personality Classification

This section outlines the process of training a Random Forest classifier and making predictions for personality types.

1. Split the preprocessed data into training and validation sets.
2. Initialize and train a Random Forest Classifier on the training data.
3. Evaluate the model's accuracy and performance on the validation set.
4. Predict personality types for the test dataset.
5. Map the predictions from numerical values back to 'Introvert' or 'Extrovert'.
6. Create a DataFrame with the test IDs and their corresponding predicted personalities.
7. Save the results to a CSV file for further analysis.

This process allows for the classification of personalities based on behavioral traits using a machine learning approach.

In [3]:
# Now proceed with the next steps
# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

# Make predictions on the test data
test_predictions = model.predict(X_test)

# Make predictions on the test data
test_predictions = model.predict(X_test)

# Map back to original labels
test_predictions_labels = np.where(test_predictions == 0, 'Introvert', 'Extrovert')

# Create a DataFrame to save the results with descriptive labels
results = pd.DataFrame({'id': test['id'], 'Predicted_Personality': test_predictions_labels})

# Save to CSV
results.to_csv('result_r_forest.csv', index=False)


Accuracy: 0.9670715249662618
              precision    recall  f1-score   support

           0       0.95      0.92      0.94       952
           1       0.97      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.95      0.96      3705
weighted avg       0.97      0.97      0.97      3705



# 5. CatBoost Model Training and Prediction for Personality Classification

This section describes the steps for using the CatBoost classifier to predict personality types.

1. Initialize the CatBoost model with specific parameters.
2. Fit the model on the training data.
3. Make predictions on the validation set and evaluate the model's accuracy and performance.
4. Predict personality types for the test dataset.
5. Map predictions back to 'Introvert' or 'Extrovert'.
6. Create a DataFrame to store the test IDs and predicted personalities.
7. Save the results to a CSV file for further analysis.

This process employs the CatBoost algorithm for effective personality classification based on behavioral data.

In [4]:
# Initialize the CatBoost model
catboost_model = CatBoostClassifier(iterations=500, learning_rate=0.1, depth=6, random_state=42, verbose=0)

# Fit the model on the training data
catboost_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = catboost_model.predict(X_val)

# Evaluate the model
print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

# Make predictions on the test data
catboost_predictions = catboost_model.predict(X_test)

# Map back to original labels
catboost_predictions_labels = np.where(catboost_predictions == 0, 'Introvert', 'Extrovert')

# Create a DataFrame to save the results with descriptive labels
catboost_results = pd.DataFrame({'id': test['id'], 'Predicted_Personality': catboost_predictions_labels})

# Save to CSV
catboost_results.to_csv('result_c_boost.csv', index=False)  # Save results

Accuracy: 0.9673414304993252
              precision    recall  f1-score   support

           0       0.94      0.93      0.94       952
           1       0.98      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.95      0.96      3705
weighted avg       0.97      0.97      0.97      3705



# 6. XGBoost Model Training and Prediction for Personality Classification

This section outlines the use of the XGBoost classifier to predict personality types.

1. Initialize the XGBoost model with specified parameters.
2. Fit the model on the training data.
3. Make predictions on the validation set and evaluate accuracy and performance.
4. Predict personality types for the test dataset.
5. Map predictions back to 'Introvert' or 'Extrovert'.
6. Create a DataFrame to store the test IDs and corresponding predicted personalities.
7. Save the results to a CSV file for further analysis.

This process applies the XGBoost algorithm for effective personality classification based on behavioral data.

In [5]:
# Initialize the XGBoost model
xgboost_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Fit the model on the training data
xgboost_model.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = xgboost_model.predict(X_val)

# Evaluate the model
print("Accuracy:", accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

# Make predictions on the test data
xgboost_predictions = xgboost_model.predict(X_test)

# Map back to original labels
xgboost_predictions_labels = np.where(xgboost_predictions == 0, 'Introvert', 'Extrovert')

# Create a DataFrame to save the results with descriptive labels
xgboost_results = pd.DataFrame({'id': test['id'], 'Predicted_Personality': xgboost_predictions_labels})

# Save to CSV
xgboost_results.to_csv('result_xg_boost.csv', index=False)  # Save results

Accuracy: 0.9665317139001349
              precision    recall  f1-score   support

           0       0.95      0.92      0.93       952
           1       0.97      0.98      0.98      2753

    accuracy                           0.97      3705
   macro avg       0.96      0.95      0.96      3705
weighted avg       0.97      0.97      0.97      3705



# 7. Merging Model Predictions and Final Voting

This section of the notebook focuses on merging predictions from different models (Random Forest, CatBoost, and XGBoost) and applying majority voting to finalize the personality classification results.

## Steps Overview

1. **Load Results**: Import prediction results from the previously trained models.
2. **Load Test Data**: Import the original feature data associated with the test dataset.
3. **Ensure ID Consistency**: Convert all 'id' columns to string type to facilitate merging.
4. **Rename Prediction Columns**: Change prediction column names for clarity and consistency across models.
5. **Merge Predictions**: Combine the predictions from different models into a single DataFrame based on the 'id'.
6. **Define Majority Voting Function**: Establish a function to determine the final prediction based on the majority of model votes.
7. **Apply Majority Voting**: Use the voting function to derive the final predictions for each entry.
8. **Check for Divergence**: Create a new column indicating if the models disagreed on the predictions.
9. **Filter Divergent Predictions**: Retain only those rows where model predictions differ.
10. **Prepare Features DataFrame**: Create a DataFrame containing the relevant features for further analysis.
11. **Merge Feature Data with Divergent Predictions**: Combine the features with the predictions to analyze cases of divergence.
12. **Print Results**: Display the DataFrame containing divergent predictions along with their corresponding feature data.

This approach allows for a comprehensive analysis of model predictions and helps ensure robust classification outcomes.

In [6]:
# Load the results from previous models
rf_results = pd.read_csv('/kaggle/working/result_r_forest.csv')  # Random Forest results
catboost_results = pd.read_csv('/kaggle/working/result_c_boost.csv')  # CatBoost results
xgboost_results = pd.read_csv('/kaggle/working/result_xg_boost.csv')  # XGBoost results
# Load the original feature data (test set)
test_data = pd.read_csv('/kaggle/input/playground-series-s5e7/test.csv')  # Load test.csv
# Ensure all 'id' columns are of the same type (string)
rf_results['id'] = rf_results['id'].astype(str)
catboost_results['id'] = catboost_results['id'].astype(str)
xgboost_results['id'] = xgboost_results['id'].astype(str)
test_data['id'] = test_data['id'].astype(str)  # Ensure test data's 'id' is a string
# Rename the prediction columns for clarity
rf_results.rename(columns={'Predicted_Personality': 'RF'}, inplace=True)  # Rename for consistency
catboost_results.rename(columns={'Predicted_Personality': 'CB'}, inplace=True)  # Rename for consistency
xgboost_results.rename(columns={'Predicted_Personality': 'XGB'}, inplace=True)  # Rename for consistency
# Merge results on 'id'
comparison_df = rf_results.merge(catboost_results, on='id')
comparison_df = comparison_df.merge(xgboost_results, on='id')
# Define a function to determine majority voting
def majority_vote(row):
    predictions = [row['RF'], row['CB'], row['XGB']]
    return Counter(predictions).most_common(1)[0][0]  # Get the most common prediction
# Apply the majority vote function to each row
comparison_df['Final_Prediction'] = comparison_df.apply(majority_vote, axis=1)
# Create a new column to check if predictions differ
comparison_df['Divergence'] = (comparison_df['RF'] != comparison_df['Final_Prediction']) | \
                              (comparison_df['CB'] != comparison_df['Final_Prediction']) | \
                              (comparison_df['XGB'] != comparison_df['Final_Prediction'])
# Filter to keep only those rows where there is a divergence
divergent_predictions_df = comparison_df[comparison_df['Divergence']]
# Prepare features DataFrame
feature_columns = ['Time_spent_Alone', 'Stage_fear', 
                   'Social_event_attendance', 'Going_outside', 
                   'Drained_after_socializing', 'Friends_circle_size', 
                   'Post_frequency']  # Original feature names
# Create a features DataFrame from the test data
features_df = test_data[feature_columns].copy()
features_df.columns = ['TSA', 'SF', 'SEA', 'GO', 'DAS', 'FCS', 'PF']  # Rename to abbreviations
# Ensure 'id' column in features_df for merging
features_df['id'] = test_data['id'].astype(str)  # Add the id for merging
# Merge the feature columns with the divergent predictions DataFrame
divergent_predictions_df = divergent_predictions_df.merge(features_df, on='id', how='left')

# Prepare the DataFrame for printing
divergent_predictions_to_print = divergent_predictions_df[['id', 'RF', 'CB', 'XGB', 'TSA', 'SF', 'SEA', 'GO', 'DAS', 'FCS', 'PF']].copy()

# Apply color formatting to 'Predicted_Personality' columns
def color_predictions(row):
    for column in ['RF', 'CB', 'XGB']:
        if row[column] == 'Extrovert':
            row[column] = f"{Fore.GREEN}{row[column]}{Style.RESET_ALL}"  # Color in green
    return row

divergent_predictions_to_print = divergent_predictions_to_print.apply(color_predictions, axis=1)

# Print the DataFrame with divergent predictions formatted with borders
print("\nDivergent Predictions (with Feature Columns):")
print(tabulate(divergent_predictions_to_print, headers='keys', tablefmt='grid', showindex=False))


Divergent Predictions (with Feature Columns):
+-------+-----------+-----------+-----------+-------+------+-------+------+-------+-------+------+
|    id | RF        | CB        | XGB       |   TSA | SF   |   SEA |   GO | DAS   |   FCS |   PF |
| 18753 | Introvert | [32mExtrovert[0m | [32mExtrovert[0m |   nan | No   |     9 |    6 | No    |     6 |    3 |
+-------+-----------+-----------+-----------+-------+------+-------+------+-------+-------+------+
| 18876 | [32mExtrovert[0m | [32mExtrovert[0m | Introvert |    10 | No   |   nan |    6 | No    |    10 |    7 |
+-------+-----------+-----------+-----------+-------+------+-------+------+-------+-------+------+
| 18973 | Introvert | Introvert | [32mExtrovert[0m |     9 | Yes  |     2 |    0 | Yes   |   nan |    2 |
+-------+-----------+-----------+-----------+-------+------+-------+------+-------+-------+------+
| 18977 | [32mExtrovert[0m | Introvert | [32mExtrovert[0m |     5 | No   |     9 |    7 | No    |    11 |    3 |

# 8. Generating Submission File from Model Predictions

This section describes the process of creating the final submission file `submission.csv` using the predictions obtained from various machine learning models (Random Forest, CatBoost, and XGBoost).

## Steps Overview

1. **Load Predictions**: The predictions from the three models are loaded into a DataFrame named `comparison_df`. This DataFrame contains the predicted personality types from each model.

2. **Define Voting Function**: A function (`final_prediction`) is defined to determine the final prediction for each entry by applying a majority voting mechanism. The function:
   - Counts the predictions from the three models for a given row.
   - Selects the prediction that appears most frequently as the final prediction.

3. **Compute Final Predictions**: The majority vote function is applied to each row of the `comparison_df` to create a new column called `Final_Prediction`. This column encapsulates the final decision on the predicted personality based on the majority.

4. **Prepare Submission DataFrame**: A new DataFrame (`submission_df`) is created containing only the `id` and the `Final_Prediction`, which is renamed to `Predicted_Personality` for clarity.

5. **Save to CSV**: The final DataFrame is saved as a CSV file named `submission.csv`, ready for submission or further analysis.

This approach ensures that the predictions included in the submission file are based on the majority consensus of the models, thereby increasing the chances of obtaining accurate and reliable results.

In [7]:
import pandas as pd
from collections import Counter

# Supondo que você tenha um DataFrame 'comparison_df' já preenchido com as previsões de RF, CB e XGB

# Defina uma função para determinar a previsão final considerando a votação
def final_prediction(row):
    # Registra as previsões de cada modelo
    predictions = [row['RF'], row['CB'], row['XGB']]
    # Calcula a previsão da maioria
    majority_vote = Counter(predictions).most_common(1)[0][0]  # Obtém a previsão mais comum
    return majority_vote

# Adicione a coluna de previsão final ao DataFrame
comparison_df['Final_Prediction'] = comparison_df.apply(final_prediction, axis=1)

# Crie o DataFrame para submissão
submission_df = comparison_df[['id', 'Final_Prediction']].copy()
submission_df.rename(columns={'Final_Prediction': 'Predicted_Personality'}, inplace=True)

# Salve o DataFrame como CSV
submission_df.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' has been created.")

Submission file 'submission.csv' has been created.
