# Magic Personality Matcher - Creación de dataset

Para usar como fuente encontramos varios datasets que podríamos usar y decidimos juntarlos todos para tener más datos, por si queremos ampliarla cantidad de datos usada para entrenamiento fácilmente.

Los datasets usados son los siguientes:
- [Twitter MBTI](https://www.kaggle.com/datasets/mazlumi/mbti-personality-type-twitter-dataset)
- [MBTI 1](https://www.kaggle.com/datasets/datasnaek/mbti-type)
- [MBTI 500](https://www.kaggle.com/datasets/zeyadkhalid/mbti-personality-types-500-dataset)

In [10]:
personality_num = {
    'ISTJ': 1,
    'ISFJ': 2,
    'INFJ': 3,
    'INTJ': 4,
    'ISTP': 5,
    'ISFP': 6,
    'INFP': 7,
    'INTP': 8,
    'ESTP': 9,
    'ESFP': 10,
    'ENFP': 11,
    'ENTP': 12,
    'ESTJ': 13,
    'ESFJ': 14,
    'ENFJ': 15,
    'ENTJ': 16
}

num_personality = {
    1: 'ISTJ',
    2: 'ISFJ',
    3: 'INFJ',
    4: 'INTJ',
    5: 'ISTP',
    6: 'ISFP',
    7: 'INFP',
    8: 'INTP',
    9: 'ESTP',
    10: 'ESFP',
    11: 'ENFP',
    12: 'ENTP',
    13: 'ESTJ',
    14: 'ESFJ',
    15: 'ENFJ',
    16: 'ENTJ'
}

Primero cargamos los datasets con Pandas

In [11]:
import pandas as pd

# Read the csv file in
dataset_twittermbti = pd.read_csv('twitter_MBTI.csv')
dataset_mbti1 = pd.read_csv('mbti_1.csv')
dataset_mbti500 = pd.read_csv('mbti_500.csv')

# Print info
print("Dataset Twitter MBTI")
print(dataset_twittermbti.head())
print(dataset_twittermbti.shape)

print("Dataset MBTI 1")
print(dataset_mbti1.head())
print(dataset_mbti1.shape)

print("Dataset MBTI 500")
print(dataset_mbti500.head())
print(dataset_mbti500.shape)

Dataset Twitter MBTI
   Unnamed: 0                                               text label
0           0  @Pericles216 @HierBeforeTheAC @Sachinettiyil T...  intj
1           1  @Hispanthicckk Being you makes you look cute||...  intj
2           2  @Alshymi Les balles sont réelles et sont tirée...  intj
3           3  I'm like entp but idiotic|||Hey boy, do you wa...  intj
4           4  @kaeshurr1 Give it to @ZargarShanif ... He has...  intj
(7811, 3)
Dataset MBTI 1
   type                                              posts
0  INFJ  'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1  ENTP  'I'm finding the lack of me in these posts ver...
2  INTP  'Good one  _____   https://www.youtube.com/wat...
3  INTJ  'Dear INTP,   I enjoyed our conversation the o...
4  ENTJ  'You're fired.|||That's another silly misconce...
(8675, 2)
Dataset MBTI 500
                                               posts  type
0  know intj tool use interaction people excuse a...  INTJ
1  rap music ehh opp yeah kno

Renombramos las columnas para que coincidan

In [12]:
# Rename columns
dataset_twittermbti.rename(
    columns={'label': 'personality', 'text': 'post'}, inplace=True)

dataset_mbti1.rename(
    columns={'type': 'personality', 'posts': 'post'}, inplace=True)

dataset_mbti500.rename(
    columns={'type': 'personality', 'posts': 'post'}, inplace=True)

Concatenamos los 3 datasets y transformamos todas las personalidades a mayúsculas y seleccionamos solo las columnas que nos interesan de personalidad y post.

In [15]:
df = pd.concat(
    [dataset_twittermbti, dataset_mbti500, dataset_mbti1])
df = df[['personality', 'post']]

Como los datasets tienen varios posts por entrada hay que dividirlos para que tengan su entrada propia en el entrenamiento. Además, se descartan los posts con menos de 20 caracteres por ser irrelevantes.

In [None]:
df['post'] = df['post'].str.split("\|\|\|")
df = df.explode('post').reset_index(drop=True)
df = df[df['post'].str.len() >= 20]

Se sustituyen las personalidades por números para que el dataset use menos memoria.

In [None]:
df['personality'] = df['personality'].apply(
    lambda x: personality_num[x.upper()])

Por último guardamos el resultado

In [16]:
df.to_csv(
    "dataset_definitivo.csv", index=False)

print("Dataset Definitivo")
print(df.shape)
print("Head")
print(df.head())
print("Tail")
print(df.tail())

Dataset Definitivo
(1519248, 2)
Head
   personality                                               post
0            4  @Pericles216 @HierBeforeTheAC @Sachinettiyil T...
1            4  @HierBeforeTheAC @Pericles216 @Sachinettiyil A...
2            4  @HierBeforeTheAC @Pericles216 @Sachinettiyil Y...
3            4  @HierBeforeTheAC @Pericles216 @Sachinettiyil Y...
4            4  @HierBeforeTheAC @Pericles216 @Sachinettiyil T...
Tail
         personality                                               post
1622106            7  I was going to close my facebook a few months ...
1622107            7  30 Seconds to Mars - All of my collections. It...
1622108            7  I have seen it, and i agree. I did actually th...
1622109            7  Ok so i have just watched Underworld 4 (Awaken...
1622110            7  I would never want to turn off my emotions. so...
