# Análisis exploratorio de datos sobre el comportamiento de personas introvertidas y extrovertidas

Los datos a analizar proceden de Kaggle. Su autor, Rakesh Kapilavayi, indica que los datos fueron recogidos de diferentes encuestas hechas con Google (*Google Forms*) para un proyecto de investigación universitario centrado en los rasgos de personalidad y en las tendencias de comportamiento de los estudiantes.

La base de datos está formada por ocho columnas clasificadas en:
- **Columnas categóricas:**
    - **Stage_fear:** indica si tienen o no pánico escénico.
    - **Drained_after_socializing:** indica si después de socializar se quedan agotados o no. 
    - **Personality:** indica si su personalidad es introvertida o extrovertida. Se trata de la columna objetivo (*target).

- **Columnas numéricas:**
    - **Time_spent_Alone:** tiempo que pasan solos en una escala de 0 a 11.
    - **Social_event_attendance:** asistencia a eventos sociales en una escala de 0 a 10.
    - **Going_outside:** frecuencia de salir fuera en una escala de 0 a 7.
    - **Friends_circle_size:** número de amigos cercanos.
    - **Post_frequency:** frecuencia de postear en redes sociales en una escala de 0 a 10.


    

### EDA - Estructura a seguir:
#### 0. Importaciones necesarias y carga de datos
#### 1. Información básica de los datos
#### 2. Revisión de nulos y duplicados
#### 3. Limpieza y transformación
#### 4. Análisis y visualizaciones

### 0. Importaciones necesarias y carga de datos

In [1]:
# Librerias necesarias:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Carga del csv:
df = pd.read_csv('../data/personality_dataset.csv')

### 1. Información básica de los datos

In [3]:
df.shape

(2900, 8)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2900 entries, 0 to 2899
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Time_spent_Alone           2837 non-null   float64
 1   Stage_fear                 2827 non-null   object 
 2   Social_event_attendance    2838 non-null   float64
 3   Going_outside              2834 non-null   float64
 4   Drained_after_socializing  2848 non-null   object 
 5   Friends_circle_size        2823 non-null   float64
 6   Post_frequency             2835 non-null   float64
 7   Personality                2900 non-null   object 
dtypes: float64(5), object(3)
memory usage: 181.4+ KB


### 2. Revisión de nulos y duplicados

In [5]:
# Valores nulos
df.isna().sum()

Time_spent_Alone             63
Stage_fear                   73
Social_event_attendance      62
Going_outside                66
Drained_after_socializing    52
Friends_circle_size          77
Post_frequency               65
Personality                   0
dtype: int64

In [6]:
# Filas duplicadas
df.duplicated().sum()

388

In [7]:
df[df.duplicated()]

Unnamed: 0,Time_spent_Alone,Stage_fear,Social_event_attendance,Going_outside,Drained_after_socializing,Friends_circle_size,Post_frequency,Personality
47,10.0,Yes,1.0,2.0,Yes,2.0,0.0,Introvert
217,5.0,Yes,2.0,0.0,Yes,2.0,0.0,Introvert
246,9.0,Yes,0.0,1.0,Yes,2.0,1.0,Introvert
248,9.0,Yes,0.0,2.0,Yes,3.0,2.0,Introvert
254,7.0,Yes,0.0,0.0,Yes,3.0,2.0,Introvert
...,...,...,...,...,...,...,...,...
2884,11.0,Yes,0.0,2.0,Yes,3.0,1.0,Introvert
2890,8.0,Yes,2.0,0.0,Yes,1.0,2.0,Introvert
2891,6.0,Yes,3.0,1.0,Yes,5.0,1.0,Introvert
2892,9.0,Yes,2.0,0.0,Yes,1.0,2.0,Introvert


### 3. Limpieza y transformación

Para evitar errores futuros, se decide pasar todos los títulos de columna a minúscula:

In [8]:
df.columns = df.columns.str.lower()
print(df.columns)

Index(['time_spent_alone', 'stage_fear', 'social_event_attendance',
       'going_outside', 'drained_after_socializing', 'friends_circle_size',
       'post_frequency', 'personality'],
      dtype='object')


#### 3.1 Variables categóricas

In [9]:
def contar_valores_categoricos(df,columnas):
    for i in columnas:
        print(f'Columna {i}:')
        print(df[i].value_counts())
        print('\n')

In [10]:
categoricas = ['stage_fear', 'drained_after_socializing', 'personality']

In [11]:
contar_valores_categoricos(df,categoricas)

Columna stage_fear:
stage_fear
No     1417
Yes    1410
Name: count, dtype: int64


Columna drained_after_socializing:
drained_after_socializing
No     1441
Yes    1407
Name: count, dtype: int64


Columna personality:
personality
Extrovert    1491
Introvert    1409
Name: count, dtype: int64




In [13]:
def contar_nulos_categoricas(df,columnas):
    for i in columnas:
        nulos = df[i].isna().sum()
        print(f'Valores nulos de {i} = {nulos}')

In [14]:
contar_nulos_categoricas(df,categoricas)

Valores nulos de stage_fear = 73
Valores nulos de drained_after_socializing = 52
Valores nulos de personality = 0


In [15]:
df.groupby('personality')['stage_fear'].value_counts(dropna = False)

personality  stage_fear
Extrovert    No            1338
             Yes            111
             NaN             42
Introvert    Yes           1299
             No              79
             NaN             31
Name: count, dtype: int64

In [16]:
df.groupby('personality')['drained_after_socializing'].value_counts(dropna = False)

personality  drained_after_socializing
Extrovert    No                           1362
             Yes                           111
             NaN                            18
Introvert    Yes                          1296
             No                             79
             NaN                            34
Name: count, dtype: int64

In [19]:
df.groupby(['drained_after_socializing','stage_fear'])['personality'].value_counts()

drained_after_socializing  stage_fear  personality
No                         No          Extrovert      1320
                                       Introvert        79
Yes                        Yes         Introvert      1266
                                       Extrovert       111
Name: count, dtype: int64

In [None]:
df[(df['stage_fear'] == 'Yes')&(df['drained_after_socializing'] == 'No')]

Unnamed: 0,time_spent_alone,stage_fear,social_event_attendance,going_outside,drained_after_socializing,friends_circle_size,post_frequency,personality


Con esta agrupación observamos que todos los estudiantes que se sienten agotados despues de socializar tienen pánico escénico y que todos aquellos que no se sienten agotados al socializar no tienen pánico escénico, por lo que podremos imputar valores nulos aplicando esta lógica en ambos sentidos.

In [20]:
df[(df['stage_fear'] == 'Yes') & (df['drained_after_socializing'].isna())].shape[0]

33

In [21]:
df[(df['drained_after_socializing'] == 'Yes') & (df['stage_fear'].isna())].shape[0]

30

In [22]:
df[(df['drained_after_socializing'] == 'No') & (df['stage_fear'].isna())].shape[0]

42

In [29]:
# Transformaciones para stage_fear:
df.loc[(df['drained_after_socializing'] == 'Yes') & (df['stage_fear'].isna()),'stage_fear'] = 'Yes'
df.loc[(df['drained_after_socializing'] == 'No') & (df['stage_fear'].isna()),'stage_fear'] = 'No'

In [32]:
# Transformaciones para drained_after_socializing:
df.loc[(df['stage_fear'] == 'Yes') & (df['drained_after_socializing'].isna()),'drained_after_socializing'] = 'Yes'
df.loc[(df['stage_fear'] == 'No') & (df['drained_after_socializing'].isna()),'drained_after_socializing'] = 'No'

In [33]:
contar_nulos_categoricas(df,categoricas)

Valores nulos de stage_fear = 1
Valores nulos de drained_after_socializing = 1
Valores nulos de personality = 0


In [35]:
df.loc[df.stage_fear.isna()]

Unnamed: 0,time_spent_alone,stage_fear,social_event_attendance,going_outside,drained_after_socializing,friends_circle_size,post_frequency,personality
1517,4.0,,3.0,0.0,,2.0,0.0,Introvert


#### 3.1 Variables numéricas

In [37]:
df.describe()

Unnamed: 0,time_spent_alone,social_event_attendance,going_outside,friends_circle_size,post_frequency
count,2837.0,2838.0,2834.0,2823.0,2835.0
mean,4.505816,3.963354,3.0,6.268863,3.564727
std,3.479192,2.903827,2.247327,4.289693,2.926582
min,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,1.0,3.0,1.0
50%,4.0,3.0,3.0,5.0,3.0
75%,8.0,6.0,5.0,10.0,6.0
max,11.0,10.0,7.0,15.0,10.0


In [41]:
def medias_numericas_target (df,columnas,target):
    for i in columnas:
        print(f'{i} según {target}:')
        print(df.groupby(target)[i].mean().round())
        print('\n')

In [47]:
target = 'personality'
numericas = [i for i in df.columns if (i not in categoricas) and (i != target)]

In [48]:
numericas

['time_spent_alone',
 'social_event_attendance',
 'going_outside',
 'friends_circle_size',
 'post_frequency']

In [49]:
medias_numericas_target(df,numericas,target)

time_spent_alone según personality:
personality
Extrovert    2.0
Introvert    7.0
Name: time_spent_alone, dtype: float64


social_event_attendance según personality:
personality
Extrovert    6.0
Introvert    2.0
Name: social_event_attendance, dtype: float64


going_outside según personality:
personality
Extrovert    5.0
Introvert    1.0
Name: going_outside, dtype: float64


friends_circle_size según personality:
personality
Extrovert    9.0
Introvert    3.0
Name: friends_circle_size, dtype: float64


post_frequency según personality:
personality
Extrovert    6.0
Introvert    1.0
Name: post_frequency, dtype: float64




In [58]:
def moda_numericas_target (df,columnas,target):
    for i in columnas:
        print(f'Moda de {i} según {target}:')
        print(df.groupby(target)[i].agg(lambda x: x.mode()))
        print('\n')

In [59]:
moda_numericas_target(df,numericas,target)

Moda de time_spent_alone según personality:
personality
Extrovert    3.0
Introvert    9.0
Name: time_spent_alone, dtype: float64


Moda de social_event_attendance según personality:
personality
Extrovert    4.0
Introvert    2.0
Name: social_event_attendance, dtype: float64


Moda de going_outside según personality:
personality
Extrovert    5.0
Introvert    0.0
Name: going_outside, dtype: float64


Moda de friends_circle_size según personality:
personality
Extrovert           8.0
Introvert    [2.0, 3.0]
Name: friends_circle_size, dtype: object


Moda de post_frequency según personality:
personality
Extrovert    7.0
Introvert    2.0
Name: post_frequency, dtype: float64




In [72]:
def comparativa_media_mediana_moda(df,columnas,target):
    for i in columnas:
        print(f'Media de{i} según {target}:')
        print(df.groupby(target)[i].mean().round())
        print('\n')

        print(f'Mediana de {i} según {target}:')
        print(df.groupby(target)[i].agg(lambda x: x.median()))
        print('\n')
        
        print(f'Moda de {i} según {target}:')
        print(df.groupby(target)[i].agg(lambda x: x.mode()))
        print('\n---------------------------------------------------------\n')
    

In [74]:
comparativa_media_mediana_moda(df,numericas,target)

Media detime_spent_alone según personality:
personality
Extrovert    2.0
Introvert    7.0
Name: time_spent_alone, dtype: float64


Mediana de time_spent_alone según personality:
personality
Extrovert    2.0
Introvert    7.0
Name: time_spent_alone, dtype: float64


Moda de time_spent_alone según personality:
personality
Extrovert    3.0
Introvert    9.0
Name: time_spent_alone, dtype: float64

---------------------------------------------------------

Media desocial_event_attendance según personality:
personality
Extrovert    6.0
Introvert    2.0
Name: social_event_attendance, dtype: float64


Mediana de social_event_attendance según personality:
personality
Extrovert    6.0
Introvert    2.0
Name: social_event_attendance, dtype: float64


Moda de social_event_attendance según personality:
personality
Extrovert    4.0
Introvert    2.0
Name: social_event_attendance, dtype: float64

---------------------------------------------------------

Media degoing_outside según personality:
personali