## 1. Data Understanding

This section focuses on loading the dataset and understanding its structure, features, and label distribution before any preprocessing or modeling.

The dataset contains **1480 text msgs** and **5 OCEAN Features**.  
Each sample corresponds to a textual response along with personality trait labels.


In [4]:
import pandas as pd
import numpy as np

# Load dataset (update path if needed)
df = pd.read_csv("../data/raw/essaytrain.csv", encoding="latin1")

# Preview data
df.head()
df.shape

(1480, 7)

In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1480 entries, 0 to 1479
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   #AUTHID  1480 non-null   object
 1   TEXT     1480 non-null   object
 2   cEXT     1480 non-null   object
 3   cNEU     1480 non-null   object
 4   cAGR     1480 non-null   object
 5   cCON     1480 non-null   object
 6   cOPN     1480 non-null   object
dtypes: object(7)
memory usage: 81.1+ KB


We can observe that there are no null values in the dataset.

In [6]:
# Missing values
df.isnull().sum()


#AUTHID    0
TEXT       0
cEXT       0
cNEU       0
cAGR       0
cCON       0
cOPN       0
dtype: int64

### Dataset Columns Description

- **AUTHID**: A unique identifier for the author (person) who wrote the essay.
- **TEXT**: Written response or essay provided by the individual.
- **cOPN**: Personality score/label for Openness.
- **cCON**: Personality score/label for Conscientiousness.
- **cEXT**: Personality score/label for Extraversion.
- **cAGR**: Personality score/label for Agreeableness.
- **cNEU**: Personality score/label for Neuroticism.


For this dataset the score is either y/n written in their respective columns which indicate high or low trait values

In [8]:
df.describe()


Unnamed: 0,#AUTHID,TEXT,cEXT,cNEU,cAGR,cCON,cOPN
count,1480,1480,1480,1480,1480,1480,1480
unique,1480,1480,2,2,2,2,2
top,2003_199.txt,"well, here I go writing a stream of consciousn...",y,y,y,y,y
freq,1,1,752,761,799,762,741


Text responses vary significantly in length, indicating that verbosity may influence personality prediction.
This factor will be explored further during EDA.


In [10]:
# Word count
df['word_count'] = df['TEXT'].apply(lambda x: len(str(x).split()))

df['word_count'].describe()


count    1480.000000
mean      651.094595
std       256.670659
min        33.000000
25%       470.000000
50%       622.000000
75%       811.000000
max      2500.000000
Name: word_count, dtype: float64

The dataset contains 1,480 essays with an average length of approximately 650 words.
Most essays fall between 470 and 811 words, indicating substantial textual content.
The wide variation in essay length suggests that verbosity may influence personality prediction and will be further explored during EDA.



## Step 1: Summary

- Dataset structure and columns were analyzed.
- Label formats and distributions were examined.
- Initial insights on text length were observed.
- No preprocessing or modeling was performed at this stage.
