### Importing required libraries

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

### Reading in the dataset

In [5]:
df = pd.read_csv('data/cleaned_grit_data.csv')
df.head()

Unnamed: 0,index,country,surveyelapse,GS1,GS2,GS3,GS4,GS5,GS6,GS7,...,operatingsystem,browser,introelapse,testelapse,Extraversion,Neuroticism,Agreeableness,Conscientiousness,Openness,Grit
0,4,JP,340,5,2,3,3,2,4,2,...,Windows,Firefox,3,337,1.2,2.5,3.3,3.8,3.0,3.083333
1,6,US,126,4,1,3,2,1,5,1,...,Windows,Chrome,36,212,4.0,2.0,3.6,3.4,5.0,2.583333
2,8,EU,130,5,3,3,5,4,5,5,...,Windows,Microsoft Internet Explorer,14,183,4.4,4.5,4.7,4.0,4.3,4.25
3,10,AE,592,5,3,3,2,4,3,3,...,Windows,Chrome,726,311,3.0,4.6,3.6,3.8,3.4,3.166667
4,11,AU,217,3,1,1,2,1,3,1,...,Windows,Firefox,376,407,2.0,1.1,3.4,3.9,4.4,2.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2200 entries, 0 to 2199
Data columns (total 100 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              2200 non-null   int64  
 1   country            2200 non-null   object 
 2   surveyelapse       2200 non-null   int64  
 3   GS1                2200 non-null   int64  
 4   GS2                2200 non-null   int64  
 5   GS3                2200 non-null   int64  
 6   GS4                2200 non-null   int64  
 7   GS5                2200 non-null   int64  
 8   GS6                2200 non-null   int64  
 9   GS7                2200 non-null   int64  
 10  GS8                2200 non-null   int64  
 11  GS9                2200 non-null   int64  
 12  GS10               2200 non-null   int64  
 13  GS11               2200 non-null   int64  
 14  GS12               2200 non-null   int64  
 15  VCL1               2200 non-null   int64  
 16  VCL2               2200

### Splitting the data into Train and Test splits

In [8]:
X = df[['Openness', 'Conscientiousness', 'Extraversion', 'Agreeableness', 'Neuroticism']]
y = df['Grit']

In [19]:
X.head()

Unnamed: 0,Openness,Conscientiousness,Extraversion,Agreeableness,Neuroticism
0,3.0,3.8,1.2,3.3,2.5
1,5.0,3.4,4.0,3.6,2.0
2,4.3,4.0,4.4,4.7,4.5
3,3.4,3.8,3.0,3.6,4.6
4,4.4,3.9,2.0,3.4,1.1


In [20]:
y.head()

0    3.083333
1    2.583333
2    4.250000
3    3.166667
4    2.000000
Name: Grit, dtype: float64

In [17]:
print("X shape: ", X.shape)
print("y shaep: ", y.shape)

X shape:  (2200, 5)
y shaep:  (2200,)


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)


X_train shape:  (1760, 5)
y_train shape:  (1760,)
X_test shape:  (440, 5)
y_test shape:  (440,)


We are going to test the data on baseline models. The models we will be using are:

1. Mean model
2. Median model
3. Simple Linear Regresssion

The metrics we will be using are R2, MAE, MSE.

In [9]:
def evaluate_model(true, pred, name):
    print(f"{name} Model Results")
    print(f"R²  : {r2_score(true, pred):.4f}")
    print(f"MAE : {mean_absolute_error(true, pred):.4f}")
    print(f"MSE : {mean_squared_error(true, pred):.4f}")

### Mean model

In [12]:
mean_pred = np.full(shape=y_test.shape, fill_value=y_train.mean())

evaluate_model(y_test, mean_pred, "Mean Baseline")

Mean Baseline Model Results
R²  : -0.0003
MAE : 0.5718
MSE : 0.4743


### Median model

In [13]:
median_pred = np.full(shape=y_test.shape, fill_value=y_train.median())

evaluate_model(y_test, median_pred, "Median Baseline")

Median Baseline Model Results
R²  : -0.0060
MAE : 0.5718
MSE : 0.4770


### Linear Regression

In [14]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

In [15]:
evaluate_model(y_test, lr_pred, "Linear Regression")

Linear Regression Model Results
R²  : 0.4855
MAE : 0.3951
MSE : 0.2440


The baseline modeling results provide an initial understanding of how well grit can be predicted using simple statistical approaches. Both the mean and median models performed poorly, with negative R2 values and identical error scores, indicating that these models are unable to capture any meaningful variation in grit scores across individuals. In contrast, the linear regression model, which used the Big Five personality traits as predictors, showed a substantial improvement, achieving an R2 of approximately 0.49 and significantly lower MAE and MSE values. This suggests that nearly half of the variability in grit can be explained by personality traits alone, with traits such as conscientiousness likely playing a strong role, which is observed from our EDA. Overall, these results confirm that while trivial models offer no predictive value, linear regression provides a meaningful baseline and justifies moving forward with more advanced modeling techniques to further improve prediction and understand underlying relationships.

### Using only 'Conscientiousness' in the linear model to see its impact

In [21]:
X = df[[ 'Conscientiousness']]
y = df['Grit']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

In [24]:
evaluate_model(y_test, lr_pred, "Linear Regression")

Linear Regression Model Results
R²  : 0.4278
MAE : 0.4243
MSE : 0.2713


We can see that the other traits do contribute to the model's performance. We will be considering them all the personality traits in out main model or analyze them more later on.