<a href="https://colab.research.google.com/github/Kentaro9923/51.504-ML-SUTD-/blob/main/Dict/ML_HW1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RIASEC Personality Test - Data Cleanup, Model Selection, and Validation

This notebook will guide us through cleaning the RIASEC dataset, performing model selection, and validating the model. We will be focusing on the "Realistic" trait (R1 to R8 questions).

## Step 1: Data Cleanup
We will load the dataset and clean it by removing rows that contain missing values (-1) in the R1 to R8 columns.

## Step 2: Model Selection
We will use the first 6500 people as training data, and perform a linear regression to see how R1 correlates with the R trait score.

## Step 3: Validation
We will validate the model using the remaining 1500 people and calculate the residual sum of squares to evaluate model performance.


In [5]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Read the dataset (assuming it's named RIASEC.csv)
df = pd.read_csv('RIASEC.csv', sep='\t')

# Display the first few rows of the dataset
df.head()


Unnamed: 0,implementation,R1,R2,R3,R4,R5,R6,R7,R8,I1,...,C5,C6,C7,C8,accuracy,elapse,country,fromsearch,age,gender
0,2,3,1,4,2,1,2,1,1,5,...,2,1,1,2,90,222,PT,0,-1,-1
1,2,1,1,1,1,1,1,1,1,4,...,1,1,1,1,100,102,US,0,-1,-1
2,2,3,2,1,1,1,1,2,1,5,...,3,4,4,4,95,264,US,1,-1,-1
3,2,3,2,1,2,2,3,1,2,5,...,1,3,2,1,60,189,SG,0,-1,-1
4,2,-1,2,3,2,3,2,1,3,5,...,4,3,3,3,90,197,US,0,-1,-1


## Step 1: Data Cleanup

We are only interested in the responses for the "Realistic" trait, which correspond to R1 to R8. We'll extract these columns and remove any rows with missing values (-1).


In [6]:
# Extract columns R1 to R8 (assuming they are named R1 to R8 in the dataset)
r_columns = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8']

# Filter out rows where any R1 to R8 value is -1
cleaned_df = df[r_columns][~(df[r_columns] == -1).any(axis=1)]

# Display the shape and the first few rows of the cleaned dataset
print(f"Cleaned dataset shape: {cleaned_df.shape}")
cleaned_df.head()


Cleaned dataset shape: (8478, 8)


Unnamed: 0,R1,R2,R3,R4,R5,R6,R7,R8
0,3,1,4,2,1,2,1,1
1,1,1,1,1,1,1,1,1
2,3,2,1,1,1,1,2,1
3,3,2,1,2,2,3,1,2
5,3,1,3,4,3,4,3,3


## Step 2: Model Selection

We'll now compute the average R score (mean of R1 to R8) for each person and use the first 6500 people as training data. We'll perform linear regression to see how R1 correlates with the R score.


In [7]:
# Compute the R score (average of R1 to R8)
cleaned_df['R_score'] = cleaned_df.mean(axis=1)

# Use the first 6500 people for training
train_data = cleaned_df[:6500]

# Set R1 as the independent variable (X) and R_score as the dependent variable (y)
X_train = train_data[['R1']].values
y_train = train_data['R_score'].values

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Compute the residual sum of squares for the first 6500 people
y_train_pred = model.predict(X_train)
rss_train = np.sum((y_train - y_train_pred) ** 2)

print(f"Residual Sum of Squares (Training Data): {rss_train}")


Residual Sum of Squares (Training Data): 2902.0393474685015


## Step 3: Validation

We will now use the remaining 1500 people as a test set to validate our model. We'll compute the residual sum of squares for this test set to check how well the model generalizes.


In [8]:
# Use the remaining 1500 people for testing
test_data = cleaned_df[6500:]

# Set R1 as the independent variable (X) and R_score as the dependent variable (y)
X_test = test_data[['R1']].values
y_test = test_data['R_score'].values

# Predict the R scores using the model
y_test_pred = model.predict(X_test)

# Compute the residual sum of squares for the test set
rss_test = np.sum((y_test - y_test_pred) ** 2)

print(f"Residual Sum of Squares (Test Data): {rss_test}")


Residual Sum of Squares (Test Data): 1028.7757850021012


## Conclusion

- **Training Data Residual Sum of Squares**: `rss_train`
- **Test Data Residual Sum of Squares**: `rss_test`

The residual sum of squares on the test data indicates how well the model generalizes. If the test residual sum of squares is significantly higher than the training residual sum of squares, it suggests that the model might not generalize well to new data.
