# CardioSense Component 1: Symptom Analysis & CVD Risk Prediction

This notebook demonstrates how to build a hybrid machine learning model that combines structured cardiovascular risk factors with free-text symptom descriptions.

## 1. Load and inspect the dataset
We load the structured dataset and the synthetic symptom descriptions. Each row corresponds to a patient entry with multiple risk factors.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from scipy.sparse import hstack
import joblib

# Load the dataset
structured_df = pd.read_csv('structured_dataset.csv')
structured_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,symptom_text
0,67,0,0,130,150,0,0,159,1,0.7,2,0,3,0,exercise induced chest pain
1,57,1,1,95,221,1,2,75,1,5.1,0,3,2,1,chest pain and fasting blood sugar high and ex...
2,43,0,1,114,236,0,2,177,1,0.9,2,3,1,0,chest pain and exercise induced chest pain and...
3,71,1,3,102,315,0,2,117,0,4.6,1,2,2,1,chest pain and high cholesterol and ST depress...
4,36,0,2,187,155,1,0,111,0,0.3,0,3,2,1,chest pain and high blood pressure and fasting...


## 2. Generate symptom features
We use TF-IDF to vectorize the free-text symptom descriptions and StandardScaler to standardize the numeric features.

In [2]:
# Extract features and labels
X_structured = structured_df.drop(['target', 'symptom_text'], axis=1)
X_text = structured_df['symptom_text']
y = structured_df['target']

# Train-test split
X_structured_train, X_structured_test, X_text_train, X_text_test, y_train, y_test = train_test_split(
    X_structured, X_text, y, test_size=0.2, random_state=42, stratify=y
)

# Scale numeric features
scaler = StandardScaler()
X_structured_train_scaled = scaler.fit_transform(X_structured_train)
X_structured_test_scaled = scaler.transform(X_structured_test)

# Vectorize text
tfidf_vectorizer = TfidfVectorizer(max_features=100)
X_text_train_vec = tfidf_vectorizer.fit_transform(X_text_train)
X_text_test_vec = tfidf_vectorizer.transform(X_text_test)

# Combine features
X_train_combined = hstack([X_structured_train_scaled, X_text_train_vec])
X_test_combined = hstack([X_structured_test_scaled, X_text_test_vec])


## 3. Train the model
We train a logistic regression classifier on the combined features.

In [3]:
# Train logistic regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_combined, y_train)

# Predict on test set
y_pred = logreg.predict(X_test_combined)
y_prob = logreg.predict_proba(X_test_combined)[:, 1]

# Evaluate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'ROC AUC: {roc_auc:.4f}')


Accuracy: 0.9750
Precision: 0.9786
Recall: 0.9946
F1 Score: 0.9865
ROC AUC: 0.9834


## 4. Save the model
We save the trained model and preprocessing objects for later inference.

In [4]:
import joblib
model_package = {
    'scaler': scaler,
    'tfidf_vectorizer': tfidf_vectorizer,
    'model': logreg
}
joblib.dump(model_package, 'cardiosense_component1_model.pkl')
print('Model saved successfully.')

Model saved successfully.
