---
comments: true
layout: post
title: ML CPT
courses: { csp: {week: 26} }
---

# Our CPT ML Project

For our CPT project, we have decided to create a project centering around mental health and self care. We decided that we would try to find A dataset dealing with depression rates. We found a few we liked, and even turned created a dataset. Using a tester dataset for practice, we created fake data about the chances of developing depression. We created a model to train data form the dataset to predict how likley a person would develop depression due to these factors: age, stress level, exercise, and sleep.

In [None]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data
num_samples = 1000

# Adjust age mean and std
age_mean = 30                   # Mean age
age_std = 10                    # Increased variability in age
age = np.random.normal(age_mean, age_std, num_samples)

# Adjust stress level mean and std
stress_level_mean = 5           # Mean stress level
stress_level_std = 2            # Increased variability in stress level
stress_level = np.random.normal(stress_level_mean, stress_level_std, num_samples)

# Adjust exercise hours mean and std
exercise_hours_mean = 1.5       # Mean daily exercise hours
exercise_hours_std = 0.5        # Increased variability in daily exercise hours
exercise_hours = np.random.normal(exercise_hours_mean, exercise_hours_std, num_samples)

# Adjust sleep hours mean and std
sleep_hours_mean = 8            # Mean daily sleep hours
sleep_hours_std = 1             # Increased variability in daily sleep hours
sleep_hours = np.random.normal(sleep_hours_mean, sleep_hours_std, num_samples)

# Calculate probability of developing depression based on the factors
probability = np.maximum(0, (age - age_mean) + \
              (stress_level - stress_level_mean) + (1.5 - exercise_hours) + \
              (8 - sleep_hours))

# Create DataFrame without 'Depression' column
data = pd.DataFrame({
    'Age': age,
    'Stress Level': stress_level,
    'Daily Exercise Hours': exercise_hours,
    'Daily Sleep Hours': sleep_hours,
    'Probability of Developing Depression': probability
})

# Save DataFrame to CSV
data.to_csv('depression_dataset.csv', index=False)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

class DepressionPredictor:
    def __init__(self):
        self.data = None
        self.model_logreg = None
        self.X_test = None
        self.y_test = None
        self.scaler = None
      
    def load_data(self, filepath):
        self.data = pd.read_csv(filepath)
      
    def preprocess_data(self):
        if self.data is None:
            raise ValueError("Data not loaded. Call load_data() first.")
      
        # Any necessary preprocessing steps can be added here

        # For example, dropping columns, handling missing values, encoding categorical variables, etc.
      
    def train_models(self):
        if self.data is None:
            raise ValueError("Data not loaded. Call load_data() first.")
        
        X = self.data.drop('Depression', axis=1)
        y = self.data['Depression']
        X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
        self.scaler = StandardScaler()
        X_train_scaled = self.scaler.fit_transform(X_train)
        self.X_test = self.scaler.transform(self.X_test)
      
        self.model_logreg = LogisticRegression()
        self.model_logreg.fit(X_train_scaled, y_train)
      
    def evaluate_models(self):
        if self.model_logreg is None:
            raise ValueError("Models not trained. Call train_models() first.")
      
        y_pred_logreg = self.model_logreg.predict(self.X_test)
        accuracy_logreg = accuracy_score(self.y_test, y_pred_logreg)
        print('LogisticRegression Accuracy: {:.2%}'.format(accuracy_logreg)) 
  
    def predict_depression_probability(self, new_data):
    if self.model_logreg is None:
        raise ValueError("Models not trained. Call train_models() first.")

    # Preprocess new data similarly to training data
    new_data_processed = new_data.copy()  # Make a copy to avoid modifying the original DataFrame
    new_data_processed['Family History of Depression'] = new_data_processed['family_history'].apply(lambda x: 1 if x == 'Yes' else 0)
    # Add any additional preprocessing steps here

    # Ensure consistency in feature names and order
    new_data_processed = new_data_processed.rename(columns={'age': 'Age',
                                                            'exercise_hours': 'Daily Exercise Hours',
                                                            'family_history': 'Family History of Depression',
                                                            'sleep_hours': 'Daily Sleep Hours',
                                                            'stress_level': 'Stress Level'})

    # Drop duplicate 'Family History of Depression' column if present
    if 'Family History of Depression' in new_data_processed.columns:
        new_data_processed = new_data_processed.drop('Family History of Depression', axis=1)

    # Check if 'Probability of Developing Depression' column is present
    if 'Probability of Developing Depression' in new_data_processed.columns:
        new_data_processed = new_data_processed.drop('Probability of Developing Depression', axis=1)

    # Debug: Print feature names in the new data
    print("Feature names in new data:", new_data_processed.columns)

    # Ensure that feature names match those seen during fit time
    expected_feature_names = set(['Age', 'Daily Exercise Hours', 'Family History of Depression', 'Daily Sleep Hours', 'Stress Level'])
    new_feature_names = set(new_data_processed.columns)
    if expected_feature_names != new_feature_names:
        missing_features = expected_feature_names - new_feature_names
        raise ValueError(f"Feature names seen at fit time, yet now missing: {missing_features}")

    # Transform new data using the scaler
    new_data_scaled = self.scaler.transform(new_data_processed)

    # Predict the probability of depression
    probability_of_depression = self.model_logreg.predict_proba(new_data_scaled)[:, 1]
    return probability_of_depression

# Usage
depression_predictor = DepressionPredictor()
depression_predictor.load_data('depression_dataset.csv')
depression_predictor.preprocess_data()
depression_predictor.train_models()
depression_predictor.evaluate_models()

# Define new data for prediction
new_data = pd.DataFrame({
    'age': [30],
    'family_history': ['Yes'],  # Assuming 'Yes' or 'No' as values
    'stress_level': [5],
    'exercise_hours': [1.5],
    'sleep_hours': [8]
})

probability_of_depression = depression_predictor.predict_depression_probability(new_data)
print('Probability of depression:', probability_of_depression)


In terms of a real dataset, we've found a few that we can use to train another model. One of the datasets provided was about students and their mwajors and how that would affect them mentally. Another one was about workers in the tech industry and how likley their circumstances would correlate with developing depression.