---
comments: true
layout: post
title: ML CPT
courses: { csp: {week: 26} }
---

# Our ML Project: Titanic + CPT

In this blog, we will show demonstrations of our code for the titanic machine learning model, as well as our own personalized CPT machine learning project.

# Titanic ML:

For the titanic project, we worked on trainnig the model with the titanic dataset. Using an API, we recieved data from the frontend, made the prediction, and sent the prediction back to the frontend to display.

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np


class TitanicPredictor:
   def __init__(self):
       self.data = None
       self.encoder = None
       self.model_dt = None
       self.model_logreg = None
       self.X_test = None
       self.y_test = None
      
   def load_data(self):
       self.data = sns.load_dataset('titanic')
      
   def preprocess_data(self):
       if self.data is None:
           raise ValueError("Data not loaded. Call load_data() first.")
      
       self.data.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
       self.data.dropna(inplace=True)
       self.data['sex'] = self.data['sex'].apply(lambda x: 1 if x == 'male' else 0)
       self.data['alone'] = self.data['alone'].apply(lambda x: 1 if x == True else 0)
      
       self.encoder = OneHotEncoder(handle_unknown='ignore')
       self.encoder.fit(self.data[['embarked']])
       onehot = self.encoder.transform(self.data[['embarked']]).toarray()
       cols = ['embarked_' + val for val in self.encoder.categories_[0]]
       self.data[cols] = pd.DataFrame(onehot)
       self.data.drop(['embarked'], axis=1, inplace=True)
       self.data.dropna(inplace=True)
      
   def train_models(self):
       X = self.data.drop('survived', axis=1)
       y = self.data['survived']
       X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
       self.model_dt = DecisionTreeClassifier()
       self.model_dt.fit(X_train, y_train)
      
       self.model_logreg = LogisticRegression()
       self.model_logreg.fit(X_train, y_train)
      
   def evaluate_models(self):
       if self.model_dt is None or self.model_logreg is None:
           raise ValueError("Models not trained. Call train_models() first.")
      
       y_pred_dt = self.model_dt.predict(self.X_test)
       accuracy_dt = accuracy_score(self.y_test, y_pred_dt)
       print('DecisionTreeClassifier Accuracy: {:.2%}'.format(accuracy_dt)) 
      
       y_pred_logreg = self.model_logreg.predict(self.X_test)
       accuracy_logreg = accuracy_score(self.y_test, y_pred_logreg)
       print('LogisticRegression Accuracy: {:.2%}'.format(accuracy_logreg)) 
  
   def predict_survival_probability(self, new_passenger):
       if self.model_logreg is None:
           raise ValueError("Models not trained. Call train_models() first.")
      
       new_passenger['sex'] = new_passenger['sex'].apply(lambda x: 1 if x == 'male' else 0)
       new_passenger['alone'] = new_passenger['alone'].apply(lambda x: 1 if x == True else 0)
      
       onehot = self.encoder.transform(new_passenger[['embarked']]).toarray()
       cols = ['embarked_' + val for val in self.encoder.categories_[0]]
       new_passenger[cols] = pd.DataFrame(onehot, index=new_passenger.index)
       new_passenger.drop(['embarked'], axis=1, inplace=True)
       new_passenger.drop(['name'], axis=1, inplace=True)
      
       dead_proba, alive_proba = np.squeeze(self.model_logreg.predict_proba(new_passenger))
       print('Death probability: {:.2%}'.format(dead_proba)) 
       print('Survival probability: {:.2%}'.format(alive_proba)) 
       return dead_proba, alive_proba


# Usage
titanic_predictor = TitanicPredictor()
titanic_predictor.load_data()
titanic_predictor.preprocess_data()
titanic_predictor.train_models()
titanic_predictor.evaluate_models()


# Define a new passenger
passenger = pd.DataFrame({
   'name': ['John Mortensen'],
   'pclass': [2],
   'sex': ['male'],
   'age': [64],
   'sibsp': [1],
   'parch': [1],
   'fare': [16.00],
   'embarked': ['S'],
   'alone': [False]
})


titanic_predictor.predict_survival_probability(passenger)


# Titanic Model: 

Data Loading and Preprocessing:
The model begins by loading the Titanic dataset using Seaborn's load_dataset() function.
It then preprocesses the data by dropping irrelevant columns ('alive', 'who', 'adult_male', 'class', 'embark_town', 'deck') and handling missing values.
Categorical variables like 'sex' and 'alone' are converted into numerical format.

One-Hot Encoding:
The model uses one-hot encoding to convert the categorical variable 'embarked' into binary vectors.

Model Training:
After preprocessing, the data is split into features (X) and the target variable (y), followed by splitting into training and testing sets.
Two models are trained: a Decision Tree Classifier (model_dt) and a Logistic Regression model (model_logreg).

Model Evaluation:
The trained models are evaluated using accuracy scores on the test data.

Prediction:
The model provides a method predict_survival_probability() to predict the survival probability of a new passenger.
The new passenger's data is preprocessed similarly to the training data.
The survival probability is predicted using the trained Logistic Regression model.

Usage Example:
An instance of the TitanicPredictor class is created.
Data is loaded, preprocessed, models are trained, and then evaluated.
A new passenger's data is defined, and the predict_survival_probability() method is called to estimate their survival probability.

In [None]:
from flask import Blueprint, jsonify, request  # jsonify creates an endpoint response object
from flask_restful import Api, Resource # used for REST API building

from model.jokes import *

joke_api = Blueprint('joke_api', __name__,
                   url_prefix='/api/jokes')

# API generator https://flask-restful.readthedocs.io/en/latest/api.html#id1
api = Api(joke_api)

class TitanicAPI(Resource):
    def post(self):
            # Get passenger data from the API request
            data = request.get_json()  # get the data as JSON
            data['alone'] = str(data['alone']).lower()
            converted_dict = {key: [value] for key, value in data.items()}
            pass_in = pd.DataFrame(converted_dict)  # create DataFrame from JSON
            titanic_predictor = TitanicPredictor()
            titanic_predictor.load_data()
            titanic_predictor.preprocess_data()
            titanic_predictor.train_models()
            titanic_predictor.evaluate_models()
            dead_proba, alive_proba = titanic_predictor.predict_survival_probability(pass_in)
            response = {
                'dead_proba': dead_proba,  # Example probabilities, replace with actual values
                'alive_proba': alive_proba
            }
            return jsonify(response)


# Add resource to the API
api.add_resource(TitanicAPI, '/create')

# Titanic API:

The TitanicAPI class is a Flask-Restful Resource representing an endpoint of the API. It handles POST requests to the /api/jokes/create endpoint.
In the post method, it extracts passenger data from the JSON request using request.get_json().
The passenger data is processed and formatted, and then passed to the TitanicPredictor class (assumed to be defined elsewhere) for prediction.
After predicting survival probabilities, a JSON response containing the probabilities is created.

# CPT ML (Depression):

For our CPT project, we have decided to create a project centering around mental health and self care. We decided that we would try to find A dataset dealing with depression rates. We found a few we liked, and even turned created a dataset. We created a model to train data form the dataset to predict how likley a person would develop depression due to these factors: age, stress level, exercise, and sleep.

In [None]:
data = pd.read_csv('depression_dataset.csv')
# Split the data into features and labels
X = data.drop('Probability of Developing Depression', axis=1)
y = data['Probability of Developing Depression']
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Function to predict the chance of being depressed
def predict_depression(age, stress_level, exercise_hours, sleep_hours):
    input_data = scaler.transform([[age, stress_level, exercise_hours, sleep_hours]])
    chance_of_depression = model.predict(input_data)[0]
    return chance_of_depression

# CPT Model:

Data Loading and Preparation:
The model begins by loading a dataset containing pertinent information for predicting depression. This dataset typically comprises features such as age, stress level, exercise hours, and sleep hours, alongside a target variable indicating the likelihood of developing depression.
Subsequently, the model divides the data into features (X) and the target variable (y). Features represent the input variables utilized for predictions, while the target variable signifies what we aim to predict—in this instance, the probability of experiencing depression.

Data Preprocessing:
Before proceeding with model training, it's imperative to preprocess the data. Within this model, data preprocessing entails standardization of features using a method known as StandardScaler. Standardization ensures that all features exhibit a mean of 0 and a standard deviation of 1, thereby enhancing the efficacy of certain machine learning algorithms.

Model Training:
The model undergoes training employing a linear regression algorithm for prediction tasks. Linear regression, although straightforward, is a robust algorithm employed for establishing relationships between a dependent variable (in this scenario, the likelihood of depression) and one or more independent variables (the features).
Training involves utilizing the preprocessed training data (X_train, y_train). Throughout this process, the model learns to discern the relationship between the input features and the target variable by minimizing the disparity between predicted values and actual observations, a process known as minimizing the loss function.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

class DepressionPredictor:
    def __init__(self):
        self.data = None
        self.model_logreg = None
        self.X_test = None
        self.y_test = None
        self.scaler = None
      
    def load_data(self, filepath):
        self.data = pd.read_csv(filepath)
      
    def preprocess_data(self):
        if self.data is None:
            raise ValueError("Data not loaded. Call load_data() first.")
      
        # Any necessary preprocessing steps can be added here

        # For example, dropping columns, handling missing values, encoding categorical variables, etc.
      
    def train_models(self):
        if self.data is None:
            raise ValueError("Data not loaded. Call load_data() first.")
        
        X = self.data.drop('Depression', axis=1)
        y = self.data['Depression']
        X_train, self.X_test, y_train, self.y_test = train_test_split(X, y, test_size=0.3, random_state=42)
      
        self.scaler = StandardScaler()
        X_train_scaled = self.scaler.fit_transform(X_train)
        self.X_test = self.scaler.transform(self.X_test)
      
        self.model_logreg = LogisticRegression()
        self.model_logreg.fit(X_train_scaled, y_train)
      
    def evaluate_models(self):
        if self.model_logreg is None:
            raise ValueError("Models not trained. Call train_models() first.")
      
        y_pred_logreg = self.model_logreg.predict(self.X_test)
        accuracy_logreg = accuracy_score(self.y_test, y_pred_logreg)
        print('LogisticRegression Accuracy: {:.2%}'.format(accuracy_logreg)) 
  
    def predict_depression_probability(self, new_data):
    if self.model_logreg is None:
        raise ValueError("Models not trained. Call train_models() first.")

    # Preprocess new data similarly to training data
    new_data_processed = new_data.copy()  # Make a copy to avoid modifying the original DataFrame
    new_data_processed['Family History of Depression'] = new_data_processed['family_history'].apply(lambda x: 1 if x == 'Yes' else 0)
    # Add any additional preprocessing steps here

    # Ensure consistency in feature names and order
    new_data_processed = new_data_processed.rename(columns={'age': 'Age',
                                                            'exercise_hours': 'Daily Exercise Hours',
                                                            'family_history': 'Family History of Depression',
                                                            'sleep_hours': 'Daily Sleep Hours',
                                                            'stress_level': 'Stress Level'})

    # Drop duplicate 'Family History of Depression' column if present
    if 'Family History of Depression' in new_data_processed.columns:
        new_data_processed = new_data_processed.drop('Family History of Depression', axis=1)

    # Check if 'Probability of Developing Depression' column is present
    if 'Probability of Developing Depression' in new_data_processed.columns:
        new_data_processed = new_data_processed.drop('Probability of Developing Depression', axis=1)

    # Debug: Print feature names in the new data
    print("Feature names in new data:", new_data_processed.columns)

    # Ensure that feature names match those seen during fit time
    expected_feature_names = set(['Age', 'Daily Exercise Hours', 'Family History of Depression', 'Daily Sleep Hours', 'Stress Level'])
    new_feature_names = set(new_data_processed.columns)
    if expected_feature_names != new_feature_names:
        missing_features = expected_feature_names - new_feature_names
        raise ValueError(f"Feature names seen at fit time, yet now missing: {missing_features}")

    # Transform new data using the scaler
    new_data_scaled = self.scaler.transform(new_data_processed)

    # Predict the probability of depression
    probability_of_depression = self.model_logreg.predict_proba(new_data_scaled)[:, 1]
    return probability_of_depression

# Usage
depression_predictor = DepressionPredictor()
depression_predictor.load_data('depression_dataset.csv')
depression_predictor.preprocess_data()
depression_predictor.train_models()
depression_predictor.evaluate_models()

# Define new data for prediction
new_data = pd.DataFrame({
    'age': [30],
    'family_history': ['Yes'],  # Assuming 'Yes' or 'No' as values
    'stress_level': [5],
    'exercise_hours': [1.5],
    'sleep_hours': [8]
})

probability_of_depression = depression_predictor.predict_depression_probability(new_data)
print('Probability of depression:', probability_of_depression)


In terms of a real dataset, we've found a few that we can use to train another model. One of the datasets provided was about students and their mwajors and how that would affect them mentally. Another one was about workers in the tech industry and how likley their circumstances would correlate with developing depression.