<a href="https://colab.research.google.com/github/JanaSchwarzerova/CyberKnife-Treatment-Data-Analysis-and-Insights/blob/main/CyberKnife_Treatment_Data_Analysis_and_Insights.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduciton – CyberKnife-Treatment-Data-Analysis-and-Insights.ipynb

This notebook presents an design for analysis of a dataset collected from patients who underwent CyberKnife treatment. The goal of this analysis is to explore the relationships between clinical parameters, diagnoses, and treatment outcomes, with a focus on complications, hospitalization duration, and other key factors that could influence patient recovery and treatment effectiveness.

The notebook is structured to walk you through the entire process of data handling, from loading and preprocessing the data to applying machine learning models for regression and classification tasks. Key methods used in this analysis include:



1.   Data Preprocessing: Cleaning and transforming the data to prepare it for analysis.
2.   Association Rule Mining: Using the Apriori algorithm to discover relationships between diagnoses.
3.  Regression Analysis: Analyzing the relationship between clinical factors (such as age and sex) and complications or hospitalization duration.
4.  Random Forest Regressor: Building an ensemble model to improve predictions and identify key factors influencing patient outcomes.
5.  Visualization: Visualizing the performance of the models and the relationships between actual and predicted outcomes.

Through this analysis, we aim to gain insights into the treatment outcomes of CyberKnife patients, providing valuable information to clinicians and medical practitioners.

**Step 1: Set up the environment and import libraries**

Start by importing the required libraries. You'll need libraries like pandas, numpy, matplotlib, seaborn, mlxtend, sklearn, etc.

In [None]:
# Install necessary libraries
!pip install mlxtend

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import apriori, association_rules
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score


**Step 2: Load the datasets**

Assuming you have the data in CSV files, you can load them using pandas. You can also add options to handle large datasets efficiently.

In [None]:
# Load the datasets
patients_df = pd.read_csv('patients.csv')
documents_df = pd.read_csv('documents.csv')
diagnoses_df = pd.read_csv('diagnoses.csv')

# Display the first few rows to understand the data
patients_df.head(), documents_df.head(), diagnoses_df.head()


**Step 3: Preprocessing the data**

Here you can clean the data, handle missing values, and standardize or transform the necessary columns (e.g., date of birth to age).

In [None]:
# Convert Date of Birth to datetime
patients_df['date_of_birth'] = pd.to_datetime(patients_df['date_of_birth'])

# Calculate Age based on Date of Birth
patients_df['age'] = (pd.to_datetime('today') - patients_df['date_of_birth']).dt.days // 365

# Handle missing or erroneous data (e.g., removing rows with invalid ages)
patients_df = patients_df[patients_df['age'] > 0]

# Check for missing values
patients_df.isnull().sum()


**Step 4: Association Rule Mining (Apriori Algorithm)**

Convert categorical diagnosis data to binary (one-hot encoded) format and apply the Apriori algorithm to identify frequent itemsets.

In [None]:
# Apply one-hot encoding to diagnoses data
diagnoses_onehot = pd.get_dummies(diagnoses_df, columns=['diagnosis_code'])

# Apply Apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(diagnoses_onehot, min_support=0.1, use_colnames=True)

# Generate association rules
association_rules_df = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Display the association rules
association_rules_df.head()


**Step 5: Regression Analysis**

Use logistic and linear regression to analyze the relationship between clinical factors (age, sex) and treatment outcomes (e.g., complications, hospitalization duration).

*Logistic regression*

In [None]:
# Logistic regression for complications (binary outcome: 0/1)
X = patients_df[['age', 'sex']]  # Example features: age and sex
y = patients_df['complications']  # Target variable: complications

log_reg = LogisticRegression()
log_reg.fit(X, y)

# Predictions and evaluation
y_pred = log_reg.predict(X)
print(f"Logistic Regression Accuracy: {log_reg.score(X, y)}")


*Linear Regression*

In [None]:
# Linear regression for hospitalization duration
X = patients_df[['age', 'sex']]
y = patients_df['duration']  # Target: duration of hospitalization

lin_reg = LinearRegression()
lin_reg.fit(X, y)

# Predictions and evaluation
y_pred = lin_reg.predict(X)
print(f"R-squared: {r2_score(y, y_pred)}")
print(f"Mean Squared Error: {mean_squared_error(y, y_pred)}")


**Step 6: Random Forest Regressor**

Use Random Forest to model the relationship between clinical factors and the hospitalization duration, and evaluate the feature importance.

In [None]:
# Random Forest Regressor for hospitalization duration
rf_regressor = RandomForestRegressor()
rf_regressor.fit(X, y)

# Predictions and evaluation
y_pred = rf_regressor.predict(X)
print(f"Random Forest R-squared: {r2_score(y, y_pred)}")
print(f"Mean Squared Error: {mean_squared_error(y, y_pred)}")

# Feature importance
feature_importance = rf_regressor.feature_importances_
print(f"Feature Importance: {feature_importance}")


**Step 7: Visualization**

Visualize the correlation between predicted and actual outcomes using scatter plots.

In [None]:
# Scatter plot for regression results
plt.figure(figsize=(8, 6))
plt.scatter(y, y_pred, alpha=0.7)
plt.title('Predicted vs Actual Duration')
plt.xlabel('Actual Duration')
plt.ylabel('Predicted Duration')
plt.show()