# Block Y1B: Creative Brief Template

Please, use this template to write down your solutions to the DataLab Tasks. If you have any questions, please, contact your mentor or the content responsible. 

## Important Notes:
- [ ] Please, rename the file to ```CreativeBrief_<your_name>_<studentnumber>.ipynb``` before submitting it. 
- [ ] Upload this template to the 'Deliverables' folder in your BUas GitHub repository.
- [ ] You are allowed to add as many (Markdown/Python) cells as you need. 
- [ ] If the task requires you to only write code or text, please, delete the unnecessary cell.
- [ ] Your work should be reproducible, meaning that we should be able to run your code in the template and get the same results as you did. Tip: use relative paths to load your data!
- [ ] Ensure that before you hand in the template, you press ```Restart & Run all```; we should be able to see the results of your code in the notebook (i.e., output cells).
- [ ] Ensure that your code in the template is ```error-free```. In other words, we should not see any error messages when we run your code.

## Project Overview
This project focuses on the analysis of diabetes patient data in order to apply in machine learning algorithms. <br>
It will involve the students working on a wide variety of machine learning techniques, ranging from basic data analysis to the optimisation of advanced models. 


## Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
...

## *Task 1: Exploratory Data Analysis (EDA) with Python and SQL
_________


### **Task 1A: Exploratory Data Analysis with SQL**

### Task Description
Etablish a connection between Python and a database with performing basic operations using SQL and perform exploratory data analysis with sql to understand the dataset's characteristics, patterns.

### Task 1A.1: General Overview of the Data

In [None]:
# Count the total number of records in the encounter table
  
    # Your code here.

# Check the distribution of different admission types

    # Your code here.
  
# Explore the top discharge dispositions

    # Your code here

### Task 1A.2: Identifying Missing or Anomalous Data

In [None]:
#Check for missing values in the race column (Hint: Count the occurance of each unique value in this column)

    # Your code here

#Check for missing or unusual values in the weight column

    #Your code here

### Task 1A.3: Understanding Age Distribution

In [None]:
# Explore the age distribution of the patients

    #Your code here

### Task 1A.4: Admission Trends by Source and Type

In [None]:
# Analyze how different admission sources contribute to hospital admissions 
# (Hint: Calculate the number of admissions per source)

    #Your code here

#Investigate which admission types correspond to specific admission sources
# (Hint: add admission type to your previous query)

    #Your code here

### Task 1A.5: Hospital Stay and Readmission Patterns

In [None]:
# Find the average time in hospital for each admission type

    #Your code here

#Investigate readmission rates by admission type

    #Your code here

### Task 1A.6: Comparing Admission Types and Outcomes

In [None]:
# Compare discharge dispositions across different admission types

    #Your code here

# Compare readmission rates by discharge disposition

    #Your code here

+++++
### **Task 1B: Exploratory Data Analysis with Python**

### Task Description
Perform comprehensive exploratory data analysis to understand the dataset's characteristics, patterns, and potential challenges. <br>
It is crucial to understand the structure and quality of the dataset before diving into any analysis or modeling.

### 1B.1: Loading the data:

In [None]:
import

# Load the dataset

# Display first few rows

### 1B.2: Analysing the dataset shape:

In [None]:
# Dataset shape

# Column names

### 1B.3: Load and Explore a Dataset Using NumPy

In [None]:
import numpy as np

# Load the dataset
data = np.genfromtxt()

# Check the shape of the dataset
print("Dataset shape:",)

# Preview the first few rows
print("First 5 rows of the dataset:\n",)

### 1B.4: Analysing data types:

In [None]:
# Data types
print("Data types:)

# Unique values in categorical columns

    print(f"Unique values in)

### 1B.5: Exploratory Data Analysis (EDA) with Visualisations

In [None]:
import matplotlib
import seaborn 

# Distribution of a numeric column


# Count plot of a categorical variable

plt.show()

# Boxplot to detect outliers

plt.show()

_________
## *Task 2: Data Processing
### Task Description
Create a robust data preprocessing pipeline to handle missing values, encode categorical variables, and scale numerical features.

### **Task 2.1: Initial Cleaning and Pre-processing**

#### 2.1.1: Load and Visualise the Data
- Use pandas to load the dataset.
- Visualise the first few rows to understand the structure of the data.

In [None]:
# Importing required Python libraries
import pandas as pd

# Loading the data

  #<add your code here>

# visualise the first few rows of the data

  #<add your code here>
    

#### 2.1.2: Display Data Types and Check for Incorrect Types

- Use the .dtypes attribute to display each column's data type.
- Identify columns that have unexpected data types (e.g., numeric data stored as strings).

In [None]:
# Display data types of each column

   #<add your code here>

#### 2.1.3: Identify and Remove Duplicate Rows

- Check for duplicate rows.
- If duplicates exist, remove them 

In [None]:
# Check for duplicate rows
 #<add your code here>

print(f"Number of duplicate rows: {}")

# Drop duplicates if any
   #<add your code here>

### Task 2.2: Handling Metadata and Missing Values
#### 2.2.1: Identify Metadata from the Dataset
Metadata includes information like column names, data types, and any additional descriptive information.

In [None]:
# Display metadata including column names and data types
 #<add your code here>


#### 2.2.2: Analyse Different Types of Data
- Categorize columns based on their data types (e.g., numerical, categorical).
- Describe the significance of each data type and how it might impact data processing.

In [None]:
# Describe data to understand different types
 #<add your code here>

#### 2.2.3: Identify Missing Values
- Missing values can significantly impact model performance and analysis.
- Count missing values in each column.
- Calculate the percentage of missing values.

In [None]:
# Identify and calculate the percentage of missing values
 #<add your code here>
print("Missing values (in percentage) for each column:\n",  )

#### 2.2.4: Splitting Data for Analysis
Splitting data ensures that models are trained on one portion of the data and tested on another, unseen portion.

In [None]:
from sklearn.model_selection import train_test_split

# Features and target
X = df.drop(columns=['target_column_name'])
y = df['target_column_name']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split()

___________
## *Task 3: Machine Learning
### Task Description
Implement various machine learning algorithms for regression, classification and clustering and evaluating them.

### **Task 3.1: Implementing Regression Baseline**
#### 3.1.1: Loading the data
- Use pandas to load the dataset.

#### 3.1.2: Start to implement Linear Regression as a baseline model

In [None]:
# importing requires sklearn packages for implementing linear regression
from sklearn.model_selection import 

# Defining features (X) and target (y)


# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split()

# Initialising and training Linear Regression model
linear_model = LinearRegression()


# Predictions
y_pred = linear_model.predict()

# Evaluation
mae 
mse 

print(f"Linear Regression MAE: {}")
print(f"Linear Regression MSE: {}")

##### You could try other linear models as well, such as Ridge, Lasso, useful information can be found [here](https://scikit-learn.org/stable/modules/linear_model.html#linear-model).

#### 3.1.3: Visualise the output for linear regression

In [None]:
for i in range(X.shape[1]):
    plt.scatter()
    plt.xlabel()
    plt.ylabel()
    plt.show()
    plt.close()

#### 3.1.4: Implement Decision Tree Regression for Non-linear Relationships

In [None]:
# importing requires sklearn packages for implementing regression
from sklearn.model_selection import 

# Initialising and training Decision Tree Regressor
tree_model = DecisionTreeRegressor()  


# Predictions
y_pred_tree = tree_model.predict()

# Evaluation


print(f"Decision Tree Regression MAE: {}")
print(f"Decision Tree Regression MSE: {}")

#### 3.1.5: Implement Gradient Boosting Regression

In [None]:
# importing requires sklearn packages for implementing Gradient Boosting Regressor
from sklearn.ensemble import

# Initialising and training Gradient Boosting Regressor
gboost_model = GradientBoostingRegressor()


# Predictions
y_pred_gboost = gboost_model.predict()

# Evaluation
mae_gboost = mean_absolute_error()
mse_gboost = mean_squared_error()

print(f"Gradient Boosting Regression MAE: {}")
print(f"Gradient Boosting Regression MSE: {}")

#### 3.1.6: Implement XGBoost Regression

In [None]:
# importing requires packages


# Initialize and train XGBoost Regressor
xgboost_model = xgb.XGBRegressor()
xgboost_model.fit()

# Predictions
y_pred_xgboost = xgboost_model

# Evaluation
mae_xgboost = mean_absolute_error()
mse_xgboost = mean_squared_error()

print(f"XGBoost Regression MAE: {}")
print(f"XGBoost Regression MSE: {}")

#### 3.1.7: Visualise the output for non-linear regression
Try to plot histogram of features, useful information can be found [here](https://matplotlib.org/stable/gallery/statistics/hist.html)

In [None]:
## Write visualisation code here

#### 3.1.8: Evaluation and Model Comparison
compare the performance of all four models: Linear Regression, Decision Tree Regression, Gradient Boosting, and XGBoost. Summarise each model’s performance using a comparison table.


In [None]:
# Sample comparison table
results = {
    "Model": ["Linear Regression", "Decision Tree", "Gradient Boosting", "XGBoost"],
    "MAE": [],
    "MSE": []
}

results_df = pd.DataFrame()
print()

#### Expected Output for model comparison in 3.1.8:

| Model | MAE | MSE |
|---|---|---|
| Linear Regression | -- | -- |
| Decision Tree     | -- | -- |
| Gradient Boosting | -- | -- |
| XGBoost           | -- | -- |

++++++++
### **3.2: Implementing Classification Baseline**


#### 3.2.1: Loading the data
- Use pandas to load the dataset.

In [None]:
## Loading the data

#### 3.2.2:  Prepare Data for Classification


In [None]:
# importing requires sklearn packages
from sklearn.model_selection import 

# Defining target and features
X = df.drop()  


# Converting target to binary if needed


# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split()

#### 3.2.3:  Implement Logistic Regression as a Baseline Model

In [None]:
# importing requires sklearn packages
from sklearn.linear_model import 
from sklearn.metrics import 

# Initialising and training Logistic Regression model
logistic_model = 


# Predictions


# Evaluation
accuracy = 


print(f"Logistic Regression Accuracy: {}")

#### 3.2.4: Implement Random Forest Classifier as a Non-linear Baseline

In [None]:
# importing requires sklearn packages
from sklearn.ensemble import 

# Initialising and train Random Forest Classifier


# Predictions


# Evaluation
accuracy =

print(f"Random Forest Accuracy: {}")

#### 3.2.5: Implement K-Nearest Neighbors (KNN) Classifier

In [None]:
# importing requires sklearn packages
from sklearn.neighbors

# Standardising the features for KNN


# Initialising and training the KNN model


# Predictions
y_pred_knn = 

# Evaluation
accuracy = 

print(f"KNN Accuracy: {}")

#### 3.2.6: Evaluation and Model Comparison
Compare the performance of implemented models using four metrics including "Accuracy", "Precision", "Recall", and "F1-score".

In [None]:
results = {
    "Model": ["Logistic Regression", "Random Forest", "K-Nearest Neighbors"],
    "Accuracy":
}

results_df = 
print()

#### Expected Output for Model Comparison in 3.2.6:
| Model | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Logistic Regression | -- | -- | -- | -- |
| Random Forest       | -- | -- | -- | -- |
| K-Nearest Neighbors | -- | -- | -- | -- |

++++++++
### **Task 3.3: Implementing Clustering Baseline**


#### 3.3.1: Load and Preprocess the Data

In [None]:
import pandas as pd


# Load the dataset

# Standardising column names and remove duplicates


# Applying standard scaling to numerical features for clustering

#### 3.3.2: Implement K-Means Clustering

In [None]:
from sklearn.cluster

# Initialising and training K-Means model
kmeans = 

# Predict clusters and evaluate


print(f"K-Means Silhouette Score: {}")

#### 3.3.3: Apply PCA for Dimensionality Reduction

In [None]:
from sklearn.decomposition 

# Applying PCA to reduce dimensions
pca = 

# Visualising clusters for K-Means in 2D
# Hint: Set n_components=2 for 2D visualisation or n_components=3 for 3D visualisation.
plt.scatter()
plt.show()

#### 3.3.4: Implement Random Hierarchical Clustering

In [None]:
from sklearn.cluster

# Initialising and fiting hierarchical clustering
hierarchical = 

# Evaluating with Silhouette Score
hierarchical_silhouette = 
print(f"Hierarchical Clustering Silhouette Score: {}")

#### 3.3.5: Implement DBSCAN for Density-Based Clustering

In [None]:
from sklearn.cluster

# Initialising and fiting DBSCAN
dbscan = 

# Filtering noise points (labeled as -1)


# Evaluate with Silhouette Score (excluding noise)

#### 3.3.6: Evaluation and Model Comparison

In [None]:
# Summary of Silhouette Scores
results = {
    "Clustering Method": ["K-Means", "Hierarchical", "DBSCAN"],
    "Silhouette Score": []
}


print()

#### Expected Output for Model Comparison in 3.3.6:
| Clustering Method | Silhouette Score |
|     ---           |    --   |
| K-Means           |  |
| Hierarchical      |  |
| DBSCAN            |  |

++++++++++
### **3.4 Feature Engineering and Feature Selection**
Select the most relevant features to improve model performance

#### 3.4.1: Feature Engineering

In [None]:
# Create age group categories if applicable
X['age_group'] = 

# Interaction feature example

#### 3.4.2: Generate Statistical Features
Calculate aggregate statistics such as mean, max, min, or standard deviation for features if relevant

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardize numerical columns

#### 3.4.3: Feature Transformation
Normalise or standardise features, especially for distance-based models like KNN or clustering

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardize numerical columns

#### 3.4.4: Feature Selection
Use feature importance from tree-based models (Random Forest, Gradient Boosting) to select important features


In [None]:
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Fit Random Forest model


# Get feature importance
importances = model.feature_importances_


# Plot feature importance

plt.show()

#### 3.4.5: Model Evaluation After Feature Selection
- After feature engineering and selection, evaluate your model with the new set of features.
- Apply cross-validation to measure the model’s performance with the refined feature set

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Fit model on selected features

print("Cross-validated accuracy after feature selection:",)

+++++++++
### **Task 3.5: Hyperparameter Tuning**
Improve the quality of your implementations


#### 3.5.1: GridSearchCV for Systematic Hyperparameter Tuning
- Define Hyperparameter Grid for Different Models
- Define grids of hyperparameters for models such as Logistic Regression (for classification) and K-Means (for clustering).

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans

# Define hyperparameter grid for Logistic Regression


# Define hyperparameter grid for KMeans

#### 3.5.2 Apply GridSearchCV with Cross-Validation
- Use GridSearchCV with cross-validation to perform hyperparameter tuning on both models.

In [None]:
# Logistic Regression GridSearchCV
logistic_search = GridSearchCV(LogisticRegression(), logistic_grid, scoring='accuracy')
logistic_search.fit(X_train, y_train)

# KMeans GridSearchCV
kmeans_search = GridSearchCV(KMeans(), kmeans_grid, scoring='silhouette_score')
kmeans_search.fit(X_train)

#### 3.5.3: Nested Cross-Validation to Validate Stability of Hyperparameters

In [None]:
from sklearn.model_selection import cross_val_score

# Logistic Regression Nested Cross-Validation
nested_cv_score_logistic = cross_val_score()
print("Nested CV Accuracy for Logistic Regression:")

#### 3.5.4: Evaluate Tuned Models with Multiple Metrics
- Evaluate the tuned Logistic Regression model on both training and test sets.
- Use multiple metrics for classification (accuracy, F1-score) and clustering (silhouette score, Calinski-Harabasz).

In [None]:
from sklearn.metrics import accuracy_score, f1_score, silhouette_score, calinski_harabasz_score

# Logistic Regression evaluation
y_train_pred = logistic_search.best_estimator_.predict(X_train)
y_test_pred = logistic_search.best_estimator_.predict(X_test)

train_accuracy = accuracy_score()
test_accuracy = accuracy_score()
train_f1 = f1_score()
test_f1 = f1_score()

print(f"Logistic Regression - Train Accuracy: ")
print(f"Logistic Regression - Train F1: ")

# KMeans evaluation
kmeans_labels = kmeans_search.best_estimator_.predict(X_train)
silhouette = silhouette_score()
calinski_harabasz = calinski_harabasz_score()

print(f"KMeans - Silhouette Score: ")
print(f"KMeans - Calinski-Harabasz Score:")

+++++++++
### **Task 3.6: Model Selection**
- Compare and evaluate the implemented models and select the best ones.

#### 3.6.1: Compare and Choose the Best Model
- Compare the models using evaluation metrics (e.g., MSE for regression, F1-score for classification, Silhouette Score for clustering).
- Summarise the performance metrics of each model in a table and choose the model with the best performance for each task.

In [None]:
import pandas as pd

# Sample model comparison
results = {
    "Model": ["Linear Regression", "Random Forest", "KMeans"],
    "MAE": [mae_lr, mae_rf, "N/A"],
    "MSE": [mse_lr, mse_rf, "N/A"],
    "Silhouette Score": ["N/A", "N/A", silhouette_score_kmeans]
}

results_df = pd.DataFrame(results)
print(results_df)

#### 3.6.2: Finalise and Save the Best Model
- Once the best model is identified, retrain it on the full dataset and save it using joblib for future use.

In [None]:
import joblib

# Save the best model
joblib.dump(grid_search.best_estimator_, "best_model.pkl")

____________
## *Task 4: Mathematics for Machine Learning

#### 4.1: Implementing Linear Regression with Numpy from Scratch

- Add all links to your Jupyter notebook files `LinearRegression.ipynb` here!

#### 4.2: Implement Logistic Regression with Numpy from Scratch

- Add all links to your Jupyter notebook files `LogisticRegression.ipynb` here!

____________
## *Task 5: Final Report for Your Creative Brief Project
- Add a link to your `FinalReport_<your_name>_<studentnumber>.pdf` here!