# COSC 437 Data Mining Lab Assignment 3 - Logistic Regression for Pulsar Classification

## Data Information
HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South). 

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. 

As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars
rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation. Thus a  potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, which treat the candidate data sets  as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class. At present multi-class labels are unavailable, given the costs associated with data annotation.

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. 

## Variable Description
| Variable Name    | Role    | Type       | Description | Units | Missing Values |
| ---------------- | ------- | ---------- | ----------- | ----- | -------------- |
| Profile_mean     | Feature | Continuous |             |       | no             |
| Profile_stdev    | Feature | Continuous |             |       | no             |
| Profile_skewness | Feature | Continuous |             |       | no             |
| Profile_kurtosis | Feature | Continuous |             |       | no             |
| DM_mean          | Feature | Continuous |             |       | no             |
| DM_stdev         | Feature | Continuous |             |       | no             |
| DM_skewness      | Feature | Continuous |             |       | no             |
| DM_kurtosis      | Feature | Continuous |             |       | no             |
| class            | Target  | Binary     |             |       | no             |


## Lab Overview
In this lab, you will apply logistic regression to the HTRU2 dataset to build a classification model. The goal is to predict whether a given candidate is a real pulsar (positive class) or spurious (negative class) using the provided features.


## Part I - Data Loading and Exploration
- Load the HTRU2 dataset into your environment. Perform an initial exploration of the dataset, including:
- Basic statistics and the distributions of the features.
- Distribution of the target class (ensure you note the imbalance between classes).
- Check for missing values (there are none, but confirm).

In [None]:
## TODO: Your code here. If appropriate, use multiple code blocks.

## Part II - Data Preprocessing
Normalization: Since logistic regression has the best performance with normalized or standardized features. Scale the data using a method of your choice (e.g., min-max scaling or standardization).

When you are done, split the data into a training set (80%) and a test set (20%). Given the class imbalance (90% negative class, 10% positive class), ensure both sets maintain this imbalance by setting stratify=y in the train_test_split function.

In [None]:
## TODO: Your code here. If appropriate, use multiple code blocks.

## Part III - Logistic Regression Model
Train a logistic regression model on the training data. Use LogisticRegression from sklearn.linear_model. Logistic regression works well when the classes are linearly separable or approximately so.

After training the logistic regression model, extract and interpret the model coefficients. Print the model coefficients and interpret them. Map the coefficients to the corresponding feature names for clarity. In logistic regression, the coefficients indicate the weight (importance) of each feature in determining the outcome. Positive coefficients suggest that as the feature increases, the probability of the positive class (pulsar) increases, while negative coefficients suggest the opposite. 

In [None]:
## TODO: Your code here. If appropriate, use multiple code blocks.

## Part IV - Model Evaluation
After training the model, it’s essential to assess its performance. Given that the pulsar data is highly imbalanced, evaluation metrics like accuracy alone are not sufficient. Focus on metrics that consider the model’s performance on both classes, particularly the minority class (pulsars).

Use the test set to generate predictions.
- `yhat_pred` contains the predicted labels (0 or 1).
- `yhat_prob` contains the predicted probabilities, which can be used to adjust decision thresholds later if needed.

### Metrics
Calculate and interpret the following metrics to evaluate your logistic regression model:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix

In [None]:
## TODO Calculate the metrics, save them in the variables below

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"Confusion Matrix:\n{confusion_matrix}")

### Imbalance Consideration
Given the imbalance in the dataset (many more spurious examples than real pulsars), achieving high precision or recall can be difficult. Pay special attention to the recall of the minority class (pulsars) as it indicates the ability to detect true pulsars.

As can be expected, missing any actual pulsars is more costly than misclassifying non-pulsars as pulsars. What other threshold should we choose? Visualize the model's behavior with a ROC curve. Pick a different threshold that will better find out pulsars without hurting the overall accuracy to much.

In [None]:
## TODO: Your code here. If appropriate, use multiple code blocks.

## Part V - Handling Class Imbalance with Resampling
Imbalanced datasets present a significant challenge in classification. Without addressing the imbalance, the model may be biased toward the majority class, leading to poor performance in identifying the minority class (real pulsars). Two strategies to handle this are using class weights or resampling.

### Using Class Weights
Logistic regression has an option to handle class imbalance by assigning different weights to classes. By setting `class_weight='balanced'`, the model will assign higher weight to the minority class (pulsars), which can help in identifying more pulsars.

Once your the balanced model is trained, evaluate the model using the same metrics (accuracy, precision, recall, F1-score, confusion matrix) to see if the performance on the minority class improves.

In [None]:
## TODO: Your code here. If appropriate, use multiple code blocks.

### Using Resampling
Alternatively, you can resample the training data to either oversample the minority class or undersample the majority class. For this lab, just use oversampling. You may use the `RandomOverSampler` from the `imblearn` library to oversample your data before training.

Once your the resampled model is trained, evaluate it using the same metrics and compare with the original and class-weighted models.

In [None]:
## TODO: Your code here. If appropriate, use multiple code blocks.

### Comparison and Discussion
Compare the results of the three models (baseline, class-weighted, and resampled).
- Which model performs best in terms of recall for the pulsar class?
- Does resampling or using class weights improve the detection of pulsars?


## Acknowledgements
- R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society 459 (1), 1104-1123, DOI: 10.1093/mnras/stw656
- R. J. Lyon, HTRU2, DOI: 10.6084/m9.figshare.3080389.v1.
- https://archive.ics.uci.edu/dataset/372/htru2