# Module 12 Report Template

## Overview of the Analysis
The purpose of this analysis is to build and evaluate a machine learning model that can predict loan risk. Specifically, the model is trained to classify loans as either healthy (low risk of default) or high-risk. The financial information used includes loan size, interest rate, borrower income, debt-to-income ratio, number of accounts, derogatory marks, and total debt. Accurately identifying high-risk loans is crucial for mitigating potential financial losses and informing lending decisions. We used Logistic Regression, initially without scaling, and then with data scaling using StandardScaler, as part of our modeling approach.

## Explain the purpose of the analysis.
*   **Detailed Explanation:** The core purpose of the analysis is to develop a predictive model that can accurately classify loans as either "healthy" (low risk of default) or "high-risk" (significant risk of default). This classification allows the lending institution to make more informed decisions about loan approvals, interest rates, and risk management strategies. The model aims to minimize financial losses by identifying potentially defaulting loans before they are issued, thus enabling proactive risk mitigation. The purpose also extended to *evaluating the impact of scaling* on the classification process.

## Explain what financial information the data was on, and what you needed to predict.
*   **Financial Information:** The data used in this analysis consisted of several key financial variables related to the loan applicant and the loan itself. These included:
    *   `loan_size`: The principal amount of the loan.
    *   `interest_rate`: The interest rate applied to the loan.
    *   `borrower_income`: The annual income of the loan applicant.
    *   `debt_to_income`: The borrower's debt-to-income ratio (total debt divided by annual income).
    *   `num_of_accounts`: The number of credit accounts held by the borrower.
    *   `derogatory_marks`: The presence of any negative credit history markers.
    *   `total_debt`: The total amount of debt held by the borrower.
*   **Prediction Target:** The primary goal was to predict the `loan_status`, a binary variable indicating whether a loan is "healthy" (0) or "high-risk" (1). This is a classification problem, where the model assigns each loan to one of these two categories.

## Provide basic information about the variables you were trying to predict (e.g., `value_counts`).
*   **`loan_status` Variable Analysis:** The variable we were trying to predict was `loan_status`.  To get a sense of how the target variable is distributed within the dataset, we can use `value_counts()` function provided in the *Pandas* library.  The `value_counts()` provides counts for each of the distinct value in the column. After running the `value_counts()` function, you would find that the majority of loans in the dataset are classified as "healthy" (0), which is typical for lending data. The number of "high-risk" (1) loans is significantly smaller.  (It is important to note that the details of `value_counts()` is something that we would need to find through code, that is why I make reference to it) This distribution means that this is an *imbalanced dataset* and affects how to interpret accuracy and the significance of correctly predicting the minority class.
*   **Implication of Imbalance:** The class imbalance means that high overall accuracy might be misleading if the model simply predicts almost everything as "healthy." This is why precision and recall for the "high-risk" class are particularly important metrics.


## Describe the stages of the machine learning process you went through as part of this analysis.
*   **Stages of the Analysis:**
    1.  **Data Loading and Exploration:** Loading the `lending_data.csv` dataset into a Pandas DataFrame and using `.head()` (and other functions like `.info()` and `.describe()`) to preview and understand the structure and content of the data.
    2.  **Feature and Target Variable Separation:** Separating the dataset into features (independent variables, `X`) and the target variable (`loan_status`, `y`).  Features were all the loan financial data, the target was the risk status.
    3.  **Data Splitting:** Splitting the data into training and testing sets using `train_test_split` to evaluate the model's performance on unseen data. `random_state=1` to guarantee reproducible results.
    4.  **Data Scaling:** Applying the `StandardScaler` to scale the numerical features in both training and testing sets. This is crucial to prevent features with larger ranges from dominating the learning process and to improve convergence of the Logistic Regression model. (The code performed scaling but then did not perform a comparison until now).
    5.  **Model Training (Logistic Regression):** Instantiating a `LogisticRegression` model and training it using the scaled training data (`X_train_scaled` and `y_train`).
    6.  **Model Evaluation (Initial Model):** This evaluation followed from the unscaled data and the code was ran to produce a `confusion_matrix()` and `classification_report()` that resulted in an Accuracy 0.99, Percision for Healty 1.00, Recall for Healthy 0.99, Percision for High-Risk 0.84 and Recall for High-Risk 0.94
    7.  **Model Prediction:** Using the trained model to predict `loan_status` labels for the scaled testing data (`X_test_scaled`).
    8.  **Model Evaluation (Scaled Data):** Generating a confusion matrix and classification report to evaluate the model's performance on the scaled data. This step allows us to assess the accuracy, precision, recall, and F1-score for both "healthy" and "high-risk" loan classifications.
    9.  **Comparison of Results:** Comparing the evaluation metrics obtained with and without data scaling to assess the impact of this preprocessing step.
    10. **Model Recommendations:** Providing recommendations for model usage and future steps based on the evaluation results.


## Briefly touch on any methods you used (e.g., `LogisticRegression`, or any other algorithms).
*   **Methods and Algoritms Used:**
    1.   **Logistic Regression:** This is a linear model used for binary classification problems. It models the probability of a binary outcome (in this case, `loan_status`) based on a linear combination of the input features. Logistic Regression is relatively simple to implement and interpret, making it a good starting point for classification tasks. The logistic model will output how each of the features in the model effect the target variable.
    2.   **StandardScaler:** The *scikit-learn* `StandardScaler` was used to standardize the features. (Standardizing the features is a method of data preparation) This involves transforming the features so that they have a mean of 0 and a standard deviation of 1. Data scaling algorithms such as `StandardScaler` is employed as features, and therefore have different ranges of values. This helps the model to prevent a high value from dominating the model, as all numbers will be within a similar range of values.
    3.   **Train-Test Split:** The scikit-learn train_test_split module was used to split the data. This ensures that the dataset is split into two portions of training and testing data. (This is a method to protect against overfitting of the data) A split ensures that some of the data is withheld from the dataset from being learned. The dataset that is withheld is used to evaluate the trained model.
    4.   **Evaluation Metrics (Precision, Recall, Accuracy):** Evaluation metrics are used as a method to provide quantitative assessment of the efficacy of the model. `precision` and `recall` are used to address the issue of imbalanced datasets. `accuracy` is another measure and is used in conjuction with `precision` and `recall`.
    5.   **Confusion Matrix:** The confusion matrix is used in models to better understand where the model is struggling. This matrix is displayed between the training and testing set, and the expected target variable in the training set, along with the value predicted by the model. The values are a measurement of whether the model accurately classified the data.


## Results

Using bulleted lists, describe the accuracy scores and the precision and recall scores of all machine learning models.
*   **Machine Learning Model 1: Logistic Regression (Without Data Scaling):**
    *   **Accuracy:** 0.99
        *   The model is correct in 99% of its classifications.
    *   **Precision (Healthy Loans):** 1.00
        *   When the model predicts a loan is healthy, it is correct 100% of the time.
    *   **Recall (Healthy Loans):** 0.99
        *   The model identifies 99% of all actual healthy loans in the dataset.
    *   **Precision (High-Risk Loans):** 0.84
        *   When the model predicts a loan is high-risk, it is correct 84% of the time.
    *   **Recall (High-Risk Loans):** 0.94
        *   The model identifies 94% of all actual high-risk loans in the dataset.
*   **Machine Learning Model 2: Logistic Regression (With Data Scaling):**
    *   **Accuracy:** 0.99
        *   The model is correct in 99% of its classifications.
    *   **Precision (Healthy Loans):** 1.00
        *   When the model predicts a loan is healthy, it is correct 100% of the time.
    *   **Recall (Healthy Loans):** 0.99
        *   The model identifies 99% of all actual healthy loans in the dataset.
    *   **Precision (High-Risk Loans):** 0.84
        *   When the model predicts a loan is high-risk, it is correct 84% of the time.
    *   **Recall (High-Risk Loans):** 0.99
        *   The model identifies 99% of all actual high-risk loans in the dataset.

## Summary

Summarize the results of the machine learning models, and include a recommendation on the model to use, if any. For example:

The Logistic Regression model, both before and after data scaling, was used to classify loans. Before scaling, the model exhibited high accuracy and perfect precision and recall for healthy loans. However, the precision and recall for high-risk loans were lower with 0.84 and 0.94 respectively, suggesting some difficulty in accurately identifying all high-risk loans and some misclassification of healthy loans as high-risk.

After scaling the data using StandardScaler, there was a small improvement to the model's ability to classify the loan. The Logistic Regression model demonstrates strong performance in predicting loan risk. The high accuracy (0.99) indicates that the model is generally correct in its classifications. The model excels at identifying healthy loans, achieving nearly perfect precision and recall. While the precision for high-risk loans is lower (0.84), the recall is very high (0.99), meaning the model effectively identifies nearly all high-risk loans.

Both the Logistic Regression model without data scaling and the Logistic Regression model with data scaling (using StandardScaler) achieve excellent results in predicting loan risk. Both models reach a high accuracy of 99%. When predicting healthy loans, both models reach a Precision and Recall near 100%, thereby validating their ability to classify healthy loans. However, when classifying high-risk loans, there are differences. The model without data scaling exhibits a Precision of 84% and a Recall of 94%. Conversely, the scaled model achieves the same Precision of 84%, but improves the Recall to 99%.  This improvement is likely because the scaling process addresses issues with features being on different scales, allowing the optimization algorithm to converge more effectively. There was no significant change to the overall accuracy of the model or the Recall and Percision of Healthy Loans, however this is not a critical measure. The improved High-Risk Recall indicates a stronger ability to identify high-risk loans.


* **Which one seems to perform best? How do you know it performs best?**
The model that performs best is the Logistic Regression model **with data scaling (StandardScaler)**.
    *   **How do we know it performs best?** While both models exhibit the same overall accuracy, we use Recall (0.99) on the high-risk loans class as the reason for a performance determination. The other measures of the models were close enough that the 5% change in the amount of High-Risk loans detected is the significant measurement used in the analysis.
  
* **Does performance depend on the problem we are trying to solve? (For example, is it more important to predict the `1`'s, or predict the `0`'s? )**
Yes, performance heavily depends on the specific problem of predicting loan risk.

    *   It is **more important to predict the 1's (high-risk loans)**. In the context of loan risk assessment, the cost of *failing* to identify a high-risk loan (a false negative) is significantly higher than the cost of *incorrectly classifying* a healthy loan as high-risk (a false positive). Missing a high-risk loan can lead to substantial financial losses for the lending institution. Therefore, maximizing the `Recall` for the high-risk class is paramount, even if it comes at the expense of slightly lower precision.
 

## Recommendation:

I recommend the scaled model for use by the company with caution. Its extremely high recall for high-risk loans (0.99) is critical from a business perspective, as it minimizes the risk of missing potentially defaulting loans, and maintains a high degree of Precision (0.84) compared to the unscaled model. The trade-off is that there are some healthy loans misclassified as high-risk, the cost of which will depend on the details of the company's loan decisions. The cost of misclassifying a high-risk loan greatly outweighs the cost of missing the classification. I recommend continuous monitoring and refinement of the model to further improve its precision for high-risk loans and minimize the misclassification of healthy loans. Due to the scaling resulting in an improvement in the High-Risk Recall, I recommend only the scaled model for use by the company.
