# Logistic Regression

Install the required libraries

In [40]:
!pip install pandas scikit-learn matplotlib seaborn

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

In [50]:

df = pd.read_csv('bank-additional-full.csv', delimiter=';')

# Data Preprocessing

# Select relevant features. Assuming all features are relevant for the first iteration.
features = df.drop('y', axis=1)  # Drop the target column
selected_columns = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
features = features[selected_columns]

target = df['y']  # Target column
# Handling missing values
# Assuming no missing values for this example. If there are, you can handle them with methods like fillna() or dropna().
features.bfill(inplace=True)
# Show the first few rows of the processed data
features.head()


# Data Preprocessing


1. **Age (Numeric)**: 
   - **Relevance**: Age is a crucial factor in financial planning. Younger clients may be more risk-tolerant and interested in growth-oriented investments, while older clients might prefer safer, income-generating options. Understanding the age distribution of clients can help tailor financial advice and product offerings.

2. **Job (Categorical)**: 
   - **Relevance**: A client's occupation can significantly influence their income level, financial goals, and risk appetite. For example, entrepreneurs might have variable income and a higher risk tolerance, whereas government employees might have stable income and prefer secure investments. This data can be used to develop targeted financial products and advice.

3. **Marital Status (Categorical)**:
   - **Relevance**: Marital status can impact financial responsibilities and goals. Married clients might be more interested in joint investments or long-term financial planning for family needs, while single clients may have different priorities. This information can help in understanding the client's financial commitments and planning needs.

4. **Education (Categorical)**:
   - **Relevance**: Education level can correlate with financial literacy, income potential, and investment preferences. Highly educated clients might be more inclined towards sophisticated investment options, while others might prefer simple, straightforward products. Tailoring communication and advice based on education levels can improve client engagement and satisfaction.

5. **Default on Credit (Categorical)**:
   - **Relevance**: A history of defaulting on credit can be a critical indicator of a client's financial health and creditworthiness. Clients with a history of default might require more cautious financial planning and might not be suitable for certain types of credit or investment products.

6. **Housing Loan (Categorical)**:
   - **Relevance**: Whether a client has a housing loan can inform their financial liabilities and risk profile. Clients with significant housing loans might have less disposable income for investments and might prefer safer, liquid assets.

7. **Personal Loan (Categorical)**:
   - **Relevance**: Similar to housing loans, personal loans can affect a client's financial flexibility. A high level of personal debt might indicate a need for debt management advice and could affect the suitability of certain investment products.


In [53]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# Separate numeric columns and classification columns
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan']
numerical_columns_train = X_train.columns.difference(categorical_columns)
numerical_columns_test = X_test.columns.difference(categorical_columns)

# Apply label coding to classified data
label_encoder = LabelEncoder()
for col in categorical_columns:
    X_train[col] = label_encoder.fit_transform(X_train[col])
    X_test[col] = label_encoder.fit_transform(X_test[col])

# Application standardization of numerical data
scaler = StandardScaler()
X_train[numerical_columns_train] = scaler.fit_transform(X_train[numerical_columns_train])
X_test[numerical_columns_train] = scaler.fit_transform(X_test[numerical_columns_train])


# Standardize the features (important for models like KNN)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Training
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

In [56]:

# K-Nearest Neighbors
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

# Model Evaluation
# Evaluate Logistic Regression
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_predictions)
lr_confusion_matrix = confusion_matrix(y_test, lr_predictions)

# Evaluate K-Nearest Neighbors
knn_predictions = knn_model.predict(X_test)
knn_accuracy = accuracy_score(y_test, knn_predictions)
knn_confusion_matrix = confusion_matrix(y_test, knn_predictions)

# Visualization (Confusion Matrix for Logistic Regression)
sns.heatmap(lr_confusion_matrix, annot=True)
plt.title('Logistic Regression Confusion Matrix')
plt.ylabel('Actual labels')
plt.xlabel('Predicted labels')
plt.show()

# Visualization (Confusion Matrix for K-Nearest Neighbors)
sns.heatmap(knn_confusion_matrix, annot=True)
plt.title('K-Nearest Neighbors Confusion Matrix')
plt.ylabel('Actual labels')
plt.xlabel('Predicted labels')
plt.show()

# Output accuracy scores
print(f"Logistic Regression Accuracy: {lr_accuracy}")
print(f"K-Nearest Neighbors Accuracy: {knn_accuracy}")



# Model Evaluation
### K-Nearest Neighbors Confusion Matrix:

- **True Negative (TN)**: The top left square (beige color) shows the number of negatives correctly classified as negative (class 0). In this case, around 7,100 instances were correctly predicted as not subscribing to a term deposit.
- **False Positive (FP)**: The top right square (beige color) indicates the number of negatives incorrectly classified as positive (class 1). About 160 instances were incorrectly predicted as subscribing to a term deposit.
- **False Negative (FN)**: The bottom left square (dark purple) shows the number of positives incorrectly classified as negative. Here, around 880 instances that did subscribe were incorrectly predicted as not subscribing.
- **True Positive (TP)**: The bottom right square (dark purple) represents the number of positives correctly classified. There are 59 instances correctly identified as subscriptions.

### Logistic Regression Confusion Matrix:

- **True Negative (TN)**: The top left square indicates a similar count of true negatives as KNN, around 7,300.
- **False Positive (FP)**: The top right square shows that there are no false positives; the model did not incorrectly predict any subscriptions.
- **False Negative (FN)**: The bottom left square shows that there are around 940 false negatives, meaning the model missed these subscriptions and predicted them as non-subscriptions.
- **True Positive (TP)**: The bottom right square shows that there are no true positives; the model did not correctly predict any subscriptions.

### Observations:

- Both models appear to have a high number of false negatives, especially the Logistic Regression model, which failed to identify any of the positive cases.
- The KNN model managed to identify some true positives, but the number is quite low.
- The imbalance between the classes could be an issue here, as indicated by the high number of true negatives and low number of true positives, which suggests that the dataset might be imbalanced with a much larger number of negative instances (no subscription) than positive ones (subscription).
- Logistic Regression seems to predict that no one will subscribe, which might suggest a model that is not well-calibrated or a dataset that is not well-suited for this model without further processing or feature engineering.
- The KNN model, while still heavily biased towards predicting non-subscriptions, at least identified some true subscriptions.

### Recommendations:

- Look into the imbalance in the dataset. If the number of non-subscriptions is much higher than the number of subscriptions, consider techniques such as oversampling, undersampling, or using algorithms that handle imbalance well.
- Perform feature engineering to try to improve the model's ability to distinguish between the two classes.
- Adjust the decision threshold for the models to see if that helps reduce the number of false negatives.