<a href="https://www.kaggle.com/code/udaykotiya/personal-loan-prediction-knn?scriptVersionId=202712287" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Project Title:Personal Loan Prediction

# Problem Statement

**Objective:**

The main problem we are addressing in this project is to predict whether a customer will accept a personal loan or not. We are using a dataset that includes information about customers, such as their age, experience, income, family size, education, and more. Based on these details, we aim to build a model that can predict whether a customer is likely to take a personal loan.



**Importance:**

This problem is important for businesses, especially banks and financial institutions. Predicting whether a customer will take a loan helps them understand customer behavior better and make smarter decisions. For example, it can help reduce the risk of offering loans to customers who may not need them or are less likely to repay. It also helps in targeting the right customers for loan offers, improving the efficiency of marketing and loan approval processes.

# Dataset Explanation

****Overview:****



The dataset used in this project contains information about customers, including their demographic details and financial status. The goal of using this dataset is to predict whether a customer will accept a personal loan based on these details.



****Features:****



1. ID: A unique number assigned to each customer to identify them.

2. Age: The age of the customer.

3. Experience: The number of years the customer has worked in their profession.

4. Income: The customer’s annual income in thousands of dollars.

5. ZIP Code: The postal code of the area where the customer lives.

6. Family: The number of family members the customer has.

7. CCAvg: The customer’s average monthly spending on credit cards.

8. Education: The customer’s education level, which is coded as 1 (Undergraduate), 2 (Graduate), or 3 (Postgraduate).

9. Mortgage: The amount of mortgage the customer has, if any.

10. Personal Loan: This is the target variable, where 0 means the customer did not accept the loan, and 1 means the customer accepted the loan.

# Methodology

**1. Data Preprocessing:**



* Handling Missing Values: We first checked for any missing values in the dataset using isnull().sum(). Luckily, there were no missing values, so no further imputation or removal of rows was needed.

* Feature Scaling: Since the dataset contains numerical features like age, income, and mortgage, we scaled these features to bring them to a similar range. We used StandardScaler to normalize the data, which helps machine learning models like K-Nearest Neighbors (KNN) perform better by ensuring that large values do not dominate smaller ones.

* Encoding Categorical Data: The categorical feature "Education" was already represented as numerical values (1: Undergrad, 2: Graduate, 3: Postgrad), so no further encoding was needed.



**2. Feature Selection:**



We selected the following features as predictors: Age, Family, Education, Securities Account, CD Account, Online, Credit Card, Income, CCAvg, Mortgage, and ZIP Code. We excluded the Experience feature to avoid redundancy with age. These features were chosen because they are likely to influence a customer's decision to accept a personal loan.



**3. Modeling:**



* K-Nearest Neighbors (KNN): We used the KNN algorithm to predict whether a customer would take a personal loan. KNN classifies based on the similarity of new data points to the closest existing data points. We set n_neighbors=3 and trained the model using the scaled training data.

* Logistic Regression: We also applied Logistic Regression, which is a popular algorithm for binary classification problems. The logistic regression model was fitted on the scaled data, and we evaluated its performance in predicting loan acceptance.





**The models were evaluated based on their accuracy, confusion matrix, and classification reports to understand how well they performed on both the training and test data. Additionally, we plotted the distributions of key features (Income, Credit Card Average, and Mortgage) to explore the data visually.**

# Step 1: Library Imports

* **pandas as pd:** Used for data manipulation and analysis, especially for loading and working with datasets in tabular form.

* **numpy as np:** A fundamental library for numerical computations in Python. It's often used for handling arrays and performing mathematical operations.

* **train_test_split from sklearn.model_selection:** This function is used to split your dataset into training and testing sets, an essential part of building and evaluating machine learning models.

* **StandardScaler from sklearn.preprocessing:** Used to standardize or scale numerical data so that all features have a similar range, which improves the performance of certain algorithms (like KNN and logistic regression).

* **KNeighborsClassifier from sklearn.neighbors:** This is the K-Nearest Neighbors classifier, an algorithm that will be used to train a model for predicting whether a customer will take a personal loan.

* **LogisticRegression from sklearn.linear_model:** A machine learning model used for binary classification tasks. In this case, it will help predict whether a customer will take a loan or not.

* **accuracy_score, confusion_matrix, classification_report from sklearn.metrics:** These are metrics that will be used to evaluate the performance of the trained models by measuring their accuracy and providing other performance indicators.

In [None]:
import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import seaborn as sns

import matplotlib.pyplot as plt

# Step 2: Data Loading

This code loads the dataset from a CSV file into a pandas DataFrame and then prints the first 5 rows to preview the data. 

In [None]:
tbankdf = pd.read_csv('/kaggle/input/personal-loan-modeling/Bank_Personal_Loan_Modelling.csv')

print(tbankdf.head())

# Pre Data Visusalization

**1. Visualize the distribution of Income**

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(tbankdf['Income'], bins=30, kde=True, color='blue')

plt.title('Income Distribution')

plt.xlabel('Income (in thousands)')

plt.ylabel('Frequency')

plt.axvline(tbankdf['Income'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean Income')

plt.axvline(tbankdf['Income'].median(), color='yellow', linestyle='dashed', linewidth=1, label='Median Income')

plt.legend()

plt.show()

**2. Visualize the distribution of Credit Card Average Spend**

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(tbankdf['CCAvg'], bins=30, kde=True, color='green')

plt.title('Credit Card Average Spend Distribution')

plt.xlabel('Average Credit Card Spend (in thousands)')

plt.ylabel('Frequency')

plt.axvline(tbankdf['CCAvg'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean CCAvg')

plt.axvline(tbankdf['CCAvg'].median(), color='yellow', linestyle='dashed', linewidth=1, label='Median CCAvg')

plt.legend()

plt.show()

**3. Visualize the distribution of Mortgage**

In [None]:
plt.figure(figsize=(12, 6))

sns.histplot(tbankdf['Mortgage'], bins=30, kde=True, color='purple')

plt.title('Mortgage Distribution')

plt.xlabel('Mortgage Amount (in thousands)')

plt.ylabel('Frequency')

plt.axvline(tbankdf['Mortgage'].mean(), color='red', linestyle='dashed', linewidth=1, label='Mean Mortgage')

plt.axvline(tbankdf['Mortgage'].median(), color='yellow', linestyle='dashed', linewidth=1, label='Median Mortgage')

plt.legend()

plt.show()

# Step 3:Data Preprocessing 

This code checks for any missing values in each column of the DataFrame and prints the count of missing values for each column. It helps in identifying data quality issues during the Data Preprocessing phase.

In [None]:
# Check for missing values

print("\nMissing values in each column:")

print(tbankdf.isnull().sum())

# Step 4:  Feature Selection 

This code defines the features (predictor variables) and the target variable for the machine learning model. The features selected are all the columns except for the "Experience" column, while the target variable is "Personal Loan," which indicates whether a customer accepted the loan.



* x: This variable contains the predictor features used to make predictions.

* y: This variable represents the target outcome we want to predict.

In [None]:
# Define features and target variable

# Considering all Predictors except Experience feature

x = tbankdf[['Age', 'Family', 'Education', 'Securities Account', 

              'CD Account', 'Online', 'CreditCard', 'Income', 

              'CCAvg', 'Mortgage', 'ZIP Code']]

y = tbankdf['Personal Loan']

# Step 5:  Data Splitting 

This code splits the dataset into training and testing sets.



**1. train_test_split(): This function randomly divides the features (x) and target variable (y) into two parts: a training set (used to train the model) and a test set (used to evaluate the model's performance).**

* **  test_size=0.30: This means 30% of the data will be used for testing, while the remaining 70% will be used for training.**

* **  random_state=1: This sets a seed for random number generation to ensure that the split is reproducible every time you run the code.**

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)

# Step 6: Feature Scaling

This code performs feature scaling, which standardizes the numerical features to have a mean of 0 and a standard deviation of 1.



* **StandardScaler(): This initializes the scaler, which will be used to normalize the data.**

* **scaler.fit_transform(x_train): This fits the scaler to the training data and transforms it, scaling the training features.**

* **scaler.transform(x_test): This applies the same scaling transformation to the test data, ensuring that both training and testing datasets are on the same scale.**

In [None]:
# Feature Scaling

scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)

x_test_scaled = scaler.transform(x_test)

# Step 7: Model Training 

# 1. Model KNN

This code implements the K-Nearest Neighbors (KNN) algorithm to train a model.



* **KNeighborsClassifier(n_neighbors=3): This initializes the KNN classifier with 3 neighbors. This means that when making a prediction, the algorithm will look at the 3 closest data points in the training set to determine the class of a new data point.**

* **knn.fit(x_train_scaled, y_train): This trains the KNN model using the scaled training data (x_train_scaled) and the corresponding target variable (y_train). The model learns the relationship between the features and the target variable during this step.**

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the KNN model

knn.fit(x_train_scaled, y_train)

# Model Prediction 

This code uses the trained K-Nearest Neighbors (KNN) model to make predictions.



* **knn.predict(x_test_scaled): This applies the KNN model to the scaled test data (x_test_scaled) to predict whether each customer in the test set will accept a personal loan. The predictions are stored in the variable y_pred_knn.**

In [None]:
y_pred_knn = knn.predict(x_test_scaled)

# Model Evaluation

This code evaluates the performance of the trained K-Nearest Neighbors (KNN) model.



* **accuracy_score(y_train, knn.predict(x_train_scaled)): This calculates the accuracy of the KNN model on the training data by comparing the true labels (y_train) with the model's predictions for the training set.**



* **accuracy_score(y_test, y_pred_knn): This calculates the accuracy of the KNN model on the test data by comparing the true labels (y_test) with the predictions made on the test set (y_pred_knn).**



* **print() statements: These print the accuracy scores for both training and test datasets in percentage format.**



* **confusion_matrix(y_test, y_pred_knn): This generates a confusion matrix, which shows the number of correct and incorrect predictions made by the model. It helps visualize the model's performance on the test set.**



* **classification_report(y_test, y_pred_knn): This provides a detailed report on the model’s performance, including precision, recall, F1-score, and support for each class (loan accepted or not).**

In [None]:
# Evaluate KNN accuracy

print("KNN Model:")

print("Accuracy of Training data set: {0:.4f} %".format(accuracy_score(y_train, knn.predict(x_train_scaled)) * 100))

print("Accuracy of Test data set: {0:.4f} %".format(accuracy_score(y_test, y_pred_knn) * 100))

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred_knn))

print("\nClassification Report:")

print(classification_report(y_test, y_pred_knn))

# 2.Model Logistic

# Model Training and Model Prediction

**Instantiate the Logistic Regression model:**



* LogisticRegression(max_iter=1000): This initializes the logistic regression model. The max_iter=1000 parameter specifies the maximum number of iterations for the solver to converge, ensuring the model has enough iterations to learn from the data effectively.



**Fit the Logistic Regression model:**



* logreg.fit(x_train_scaled, y_train): This trains the logistic regression model using the scaled training data (x_train_scaled) and the corresponding target variable (y_train). During this step, the model learns the relationship between the features and the target variable.



**Predict the response for test data:**





* logreg.predict(x_test_scaled): This uses the trained logistic regression model to make predictions on the scaled test data (x_test_scaled). The predicted outcomes are stored in the variable y_pred_logreg.

In [None]:
# Instantiate the Logistic Regression model

logreg = LogisticRegression(max_iter=1000)



# Fit the Logistic Regression model

logreg.fit(x_train_scaled, y_train)



# Predict the response for test data with Logistic Regression

y_pred_logreg = logreg.predict(x_test_scaled)

# Model Evaluation

**Print Logistic Regression Model Header:**



* print("\nLogistic Regression Model:"): This line simply prints a header to indicate that the following outputs will relate to the logistic regression model.



**Evaluate Accuracy on Training Data:**



* accuracy_score(y_train, logreg.predict(x_train_scaled)): This calculates the accuracy of the logistic regression model on the training dataset by comparing the true labels (y_train) with the predictions made for the training data. The accuracy is formatted as a percentage.



**Evaluate Accuracy on Test Data:**



* accuracy_score(y_test, y_pred_logreg): This calculates the accuracy of the logistic regression model on the test dataset by comparing the true labels (y_test) with the predictions (y_pred_logreg). This also gets printed as a percentage.



**Generate Confusion Matrix:**



* confusion_matrix(y_test, y_pred_logreg): This generates a confusion matrix to visualize the number of correct and incorrect predictions made by the model on the test set.



**Generate Classification Report:**



* classification_report(y_test, y_pred_logreg): This provides a detailed report on the logistic regression model’s performance, including metrics such as precision, recall, F1-score, and support for each class (loan accepted or not).

In [None]:
# Evaluate Logistic Regression accuracy

print("\nLogistic Regression Model:")

print("Accuracy of Training data set: {0:.4f} %".format(accuracy_score(y_train, logreg.predict(x_train_scaled)) * 100))

print("Accuracy of Test data set: {0:.4f} %".format(accuracy_score(y_test, y_pred_logreg) * 100))

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred_logreg))

print("\nClassification Report:")

print(classification_report(y_test, y_pred_logreg))

# Step 8: Data Visualization

This code imports the libraries needed for data visualization:



**import seaborn as sns: This imports the Seaborn library, which is built on top of Matplotlib. Seaborn provides a high-level interface for drawing attractive statistical graphics, making it easier to create complex visualizations with less code.**



**import matplotlib.pyplot as plt: This imports the Matplotlib library's pyplot module, which is used for creating static, animated, and interactive visualizations in Python. It provides functions to plot data in various forms, such as line plots, bar charts**

In [None]:
import seaborn as sns

import matplotlib.pyplot as plt

**1. Feature Distribution**

**Overview of Features:**



The three graphs display the distributions of income, credit card spending, and mortgage amounts within the dataset.



**Graph 1: Income Distribution:**



* **The income distribution graph shows a rightward skew, indicating that most individuals earn lower incomes compared to the average.**



* **This suggests that many people in the dataset are likely to belong to a lower socioeconomic group.**



**Graph 2: Credit Card Average Spend Distribution:**



* **The graph for credit card spending also exhibits a rightward skew, meaning that the majority of individuals spend less on credit cards than the average.**



* **This could indicate cautious spending habits or limited financial resources among the population represented.**



**Graph 3: Mortgage Distribution:**



**The mortgage distribution graph similarly shows a rightward skew, indicating that most individuals have lower mortgage amounts.**



* **This further reinforces the idea that the dataset may represent individuals with lower financial commitments and possibly lower home ownership levels.**



**Socioeconomic Insights:**



* **The patterns observed in all three graphs suggest that the dataset likely represents a population with lower socioeconomic status.**



* **Understanding this context is important for interpreting the financial behaviors and needs of the individuals in the dataset.**



**Implications for Financial Analysis:**



* **Recognizing that the population has lower income and spending levels can inform decisions about loan approvals and financial product offerings tailored to their circumstances.**



**Guiding Future Research:**



* **Being aware of the socioeconomic context provided by these distributions can guide future analyses, enabling strategies that specifically address the needs and behaviors of the individuals represented in the dataset.**

In [None]:
plt.figure(figsize=(15, 5))



plt.subplot(1, 3, 1)

sns.histplot(tbankdf['Income'], bins=30, kde=True)

plt.title('Income Distribution')



plt.subplot(1, 3, 2)

sns.histplot(tbankdf['CCAvg'], bins=30, kde=True)

plt.title('Credit Card Average Spend Distribution')



plt.subplot(1, 3, 3)

sns.histplot(tbankdf['Mortgage'], bins=30, kde=True)

plt.title('Mortgage Distribution')



plt.tight_layout()

plt.show()


**2. Correlation Heatmap**

**Strong Positive Correlations:**



* **Age and Experience:** There is a very strong positive correlation between age and experience (0.99). This makes sense as people generally gain more experience over time.

* **Income and CCAvg:** There's a moderate positive correlation between income and average credit card spending (0.65). This suggests that people with higher incomes tend to have higher average credit card spending.

* **Mortgage and Personal Loan:** There's a moderate positive correlation (0.50) between having a mortgage and having a personal loan. This indicates that people who have a mortgage are more likely to also have a personal loan.



**Moderate Negative Correlations:**



* **Income and Education:** There is a moderate negative correlation (0.19) between income and education level. This could be due to factors like income discrepancies in different fields, the cost of higher education, or other complex socioeconomic factors.

* **Family and CCAvg:** A moderate negative correlation (0.11) exists between family size and average credit card spending. This might suggest that larger families have less disposable income for credit card spending.



**Weak Correlations:****



The majority of the correlations are weak, meaning there's little to no linear relationship between those variables. For example, there's a weak correlation between having a credit card and having a CD account (0.17). This doesn't necessarily mean they're unrelated, but rather that the relationship isn't a simple linear one.



**Insights for Your Project:**



* **Understanding Customer Profiles:** The correlations suggest that you might group customers based on factors like income, credit card spending, and mortgage status. This could be useful for targeted marketing campaigns.

* **Predicting Credit Card Spending:** The correlation between income and CCAvg could help you develop models to predict credit card spending based on income.

* **Loan Risk Assessment:** The correlations between mortgage and personal loan status could provide valuable insights for assessing loan risks.

In [None]:
plt.figure(figsize=(12, 8))

correlation_matrix = tbankdf.corr()

sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm')

plt.title('Correlation Heatmap')

plt.show()

**3. Loan Approval Rate by Demographics**

In [None]:
tbankdf['Age Group'] = pd.cut(tbankdf['Age'], bins=[20, 30, 40, 50, 60, 70], labels=['20-30', '30-40', '40-50', '50-60', '60-70'])

approval_rate_by_age = tbankdf.groupby('Age Group')['Personal Loan'].mean().reset_index()



plt.figure(figsize=(8, 5))

sns.barplot(data=approval_rate_by_age, x='Age Group', y='Personal Loan')

plt.title('Loan Approval Rate by Age Group')

plt.ylabel('Approval Rate')

plt.xlabel('Age Group')

plt.show()

**4. Model Performance Comparison**

In [None]:
models = ['KNN', 'Logistic Regression']

accuracies = [

    accuracy_score(y_train, knn.predict(x_train_scaled)),

    accuracy_score(y_train, logreg.predict(x_train_scaled))

]



plt.figure(figsize=(8, 5))

sns.barplot(x=models, y=accuracies)

plt.title('Model Training Accuracy Comparison')

plt.ylabel('Accuracy Score')

plt.ylim(0, 1)

plt.show()

**5. Confusion Matrix Visualization**

**The two graphs presented are confusion matrices that visualize the performance of the K-Nearest Neighbors (KNN) and Logistic Regression models. Each matrix displays the actual class labels on the y-axis and the predicted class labels on the x-axis, providing insight into the models' classification accuracy. Both models show strong performance, particularly in correctly identifying class 0, as indicated by the large number of data points in the upper left cell of each matrix. The small values in the upper right cell highlight a low false positive rate for class 0, while the lower left cell indicates a slightly higher false negative rate for class 1. Overall, while both models perform well, Logistic Regression achieves slightly higher accuracy with fewer misclassifications than KNN.**



**Analysis from the Graphs:**



* **Performance Comparison:** Both models effectively classify instances, but Logistic Regression has a slight edge in accuracy and fewer misclassifications.

* **Class 0 Classification:** The high number of correctly classified instances for class 0 suggests both models are reliable for this category.

* **Low False Positive Rate:** The small values in the upper right cell indicate that both models are good at minimizing false positives for class 0.

* **False Negative Insight:** The slightly larger value in the lower left cell for class 1 suggests that both models may struggle more with correctly identifying instances of class 1, indicating a need for improvement in that area.

* **Overall Effectiveness:** The confusion matrices highlight the models' strengths and weaknesses, guiding future refinements and model selection based on specific classification needs.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay



cm_knn = confusion_matrix(y_test, y_pred_knn)

cm_logreg = confusion_matrix(y_test, y_pred_logreg)



fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(cm_knn, annot=True, fmt='d', ax=axes[0], cmap='Blues')

axes[0].set_title('Confusion Matrix - KNN')

axes[0].set_xlabel('Predicted')

axes[0].set_ylabel('Actual')



sns.heatmap(cm_logreg, annot=True, fmt='d', ax=axes[1], cmap='Greens')

axes[1].set_title('Confusion Matrix - Logistic Regression')

axes[1].set_xlabel('Predicted')

axes[1].set_ylabel('Actual')



plt.tight_layout()

plt.show()


**6. Bar Plot of Loan Approval by Education Level**

In [None]:
education_approval = tbankdf.groupby('Education')['Personal Loan'].mean().reset_index()

plt.figure(figsize=(8, 5))

sns.barplot(data=education_approval, x='Education', y='Personal Loan')

plt.title('Loan Approval Rate by Education Level')

plt.ylabel('Approval Rate')

plt.xlabel('Education Level')

plt.show()

**7. Box Plot for Income by Loan Approval**

* **Income as a Factor for Approval: The higher median income for approved applicants suggests that lenders tend to favor individuals with higher incomes when deciding on loan approvals. This reinforces the idea that income plays a crucial role in the loan approval process.**



* **Variation Among Approved Applicants: The wider distribution of income among those approved indicates that there is a diverse range of income levels among successful applicants. This suggests that lenders might be open to approving loans for individuals with varying incomes, not just those at the high end of the income spectrum.**



* **Potential for Lower-Income Approvals: The presence of lower-income individuals in the approved category implies that while higher income increases the likelihood of approval, lenders may also consider other factors (such as credit score, employment stability, or debt-to-income ratio) when evaluating loan applications from lower-income individuals.**



* **Understanding Risk: Lenders may use the insights from this boxplot to assess the risk associated with different income levels. For example, while higher income may indicate a lower risk of default, there might be acceptable levels of income for applicants who demonstrate other strong financial indicators.**



* **Identifying Target Markets: The analysis can help financial institutions identify target markets for personal loans. Understanding which income brackets are more likely to be approved can guide marketing strategies and loan product offerings.**

In [None]:
plt.figure(figsize=(10, 6))

sns.boxplot(data=tbankdf, x='Personal Loan', y='Income')

plt.title('Income Distribution by Loan Approval')

plt.xlabel('Personal Loan')

plt.ylabel('Income')

plt.xticks([0, 1], ['Not Approved', 'Approved'])

plt.show()

**8. Model Performance Metrics Visualization**


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score



# Calculate performance metrics

metrics = {

    'Model': ['KNN', 'Logistic Regression'],

    'Accuracy': [

        accuracy_score(y_test, y_pred_knn),

        accuracy_score(y_test, y_pred_logreg)

    ],

    'Precision': [

        precision_score(y_test, y_pred_knn),

        precision_score(y_test, y_pred_logreg)

    ],

    'Recall': [

        recall_score(y_test, y_pred_knn),

        recall_score(y_test, y_pred_logreg)

    ],

    'F1 Score': [

        f1_score(y_test, y_pred_knn),

        f1_score(y_test, y_pred_logreg)

    ]

}

metrics_df = pd.DataFrame(metrics)



metrics_df.set_index('Model').plot(kind='bar', figsize=(10, 6))

plt.title('Model Performance Metrics Comparison')

plt.ylabel('Score')

plt.ylim(0, 1)

plt.xticks(rotation=0)

plt.show()

**Generating Random Data**

In [None]:
sample_data = {

    'ID': [1, 2, 3, 4, 5, 

           6, 7, 8, 9, 10, 

           11, 12, 13, 14, 15, 

           16, 17, 18, 19, 20, 

           21, 22, 23, 24, 25, 

           26, 27, 28, 29, 30, 

           31, 32, 33, 34, 35, 

           36, 37, 38, 39, 40, 

           41, 42, 43, 44, 45, 

           46, 47, 48, 49, 50],

    'Age': [25, 45, 39, 35, 35, 

            40, 28, 50, 32, 23, 

            29, 37, 31, 34, 41, 

            30, 55, 38, 27, 44, 

            48, 39, 36, 42, 50, 

            22, 29, 40, 33, 46, 

            35, 41, 28, 39, 36, 

            26, 45, 30, 24, 35, 

            39, 32, 40, 41, 38, 

            44, 27, 35, 45, 36],

    'Experience': [1, 19, 15, 9, 8, 

                   12, 3, 20, 10, 5, 

                   4, 18, 11, 16, 7, 

                   9, 22, 14, 6, 12, 

                   10, 11, 9, 17, 13, 

                   1, 2, 19, 18, 8, 

                   3, 5, 10, 20, 14, 

                   9, 7, 6, 4, 5, 

                   8, 16, 14, 2, 3, 

                   20, 18, 11, 8, 9],

    'Income': [49, 34, 11, 100, 45, 

               80, 65, 30, 75, 20, 

               40, 70, 50, 90, 60, 

               30, 25, 55, 95, 85, 

               45, 60, 70, 90, 100, 

               55, 45, 50, 30, 20, 

               90, 80, 75, 65, 85, 

               60, 95, 50, 40, 30, 

               100, 80, 70, 20, 50, 

               30, 25, 55, 45, 40],

    'ZIP Code': [91107, 90089, 94720, 94112, 91330, 

                 92093, 90001, 92007, 90010, 94000, 

                 90020, 93000, 94110, 90030, 91370, 

                 94020, 91110, 94080, 91340, 91130, 

                 94010, 90040, 94750, 94040, 90050, 

                 92050, 91150, 91350, 94060, 94070, 

                 90090, 90080, 91120, 94030, 90060, 

                 94730, 92080, 91140, 94090, 92010, 

                 94120, 91180, 90030, 91380, 90070, 

                 94130, 91190, 91360, 92040, 90000],

    'Family': [4, 3, 1, 1, 4, 

               2, 3, 1, 2, 2, 

               1, 4, 3, 1, 2, 

               4, 3, 2, 4, 2, 

               3, 2, 1, 4, 1, 

               3, 1, 2, 3, 4, 

               2, 1, 4, 3, 2, 

               1, 4, 2, 1, 3, 

               2, 4, 1, 2, 1, 

               4, 3, 2, 4, 2],

    'CCAvg': [1.6, 1.5, 1.0, 2.7, 1.0, 

              2.2, 1.4, 2.0, 1.8, 2.5, 

              1.9, 1.7, 2.1, 1.3, 1.5, 

              2.0, 1.6, 2.2, 1.4, 1.8, 

              2.3, 1.5, 1.1, 1.6, 2.0, 

              1.4, 1.2, 1.9, 2.1, 1.8, 

              2.0, 1.5, 1.7, 1.4, 1.9, 

              2.1, 1.6, 1.8, 2.2, 1.3, 

              1.9, 2.3, 1.1, 1.5, 2.0, 

              2.1, 1.8, 1.6, 1.9, 2.0],

    'Education': [1, 1, 1, 2, 2, 

                  2, 1, 1, 2, 1, 

                  2, 2, 1, 1, 2, 

                  1, 1, 2, 2, 1, 

                  1, 1, 2, 1, 2, 

                  1, 2, 1, 1, 1, 

                  1, 1, 2, 2, 2, 

                  2, 1, 2, 1, 1, 

                  2, 1, 1, 2, 2, 

                  1, 2, 1, 1, 1],

    'Mortgage': [0, 0, 0, 0, 0, 

                 150, 0, 0, 0, 0, 

                 0, 100, 0, 0, 0, 

                 200, 0, 0, 0, 0, 

                 0, 100, 0, 0, 0, 

                 0, 0, 0, 0, 0, 

                 0, 0, 0, 0, 0, 

                 0, 0, 0, 0, 0, 

                 0, 100, 0, 0, 0, 

                 0, 0, 0, 0, 0],

    'Personal Loan': [0, 0, 0, 0, 0, 

                      1, 0, 0, 1, 0, 

                      1, 0, 0, 1, 0, 

                      0, 1, 0, 1, 0, 

                      0, 1, 0, 1, 0, 

                      0, 0, 0, 0, 1, 

                      1, 0, 1, 0, 0, 

                      1, 0, 1, 0, 1, 

                      0, 0, 0, 1, 0, 

                      1, 1, 0, 1, 0],

    'Securities Account': [1, 1, 0, 0, 0, 

                           1, 1, 0, 0, 1, 

                           1, 1, 1, 0, 1, 

                           1, 1, 0, 0, 1, 

                           0, 1, 0, 0, 1, 

                           1, 1, 0, 1, 1, 

                           0, 1, 1, 0, 0, 

                           1, 0, 1, 1, 0, 

                           0, 1, 0, 1, 1, 

                           1, 0, 1, 0, 1],

    'CD Account': [0, 0, 0, 0, 0, 

                   1, 0, 0, 1, 0, 

                   0, 0, 1, 0, 1, 

                   0, 1, 0, 0, 1, 

                   1, 0, 1, 1, 0, 

                   0, 1, 0, 0, 0, 

                   0, 0, 1, 1, 1, 

                   0, 1, 0, 0, 1, 

                   0, 1, 0, 1, 0, 

                   1, 0, 0, 1, 0],

    'Online': [0, 0, 0, 0, 1, 

               1, 1, 1, 0, 1, 

               1, 0, 0, 0, 1, 

               0, 1, 0, 1, 0, 

               1, 1, 1, 0, 0, 

               1, 1, 1, 1, 0, 

               0, 0, 1, 1, 0, 

               0, 1, 0, 1, 1, 

               0, 1, 0, 0, 1, 

               1, 0, 1, 0, 1],

    'CreditCard': [0, 0, 0, 0, 1, 

                   1, 0, 0, 0, 0, 

                   1, 0, 1, 0, 1, 

                   0, 1, 0, 1, 1, 

                   0, 0, 0, 1, 1, 

                   0, 0, 1, 1, 0, 

                   1, 0, 0, 1, 0, 

                   1, 0, 0, 1, 1, 

                   1, 0, 0, 1, 1, 

                   0, 0, 1, 0, 0]

}

In [None]:
df_sample = pd.DataFrame(sample_data)



X = df_sample[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 

                'Education', 'Mortgage', 'Securities Account', 

                'CD Account', 'Online', 'CreditCard']]



scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)



knn_predictions = knn.predict(X_scaled)

logistic_predictions = logreg.predict(X_scaled)



df_sample['KNN Prediction'] = knn_predictions

df_sample['Logistic Prediction'] = logistic_predictions



print(df_sample[['ID', 'KNN Prediction', 'Logistic Prediction']])