# **Project Summary -**



# **Project Name**   

## **Paisabazaar Banking Fraud Analysis**

---




##### **Project Type**    - Classification (Supervise Ml)
##### **Contribution**    - Individual


# **Project Objective:**



*  The main Objective of this project is  **to analyze customer data and predict their credit score,** which help:

1.   Improve creditworthiness assessment
2.  Reduce loan default risks
3.    Provide personalized financial advice











## **Problem Statement**



•	Paisabazaar is a financial services company offering credit and banking products to customers.

•	One of their key challenges is accurately assessing the creditworthiness of customers.

•	Credit scores are critical in evaluating the risk of loan defaults and making approval decisions.

•	The company wants to classify customers into credit score categories: Good, Standard, or Poor.

•	The classification should be based on features like:

    1. Annual income
    2. Number of loans
    3. Credit card usage
    4. Payment behavior
    5. Occupation, monthly EMI, etc.

•	The goal is to build a supervised machine learning classification model.

•	A successful model will:

    1.	Help reduce default risk

    2.	Improve loan approval accuracy

    3.	Enable personalized financial recommendations







# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
data=pd.read_csv("/content/dataset-2.csv")


### Dataset First View

In [None]:
# To display first 5 rows
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape   #rows=100000 and columns=28

### Dataset Information

In [None]:
data.info() # summary of data set

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
data.columns

### What did you know about your dataset?



1.   No.of rows:100000
2.   No.of columns:28
3.  **Discreate Categorical variable:**

      a.Name

      b.Occupation

      c.Type_of_Loan

      d.Payment_of_Min_Amount

      e.Payment_Behaviour

      f.Credit_Score

4. **Discreate count variable:**

      a.Month

      b.Num_Bank_Accounts

      c.Num_Credit_Card

      d.Num_of_Loan

      e.Num_of_Delayed_Payment

      f.Num_Credit_Inquiries

      g.ID

      h.Customer_ID

5. **continuous variables:**

       a. Age
       b.SSN
       c.Annual_Income
       d.Monthly_Inhand_Salary
       e.Interest_Rate
       f.Delay_from_due_date
       g.Changed_Credit_Limit
       h.Outstanding_Debt
       i.Credit_Utilization_Ratio
       j.Credit_History_Age
       k.Total_EMI_per_month
       l.Amount_invested_monthly
       m.Monthly_Balance

6. There are no duplicated values in data set
7. There are no missing values in data set


## ***2. Understanding Your Variables***

In [None]:
data.columns  # To see all columns of dataset

In [None]:
data.describe()

### Variables Description

1. ID: All unique value
2. Customer_ID: customer id
3. Month: 	Month number (likely used for time series or tracking changes over months).
4. Name: Full name of customer
5. Age:  Age of customer
6. SSN:  Social Security Number
7. Occupation: Customers profession
8. Annual_Income: How much customer earn in year
9. Monthly_Inhand_Salary:How much salary crediated to customer account  in month
10. Num_Bank_Accounts: No. of active bank account a customer have
11. Num_Credit_Card : No. of crediate card customer owned
12. Interest_Rate: Average interest rate on loans/credit.
13. Num_of_Loan: 	Total number of loans availed.
14. Type_of_Loan: type of loans taken by customer
15. Delay_from_due_date: Average days of delay in payment past due date.
16. Num_of_Delayed_Payment: how many time customer did delay payent
17. Changed_Credit_Limit: Amount by which credit limit was increased/decreased.

18. Num_Credit_Inquiries: 	Number of times customer applied for credit recently.
19. Credit_Mix: Type of credit portfolio – e.g., Good, Standard, Bad.
20. Outstanding_Debt: Total remaining unpaid debt.
21. Credit_Utilization_Ratio: Ratio of used credit to total credit limit.
22. Credit_History_Age: Duration (in months or years) of the customer’s credit history.
23. Payment_of_Min_Amount: 	Whether the customer paid the minimum amount due (Yes/No).
24. Total_EMI_per_month: Monthly total of EMIs paid.
25. Amount_invested_monthly: Monthly investment made by the customer.
26. Payment_Behaviour: customer’s pattern of spending and payment behavior.
27. Monthly_Balance: Average monthly balance left after all expenses.
28. Credit_Score: Overall credit score category – e.g., Good, Standard, Poor.
      
      
  

### Check Unique Values for each variable.

In [None]:
# To check unique value in variable use unique() and for unique count use nunique()
data.nunique()

In [None]:
data['Month'].value_counts()

In [None]:
data['Occupation'].value_counts()

In [None]:
data['Num_Bank_Accounts'].value_counts()

In [None]:
data['Credit_Mix'].value_counts()

In [None]:
data['Delay_from_due_date'].value_counts()

In [None]:
data['Annual_Income'].value_counts()

In [None]:
data['Credit_Score'].value_counts()

In [None]:
data['Payment_Behaviour'].value_counts()

In [None]:
data['Monthly_Balance'].value_counts()

In [None]:
data['Total_EMI_per_month'].value_counts()

In [None]:
data['Changed_Credit_Limit'].value_counts()

In [None]:
data['Interest_Rate'].value_counts()

In [None]:
data['Num_of_Loan'].value_counts()

In [None]:
discreate_categorical=['Name','Occupation','Type_of_Loan', 'Credit_Mix','Payment_of_Min_Amount','Payment_Behaviour','Credit_Score']
discreate_count=['Num_Bank_Accounts','Num_Credit_Card','Num_of_Loan','Num_of_Delayed_Payment','Num_Credit_Inquiries']
continuous=['Age','Annual_Income','Monthly_Inhand_Salary','Interest_Rate','Delay_from_due_date','Changed_Credit_Limit',
          'Outstanding_Debt','Credit_Utilization_Ratio','Credit_History_Age','Total_EMI_per_month','Amount_invested_monthly','Monthly_Balance']

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Base on filter methode if all features are unique then drop feature
data.drop(columns=['ID','Customer_ID','SSN','Month','Name'],inplace=True)

In [None]:
#duplicate value
data.duplicated().sum()
#missiing value
data.isnull().sum()

In [None]:
data.head(2)

In [None]:
data['Num_Bank_Accounts'].unique()

In [None]:
data['Num_Credit_Card'].unique()

In [None]:
data['Interest_Rate'].unique()

In [None]:
data['Num_of_Loan'].unique()

In [None]:
data['Delay_from_due_date'].unique()

In [None]:
data['Num_of_Delayed_Payment'].unique()

In [None]:
data['Payment_Behaviour'].unique()

In [None]:
# Fixing data type
data['Age']=data['Age'].astype('int')
data['Num_Bank_Accounts']=data['Num_Bank_Accounts'].astype('int')
data['Num_Credit_Card']=data['Num_Credit_Card'].astype('int')
data['Num_of_Loan']=data['Num_of_Loan'].astype('int')
data['Delay_from_due_date']=data['Delay_from_due_date'].astype('int')
data['Num_of_Delayed_Payment']=data['Num_of_Delayed_Payment'].astype('int')


In [None]:
# Wrong data
data['Payment_of_Min_Amount'] = data['Payment_of_Min_Amount'].replace('NM', 'NA')

In [None]:
continuous


In [None]:
# Fix Skewness
data[['Annual_Income','Monthly_Inhand_Salary','Interest_Rate','Changed_Credit_Limit','Outstanding_Debt','Credit_Utilization_Ratio',
      'Credit_History_Age','Total_EMI_per_month','Amount_invested_monthly','Monthly_Balance']].skew() # right skew

In [None]:
from scipy.stats import boxcox


# Function to apply boxcox with positive shift
def boxcox_transform(col):
    # If minimum value <= 0, shift the column
    min_val = col.min()
    if min_val <= 0:
        col = col + abs(min_val) + 1
    transformed, _ = boxcox(col)
    return transformed

# Apply safely to columns
data['Annual_Income'] = boxcox_transform(data['Annual_Income'])
data['Monthly_Inhand_Salary'] = boxcox_transform(data['Monthly_Inhand_Salary'])
data['Outstanding_Debt'] = boxcox_transform(data['Outstanding_Debt'])
data['Total_EMI_per_month'] = boxcox_transform(data['Total_EMI_per_month'])
data['Amount_invested_monthly'] = boxcox_transform(data['Amount_invested_monthly'])
data['Monthly_Balance'] = boxcox_transform(data['Monthly_Balance'])

In [None]:
data[['Annual_Income','Monthly_Inhand_Salary','Interest_Rate','Changed_Credit_Limit','Outstanding_Debt','Credit_Utilization_Ratio',
      'Credit_History_Age','Total_EMI_per_month','Amount_invested_monthly','Monthly_Balance']].skew() # right skew

In [None]:
sns.boxplot(data)
plt.show()

In [None]:
sns.boxplot(data['Delay_from_due_date'])
plt.show()

In [None]:


# Select only numeric columns
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns

# Loop to remove outliers in one go
for col in numeric_cols:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    data = data[(data[col] >= lower) & (data[col] <= upper)]


In [None]:
sns.boxplot(data['Delay_from_due_date'])
plt.show()

In [None]:

# Feature Engineering

# 1. Debt to Income Ratio
data['Debt_to_Income_Ratio'] = data['Outstanding_Debt'] / (data['Annual_Income'] + 1)

# 2. Credit Card Utilization Score
data['Credit_Card_Utilization_Score'] = data['Credit_Utilization_Ratio'] * data['Num_Credit_Card']


# 3. Payment Delay Score
data['Payment_Delay_Score'] = data['Num_of_Delayed_Payment'] * data['Delay_from_due_date']


In [None]:
data.columns

# What all manipulations have you done and insights you found??

#### I have drop those columns which contain unique value beacuse they not useful in ml
#### I have chaek is there are any duplicate or missing values
#### I have convert skew data to normal data for ml
#### I have remove outlier with upper limit and lower limit
#### I have done feature engineeting create a new columns for analysis


# 4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables


In [None]:
data.columns

# Chart - 1

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of Credit Score
sns.countplot(x='Credit_Score', data=data)
plt.title("Distribution of Credit Scores")
plt.show()


## 1. Why did you pick the specific chart?

- To check count of each category of credit_score

##### 2. What is/are the insight(s) found from the chart?

*   most customers have an average credit score.
*   If many users are in the Poor category:

    - The company might face higher lending risk.

- It may need to tighten credit policies or offer financial counseling.

- If most users are Good:

- Indicates a low-risk customer base, ideal for premium credit products.


#### Will the gained insights help creating a positive business impact?


Positive: Most of the user have standard credit score
- If most users are Good:

Indicates a low-risk customer base, ideal for premium credit products.


Negative:
- If many users are in the Poor category which are in median range:

     - The company might face higher lending risk.

- It may need to tighten credit policies or offer financial counseling.




#### Chart - 2

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x='Occupation', y='Num_of_Loan', data=data, estimator='mean')
plt.xticks(rotation=45)
plt.title("Average Number of Loans per Occupation")
plt.show()



##### 1. Why did you pick the specific chart?

- Shows which occupations tend to take more loans.*italicized text*

##### 2. What is/are the insight(s) found from the chart?

- Managers or Entrepreneurs might have more loans due to business needs.
- Scientist and Doctor also have more loans also


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Company can do focus more on:
- Managers or Entrepreneurs might have more loans due to business needs.
- Scientist and Doctor also have more loans also

Negative:
-

#### Chart - 3

In [None]:
num_cols = ['Age', 'Annual_Income', 'Credit_Utilization_Ratio', 'Monthly_Balance']
for col in num_cols:
    plt.figure(figsize=(6,4))
    sns.boxplot(x='Credit_Score', y=col, data=data)
    plt.title(f'{col} vs Credit Score')
    plt.show()


##### 1. Why did you pick the specific chart?

It will give clear idea about min , max, 25%,755,outlier in data

##### 2. What is/are the insight(s) found from the chart?

If Good scores have a higher median age, it suggests credit improves with age/experience.

If Poor scores have more young individuals, it may indicate lack of credit history or financial maturity.

Watch for outliers: older people with poor scores could indicate financial stress in retirement.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 4

In [None]:
sns.boxplot(x='Credit_Score', y='Age', data=data)
plt.show()

##### 1. Why did you pick the specific chart?

- analyze how age varies across different credit score categories.

##### 2. What is/are the insight(s) found from the chart?

- Older individuals tend to have better credit scores; poor scores are more common in younger age groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Helps target older customers for premium products and long-term loans.

Any Negative Insight:
Yes. Younger individuals may be underserved due to lack of credit history — may lead to loss of potential long-term customers.

#### Chart - 5

In [None]:

sns.countplot(x='Credit_Mix', hue='Credit_Score', data=data)
plt.show()


##### 1. Why did you pick the specific chart?

- To compare how different credit mix types (Bad, Standard, Good) are distributed across credit score categories.

##### 2. What is/are the insight(s) found from the chart?

- customers with a "Good" credit mix are mostly in the "Good" credit score group, while a "Bad" mix correlates with poor scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Yes. Can guide lenders to emphasize maintaining a balanced credit mix when assessing or advising customers.

Negative Insight:

If credit mix is too heavily weighted in scoring, customers with limited product access may be unfairly penalized — risk of bias.

#### Chart - 6

In [None]:
sns.histplot(data['Annual_Income'], kde=True)
plt.show()

##### 1. Why did you pick the specific chart?

- To observe the distribution and skewness of annual income across all customers.

##### 2. What is/are the insight(s) found from the chart?

- Income is right-skewed — most customers earn in the lower range, while a few have very high incomes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Yes. Helps in income-based segmentation — tailor financial products for low, mid, and high earners.

Negative Insight:

High income outliers may distort model training or bias targeting; may also hide underserved low-income groups.

#### Chart - 7

In [None]:
sns.countplot(x='Occupation', data=data)
plt.show()

##### 1. Why did you pick the specific chart?

- To understand the distribution of customers across different occupations.

##### 2. What is/are the insight(s) found from the chart?

- Certain occupations (e.g., "Salaried", "Self_Employed") have a much higher representation, while others are underrepresented.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

- Positive Business Impact:

Yes. Enables targeted marketing and product design for the most common occupations.

- Negative Insight:

Overrepresentation of a few occupations may lead to biased models and neglect of smaller yet valuable customer groups.

#### Chart - 8

In [None]:
sns.lineplot(x='Credit_History_Age', y='Credit_Utilization_Ratio', data=data)
plt.show()

##### 1. Why did you pick the specific chart?

- To see the trend/relationship between the length of credit history and credit utilization.

##### 2. What is/are the insight(s) found from the chart?

- Generally, as credit history age increases, utilization ratio tends to decrease, indicating more responsible usage over time.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Encourages rewarding long-term customers with stable credit behavior — improves retention and cross-sell potential.

Negative Insight:
New customers with short history may have high utilization and get low scores unfairly, affecting onboarding and growth.

#### Chart - 9

In [None]:
sns.heatmap(data[continuous].corr(),annot=True)
plt.show()

##### 1. Why did you pick the specific chart?

- To find correlation between only continuous variables, which helps in cleaner, more focused feature analysis.

##### 2. What is/are the insight(s) found from the chart?

- You can clearly see which continuous variables are strongly related (e.g., Outstanding_Debt vs Total_EMI_per_month).

Weak or no correlation means independent influence on target.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:
Yes. Helps in removing redundant features, reducing model complexity, and improving performance.

Negative Insight:
Ignoring low-correlation but important features (e.g., non-linear relationships) may lead to loss of predictive power — should combine with other techniques (like feature importance from models).

#### Chart - 10 - Correlation Heatmap

In [None]:
selected_cols = [
    'Annual_Income',
    'Outstanding_Debt',
    'Credit_Utilization_Ratio',
    'Monthly_Balance',
    'Total_EMI_per_month',
    'Debt_to_Income_Ratio',
    'Payment_Delay_Score'
]
plt.figure(figsize=(10,6))
sns.heatmap(data[selected_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Key Financial Features")
plt.show()



##### 1. Why did you pick the specific chart?

- To visually detect strong correlations between numerical features — useful for identifying feature relationships and redundancy.

##### 2. What is/are the insight(s) found from the chart?

- Outstanding_Debt is strongly correlated with Total_EMI_per_month.

Debt_to_Income_Ratio and Credit_Utilization_Ratio also show strong patterns.

Some features are independent — good for diverse modeling inputs.

#### Chart - 11 - Pair Plot

In [None]:
continuous


In [None]:
numeric_col=['Annual_Income',
 'Interest_Rate',]
sns.pairplot(data[numeric_col])
plt.show()

##### 1. Why did you pick the specific chart?

To visualize pairwise relationships and class separability across multiple numeric features.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Customers with higher credit history age have significantly lower credit utilization ratios."

 1. State Hypotheses
Null Hypothesis (H₀):
There is no significant correlation between Credit History Age and Credit Utilization Ratio.

𝐻
0
:
𝜌
=
0


Alternate Hypothesis (H₁):
There is a significant negative correlation between Credit History Age and Credit Utilization Ratio.

𝐻
1
:
𝜌
<
0


#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr
import pandas as pd




# Apply Pearson correlation test
corr, p_value = pearsonr(data['Credit_History_Age'], data['Credit_Utilization_Ratio'])

print(f"Correlation Coefficient: {corr:.3f}")
print(f"P-value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

I used the Pearson Correlation Test (scipy.stats.pearsonr())

##### Why did you choose the specific statistical test?

Both variables are continuous and assumed to have a linear relationship.

Pearson correlation is appropriate to:

Measure the strength and direction of the relationship.

Provide a p-value to test the statistical significance of the correlation.



### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- The average annual income is significantly different across credit score groups
- Null Hypothesis (H₀):
The mean annual income is the same across all credit score groups (Good, Standard, Poor).

𝐻
0
:
𝜇
Good
=
𝜇
Standard
=
𝜇
Poor

Alternative Hypothesis (H₁):
At least one group has a different mean annual income.

𝐻
1
:
At least one
𝜇
 is different


#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import f_oneway

# Prepare data
df2 = data[['Annual_Income', 'Credit_Score']]

# Group the data
group_good = df2[df2['Credit_Score'] == 'Good']['Annual_Income']
group_std = df2[df2['Credit_Score'] == 'Standard']['Annual_Income']
group_poor = df2[df2['Credit_Score'] == 'Poor']['Annual_Income']

# Perform One-Way ANOVA
f_stat, p_value = f_oneway(group_good, group_std, group_poor)

# Results
print(f"F-Statistic: {f_stat:.2f}")
print(f"P-Value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

-  One-Way ANOVA (Analysis of Variance)

##### Why did you choose the specific statistical test?

- One-way ANOVA is used to compare means across 3 or more groups.

- Our goal is to test whether Annual Income varies significantly between Credit Score groups (Good, Standard, Poor).

- It provides an F-statistic and a p-value to check if the differences are statistically significant.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

- here is a significant correlation between Outstanding Debt and Total EMI per month.
- Null Hypothesis (H₀):
There is no significant correlation between Outstanding Debt and Total EMI per month.

𝐻
0
:
𝜌
=
0

Alternative Hypothesis (H₁):
There is a significant correlation between Outstanding Debt and Total EMI per month.

𝐻
1
:
𝜌
≠
0


#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import pearsonr

df3 = data[['Outstanding_Debt', 'Total_EMI_per_month']]
# Perform Pearson correlation test
corr, p_value = pearsonr(df3['Outstanding_Debt'], df3['Total_EMI_per_month'])

# Output results
print(f"Correlation Coefficient: {corr:.3f}")
print(f"P-Value: {p_value:.4f}")


##### Which statistical test have you done to obtain P-Value?

- Pearson Correlation Test

##### Why did you choose the specific statistical test?

- Both Outstanding_Debt and Total_EMI_per_month are continuous numeric variables.

- Pearson’s test is appropriate to:

- Assess the linear relationship between two continuous variables.

- Provide a p-value to test if that correlation is statistically significant.

### 3. Categorical Encoding

In [None]:
#Encoding
discreate_categorical

In [None]:
# Label Encoding ---
from sklearn.preprocessing import LabelEncoder
label_cols = ['Credit_Mix', 'Payment_of_Min_Amount', 'Credit_Score']
for col in label_cols:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col].astype(str))

In [None]:
# One hot encoding
data = pd.get_dummies(data, columns=['Occupation', 'Payment_Behaviour'], drop_first=True)

In [None]:
# --- Step 4: Encoding 'Type_of_Loan' ---
loan_types_split = data['Type_of_Loan'].dropna().apply(
    lambda x: [lt.strip().replace(" and", "") for lt in x.split(',')]
)

from itertools import chain
all_loan_types = set(chain.from_iterable(loan_types_split))

for loan_type in all_loan_types:
    data[f"Has_{loan_type.replace(' ', '_')}"] = data['Type_of_Loan'].fillna('').apply(
        lambda x: int(loan_type in x)
    )

data.drop(columns=['Type_of_Loan'], inplace=True)

# Preview the final dataset
print(data.head())

#### What all categorical encoding techniques have you used & why did you use those techniques?

- I done encoding for Ml algorithm

In [None]:
# Combine features and target first
df = data.dropna()  # or any preprocessing

x = df.drop('Credit_Score', axis=1)
y = df['Credit_Score']


In [None]:
y

## ***7. ML Model Implementation***

### ML Model -  LogisticRegression

In [None]:
from sklearn.model_selection import train_test_split

random_state=None
best_accuracy=0.0
for random_state in range(100):
  X_train,X_test,Y_train,Y_test=train_test_split(x,y,test_size=0.2,random_state=random_state)

from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
#fit model
model.fit(X_train,Y_train)
#Predication
ypred_train=model.predict(X_train)
ypred_test=model.predict(X_test)
#Evaluation
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
#print("Train Accuracy:",accuracy_score(y_train,ypred_train))
#print("Test Accuracy:",accuracy_score(y_test,ypred_test))
#print("Cross Validation Score:",cross_val_score(model,X_train,y_train,cv=5).mean())
#or
accuracy=accuracy_score(Y_test,ypred_test)
#
if accuracy>best_accuracy:
    best_accuracy=accuracy
    best_random_state=random_state
print("best random state:",best_random_state)
print("Best Accuracy:",best_accuracy)

In [None]:
#model
model1=LogisticRegression()
model1.fit(X_train,Y_train)
#prediction
ypred_train=model1.predict(X_train)
ypred_test=model1.predict(X_test)
#Evaluation
print("Train Accuracy:",accuracy_score(Y_train,ypred_train))
print("Test Accuracy:",accuracy_score(Y_test,ypred_test))
print("Cross Validation score:",cross_val_score(model1,X_train,Y_train,cv=5).mean())

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model2=KNeighborsClassifier()
model2.fit(X_train,Y_train)
#predict
ypred_trainknn=model2.predict(X_train)
ypred_testknn=model2.predict(X_test)
#Evaluation
print("Train Accuracy:",accuracy_score(Y_train,ypred_trainknn))
print("Test Accuracy:",accuracy_score(Y_test,ypred_testknn))
print("Cross Validation Score:",cross_val_score(model2,X_train,Y_train,cv=5).mean())

### ML Model - SVM

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
#Model
estimator=SVC()
#parameter
param_grid={"C":[0.01,0.1,1,10,100],'kernel':['linear','poly','rbf','sigmoid']}
#Best parameter
svm_grid=GridSearchCV(estimator,param_grid,cv=5,scoring='accuracy')
svm_grid.fit(X_train,Y_train)
svm_grid.best_params_
#prediction
ypred_trainsvm=svm_grid.predict(X_train)
ypred_testsvm=svm_grid.predict(X_test)
#Evaluation
print("Train Accuracy:",accuracy_score(Y_train,ypred_trainsvm))
print("Test Accuracy:",accuracy_score(Y_test,ypred_testsvm))
print("Cross Validation score:",cross_val_score(svm_grid,X_train,Y_train,cv=5).mean())

DECISION TREE

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

#estimators
estimator=DecisionTreeClassifier()
#prameter
param_grid={"criterion":['gini','entropy'],'max_depth':list(range(1,10))}
#best parameter
dt_grid=GridSearchCV(estimator,param_grid,cv=5,scoring='accuracy')
dt_grid.fit(X_train,Y_train)
dt_grid.best_params_
#important features
dt_grid.best_estimator_
dt_grid.best_estimator_.feature_importances_
#create dataframe
feats=pd.DataFrame(data=dt_grid.best_estimator_.feature_importances_,
                  index=x.columns,
                  columns=['important_features'])
imp_feats=feats[feats['important_features']>0]
imp_feats_list=imp_feats.index.to_list()
imp_feats_list




In [None]:
dt_grid.best_estimator_.feature_importances_
#create dataframe

In [None]:
X_imp=x[imp_feats_list]
y
# train test split
X_train,X_test,Y_train,Y_test=train_test_split(X_imp,y,test_size=0.2,random_state=99)
#final dt
final_dt=DecisionTreeClassifier()
final_dt.fit(X_train,Y_train)
#prediction
ypred_traindt=final_dt.predict(X_train)
ypred_testdt=final_dt.predict(X_test)
#Evalauation
print("Train Accuracy:",accuracy_score(Y_train,ypred_traindt))
print("Test Accuracy:",accuracy_score(Y_test,ypred_testdt))
print("Cross Validation Score:",cross_val_score(final_dt,X_train,Y_train,cv=5).mean())

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
from joblib import dump
dump(final_dt,'loan_model.joblib')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
from joblib import load
loaded_model=load('loan_model.joblib')
loaded_model.predict(X_test)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In [None]:
import numpy as np
from flask import Flask, request, render_template
import pickle

app = Flask(__name__)
model = pickle.load(open('loan_model.pkl', 'rb'))

@app.route('/')
def home():
    return render_template('index.html')

@app.route('/predict',methods=['POST'])
def predict():
    int_features = [int(x) for x in request.form.values()]
    final_features = [np.array(int_features)]
    prediction = model.predict(final_features)

    output = round(prediction[0], 2)

    return render_template('index.html', prediction_text='Employee Salary should be $ {}'.format(output))

if __name__ == "__main__":
    app.run(debug=True)

In [None]:
<!DOCTYPE html>
<html>
<head>
    <title>Loan Prediction</title>
</head>
<body>
    <h2>Enter Details for Prediction</h2>
    <form action="/predict" method="post">
        <label>Feature 1:</label><input type="text" name="f1"><br>
        <label>Feature 2:</label><input type="text" name="f2"><br>
        <label>Feature 3:</label><input type="text" name="f3"><br>
        <label>Feature 4:</label><input type="text" name="f4"><br>
        <label>Feature 5:</label><input type="text" name="f5"><br><br>
        <input type="submit" value="Predict">
    </form>
    <br>
    {% if prediction_text %}
        <h3>{{ prediction_text }}</h3>
    {% endif %}
</body>
</html>


In [None]:
pip install flask numpy pickle-mixin
python app.py

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***