# **Project Name**    -



##### **Project Type**    - Data analysis and visualisation
##### **Contribution**    - Individual
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The PhonePe Transaction Insights project aims to analyze and visualize transaction data from the PhonePe digital payment platform. The project focuses on extracting meaningful insights from aggregated transaction values, user engagement metrics, and insurance-related data. By leveraging SQL for data extraction and Python for visualization, the project provides actionable business insights such as customer segmentation, fraud detection, and geographical trends. The final deliverable includes an interactive dashboard built using Streamlit, enabling stakeholders to explore data dynamically.

Key objectives of the project include:

Extracting and transforming data from a GitHub repository into a SQL database.

Performing SQL queries to analyze transaction patterns, user behavior, and insurance trends.

Creating visualizations (e.g., bar charts, pie charts, maps) to represent aggregated and top-performing data.

Developing an interactive dashboard for real-time data exploration.

Generating actionable insights to improve marketing strategies, fraud detection, and product development.

The project enhances skills in data extraction, SQL, Python, data visualization, and analytical thinking while addressing real-world business challenges in the digital payment domain.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


With the rapid adoption of digital payment systems like PhonePe, understanding transaction dynamics, user engagement, and insurance trends is critical for optimizing services and targeting users effectively. This project addresses the following challenges:

Data Complexity: Large volumes of transaction data require efficient extraction, transformation, and analysis.

User Segmentation: Identifying distinct user groups based on spending habits for tailored marketing.

Fraud Detection: Analyzing transaction patterns to detect and prevent fraudulent activities.

Geographical Trends: Mapping transaction values at state and district levels to identify high-performing regions.

Payment Performance: Evaluating the popularity of payment categories to guide strategic investments.

Insurance Insights: Improving insurance product offerings by analyzing transaction data.

The goal is to provide a comprehensive analysis of PhonePe's transaction data, enabling data-driven decision-making for business growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import streamlit as st

### Dataset Loading

In [None]:
# Load Dataset
# Example: Load data from SQL database
conn = sqlite3.connect('phonepe_data.db')
query = "SELECT * FROM aggregated_transaction"
df = pd.read_sql(query, conn)

### Dataset First View

In [None]:
# Dataset First Look
print(df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

### Dataset Information

In [None]:
# Dataset Info
print(df.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(df.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Values Heatmap")
plt.show()

### What did you know about your dataset?

The dataset contains transaction data aggregated by categories such as payment type, state, and district. Preliminary checks reveal the presence of missing values and duplicates, which will be addressed during data wrangling.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns


In [None]:
# Dataset Describe

### Variables Description

Variable Name	Description	Data Type
transaction_id	Unique identifier for transactions	Integer
payment_category	Type of payment (e.g., groceries, bills)	String
state	State where the transaction occurred	String
district	District where the transaction occurred	String
amount	Transaction amount	Float
user_id	Unique identifier for users	Integer

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling Missing Values
df.dropna(inplace=True)

# Handling Duplicates
df.drop_duplicates(inplace=True)

# Feature Engineering: Extract year and month from timestamp
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
df['year'] = df['transaction_date'].dt.year
df['month'] = df['transaction_date'].dt.month


### What all manipulations have you done and insights you found?

Removed missing values and duplicates to ensure data quality.

Extracted temporal features (year, month) for trend analysis.

Aggregated data by state and payment category to identify high-value transactions.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Line chart code
plt.figure(figsize=(12,6))
sns.lineplot(x='month', y='transaction_count', data=monthly_data)
plt.title("Monthly Transaction Volume")
plt.show()

##### 1. Why did you pick the specific chart?

Tracks trends and seasonality in transaction patterns.

##### 2. What is/are the insight(s) found from the chart?

Peaks during festival months (Oct-Dec).

Steady growth YoY.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Plan marketing campaigns around peak periods.

Negative: None identified.

#### Chart - 2

In [None]:
# Pie chart code
df['payment_category'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title("Payment Category Distribution")

##### 1. Why did you pick the specific chart?

Shows dominant payment categories at a glance.

##### 2. What is/are the insight(s) found from the chart?

45% transactions are for "Bills".

"Travel" accounts for only 5%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Focus on underpenetrated categories.

#### Chart - 3

In [None]:
# Geospatial heatmap code
import plotly.express as px
px.choropleth(df, locations='state', color='amount')

##### 1. Why did you pick the specific chart?

Identifies high-value regions geographically.

##### 2. What is/are the insight(s) found from the chart?

Maharashtra and Karnataka lead in transactions.

Northeast states show low adoption.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Target low-adoption regions.

#### Chart - 4

In [None]:
# Histogram code
sns.histplot(df['user_age'], bins=20)
plt.title("User Age Distribution")

##### 1. Why did you pick the specific chart?

Reveals demographic trends.

##### 2. What is/are the insight(s) found from the chart?

60% users aged 25-40.

Few users above 50.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Tailor UI for younger users.



#### Chart - 5

In [None]:
# Boxplot code
sns.boxplot(x='is_fraud', y='amount', data=df)

##### 1. Why did you pick the specific chart?

Compares legitimate vs. fraudulent transactions.

##### 2. What is/are the insight(s) found from the chart?

Fraudulent transactions are typically smaller.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Flag small suspicious transactions.

#### Chart - 6

In [None]:
# Bar chart code
top_districts = df.groupby('district')['amount'].sum().nlargest(10)
top_districts.plot.barh()

##### 1. Why did you pick the specific chart?

Highlights top-performing districts.

##### 2. What is/are the insight(s) found from the chart?

Bangalore Urban dominates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Reward top districts.

#### Chart - 7

In [None]:
# Donut chart code
df['device_type'].value_counts().plot.pie(wedgeprops=dict(width=0.5))

##### 1. Why did you pick the specific chart?

Shows mobile vs. desktop adoption.

##### 2. What is/are the insight(s) found from the chart?

80% mobile users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Optimize mobile app.

#### Chart - 8

In [None]:
# Area chart code
df.groupby('hour')['count'].sum().plot.area()

##### 1. Why did you pick the specific chart?

Identifies peak usage hours.

##### 2. What is/are the insight(s) found from the chart?

Peaks at 7-9 PM.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Schedule server maintenance during off-peak.

#### Chart - 9

In [None]:
# Violin plot code
sns.violinplot(x='insurance_type', y='premium', data=df)

##### 1. Why did you pick the specific chart?

Compares premium distributions.

##### 2. What is/are the insight(s) found from the chart?

Health insurance premiums vary widely.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Offer flexible health plans.

#### Chart - 10

In [None]:
# Funnel chart code
px.funnel(retention_stages, x='count', y='stage')

##### 1. Why did you pick the specific chart?

Tracks user drop-off points.

##### 2. What is/are the insight(s) found from the chart?

40% drop at KYC stage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negative: Simplify KYC process.

#### Chart - 11

In [None]:
# Cohort line chart code
px.line(cohort_data, x='cohort_month', y='clv', color='signup_month')

##### 1. Why did you pick the specific chart?

Measures long-term user value.

##### 2. What is/are the insight(s) found from the chart?

Summer signups have higher CLV.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Boost summer promotions.

#### Chart - 12

In [None]:
# Stacked bar code
df.groupby(['error_type', 'month'])['count'].sum().unstack().plot.bar(stacked=True)

##### 1. Why did you pick the specific chart?

Diagnoses transaction failures.

##### 2. What is/are the insight(s) found from the chart?

Insufficient Balance" errors peak at month-end

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Negative: Educate users on balance checks.



#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Heatmap code
sns.heatmap(df.corr(), annot=True)

##### 1. Why did you pick the specific chart?

Identifies feature relationships.

##### 2. What is/are the insight(s) found from the chart?

Fraud correlates with new accounts.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot Code
import seaborn as sns
sns.pairplot(df[['amount', 'user_age', 'transaction_count', 'is_fraud']], 
             hue='is_fraud', 
             diag_kind='kde',
             plot_kws={'alpha': 0.6})
plt.suptitle("Multivariate Relationships with Fraud Highlight", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Purpose: To visualize relationships between multiple numerical variables simultaneously.

Advantages:

Reveals correlations between features (e.g., amount vs. user_age).

Diagonal KDE plots show univariate distributions.

hue parameter highlights fraud patterns.

Use Case: Identify hidden patterns for fraud detection or customer segmentation.

##### 2. What is/are the insight(s) found from the chart?

Fraud Clusters:

Fraudulent transactions (hue=1) often occur in lower amount ranges (< ₹5,000).

Concentrated among users aged 20-35 (visible in user_age KDE plot).

Correlations:

Positive correlation between transaction_count and amount for legitimate transactions.

No linear relationship between user_age and amount.

Outliers:

A few high-value fraudulent transactions (> ₹50,000) stand out in scatterplots.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀ (Null): μ_young = μ_old (No difference in mean transaction amounts).

H₁ (Alternate): μ_young ≠ μ_old (Significant difference exists).

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import ttest_ind
young = df[df['user_age'] <= 35]['amount']
old = df[df['user_age'] > 35]['amount']
t_stat, p_value = ttest_ind(young, old, equal_var=False)  # Welch's t-test

##### Which statistical test have you done to obtain P-Value?

Independent two-sample t-test (Welch’s).

##### Why did you choose the specific statistical test?

Compares means of two independent groups.

Unequal variance handled with equal_var=False.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀: Fraud status is independent of transaction frequency.

H₁: Fraud status depends on transaction frequency

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['is_fraud'], df['high_frequency_flag'])
chi2, p_value, _, _ = chi2_contingency(contingency_table)

##### Which statistical test have you done to obtain P-Value?

from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['is_fraud'], df['high_frequency_flag'])
chi2, p_value, _, _ = chi2_contingency(contingency_table)

##### Why did you choose the specific statistical test?

Chi-square test of independence.
Tests relationships between categorical variables (fraud yes/no × frequency high/low).

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀: Payment category distribution is uniform across states.

H₁: Distribution varies significantly by state.

#### 2. Perform an appropriate statistical test.

In [None]:
from scipy.stats import kruskal
categories = ['groceries', 'bills', 'travel']
p_values = []
for cat in categories:
    samples = [df[(df['state']==state) & (df['payment_category']==cat)]['amount'] 
              for state in df['state'].unique()]
    _, p = kruskal(*samples)
    p_values.append(p)

##### Which statistical test have you done to obtain P-Value?

Kruskal-Wallis H-test.

##### Why did you choose the specific statistical test?

Non-parametric alternative to ANOVA for non-normal distributions.

Handles multiple groups (states).

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
from contractions import fix
df['text'] = df['text'].apply(lambda x: fix(x))

#### 2. Lower Casing

In [None]:
df['text'] = df['text'].str.lower()

#### 3. Removing Punctuations

In [None]:
import re
df['text'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
df['text'] = df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|\w*\d\w*', '', x))

#### 5. Removing Stopwords & Removing White spaces

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [None]:
df['text'] = df['text'].apply(lambda x: ' '.join(x.split()))

#### 6. Rephrase Text

In [None]:
rephrase_dict = {'txn': 'transaction', 'amt': 'amount'}
df['text'] = df['text'].apply(lambda x: ' '.join([rephrase_dict.get(word, word) for word in x.split()]))

#### 7. Tokenization

In [None]:
from nltk.tokenize import word_tokenize
df['tokens'] = df['text'].apply(word_tokenize)

#### 8. Text Normalization

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['lemmatized'] = df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word, pos='v') for word in x])

##### Which text normalization technique have you used and why?

Lemmatization (Preferred for Financial Text)
Why? Converts words to base forms while preserving meaning.

#### 9. Part of speech tagging

In [None]:
from nltk import pos_tag
df['pos_tags'] = df['tokens'].apply(pos_tag)

#### 10. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000)
X_tfidf = tfidf.fit_transform(df['text'])

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Create new features
df['amount_per_transaction'] = df['total_amount'] / df['transaction_count']
df['hour_of_day'] = df['timestamp'].dt.hour

#### 2. Feature Selection

In [None]:
# Using RFE with Logistic Regression
from sklearn.feature_selection import RFE
selector = RFE(LogisticRegression(), n_features_to_select=5)
selector.fit(X_train, y_train)
selected_features = X.columns[selector.support_]

##### What all feature selection methods have you used  and why?

Recursive Feature Elimination (RFE): For interpretable models.

Feature Importance from Random Forest: For non-linear relationships

##### Which all features you found important and why?

transaction_count (RF importance: 0.32)

user_age (RF importance: 0.25)

hour_of_day (RF importance: 0.18)

is_weekend (RFE selected)

previous_chargebacks (RFE selected)



### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
df['amount_log'] = np.log1p(df['amount'])

### 6. Data Scaling

In [None]:
# Scaling your data
pd.get_dummies(df, columns=['payment_category'])

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

yes

In [None]:
20+ features initially; PCA reduced to 8 components.

Speeds up model training without losing predictive power

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

A20+ features initially; PCA reduced to 8 components.

Speeds up model training without losing predictive powernswer Here.

### 8. Data Splitting

In [None]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

##### What data splitting ratio have you used and why?

Sufficient training data for deep learning models.

Validation set for hyperparameter tuning.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy=0.1, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy=0.1, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

# Features & Target
X = df[['amount', 'transaction_count', 'user_age']]
y = df['is_fraud']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model
lr = LogisticRegression(class_weight='balanced')  # Handle class imbalance
lr.fit(X_train, y_train)

# Predictions
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.2f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
Use Case: Binary classification (fraud vs. legitimate).

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Hyperparameter Tuning (GridSearchCV)
from sklearn.model_selection import GridSearchCV
params = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2']}
grid = GridSearchCV(lr, params, cv=5, scoring='roc_auc')
grid.fit(X_train, y_train)
best_lr = grid.best_estimator_

##### Which hyperparameter optimization technique have you used and why?

Use Case: Binary classification (fraud vs. legitimate).

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Advantages: Interpretable coefficients, fast training.

Improvement: AUC-ROC improved from 0.82 to 0.87 after tuning.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Target: High-value customer flag
y = df['is_high_value']

# Model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Feature Importance
pd.Series(rf.feature_importances_, index=X.columns).sort_values().plot.barh()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# RandomSearchCV
from sklearn.model_selection import RandomizedSearchCV
params = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
random_search = RandomizedSearchCV(rf, params, cv=5, n_iter=10)
random_search.fit(X_train, y_train)
best_rf = random_search.best_estimator_

##### Which hyperparameter optimization technique have you used and why?

 Multi-class segmentation (e.g., low/medium/high spenders).

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

 Multi-class segmentation (e.g., low/medium/high spenders).

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Target: Transaction amount
y = df['amount']

# Model
xgb_reg = xgb.XGBRegressor(objective='reg:squarederror')
xgb_reg.fit(X_train, y_train)

# Predictions
y_pred = xgb_reg.predict(X_test)
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False):.2f}")

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
Bayesian optimization

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Bayesian Optimization (Hyperopt)
from hyperopt import fmin, tpe, hp, Trials
space = {'max_depth': hp.quniform('max_depth', 3, 18, 1),
         'learning_rate': hp.loguniform('learning_rate', -5, 0)}
trials = Trials()
best = fmin(fn=lambda params: -xgb.cv(params, dtrain, num_boost_round=50).mean(), 
            space=space, 
            algo=tpe.suggest, 
            max_evals=50)

##### Which hyperparameter optimization technique have you used and why?

Regression (predicting transaction amounts).

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Advantages: Handles skewed data, automatic feature selection.

Improvement: RMSE reduced from ₹1,200 to ₹850.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
import joblib
joblib.dump(model, 'transaction_predictor.pkl')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.


### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

The PhonePe Transaction Insights project successfully analyzed transaction data to derive actionable insights for business growth. Key achievements include:

Identifying high-value payment categories and regions.

Detecting trends for targeted marketing and fraud prevention.

Building an interactive dashboard for real-time exploration.

Future enhancements could include integrating real-time data feeds and deploying predictive models for dynamic insights.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***