# 🚀 GST Analytics Hackathon Project

Welcome to my project submission for the **GST Analytics Hackathon**! This project aims to develop an innovative predictive model that can accurately classify entities within the GST system. Leveraging cutting-edge techniques in machine learning, this project addresses a complex binary classification problem using a large dataset provided by the GST Analytics team.

## 📁 Project Overview

- **Objective**: To create a supervised learning model that predicts whether a specific entity is classified as "0" or "1" based on various features.
- **Dataset**: The dataset contains 9 lakh records with 23 attributes, spread over training and testing sets.
- **Methodology**: The project involves data preprocessing, feature engineering, model training, and evaluation using various performance metrics.
- **Tools & Technologies**: Python, scikit-learn, XGBoost, Pandas, Optuna and more.

## 🔍 Key Evaluation Metrics

- **Accuracy**
- **Precision**
- **Recall**
- **F1 Score**
- **AUC-ROC Curve**
- **Confusion Matrix**

## 🌟 Why This Project?

This project not only showcases my technical skills in data science but also contributes to the development of solutions that can enhance the efficiency of the GST system in India.

Let's dive in! 💡


## Importing libraries, data and getting started

In [39]:
import pandas as pd
import plotly.express as px

In [51]:
xtrain = pd.read_csv('data/X_Train_Data_Input.csv')
xtest  = pd.read_csv('data/X_Test_Data_Input.csv')
ytrain = pd.read_csv('data/Y_Train_Data_Target.csv')
ytest  = pd.read_csv('data/Y_Test_Data_Target.csv')

In [52]:
round((xtrain.isna().sum()/xtrain.shape[0]) * 100, 2) # Finding % of null values by column for X train

ID           0.00
Column0      0.00
Column1      0.00
Column2      0.00
Column3     16.09
Column4     16.27
Column5     21.29
Column6      0.49
Column7      0.00
Column8      0.49
Column9     93.25
Column10     0.00
Column11     0.00
Column12     0.00
Column13     0.00
Column14    46.58
Column15     2.10
Column16     0.00
Column17     0.00
Column18     0.00
Column19     0.00
Column20     0.00
Column21     0.00
dtype: float64

In [None]:
missing_perc_xtrain = round((xtrain.isna().sum().sum()/xtrain.size) * 100, 2) # Overall % of missing data in X train
print(f'Overall percentage of missing data in X train: {missing_perc_xtrain}%')

Overall percentage of missing data in X train: 8.55%


In [91]:
dict = ((xtrain.isna().sum()/xtrain.shape[0]) * 100).to_dict() # Converting to a dictionary with the format = Column : % null

missing_xtrain_df = pd.DataFrame(data={
    'Column': dict.keys(),
    'Missing': dict.values()
}, index = range(0, 23)) # Converting dictionary to dataframe for visualization

missing_xtrain_df.head()

Unnamed: 0,Column,Missing
0,ID,0.0
1,Column0,0.001146
2,Column1,0.0
3,Column2,0.0
4,Column3,16.086829


In [126]:
fig = px.bar(data_frame = missing_xtrain_df, 
             x = missing_xtrain_df['Column'],
             y = missing_xtrain_df['Missing'],
             height = 500,
             color = missing_xtrain_df['Column'],
             title = '% missing data by column for X Train')

fig.update_traces(marker_line_color = 'black',
                  marker_line_width = 1)
fig.update_layout(xaxis_title = 'Column names',
                  yaxis_title = 'Missing Percentage')

fig.show()

In [None]:
round((xtest.isna().sum()/xtrain.shape[0]) * 100, 2) # Finding % of null values by column for X test

ID           0.00
Column0      0.00
Column1      0.00
Column2      0.00
Column3      5.38
Column4      5.44
Column5      7.09
Column6      0.16
Column7      0.00
Column8      0.16
Column9     31.06
Column10     0.00
Column11     0.00
Column12     0.00
Column13     0.00
Column14    15.50
Column15     0.70
Column16     0.00
Column17     0.00
Column18     0.00
Column19     0.00
Column20     0.00
Column21     0.00
dtype: float64

In [None]:
missing_perc_xtest = round((xtest.isna().sum().sum()/xtest.size) * 100, 2) # Overall % of missing data in X test
print(f'Overall percentage of missing data in X test: {missing_perc_xtest}%')

Overall percentage of missing data in X test: 8.54%


In [112]:
dict = ((xtest.isna().sum()/xtest.shape[0]) * 100).to_dict() # Converting a dictionary with the format = Column : % null

missing_xtest_df = pd.DataFrame(data={
    'Column': dict.keys(),
    'Missing': dict.values()
}, index = range(0, 23)) # Converting dictionary to dataframe for visualization

missing_xtest_df.head()

Unnamed: 0,Column,Missing
0,ID,0.0
1,Column0,0.000764
2,Column1,0.0
3,Column2,0.0
4,Column3,16.137586


In [116]:
fig = px.bar(data_frame = missing_xtest_df, 
             x = missing_xtest_df['Column'],
             y = missing_xtest_df['Missing'],
             height = 500,
             color = missing_xtest_df['Column'],
             title = '% missing data by column for X Test')

fig.update_traces(marker_line_color = 'black',
                  marker_line_width = 1)
fig.update_layout(xaxis_title = 'Column names',
                  yaxis_title = 'Missing Percentage')

fig.show()

In [None]:
ytrain.isna().sum()/ytrain.size # % missing values in Y train

ID        0.0
target    0.0
dtype: float64

In [None]:
ytest.isna().sum()/ytest.size # % missing values in Y test

ID        0.0
target    0.0
dtype: float64