Problem Statement:

A travel company wants to predict whether a customer will purchase a travel package after receiving marketing communication. The dataset contains customer demographics, financial status, engagement activities, sales interaction details, and pitch satisfaction scores.
However, many customers do not convert, and follow-ups consume substantial time and cost.
To solve this, we aim to build a Random Forest Classification model that can accurately predict:

Will the customer purchase the travel package? (ProdTaken = 1 or 0)

| Column Name                 | Description |
|-----------------------------|-------------|
| **CustomerID**              | Unique identifier assigned to each customer. Not used for modeling. |
| **ProdTaken**               | Target variable: 1 = customer purchased the travel package, 0 = did not purchase. |
| **Age**                     | Age of the customer. Contains some missing values. |
| **TypeofContact**           | How the customer was contacted (Company Invited / Self Enquiry). |
| **CityTier**                | City classification of the customer (1, 2, or 3). Indicates development level and lifestyle. |
| **DurationOfPitch**         | Duration (in minutes) of the sales pitch given to the customer. Contains missing values. |
| **Occupation**              | Customer's occupation (Salaried, Business, Free Lancer, Student, etc.). |
| **Gender**                  | Gender of the customer (Male / Female). |
| **NumberOfPersonVisiting**  | Number of people visiting along with the customer. |
| **NumberOfFollowups**       | Number of follow-up calls made to the customer. Contains missing values. |
| **ProductPitched**          | Type of travel package pitched (Basic, Deluxe, Super Deluxe, etc.). |
| **PreferredPropertyStar**   | Preferred hotel/property star rating (3, 4, or 5 star). |
| **MaritalStatus**           | Marital status of the customer (Married / Single / Divorced). |
| **NumberOfTrips**           | Number of previous trips taken by the customer. Indicates travel interest. |
| **Passport**                | Whether the customer has a passport (1 = Yes, 0 = No). |
| **PitchSatisfactionScore**  | Customer’s satisfaction rating of the pitch (1–5). |
| **OwnCar**                  | Whether the customer owns a car. |
| **NumberOfChildrenVisiting**| Number of children accompanying the customer during the session. |
| **Designation**             | Job designation or seniority level (Executive, Manager, AVP, etc.). |
| **MonthlyIncome**           | Customer's monthly income. Continuous numeric feature. |


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline 
#to display imagine inside jupyter notebook
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
ds = pd.read_csv(r"E:\MLP\Machine Learning\Ensemble Techniques\Travel.csv")
ds

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [3]:
ds.sample(10)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
359,200359,0,36.0,Company Invited,3,9.0,Small Business,Male,2,4.0,Standard,3.0,Divorced,5.0,0,2,0,1.0,Senior Manager,24699.0
2270,202270,1,28.0,Company Invited,1,6.0,Small Business,Male,2,4.0,Basic,4.0,Married,2.0,0,4,1,0.0,Executive,17596.0
4507,204507,0,37.0,Self Enquiry,1,7.0,Salaried,Female,3,4.0,Deluxe,4.0,Married,2.0,0,5,1,2.0,Manager,23906.0
4732,204732,0,32.0,Company Invited,3,27.0,Salaried,Male,4,2.0,Basic,3.0,Married,2.0,0,5,1,1.0,Executive,21469.0
3032,203032,0,51.0,Self Enquiry,1,9.0,Small Business,Male,4,4.0,Super Deluxe,3.0,Divorced,,0,2,0,2.0,AVP,36317.0
765,200765,0,38.0,Company Invited,1,9.0,Salaried,Male,2,3.0,Basic,3.0,Divorced,4.0,0,3,1,1.0,Executive,17821.0
1550,201550,0,30.0,Self Enquiry,1,6.0,Salaried,Male,2,4.0,Deluxe,3.0,Married,2.0,1,1,1,1.0,Manager,20126.0
1388,201388,0,32.0,Company Invited,1,21.0,Small Business,Female,2,3.0,Deluxe,3.0,Married,6.0,0,3,0,1.0,Manager,21667.0
225,200225,0,59.0,Self Enquiry,1,9.0,Salaried,Male,3,4.0,Basic,3.0,Divorced,4.0,0,3,0,1.0,Executive,17177.0
3365,203365,0,29.0,Company Invited,1,7.0,Small Business,Male,3,4.0,Basic,3.0,Single,2.0,1,4,0,1.0,Executive,20832.0


Preprocessing

In [4]:
ds = ds.drop(columns=["CustomerID"])

In [5]:
ds.shape

(4888, 19)

In [6]:
ds.dtypes

ProdTaken                     int64
Age                         float64
TypeofContact                object
CityTier                      int64
DurationOfPitch             float64
Occupation                   object
Gender                       object
NumberOfPersonVisiting        int64
NumberOfFollowups           float64
ProductPitched               object
PreferredPropertyStar       float64
MaritalStatus                object
NumberOfTrips               float64
Passport                      int64
PitchSatisfactionScore        int64
OwnCar                        int64
NumberOfChildrenVisiting    float64
Designation                  object
MonthlyIncome               float64
dtype: object

In [7]:
ds.isna().sum()

ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [8]:
ds = ds.fillna(ds.mean(numeric_only=True))      # fill numeric missing values
ds = ds.fillna(ds.mode().iloc[0])               # fill categorical missing values

In [9]:
ds.isna().sum()

ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [10]:
x = ds.drop("ProdTaken",axis=1)
y = ds["ProdTaken"]

In [11]:
x

Unnamed: 0,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,41.000000,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,49.000000,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,37.000000,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,33.000000,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,37.622265,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,49.000000,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,28.000000,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,52.000000,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,19.000000,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


In [12]:
y

0       1
1       0
2       1
3       0
4       0
       ..
4883    1
4884    1
4885    1
4886    1
4887    1
Name: ProdTaken, Length: 4888, dtype: int64

In [13]:
cat_cols = x.select_dtypes(include=["object"]).columns
num_cols = x.select_dtypes(exclude=["object"]).columns

In [14]:
cat_cols

Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
       'MaritalStatus', 'Designation'],
      dtype='object')

In [15]:
num_cols

Index(['Age', 'CityTier', 'DurationOfPitch', 'NumberOfPersonVisiting',
       'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips',
       'Passport', 'PitchSatisfactionScore', 'OwnCar',
       'NumberOfChildrenVisiting', 'MonthlyIncome'],
      dtype='object')

In [16]:
ds[cat_cols].nunique()

TypeofContact     2
Occupation        4
Gender            3
ProductPitched    5
MaritalStatus     4
Designation       5
dtype: int64

In [17]:
ct = ColumnTransformer(
    transformers=[("encoder", OneHotEncoder(handle_unknown='ignore'), cat_cols)],
    remainder="passthrough"
)

X = ct.fit_transform(x)

In [18]:
encoded_cols = ct.named_transformers_["encoder"].get_feature_names_out(cat_cols)
all_feature_names = list(encoded_cols) + list(num_cols)
import pandas as pd

x = pd.DataFrame(X, columns=all_feature_names)

In [19]:
x

Unnamed: 0,TypeofContact_Company Invited,TypeofContact_Self Enquiry,Occupation_Free Lancer,Occupation_Large Business,Occupation_Salaried,Occupation_Small Business,Gender_Fe Male,Gender_Female,Gender_Male,ProductPitched_Basic,...,DurationOfPitch,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,MonthlyIncome
0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,6.0,3.0,3.0,3.0,1.0,1.0,2.0,1.0,0.0,20993.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,14.0,3.0,4.0,4.0,2.0,0.0,3.0,1.0,2.0,20130.0
2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,8.0,3.0,4.0,3.0,7.0,1.0,3.0,0.0,0.0,17090.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,...,9.0,2.0,3.0,3.0,2.0,1.0,5.0,1.0,1.0,17909.0
4,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,8.0,2.0,3.0,4.0,1.0,0.0,5.0,1.0,0.0,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,9.0,3.0,5.0,4.0,2.0,1.0,1.0,1.0,1.0,26576.0
4884,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,...,31.0,4.0,5.0,3.0,3.0,1.0,3.0,1.0,2.0,21212.0
4885,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,17.0,4.0,4.0,4.0,7.0,0.0,1.0,1.0,3.0,31820.0
4886,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,...,16.0,3.0,4.0,3.0,3.0,0.0,5.0,0.0,2.0,20289.0


Model Building

In [20]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=15,shuffle=True)

In [21]:
x_train.shape,x_test.shape,y_train.shape,y_test.shape

((3910, 35), (978, 35), (3910,), (978,))

In [42]:
model = RandomForestClassifier(
    n_estimators=100,        # number of trees
    random_state=42
)

model.fit(x_train, y_train)


In [43]:
model.fit(x_train,y_train)

In [44]:
y_pred_test = model.predict(x_test)
y_pred_train = model.predict(x_train)

In [45]:
print("Accuracy:", accuracy_score(y_test, y_pred_test))
print(classification_report(y_test, y_pred_test))

Accuracy: 0.9151329243353783
              precision    recall  f1-score   support

           0       0.91      0.99      0.95       788
           1       0.95      0.59      0.73       190

    accuracy                           0.92       978
   macro avg       0.93      0.79      0.84       978
weighted avg       0.92      0.92      0.91       978



In [46]:
print("Accuracy:", accuracy_score(y_train, y_pred_train))
print(classification_report(y_train, y_pred_train))

Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3180
           1       1.00      1.00      1.00       730

    accuracy                           1.00      3910
   macro avg       1.00      1.00      1.00      3910
weighted avg       1.00      1.00      1.00      3910



Other Models

In [30]:
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(x_train, y_train)

y_pred_lr = log_reg.predict(x_test)
y_pred_train_lr = log_reg.predict(x_train)

print("Test Metrics")
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

print("Train Metrics")
print("Logistic Regression Accuracy:", accuracy_score(y_train, y_pred_train_lr))
print(classification_report(y_train, y_pred_train_lr))

Test Metrics
Logistic Regression Accuracy: 0.8323108384458078
              precision    recall  f1-score   support

           0       0.85      0.96      0.90       788
           1       0.66      0.28      0.40       190

    accuracy                           0.83       978
   macro avg       0.75      0.62      0.65       978
weighted avg       0.81      0.83      0.80       978

Train Metrics
Logistic Regression Accuracy: 0.8419437340153453
              precision    recall  f1-score   support

           0       0.85      0.97      0.91      3180
           1       0.69      0.28      0.40       730

    accuracy                           0.84      3910
   macro avg       0.77      0.63      0.65      3910
weighted avg       0.82      0.84      0.81      3910



In [31]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(x_train, y_train)

y_pred_dt = dt.predict(x_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

Decision Tree Accuracy: 0.9100204498977505
              precision    recall  f1-score   support

           0       0.94      0.95      0.94       788
           1       0.78      0.74      0.76       190

    accuracy                           0.91       978
   macro avg       0.86      0.85      0.85       978
weighted avg       0.91      0.91      0.91       978



In [32]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)

y_pred_knn = knn.predict(x_test)

print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

KNN Accuracy: 0.7822085889570553
              precision    recall  f1-score   support

           0       0.81      0.95      0.88       788
           1       0.31      0.09      0.14       190

    accuracy                           0.78       978
   macro avg       0.56      0.52      0.51       978
weighted avg       0.71      0.78      0.73       978



In [34]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier()
gb.fit(x_train, y_train)

y_pred_gb = gb.predict(x_test)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))


Gradient Boosting Accuracy: 0.8619631901840491
              precision    recall  f1-score   support

           0       0.87      0.97      0.92       788
           1       0.78      0.40      0.53       190

    accuracy                           0.86       978
   macro avg       0.83      0.69      0.72       978
weighted avg       0.85      0.86      0.84       978

