# Dataset Overview: Trips & Travel Customer Data

## 1. Context

"Trips & Travel.Com" is aiming to expand its customer base by introducing a **new Wellness Tourism Package**.  
Wellness Tourism focuses on travel that allows individuals to **maintain, enhance, or kick-start a healthy lifestyle**, boosting overall well-being.

Previously, the company offered **five types of packages**:  
**Basic, Standard, Deluxe, Super Deluxe, and King**.  

Last year, only **18% of customers purchased packages**, and marketing costs were high because outreach was done **randomly**, without leveraging available customer information.

The company now wants to **use historical customer data** to identify potential customers more efficiently, reduce marketing costs, and maximize conversions for the new wellness package.

---

## 2. Dataset Content

The dataset contains historical and demographic information about customers, including:

- **Designation** – Job role of the customer (e.g., Executive, Manager)  
- **Passport Status** – Whether the customer has a passport  
- **Tier City** – Customer's city tier (1, 2, 3)  
- **Marital Status** – Single, Married, or Unmarried  
- **Occupation** – Nature of the customer’s work/business  
- **Monthly Income** – Customer's earnings  
- **Age** – Customer’s age  
- **Package Purchased** – Whether the customer purchased a package  

### Key Observations from EDA

- **Target Customers:** Executives are more likely to purchase packages.  
- Customers who **own a passport**, live in **tier 3 cities**, and are **single/unmarried** have higher interest.  
- Customers with **large businesses**, monthly income in the **15,000–25,000** range, and age **15–30** tend to prefer **5-star properties** and have higher chances of purchasing the wellness package.

---

## 3. Objective

- **Predict** which customers are most likely to purchase the new Wellness Tourism Package.  
- **Identify** the most significant features influencing purchase behavior.  
- **Recommend** customer segments to target for marketing to maximize ROI.


In [29]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [30]:
df = pd.read_csv('/content/Travel.csv')

In [31]:
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [32]:
df.shape

(4888, 20)

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

In [34]:
df.isnull().sum()

Unnamed: 0,0
CustomerID,0
ProdTaken,0
Age,226
TypeofContact,25
CityTier,0
DurationOfPitch,251
Occupation,0
Gender,0
NumberOfPersonVisiting,0
NumberOfFollowups,45


Data cleaning

* missing values
* duplicates
* data type


In [35]:
#Data cleaning

#checking all catergoires

In [36]:
for col in df.select_dtypes(include='object').columns:
    print(f"'{col}':")
    print(df[col].value_counts())
    print("\n")

'TypeofContact':
TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64


'Occupation':
Occupation
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: count, dtype: int64


'Gender':
Gender
Male       2916
Female     1817
Fe Male     155
Name: count, dtype: int64


'ProductPitched':
ProductPitched
Basic           1842
Deluxe          1732
Standard         742
Super Deluxe     342
King             230
Name: count, dtype: int64


'MaritalStatus':
MaritalStatus
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: count, dtype: int64


'Designation':
Designation
Executive         1842
Manager           1732
Senior Manager     742
AVP                342
VP                 230
Name: count, dtype: int64




From the above o/p we can see that we need to process the gender and marital  status columns

In [37]:
df['Gender']=df['Gender'].replace('Fe Male',"Female")

In [38]:
df['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,2916
Female,1972


In [39]:
df['MaritalStatus']=df['MaritalStatus'].replace('Single',"Unmarried")

In [40]:
df['MaritalStatus'].value_counts()

Unnamed: 0_level_0,count
MaritalStatus,Unnamed: 1_level_1
Married,2340
Unmarried,1598
Divorced,950


In [41]:
# Calculate missing value percentages using isnull()
missing_percent = df.isnull().sum() / len(df) * 100

for col, pct in missing_percent.items():
  if pct!=0:
    print(f"'{col}': {pct:.2f}% ")


'Age': 4.62% 
'TypeofContact': 0.51% 
'DurationOfPitch': 5.14% 
'NumberOfFollowups': 0.92% 
'PreferredPropertyStar': 0.53% 
'NumberOfTrips': 2.86% 
'NumberOfChildrenVisiting': 1.35% 
'MonthlyIncome': 4.77% 


# Imputing Null values

* Impute Median value for Age column
* Impute Mode for Type of Contract
* Impute Median for Duration of Pitch
* Impute Mode for NumberofFollowup as it is Discrete feature
* Impute Mode for PreferredPropertyStar
* Impute Median for NumberofTrips
* Impute Mode for NumberOfChildrenVisiting
* Impute Median for MonthlyIncome

In [42]:
df.Age.fillna(df.Age.median(), inplace=True)
df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace=True)
df.DurationOfPitch.fillna(df.DurationOfPitch.median(), inplace=True)
df.NumberOfFollowups.fillna(df.NumberOfFollowups.mode()[0], inplace=True)
df.PreferredPropertyStar.fillna(df.PreferredPropertyStar.mode()[0], inplace=True)
df.NumberOfTrips.fillna(df.NumberOfTrips.median(), inplace=True)
df.NumberOfChildrenVisiting.fillna(df.NumberOfChildrenVisiting.mode()[0], inplace=True)
df.MonthlyIncome.fillna(df.MonthlyIncome.median(), inplace=True)

In [43]:
df.describe()

Unnamed: 0,CustomerID,ProdTaken,Age,CityTier,DurationOfPitch,NumberOfPersonVisiting,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,MonthlyIncome
count,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0,4888.0
mean,202443.5,0.188216,37.547259,1.654255,15.36293,2.905074,3.711129,3.577946,3.229746,0.290917,3.078151,0.620295,1.184738,23559.179419
std,1411.188388,0.390925,9.104795,0.916583,8.316166,0.724891,0.998271,0.797005,1.822769,0.454232,1.365792,0.485363,0.852323,5257.862921
min,200000.0,0.0,18.0,1.0,5.0,1.0,1.0,3.0,1.0,0.0,1.0,0.0,0.0,1000.0
25%,201221.75,0.0,31.0,1.0,9.0,2.0,3.0,3.0,2.0,0.0,2.0,0.0,1.0,20485.0
50%,202443.5,0.0,36.0,1.0,13.0,3.0,4.0,3.0,3.0,0.0,3.0,1.0,1.0,22347.0
75%,203665.25,0.0,43.0,3.0,19.0,3.0,4.0,4.0,4.0,1.0,4.0,1.0,2.0,25424.75
max,204887.0,1.0,61.0,3.0,127.0,5.0,6.0,5.0,22.0,1.0,5.0,1.0,3.0,98678.0


In [44]:
df.isnull().sum()

Unnamed: 0,0
CustomerID,0
ProdTaken,0
Age,0
TypeofContact,0
CityTier,0
DurationOfPitch,0
Occupation,0
Gender,0
NumberOfPersonVisiting,0
NumberOfFollowups,0


In [45]:
df.drop('CustomerID', inplace=True, axis=1)

Feature Engineering

In [46]:
df.head()
#Here we can see that the number of person visting is in two : parents and childeren we can combine them to a total column

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,0.0,Manager,20993.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Unmarried,7.0,1,3,0,0.0,Executive,17090.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,0,36.0,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [47]:
df['TotalVisiting'] = df['NumberOfPersonVisiting'] + df['NumberOfChildrenVisiting']
df.drop(columns=['NumberOfPersonVisiting', 'NumberOfChildrenVisiting'], axis=1, inplace=True)

In [48]:
# getting all the numeric features
num_features = [feature for feature in df.columns if df[feature].dtype != 'O']
print('Numerical Features :', len(num_features))

# categorical features
cat_features = [feature for feature in df.columns if df[feature].dtype == 'O']
print('Categorical Features :', len(cat_features))


# Discrete features
discrete_features=[feature for feature in num_features if len(df[feature].unique())<=25]
print('Discrete Features :',len(discrete_features))

# continuous features
continuous_features=[feature for feature in num_features if feature not in discrete_features]
print('Continuous Features :',len(continuous_features))

Numerical Features : 12
Categorical Features : 6
Discrete Features : 9
Continuous Features : 3


In [50]:
df['ProdTaken'].unique()

array([1, 0])

Train test split

In [51]:
from sklearn.model_selection import train_test_split

X = df.drop('ProdTaken', axis=1)
y = df['ProdTaken']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [52]:
X_train.shape,X_test.shape

((3910, 17), (978, 17))

In [53]:
# Now we can store the num and catergorcial data in separete varaibles and then apply onehot and scaler using a column transformer

cat_features = X.select_dtypes(include="object").columns
num_features = X.select_dtypes(exclude="object").columns


In [54]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

scaler = StandardScaler()
ohe = OneHotEncoder()

preprocessor = ColumnTransformer([
    ('OneHot',ohe,cat_features),
    ('Scaler',scaler,num_features)
])


In [55]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [56]:
X_train

array([[ 0.        ,  1.        ,  0.        , ...,  0.78296635,
        -0.38224537, -0.77415132],
       [ 0.        ,  1.        ,  0.        , ...,  0.78296635,
        -0.4597992 ,  0.64361526],
       [ 0.        ,  1.        ,  0.        , ...,  0.78296635,
        -0.24519557, -0.06526803],
       ...,
       [ 1.        ,  0.        ,  0.        , ...,  0.78296635,
        -0.36057591,  0.64361526],
       [ 0.        ,  1.        ,  0.        , ...,  0.78296635,
        -0.25279888,  0.64361526],
       [ 1.        ,  0.        ,  0.        , ..., -1.2771941 ,
        -1.08251091, -1.48303461]])

Modelling

In [58]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score


In [63]:
models={
    "Logisitic Regression":LogisticRegression(),
    "Decision Tree":DecisionTreeClassifier(),
    "Random Forest":RandomForestClassifier(),
}
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Training set performance
    model_train_accuracy = accuracy_score(y_train, y_train_pred) # Calculate Accuracy
    model_train_f1 = f1_score(y_train, y_train_pred, average='weighted') # Calculate F1-score
    model_train_precision = precision_score(y_train, y_train_pred) # Calculate Precision
    model_train_recall = recall_score(y_train, y_train_pred) # Calculate Recall
    model_train_rocauc_score = roc_auc_score(y_train, y_train_pred)


    # Test set performance
    model_test_accuracy = accuracy_score(y_test, y_test_pred) # Calculate Accuracy
    model_test_f1 = f1_score(y_test, y_test_pred, average='weighted') # Calculate F1-score
    model_test_precision = precision_score(y_test, y_test_pred) # Calculate Precision
    model_test_recall = recall_score(y_test, y_test_pred) # Calculate Recall
    model_test_rocauc_score = roc_auc_score(y_test, y_test_pred) #Calculate Roc


    print(list(models.keys())[i])



    print('Model performance for Training set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print('- F1 score: {:.4f}'.format(model_train_f1))

    print('- Precision: {:.4f}'.format(model_train_precision))
    print('- Recall: {:.4f}'.format(model_train_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_train_rocauc_score))



    print('--------------------------------------------------------')

    print('Model performance for Test set')
    print('- Accuracy: {:.4f}'.format(model_test_accuracy))
    print('- F1 score: {:.4f}'.format(model_test_f1))
    print('- Precision: {:.4f}'.format(model_test_precision))
    print('- Recall: {:.4f}'.format(model_test_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_test_rocauc_score))

    print('--------------------------------------------------------')
    print('--------------------------------------------------------')



Logisitic Regression
Model performance for Training set
- Accuracy: 0.8463
- F1 score: 0.8208
- Precision: 0.7013
- Recall: 0.3059
- Roc Auc Score: 0.6380
--------------------------------------------------------
Model performance for Test set
- Accuracy: 0.8354
- F1 score: 0.8078
- Precision: 0.6829
- Recall: 0.2932
- Roc Auc Score: 0.6301
--------------------------------------------------------
--------------------------------------------------------
Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
--------------------------------------------------------
Model performance for Test set
- Accuracy: 0.9131
- F1 score: 0.9121
- Precision: 0.7944
- Recall: 0.7487
- Roc Auc Score: 0.8508
--------------------------------------------------------
--------------------------------------------------------
Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precis

In [64]:
## Hyperparameter Training
rf_params = {"max_depth": [5, 8, 15, None, 10],
             "max_features": [5, 7, "auto", 8],
             "min_samples_split": [2, 8, 15, 20],
             "n_estimators": [100, 200, 500, 1000]}

In [65]:
from sklearn.model_selection import RandomizedSearchCV

In [71]:
rf = RandomForestClassifier()
rs = RandomizedSearchCV(RandomForestClassifier(),
                       param_distributions=rf_params,
                       cv=3,
                       n_iter=10,
                       verbose=2,
                       n_jobs=-1)

In [75]:
rs.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [76]:
rs.best_params_

{'n_estimators': 500,
 'min_samples_split': 2,
 'max_features': 7,
 'max_depth': None}

In [77]:
rs.best_score_

np.float64(0.9104873604565209)

In [78]:
y_pred = rs.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")

Accuracy: 0.9355828220858896
F1 Score: 0.8061538461538461
Precision: 0.9776119402985075
Recall: 0.6858638743455497
