## Holiday Package Prediciton

### 1) Problem statement.
"Trips & Travel.Com" company wants to enable and establish a viable business model to expand the customer base.
One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering * Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information.
The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.
### 2) Data Source.
The Dataset is collected from https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction
The data consists of 20 column and 4888 rows.

In [2]:
# Necessary Modeling libraries
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


# Data preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Visualization libraries
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Metrics
from sklearn import metrics

# Data manipulation
import pandas as pd
import numpy as np


In [3]:
## Reading data from CSV file

df = pd.read_csv('Travel.csv')

In [4]:
## Inspecting initial rows of the dataframe

df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


## Data Cleaning

In [5]:
## Inspecting if there are any missing values in the dataset

df.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [7]:
## Inspesting if there are any duplicates

df.duplicated().sum()

0

In [107]:
## Inspecting if there are typos or unnesessary spaces in column names 

df.columns

Index(['CustomerID', 'ProdTaken', 'Age', 'TypeofContact', 'CityTier',
       'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting',
       'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar',
       'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore',
       'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
      dtype='object')

In [108]:
## Inspecting if there are spaces or misspelled categories

df['Gender'].unique()

array(['Female', 'Male', 'Fe Male'], dtype=object)

In [109]:
## Fixing the issue

df['Gender'] = df['Gender'].replace('Fe Male','Female')

In [110]:
## Making sure if the changes has reflected

df['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [111]:
## Inspecting if there are spaces or misspelled categories

df['MaritalStatus'].unique()

array(['Single', 'Divorced', 'Married', 'Unmarried'], dtype=object)

In [112]:
## Fixing the issue

df['MaritalStatus'] = df['MaritalStatus'].replace('Single','Unmarried')

In [113]:
df['MaritalStatus'].unique()

array(['Unmarried', 'Divorced', 'Married'], dtype=object)

In [114]:
## Inspecting if there are spaces or misspelled categories

df['TypeofContact'].unique()

array(['Self Enquiry', 'Company Invited', nan], dtype=object)

In [115]:
## Inspecting if there are spaces or misspelled categories

df['Occupation'].unique()

array(['Salaried', 'Free Lancer', 'Small Business', 'Large Business'],
      dtype=object)

In [116]:
## Fixing the misspelled category

df['Occupation'].replace('Free Lancer','Freelancer',inplace=True)

In [117]:
## Making sure if the changes has reflected

df['Occupation'].unique()

array(['Salaried', 'Freelancer', 'Small Business', 'Large Business'],
      dtype=object)

In [118]:
## Inspecting if there are spaces or misspelled categories

df['Designation'].unique()

array(['Manager', 'Executive', 'Senior Manager', 'AVP', 'VP'],
      dtype=object)

In [119]:
# Identify features with missing values
features_with_nan = [feature for feature in df.columns if df[feature].isnull().sum() >= 1]

# Print the percentage of missing values for each feature
for feature in features_with_nan:
    print(f'{feature} has {np.round(df[feature].isnull().mean()*100, 3)}% missing values')


Age has 4.624% missing values
TypeofContact has 0.511% missing values
DurationOfPitch has 5.135% missing values
NumberOfFollowups has 0.921% missing values
PreferredPropertyStar has 0.532% missing values
NumberOfTrips has 2.864% missing values
NumberOfChildrenVisiting has 1.35% missing values
MonthlyIncome has 4.767% missing values


In [120]:
# Generate descriptive statistics for numerical features with missing values
# This excludes object (categorical) data types

df[features_with_nan].select_dtypes(exclude='object').describe()


Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


#### Handling Missing Values with Mean and Mode Imputation

In [38]:
# Fill missing values in the 'Age' column with the median value
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing values in the 'TypeofContact' column with the mode (most frequent value)
df['TypeofContact'].fillna(df['TypeofContact'].mode()[0], inplace=True)

# Fill missing values in the 'DurationOfPitch' column with the median value
df['DurationOfPitch'].fillna(df['DurationOfPitch'].median(), inplace=True)

# Fill missing values in the 'NumberOfFollowups' column with the mode (most frequent value)
df['NumberOfFollowups'].fillna(df['NumberOfFollowups'].mode()[0], inplace=True)

# Fill missing values in the 'PreferredPropertyStar' column with the mode (most frequent value)
df['PreferredPropertyStar'].fillna(df['PreferredPropertyStar'].mode()[0], inplace=True)

# Fill missing values in the 'NumberOfTrips' column with the median value
df['NumberOfTrips'].fillna(df['NumberOfTrips'].median(), inplace=True)

# Fill missing values in the 'NumberOfChildrenVisiting' column with the mode (most frequent value)
df['NumberOfChildrenVisiting'].fillna(df['NumberOfChildrenVisiting'].mode()[0], inplace=True)

# Fill missing values in the 'MonthlyIncome' column with the median value
df['MonthlyIncome'].fillna(df['MonthlyIncome'].median(), inplace=True)


In [39]:
## Checking if there are any missing values left
df.isnull().sum()

CustomerID                  0
ProdTaken                   0
Age                         0
TypeofContact               0
CityTier                    0
DurationOfPitch             0
Occupation                  0
Gender                      0
NumberOfPersonVisiting      0
NumberOfFollowups           0
ProductPitched              0
PreferredPropertyStar       0
MaritalStatus               0
NumberOfTrips               0
Passport                    0
PitchSatisfactionScore      0
OwnCar                      0
NumberOfChildrenVisiting    0
Designation                 0
MonthlyIncome               0
dtype: int64

In [41]:
## Dropping unnecessary features

df.drop('CustomerID',inplace=True,axis=1)

## Feature Engineering

In [44]:
# Create a new feature 'Total Visiting' by adding the number of children visiting and the number of persons visiting
# This combines the two related features into a single feature, simplifying analysis and modeling
df['Total Visiting'] = df['NumberOfChildrenVisiting'] + df['NumberOfPersonVisiting']

# Drop the original features 'NumberOfChildrenVisiting' and 'NumberOfPersonVisiting'
# These are removed because their information has been consolidated into 'Total Visiting', reducing redundancy in the dataset
df.drop(['NumberOfChildrenVisiting', 'NumberOfPersonVisiting'], axis=1, inplace=True)


In [45]:
## Checking the changes

df.head()

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,Total Visiting
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,Manager,20993.0,3.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Unmarried,7.0,1,3,0,Executive,17090.0,3.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,0,36.0,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0


In [55]:
## Categorical Features

categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O' ]

print(f'There are {len(categorical_features)} number of categorical features')

There are 6 number of categorical features


In [122]:
## Numerical Features

numerical_features = [feature for feature in df.columns if df[feature].dtype != 'O' ]

print(f'There are {len(numerical_features)} number of numeric feature in the dataset')

There are 14 number of numeric feature in the dataset


In [124]:
## Discrete Features

discrete_features = [feature for feature in numerical_features if len(df[feature].unique()) <=25 ]

print(f'There are {len(discrete_features)} number of discrete features')

There are 10 number of discrete features


In [56]:
## Spliting the dataset into Indep and Dep variables

X = df.drop('ProdTaken',axis=1)

y = df['ProdTaken']

In [57]:
## Inspecting the indep variable

X

Unnamed: 0,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,Total Visiting
0,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,Manager,20993.0,3.0
1,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Unmarried,7.0,1,3,0,Executive,17090.0,3.0
3,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,36.0,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,49.0,Self Enquiry,3,9.0,Small Business,Male,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,Manager,26576.0,4.0
4884,28.0,Company Invited,1,31.0,Salaried,Male,5.0,Basic,3.0,Unmarried,3.0,1,3,1,Executive,21212.0,6.0
4885,52.0,Self Enquiry,3,17.0,Salaried,Female,4.0,Standard,4.0,Married,7.0,0,1,1,Senior Manager,31820.0,7.0
4886,19.0,Self Enquiry,3,16.0,Small Business,Male,4.0,Basic,3.0,Unmarried,3.0,0,5,0,Executive,20289.0,5.0


In [69]:
# Identify categorical features (those with object data type) in the dataset
cat_features = X.select_dtypes(include='object').columns

# Identify numerical features (those without object data type) in the dataset
num_features = X.select_dtypes(exclude='object').columns

# StandardScaler will normalize numerical features to have a mean of 0 and a standard deviation of 1
num_transformer = StandardScaler()

# OneHotEncoder will convert categorical features into a one-hot numeric array, dropping the first category to - 
# avoid collinearity
col_encoder = OneHotEncoder(drop='first')

# ColumnTransformer applies transformations to specific columns:
# - StandardScaler to numerical features
# - OneHotEncoder to categorical features
processor = ColumnTransformer(
    [
        ('StandardScaler', num_transformer, num_features),
        ('OneHotEncoder', col_encoder, cat_features)
    ]
)


In [70]:
## Inspecting the instance of an ColumnTransformer

processor

In [71]:
## Splitting the dataset into training and testing segments

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state=42)

In [72]:
# Apply the transformations to the training data and fit the processor
# This will scale numerical features and encode categorical features in the training set

X_train = processor.fit_transform(X_train)

In [82]:
# Apply the same transformations to the test data, using the already fitted processor
# Ensures that the test data undergoes the same scaling and encoding as the training data

X_test = processor.transform(X_test)

## Model Training

In [83]:
# Define a dictionary with different classification models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boost": GradientBoostingClassifier()
}

# Iterate through the dictionary of models
for i in range(len(list(models))):
    # Select the model
    model = list(models.values())[i]
    
    # Train the model on the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the training set
    y_train_pred = model.predict(X_train)
    
    # Make predictions on the test set
    y_test_pred = model.predict(X_test)
    
    # Calculate performance metrics for the training set
    model_train_accuracy = metrics.accuracy_score(y_train, y_train_pred)
    model_train_f1 = metrics.f1_score(y_train, y_train_pred, average='weighted')
    model_train_precision = metrics.precision_score(y_train, y_train_pred)
    model_train_recall = metrics.recall_score(y_train, y_train_pred)
    model_train_rocauc_score = metrics.roc_auc_score(y_train, y_train_pred)
    
    # Calculate performance metrics for the test set
    model_test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    model_test_f1 = metrics.f1_score(y_test, y_test_pred, average='weighted')
    model_test_precision = metrics.precision_score(y_test, y_test_pred)
    model_test_recall = metrics.recall_score(y_test, y_test_pred)
    model_test_rocauc_score = metrics.roc_auc_score(y_test, y_test_pred)
    
    # Print the model's name
    print(list(models.keys())[i])
    
    # Print training set performance metrics
    print('Model performance for Training set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print('- F1 score: {:.4f}'.format(model_train_f1))
    print('- Precision: {:.4f}'.format(model_train_precision))
    print('- Recall: {:.4f}'.format(model_train_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_train_rocauc_score))
    
    print('----------------------------------')
    
    # Print test set performance metrics
    print('Model performance for Test set')
    print('- Accuracy: {:.4f}'.format(model_test_accuracy))
    print('- F1 score: {:.4f}'.format(model_test_f1))
    print('- Precision: {:.4f}'.format(model_test_precision))
    print('- Recall: {:.4f}'.format(model_test_recall))
    print('- Roc Auc Score: {:.4f}'.format(model_test_rocauc_score))
    
    print('='*35)
    print('\n')


Logisitic Regression
Model performance for Training set
- Accuracy: 0.8451
- F1 score: 0.8188
- Precision: 0.7101
- Recall: 0.3034
- Roc Auc Score: 0.6373
----------------------------------
Model performance for Test set
- Accuracy: 0.8432
- F1 score: 0.8193
- Precision: 0.6719
- Recall: 0.3139
- Roc Auc Score: 0.6393


Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.9018
- F1 score: 0.9025
- Precision: 0.7289
- Recall: 0.7555
- Roc Auc Score: 0.8455


Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.9243
- F1 score: 0.9184
- Precision: 0.9405
- Recall: 0.6350
- Roc Auc Score: 0.8129


Gradient Boost
Model performance for Training se

Since `Accuracy Score` for `Random Forest` is highest amids all other's therefore opting Random Forest as my Champion Model and will fine tune it so that it can improve further

In [84]:
## Hyperparameter Training
# Define a dictionary of hyperparameters for the Random Forest model

rf_params = {
    "max_depth": [5, 8, 15, None, 10],  # Maximum depth of the tree
    "max_features": [5, 7, "auto", 8],  # Number of features to consider when looking for the best split
    "min_samples_split": [2, 8, 15, 20],  # Minimum number of samples required to split an internal node
    "n_estimators": [100, 200, 500, 1000]  # Number of trees in the forest
}


In [89]:
# Create a list of tuples for RandomizedSearchCV, including the model and its corresponding hyperparameters

randomcv_models = [
    ('RF', RandomForestClassifier(), rf_params)  # Random Forest with its hyperparameter grid
]


In [90]:
## Inspesting the variable

randomcv_models

[('RF',
  RandomForestClassifier(),
  {'max_depth': [5, 8, 15, None, 10],
   'max_features': [5, 7, 'auto', 8],
   'min_samples_split': [2, 8, 15, 20],
   'n_estimators': [100, 200, 500, 1000]})]

In [91]:
# Import RandomizedSearchCV from sklearn for hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# Initialize an empty dictionary to store the best parameters for each model
model_param = {}

# Iterate over the list of models and their hyperparameter grids
for name, model, params in randomcv_models:
    # Instantiate RandomizedSearchCV with the current model and its hyperparameters
    random = RandomizedSearchCV(estimator=model,
                                param_distributions=params,
                                n_iter=100,  # Number of parameter settings that are sampled
                                cv=3,  # Number of cross-validation folds
                                verbose=2,  # Verbosity level
                                n_jobs=-1)  # Use all available cores
    
    # Fit RandomizedSearchCV on the training data
    random.fit(X_train, y_train)
    
    # Store the best parameters found for the current model
    model_param[name] = random.best_params_

# Print the best parameters for each model
for model_name in model_param:
    print(f"---------------- Best Params for {model_name} -------------------")
    print(model_param[model_name])


Fitting 3 folds for each of 100 candidates, totalling 300 fits
---------------- Best Params for RF -------------------
{'n_estimators': 1000, 'min_samples_split': 2, 'max_features': 8, 'max_depth': 15}


In [94]:
# Define a dictionary with Random Forest classifier and its specified hyperparameters
models = {
    "Random Forest": RandomForestClassifier(n_estimators=1000,  # Number of trees in the forest
                                            min_samples_split=2,  # Minimum number of samples required to split an internal node
                                            max_features=8,  # Number of features to consider when looking for the best split
                                            max_depth=15)  # Maximum depth of the tree
}

# Iterate through the dictionary of models
for i in range(len(list(models))):
    # Select the model
    model = list(models.values())[i]
    
    # Train the model on the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the training set
    y_train_pred = model.predict(X_train)
    
    # Make predictions on the test set
    y_test_pred = model.predict(X_test)
    
    # Calculate performance metrics for the training set
    model_train_accuracy = metrics.accuracy_score(y_train, y_train_pred)  # Calculate Accuracy
    model_train_f1 = metrics.f1_score(y_train, y_train_pred, average='weighted')  # Calculate F1-score
    model_train_precision = metrics.precision_score(y_train, y_train_pred)  # Calculate Precision
    model_train_recall = metrics.recall_score(y_train, y_train_pred)  # Calculate Recall
    model_train_rocauc_score = metrics.roc_auc_score(y_train, y_train_pred)  # Calculate ROC AUC Score
    
    # Calculate performance metrics for the test set
    model_test_accuracy = metrics.accuracy_score(y_test, y_test_pred)  # Calculate Accuracy
    model_test_f1 = metrics.f1_score(y_test, y_test_pred, average='weighted')  # Calculate F1-score
    model_test_precision = metrics.precision_score(y_test, y_test_pred)  # Calculate Precision
    model_test_recall = metrics.recall_score(y_test, y_test_pred)  # Calculate Recall
    model_test_rocauc_score = metrics.roc_auc_score(y_test, y_test_pred)  # Calculate ROC AUC Score
    
    # Print the model's name
    print(list(models.keys())[i])
    
    # Print training set performance metrics
    print('Model performance for Training set')
    print("- Accuracy: {:.4f}".format(model_train_accuracy))
    print('- F1 score: {:.4f}'.format(model_train_f1))
    print('- Precision: {:.4f}'.format(model_train_precision))
    print('- Recall: {:.4f}'.format(model_train_recall))
    print('- ROC AUC Score: {:.4f}'.format(model_train_rocauc_score))
    
    print('----------------------------------')
    
    # Print test set performance metrics
    print('Model performance for Test set')
    print('- Accuracy: {:.4f}'.format(model_test_accuracy))
    print('- F1 score: {:.4f}'.format(model_test_f1))
    print('- Precision: {:.4f}'.format(model_test_precision))
    print('- Recall: {:.4f}'.format(model_test_recall))
    print('- ROC AUC Score: {:.4f}'.format(model_test_rocauc_score))
    
    print('=' * 35)
    print('\n')


Random Forest
Model performance for Training set
- Accuracy: 0.9994
- F1 score: 0.9994
- Precision: 1.0000
- Recall: 0.9969
- Roc Auc Score: 0.9985
----------------------------------
Model performance for Test set
- Accuracy: 0.9250
- F1 score: 0.9198
- Precision: 0.9227
- Recall: 0.6533
- Roc Auc Score: 0.8204


