### Holiday Package Prediciton
#### 1) Problem statement.
"Trips & Travel.Com" company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering * Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

#### 2) Data Collection.
The Dataset is collected from https://www.kaggle.com/datasets/susant4learning/holiday-package-purchase-prediction The data consists of 20 column and 4888 rows.

In [149]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as ply
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [150]:
df=pd.read_csv('Travel.csv')
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


Handling missing values

In [151]:
df.isnull().sum()

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [152]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

In [153]:
classification_features=[column for column in df.columns if df[column].dtype == 'object']
classification_features

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [154]:
numerical_features=[column for column in df.columns if df[column].dtype != 'object']
numerical_features

['CustomerID',
 'ProdTaken',
 'Age',
 'CityTier',
 'DurationOfPitch',
 'NumberOfPersonVisiting',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'NumberOfChildrenVisiting',
 'MonthlyIncome']

In [155]:
for feature in classification_features:
	print(df[feature].value_counts())
	print()

TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64

Occupation
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: count, dtype: int64

Gender
Male       2916
Female     1817
Fe Male     155
Name: count, dtype: int64

ProductPitched
Basic           1842
Deluxe          1732
Standard         742
Super Deluxe     342
King             230
Name: count, dtype: int64

MaritalStatus
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: count, dtype: int64

Designation
Executive         1842
Manager           1732
Senior Manager     742
AVP                342
VP                 230
Name: count, dtype: int64



In [156]:
df['Gender']=df['Gender'].str.replace("Fe Male","Female")

In [157]:
for feature in classification_features:
	print(df[feature].value_counts())
	print()

TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64

Occupation
Salaried          2368
Small Business    2084
Large Business     434
Free Lancer          2
Name: count, dtype: int64

Gender
Male      2916
Female    1972
Name: count, dtype: int64

ProductPitched
Basic           1842
Deluxe          1732
Standard         742
Super Deluxe     342
King             230
Name: count, dtype: int64

MaritalStatus
Married      2340
Divorced      950
Single        916
Unmarried     682
Name: count, dtype: int64

Designation
Executive         1842
Manager           1732
Senior Manager     742
AVP                342
VP                 230
Name: count, dtype: int64



Data Handling

In [158]:
# For classification replace missing value with mode
for feature in classification_features:
	print(feature)
	print(df[feature].isnull().sum())
	mode_value=df[df[feature].notna()][feature].mode()[0]
	df[feature]=df[feature].fillna(mode_value)
	print(df[feature].isnull().sum())

TypeofContact
25
0
Occupation
0
0
Gender
0
0
ProductPitched
0
0
MaritalStatus
0
0
Designation
0
0


In [159]:
# For numerica value replace missing value with mode
for feature in numerical_features:
	print(feature)
	print(df[feature].isnull().sum())
	mean_value=df[feature].mean()
	df[feature]=df[feature].fillna(mean_value)
	print(df[feature].isnull().sum())

CustomerID
0
0
ProdTaken
0
0
Age
226
0
CityTier
0
0
DurationOfPitch
251
0
NumberOfPersonVisiting
0
0
NumberOfFollowups
45
0
PreferredPropertyStar
26
0
NumberOfTrips
140
0
Passport
0
0
PitchSatisfactionScore
0
0
OwnCar
0
0
NumberOfChildrenVisiting
66
0
MonthlyIncome
233
0


Feature Extraction

In [160]:
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,37.622265,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [161]:
df["TotalVisited"]=df['NumberOfChildrenVisiting']+df['NumberOfPersonVisiting']
df.drop(columns=['NumberOfChildrenVisiting','NumberOfPersonVisiting'],axis=1,inplace=True)

In [162]:
df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisited
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Single,1.0,1,2,1,Manager,20993.0,3.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Single,7.0,1,3,0,Executive,17090.0,3.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,200004,0,37.622265,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0


In [163]:
classification_features=[column for column in df.columns if df[column].dtype == 'object']
numerical_features=[column for column in df.columns if df[column].dtype != 'object']

In [164]:
district_featues=[column for column in numerical_features if len(df[column].unique()) <= 15]
continuous_feature=[column for column in numerical_features if len(df[column].unique()) > 15]

### Train Test Split And Model Training

In [165]:
X=df.drop(columns=['ProdTaken'],axis=1)
Y=df['ProdTaken']

In [166]:
from sklearn.model_selection  import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=.3,random_state=42)

In [167]:
numerical_features.remove('ProdTaken')
numerical_features

['CustomerID',
 'Age',
 'CityTier',
 'DurationOfPitch',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'MonthlyIncome',
 'TotalVisited']

Apply Standardization

In [168]:
from sklearn.preprocessing import StandardScaler
for freature in numerical_features:
	scaler=StandardScaler()
	X_train[feature]=scaler.fit_transform(X_train[[feature]])
	X_test[feature]=scaler.transform(X_test[[feature]])

In [169]:
X_train.head()

Unnamed: 0,CustomerID,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisited
736,200736,48.0,Self Enquiry,1,10.0,Salaried,Male,4.0,Standard,3.0,Unmarried,1.0,0,5,1,Senior Manager,0.449423,4.0
1615,201615,30.0,Self Enquiry,1,11.0,Large Business,Female,3.0,Basic,5.0,Married,6.0,0,5,0,Executive,-1.018461,3.0
336,200336,29.0,Self Enquiry,1,14.0,Salaried,Male,5.0,Basic,5.0,Divorced,2.0,1,3,1,Executive,-1.222778,4.0
4526,204526,29.0,Self Enquiry,3,9.0,Small Business,Female,4.0,Deluxe,4.0,Married,3.0,1,3,1,Manager,-0.029263,5.0
2665,202665,34.0,Self Enquiry,1,11.0,Small Business,Female,5.0,Basic,4.0,Divorced,8.0,0,4,0,Executive,-0.43545,4.0


Label Encoding

In [170]:
from sklearn.preprocessing import LabelEncoder
for feature in classification_features:
	encoder=LabelEncoder()
	encoder.fit(X[feature])
	X_train[feature]=encoder.transform(X_train[[feature]])
	X_test[feature]=encoder.transform(X_test[feature])

In [171]:
X_train.head()

Unnamed: 0,CustomerID,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,TotalVisited
736,200736,48.0,1,1,10.0,2,1,4.0,3,3.0,3,1.0,0,5,1,3,0.449423,4.0
1615,201615,30.0,1,1,11.0,1,0,3.0,0,5.0,1,6.0,0,5,0,1,-1.018461,3.0
336,200336,29.0,1,1,14.0,2,1,5.0,0,5.0,0,2.0,1,3,1,1,-1.222778,4.0
4526,204526,29.0,1,3,9.0,3,0,4.0,1,4.0,1,3.0,1,3,1,2,-0.029263,5.0
2665,202665,34.0,1,1,11.0,3,0,5.0,0,4.0,0,8.0,0,4,0,1,-0.43545,4.0


Model Apply

In [172]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score, roc_auc_score,roc_curve 

In [173]:
# Evaluate the model
def evaluate_performance(true,pred):
	accuracy=accuracy_score(true,pred) # Calculate Accuracy
	f1=f1_score(true,pred,average='weighted') # Calculate F1 score
	precision=precision_score(true,pred) # Calculate presicion
	recall= recall_score(true,pred) # Calculate recall
	rocauc_score=roc_auc_score(true,pred) #Clacualte ROU AUC score

	return accuracy,f1,precision,recall,rocauc_score

In [174]:
models={
    "Logisitic Regression":LogisticRegression(),
    "Decision Tree":DecisionTreeClassifier(),
    "Random Forest":RandomForestClassifier(),
    "Gradient Boost":GradientBoostingClassifier()
}

In [175]:
for name, model in models.items():
	model.fit(X_train,Y_train)

	# Make predictions
	Y_train_pred = model.predict(X_train)
	Y_test_pred = model.predict(X_test)

    # Training set performance
	model_train_accuracy ,model_train_f1,model_train_precision,model_train_recall ,model_train_rocauc_score = evaluate_performance(Y_train,Y_train_pred)	

    # Test set performance
	model_test_accuracy,model_test_f1,model_test_precision,model_test_recall,model_test_rocauc_score=evaluate_performance(Y_test,Y_test_pred)


	print(name)
    
	print('Model performance for Training set')
	print(f"- Accuracy: {model_train_accuracy}")
	print(f'- F1 score: {model_train_f1}')
	print(f'- Precision: {model_train_precision}')
	print(f'- Recall: {model_train_recall}')
	print(f'- Roc Auc Score: {model_train_rocauc_score}')

	print("-"*15)
	print('Model performance for Training set')
	print(f"- Accuracy: {model_test_accuracy}")
	print(f'- F1 score: {model_test_f1}')
	print(f'- Precision: {model_test_precision}')
	print(f'- Recall: {model_test_recall}')
	print(f'- Roc Auc Score: {model_test_rocauc_score}')

	print("="*15)
	print()

Logisitic Regression
Model performance for Training set
- Accuracy: 0.8251973107278574
- F1 score: 0.7601388407079831
- Precision: 0.9285714285714286
- Recall: 0.0804953560371517
- Roc Auc Score: 0.5395269572978552
---------------
Model performance for Training set
- Accuracy: 0.8214042263122018
- F1 score: 0.7528472509410283
- Precision: 0.8
- Recall: 0.058394160583941604
- Roc Auc Score: 0.5275206343573521

Decision Tree
Model performance for Training set
- Accuracy: 1.0
- F1 score: 1.0
- Precision: 1.0
- Recall: 1.0
- Roc Auc Score: 1.0
---------------
Model performance for Training set
- Accuracy: 0.8916155419222904
- F1 score: 0.8916915306133852
- Precision: 0.7090909090909091
- Recall: 0.7116788321167883
- Roc Auc Score: 0.822310497366022

Random Forest
Model performance for Training set
- Accuracy: 1.0
- F1 score: 1.0
- Precision: 1.0
- Recall: 1.0
- Roc Auc Score: 1.0
---------------
Model performance for Training set
- Accuracy: 0.9059304703476483
- F1 score: 0.895956989978016

In [176]:
#Hyperparameter Tuning
rf_params={"max_depth": [5, 8, 15, None, 10],
            "max_features": [5, 7, "auto", 8],
            "min_samples_split": [2, 8, 15, 20],
            "n_estimators": [100, 200, 500, 1000]
			}

In [177]:
randomcv_models = [
                   ("RF", RandomForestClassifier(), rf_params)
				   
                   ]

In [178]:
from sklearn.model_selection import RandomizedSearchCV

model_best_params={}
for name,model,parameters in randomcv_models:
	random=RandomizedSearchCV(estimator=model,
						   param_distributions=parameters,
						   cv=5
						   )
	
	random.fit(X_train,Y_train)
	model_best_params[name]=random.best_params_

for model_name in model_best_params:
	print(f"---------------- Best Params for {model_name} -------------------")
	print(model_best_params[model_name])

---------------- Best Params for RF -------------------
{'n_estimators': 500, 'min_samples_split': 8, 'max_features': 8, 'max_depth': 15}


In [179]:
models={
    "Random Forest":RandomForestClassifier(n_estimators= 100, min_samples_split= 2, max_features= 7, max_depth= None)
}

In [180]:
for name, model in models.items():
	model.fit(X_train,Y_train)

	# Make predictions
	Y_train_pred = model.predict(X_train)
	Y_test_pred = model.predict(X_test)

    # Training set performance
	model_train_accuracy ,model_train_f1,model_train_precision,model_train_recall ,model_train_rocauc_score = evaluate_performance(Y_train,Y_train_pred)	

    # Test set performance
	model_test_accuracy,model_test_f1,model_test_precision,model_test_recall,model_test_rocauc_score=evaluate_performance(Y_test,Y_test_pred)


	print(name)
    
	print('Model performance for Training set')
	print(f"- Accuracy: {model_train_accuracy}")
	print(f'- F1 score: {model_train_f1}')
	print(f'- Precision: {model_train_precision}')
	print(f'- Recall: {model_train_recall}')
	print(f'- Roc Auc Score: {model_train_rocauc_score}')

	print("-"*15)
	print('Model performance for Training set')
	print(f"- Accuracy: {model_test_accuracy}")
	print(f'- F1 score: {model_test_f1}')
	print(f'- Precision: {model_test_precision}')
	print(f'- Recall: {model_test_recall}')
	print(f'- Roc Auc Score: {model_test_rocauc_score}')

	print("="*15)
	print()

Random Forest
Model performance for Training set
- Accuracy: 1.0
- F1 score: 1.0
- Precision: 1.0
- Recall: 1.0
- Roc Auc Score: 1.0
---------------
Model performance for Training set
- Accuracy: 0.9113837764144512
- F1 score: 0.9030670212302431
- Precision: 0.9186046511627907
- Recall: 0.5766423357664233
- Roc Auc Score: 0.7824536071120465

