## About Dataset
### Context


#### "Trips & Travel.Com" company wants to enable and establish a viable business model to expand the customer base. One of the ways to expand the customer base is to introduce a new offering of packages. Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages. However, the marketing cost was quite high because customers were contacted at random without looking at the available information. The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being. However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import warnings

warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Travel.csv')
df

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0


### Clean the Data

In [3]:
df['Gender'] = df['Gender'].str.replace('Fe Male', 'Female')

In [4]:
df['Gender'].value_counts()

Gender
Male      2916
Female    1972
Name: count, dtype: int64

In [5]:
df['MaritalStatus'] = df['MaritalStatus'].str.replace('Single', 'Unmarried')

In [6]:
df['MaritalStatus'].value_counts()

MaritalStatus
Married      2340
Unmarried    1598
Divorced      950
Name: count, dtype: int64

In [7]:
df['TypeofContact'].value_counts()

TypeofContact
Self Enquiry       3444
Company Invited    1419
Name: count, dtype: int64

## Get The Feature with Not Available values!

In [8]:
features_with_na = [feature for feature in df.columns if df[feature].isna().sum() > 0]

In [9]:
features_with_na

['Age',
 'TypeofContact',
 'DurationOfPitch',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'NumberOfChildrenVisiting',
 'MonthlyIncome']

In [10]:
for feature in features_with_na:
    print(feature, np.round(df[feature].isnull().mean() * 100, 2))

Age 4.62
TypeofContact 0.51
DurationOfPitch 5.14
NumberOfFollowups 0.92
PreferredPropertyStar 0.53
NumberOfTrips 2.86
NumberOfChildrenVisiting 1.35
MonthlyIncome 4.77


In [11]:
df[features_with_na].describe()

Unnamed: 0,Age,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,NumberOfChildrenVisiting,MonthlyIncome
count,4662.0,4637.0,4843.0,4862.0,4748.0,4822.0,4655.0
mean,37.622265,15.490835,3.708445,3.581037,3.236521,1.187267,23619.853491
std,9.316387,8.519643,1.002509,0.798009,1.849019,0.857861,5380.698361
min,18.0,5.0,1.0,3.0,1.0,0.0,1000.0
25%,31.0,9.0,3.0,3.0,2.0,1.0,20346.0
50%,36.0,13.0,4.0,3.0,3.0,1.0,22347.0
75%,44.0,20.0,4.0,4.0,4.0,2.0,25571.0
max,61.0,127.0,6.0,5.0,22.0,3.0,98678.0


## Fill Null Values

### Impute Median value for age (to handle outliers effectively)
### Impute Mode value for type of contact (will replace with highest frequency type of the contact)
### Impute Median value for Duration of pitch
### Impute Mode value number of followups as it is discrete feature
### Impute Mode value for preffered property start
### Impute median for numbber of trips
### Impute Mode for number of children visiting
### Impute median for monthly income

In [12]:
# age : 
df['Age'].fillna(df['Age'].median(), inplace=True)

# Type of contact
df.TypeofContact.fillna(df.TypeofContact.mode()[0], inplace=True)

# Duration of pitch
df['DurationOfPitch'].fillna(df['DurationOfPitch'].median(), inplace=True)


# Number of Followups
df['NumberOfFollowups'].fillna(df['NumberOfFollowups'].mode()[0], inplace=True)

# Preffered Property start
df['PreferredPropertyStar'].fillna(df['PreferredPropertyStar'].mode()[0], inplace=True),

# Number of trips
df.NumberOfTrips.fillna(df.NumberOfTrips.median(), inplace=True)

# Monthly income
df.MonthlyIncome.fillna(df.MonthlyIncome.median(), inplace=True)

In [13]:
df.drop('CustomerID', axis=1, inplace=True)

In [14]:
df

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,0.0,Manager,20993.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Unmarried,7.0,1,3,0,0.0,Executive,17090.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,0,36.0,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Unmarried,3.0,1,3,1,2.0,Executive,21212.0
4885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Unmarried,3.0,0,5,0,2.0,Executive,20289.0


In [15]:
df['NumberOfVisiting'] = df['NumberOfChildrenVisiting'] + df['NumberOfPersonVisiting'] 

In [16]:
df.drop(columns=['NumberOfChildrenVisiting', 'NumberOfPersonVisiting'], axis=1, inplace=True)

In [17]:
df

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,Designation,MonthlyIncome,NumberOfVisiting
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3.0,Deluxe,3.0,Unmarried,1.0,1,2,1,Manager,20993.0,3.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,Manager,20130.0,5.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,4.0,Basic,3.0,Unmarried,7.0,1,3,0,Executive,17090.0,3.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,3.0,Basic,3.0,Divorced,2.0,1,5,1,Executive,17909.0,3.0
4,0,36.0,Self Enquiry,1,8.0,Small Business,Male,3.0,Basic,4.0,Divorced,1.0,0,5,1,Executive,18468.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,Manager,26576.0,4.0
4884,1,28.0,Company Invited,1,31.0,Salaried,Male,5.0,Basic,3.0,Unmarried,3.0,1,3,1,Executive,21212.0,6.0
4885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4.0,Standard,4.0,Married,7.0,0,1,1,Senior Manager,31820.0,7.0
4886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,4.0,Basic,3.0,Unmarried,3.0,0,5,0,Executive,20289.0,5.0


### Get all the numerica features

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ProdTaken               4888 non-null   int64  
 1   Age                     4888 non-null   float64
 2   TypeofContact           4888 non-null   object 
 3   CityTier                4888 non-null   int64  
 4   DurationOfPitch         4888 non-null   float64
 5   Occupation              4888 non-null   object 
 6   Gender                  4888 non-null   object 
 7   NumberOfFollowups       4888 non-null   float64
 8   ProductPitched          4888 non-null   object 
 9   PreferredPropertyStar   4888 non-null   float64
 10  MaritalStatus           4888 non-null   object 
 11  NumberOfTrips           4888 non-null   float64
 12  Passport                4888 non-null   int64  
 13  PitchSatisfactionScore  4888 non-null   int64  
 14  OwnCar                  4888 non-null   

In [19]:
numerical_data = [feature for feature in df.columns if df[feature].dtype != 'O']

In [20]:
numerical_data

['ProdTaken',
 'Age',
 'CityTier',
 'DurationOfPitch',
 'NumberOfFollowups',
 'PreferredPropertyStar',
 'NumberOfTrips',
 'Passport',
 'PitchSatisfactionScore',
 'OwnCar',
 'MonthlyIncome',
 'NumberOfVisiting']

In [21]:
df[numerical_data]

Unnamed: 0,ProdTaken,Age,CityTier,DurationOfPitch,NumberOfFollowups,PreferredPropertyStar,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,MonthlyIncome,NumberOfVisiting
0,1,41.0,3,6.0,3.0,3.0,1.0,1,2,1,20993.0,3.0
1,0,49.0,1,14.0,4.0,4.0,2.0,0,3,1,20130.0,5.0
2,1,37.0,1,8.0,4.0,3.0,7.0,1,3,0,17090.0,3.0
3,0,33.0,1,9.0,3.0,3.0,2.0,1,5,1,17909.0,3.0
4,0,36.0,1,8.0,3.0,4.0,1.0,0,5,1,18468.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4883,1,49.0,3,9.0,5.0,4.0,2.0,1,1,1,26576.0,4.0
4884,1,28.0,1,31.0,5.0,3.0,3.0,1,3,1,21212.0,6.0
4885,1,52.0,3,17.0,4.0,4.0,7.0,0,1,1,31820.0,7.0
4886,1,19.0,3,16.0,4.0,3.0,3.0,0,5,0,20289.0,5.0


In [22]:
categorical_data = [feature for feature in df.columns if df[feature].dtype == 'O']

In [23]:
df[categorical_data]

Unnamed: 0,TypeofContact,Occupation,Gender,ProductPitched,MaritalStatus,Designation
0,Self Enquiry,Salaried,Female,Deluxe,Unmarried,Manager
1,Company Invited,Salaried,Male,Deluxe,Divorced,Manager
2,Self Enquiry,Free Lancer,Male,Basic,Unmarried,Executive
3,Company Invited,Salaried,Female,Basic,Divorced,Executive
4,Self Enquiry,Small Business,Male,Basic,Divorced,Executive
...,...,...,...,...,...,...
4883,Self Enquiry,Small Business,Male,Deluxe,Unmarried,Manager
4884,Company Invited,Salaried,Male,Basic,Unmarried,Executive
4885,Self Enquiry,Salaried,Female,Standard,Married,Senior Manager
4886,Self Enquiry,Small Business,Male,Basic,Unmarried,Executive


In [24]:
# Discrete features (having fixed number of categories) : (if unique features are less than 25, it will be discrete feature, otherwise it will be continues data)
discrete_features = [feature for feature in numerical_data if len(df[feature].unique()) <= 25]

In [25]:
continues_features = [feature for feature in numerical_data if feature not in discrete_features]

In [26]:
continues_features

['Age', 'DurationOfPitch', 'MonthlyIncome']

### Train Test Split

In [27]:
from sklearn.model_selection import train_test_split

In [28]:
x = df.drop('ProdTaken', axis=1)
y = df['ProdTaken']

In [29]:
y.value_counts()

ProdTaken
0    3968
1     920
Name: count, dtype: int64

In [30]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

In [31]:
categorical_data

['TypeofContact',
 'Occupation',
 'Gender',
 'ProductPitched',
 'MaritalStatus',
 'Designation']

In [32]:
cat_features = x.select_dtypes(include='object').columns
num_features = x.select_dtypes(exclude='object').columns

In [33]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# ColumnTransformer is used to combine multiple transformer techniques. (like, here we are using two transformer techniques. 1) onehot and 2) standardscaler ) 
from sklearn.compose import ColumnTransformer

numerica_transformer = StandardScaler()
oh_transformer = OneHotEncoder(drop='first')

preprocessor = ColumnTransformer(
    [
        ('OneHotEncoder', oh_transformer, cat_features),
        ('StandardScaler', numerica_transformer, num_features)
    ]
)

In [34]:
preprocessor

In [35]:
# training data uses fit_transform and testing data uses transform (due to data leakage)
x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

### Random forest classifier

In [48]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [55]:
models = [RandomForestClassifier(), DecisionTreeClassifier()]
for i in range(len(models)):
    model = models[i]
    model.fit(x_train, y_train)


    print(f"--------------Model : {i}----------------" )
    # training performance :
    y_train_pred = model.predict(x_train)

    # Testing performance :
    y_test_pred = model.predict(x_test)

    # Training Score : 
    print(accuracy_score(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

    # Testing score
    print(accuracy_score(y_test, y_test_pred))
    print(classification_report(y_test, y_test_pred))

--------------Model : 0----------------
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2969
           1       1.00      1.00      1.00       697

    accuracy                           1.00      3666
   macro avg       1.00      1.00      1.00      3666
weighted avg       1.00      1.00      1.00      3666

0.9214402618657938
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       999
           1       0.95      0.60      0.74       223

    accuracy                           0.92      1222
   macro avg       0.93      0.80      0.85      1222
weighted avg       0.92      0.92      0.91      1222

--------------Model : 1----------------
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2969
           1       1.00      1.00      1.00       697

    accuracy                           1.00      3666
   macro avg       1.

#### Hyperparameter Tuning 

In [56]:
rf_params = {
    'max_depth' : [5, 8, 15, None, 10],
    'max_features' : [5, 7, 'auto', 8],
    'min_samples_split' : [2, 8, 15, 20],
    'n_estimators' : [100, 200, 500, 1000]
}

In [68]:
# Model lists

models = [
    ('RF', RandomForestClassifier(), rf_params)
]

In [69]:
models

[('RF',
  RandomForestClassifier(),
  {'max_depth': [5, 8, 15, None, 10],
   'max_features': [5, 7, 'auto', 8],
   'min_samples_split': [2, 8, 15, 20],
   'n_estimators': [100, 200, 500, 1000]})]

In [70]:
from sklearn.model_selection import RandomizedSearchCV

In [71]:
model_params = {}
for name, model, params in models:
    randomcv = RandomizedSearchCV(estimator=model, param_distributions=params, n_iter=100, cv=3, verbose=3, n_jobs=-1)

    randomcv.fit(x_train, y_train)

    model_params[name] = randomcv.best_params_


for model_name in model_params:

    print(f"-----------------------------Best params for the model : {model_name}------------------------------")
    print(model_params[model_name])

Fitting 3 folds for each of 100 candidates, totalling 300 fits
-----------------------------Best params for the model : RF------------------------------
{'n_estimators': 1000, 'min_samples_split': 2, 'max_features': 7, 'max_depth': None}


In [72]:
models = [RandomForestClassifier(n_estimators=1000, min_samples_split=2, max_features=7, max_depth=None)]
for i in range(len(models)):
    model = models[i]
    model.fit(x_train, y_train)


    print(f"--------------Model : {i}----------------" )
    # training performance :
    y_train_pred = model.predict(x_train)

    # Testing performance :
    y_test_pred = model.predict(x_test)

    # Training Score : 
    print(accuracy_score(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

    # Testing score
    print(accuracy_score(y_test, y_test_pred))
    print(classification_report(y_test, y_test_pred))

--------------Model : 0----------------
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2969
           1       1.00      1.00      1.00       697

    accuracy                           1.00      3666
   macro avg       1.00      1.00      1.00      3666
weighted avg       1.00      1.00      1.00      3666

0.9320785597381342
              precision    recall  f1-score   support

           0       0.93      0.99      0.96       999
           1       0.95      0.66      0.78       223

    accuracy                           0.93      1222
   macro avg       0.94      0.83      0.87      1222
weighted avg       0.93      0.93      0.93      1222

