<a href="https://colab.research.google.com/github/Asrarullah7/Project/blob/main/travel_package_purchase_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Travel Package Purchase Prediction
# Description
**Background and Context **
You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

This time company wants to harness the available data of existing and potential customers to target the right customers.

You as a Data Scientist at "Visit with us" travel company has to analyze the customers' data and information to provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.

Objective
To predict which customer is more likely to purchase the newly introduced travel package.

Data Dictionary
Customer details:
* CustomerID: Unique customer ID
* ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
* Age: Age of customer
* TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
* CityTier: City tier depends on the development of a city, population, facilities, and living standards.

The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3. It's the city the customer lives in. 6. Occupation: Occupation of customer 7. Gender: Gender of customer 8. NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer 9. PreferredPropertyStar: Preferred hotel property rating by customer 10. MaritalStatus: Marital status of customer 11. NumberOfTrips: Average number of trips in a year by customer 12. Passport: The customer has a passport or not (0: No, 1: Yes) 13. OwnCar: Whether the customers own a car or not (0: No, 1: Yes) 14. NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer 15. Designation: Designation of the customer in the current organization 16. MonthlyIncome: Gross monthly income of the customer 17. PitchSatisfactionScore: Sales pitch satisfaction score 18. ProductPitched: Product pitched by the salesperson 19. NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch 20. DurationOfPitch: Duration of the pitch by a salesperson to the customer

# Import Libraries

In [None]:
# libraries to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# libraries to read and manipulate the data
import numpy as np
import pandas as pd

# libraries used for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to enable plotting graphs in Jupyter notebook
%matplotlib inline

# library to split the data into train and test set
from sklearn.model_selection import train_test_split

# library to impute missing values
from sklearn.impute import SimpleImputer

# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from xgboost import XGBRegressor
from sklearn import tree
import scipy.stats as stats

# libraries for model evaluation and metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# library used for hyper parameter tuning
from sklearn.model_selection import GridSearchCV

# EDA
# Read the data set

In [None]:
# read the excel file and using Tourism sheet as our data
url = 'Tourism.xlsx'
data = pd.read_excel(url, sheet_name='Tourism')

# copying data to another varaible to avoid any changes to original data
travel = data.copy()
print(f"There are {travel.shape[0]} rows and {travel.shape[1]} columns.")

# First and last 5 rows of the dataset

In [None]:
travel.head()

In [None]:
travel.tail()

In [None]:
# let's view a sample of the data
travel.sample(n=10, random_state=1)

Check for duplicates in the data set

In [None]:
travel[travel.duplicated()].count()



*   There are no duplicate entries.




Check the data types of the columns for the dataset

In [None]:
travel.info()

* Dataset has 20 columns, there are 14 numerical columns and 6 object columns.
* We can convert object columns into categorical columns, converting "objects" to "category" reduces the data space required to store the dataframe.
* Designation', 'ProdTaken', 'OwnCar', 'Passport', 'CityTier', 'MaritalStatus', * ProductPitched', 'Gender', 'Occupation', 'TypeofContact' columns are categorical columns.
* Dataset seems to have missing values in few variables.

Fixing the data types

In [None]:
## Converting the data type of categorical features to 'category'

cat_cols = ['Designation','ProdTaken', 'OwnCar', 'Passport', 'CityTier', 'MaritalStatus', 'ProductPitched', 'Gender', 'Occupation']

In [None]:
for i in cat_cols:
    travel[i] = travel[i].astype('category')

In [None]:
travel.info()

**Check the unique values of the columns**

In [None]:
# check for number of unique values in each column
travel.nunique().sort_values(ascending=False)

**Dropping CustomerID column**

In [None]:
# Dropping columns which are not adding any information.
travel.drop('CustomerID',axis=1,inplace=True)
travel.shape


In [None]:
# let's check the value counts of features which has less unique values
for col in travel.columns:
    if travel[col].nunique() <= 6:
        print('***********************', col, '*******************')
        print(travel[col].value_counts(normalize=True)*100)
        print('-'*50)

* ProdTaken : This is our Target variable. Only 18% of the customers took the product.
* TypeofContact : 70% of the customers contacted by self and only 30% of the customers are contacted by company invite.
* CityTier : 65% of the customers are from Tire 1, 30% of the customers are from Tire 3 and only 4% of the customers are from Tire 2.
* Occupation : Most of the customer are Salaried(48%) and have small business(42%).
* Gender : 60% of the customers are Male. Few Female values are wrongly entered as Fe Male. It should be corrected.
NumberOfPersonVisiting : Nearly 80% of the customer are travelling with 3 persons.
* NumberOfFollowups : more than 80% of the customers had more than 3 follow ups.
* ProductPitched : Basic and Deluxe are the popular products among the 5 products. 70% of the customers uses either Basic or Deluxe package.
* PreferredPropertyStar : 61% of the customers prefers 3 star hotel.
* MaritalStatus : Most of the customer are Married.
* Passport : Most of the customer do not have passport. Only 30% of the customers has passport.
* PitchSatisfactionScore : Nearly 70% of the customer rated more than or equal to 3.
* OwnCar : 62% of the customers has owncar.
* NumberOfChildrenVisiting : 70% of the customers travelled with 1 or 2 children under 5 years. 22% of the customers don't have children under 5 years.
* Designation : Most of the customers are Executives or Managers.



**Gender column correction**
* Replacing 'Fe Male' with 'Female'

In [None]:
travel['Gender'].replace({'Fe Male' : 'Female'}, inplace=True)

In [None]:
travel['Gender'].value_counts()

Check for the missing values:

In [None]:
travel.isnull().sum().sum()

In [None]:
travel.isnull().sum().sort_values(ascending=False)

In [None]:
# Calculating the percentage of missing values in the dataset
travel.isnull().sum().sort_values(ascending=False)/travel.shape[0] * 100

* Out of 19 columns, 8 columns has missing values.
* DurationOfPitch, MonthlyIncome, Age, NumberOfTrips, NumberOfChildrenVisiting, * NumberOfFollowups, PreferredPropertyStar, TypeofContact columns has missing values.
* DurationOfPitch has 5% of missing data, may be some persons duration is not recorded.
* MonthlyIncome has 4.7% of missing data.
* Age has 4.6% of missing data.
* NumberOfTrips, NumberOfChildrenVisiting, NumberOfChildrenVisiting,
* NumberOfFollowups, PreferredPropertyStar, TypeofContact has less than 3% of missing data.

**Missing numeric values can be filled with median (real) values.**

In [None]:
# we will replace missing values in these columns with its median
medianFiller = lambda x: x.fillna(x.median())
NaNcolumns = ['DurationOfPitch','MonthlyIncome','Age','NumberOfTrips','NumberOfChildrenVisiting','NumberOfFollowups','PreferredPropertyStar']
print (NaNcolumns)

In [None]:
# apply the lamda function on NaNcolumns for missing values
travel[NaNcolumns] = travel[NaNcolumns].apply(medianFiller,axis=0)

In [None]:
# looking at which columns have missing values
travel.isnull().sum().sort_values(ascending=False)

In [None]:
travel.TypeofContact.unique()

* Remove missing values for TypeofContact as it's values are of either 'Self Enquiry' or 'Company Invited' and no median possibility.
* TypeofContact missing values are only 25, ~ 0.5% of rows so impact is negligible.

In [None]:
# Remove the rows of data which have missing value(s)
travel.dropna(inplace=True)

In [None]:
# check if there are any missing values
travel.isnull().sum().sum()

**Missing values are treated.**

In [None]:
# New dataframe shape
print(f"There are {travel.shape[0]} rows and {travel.shape[1]} columns.")


Summary of the dataset

In [None]:
travel.info()

In [None]:
# View summary of dataset numerical variables
travel.describe().T

Observations

* Age mean and median is 37.59 and 36 respectively which indicates a roughly normal distribution.
* DurationOfPitch mean is 15.38 while median is 13.0 indicating more pitches are under 15.38 with some extreme of 127.0, indicating possible outliers.
* NumberOfPersonVisiting mean and median is 2.91 and 3.0 respectively which indicates a roughly normal distribution.
* NumberOfFollowups mean and median is 3.71 vs 4.0 indicating a slight left skewed meaning more follow ups are 4.0 and above.
* PreferredPropertyStar mean and median is 3.58 vs 3.0 indicating more preferred hotel rating of 3 stars over 4 and 5 stars.
* NumberOfTrips mean and median is 3.23 vs 3.0 indicating more number of trips are 3.0 and below with extreme of 22.0 indicating outliers.
* PitchSatisfactionScore mean and median is 3.08 vs 3.0 which indicates a roughly normal distribution.
* NumberOfChildrenVisiting mean and median is 1.187744 vs 1.0 indicating more number 1.0 or below.
* MonthlyIncome mean and median is 23565.41 vs 22347 indicating more with income less than 23565.41.

In [None]:
# View summary of dataset categorical variables
travel.describe(include=['category']).T


Observations

* 81.1% of the customers in the dataset did not take the package.
* 70% of the customers are self enquired.
* 65% of the customers are from Tire 1.
* 48% of customers are salaried and they form the biggest occupation type of all customers.
* 60% of customers are male.
* 37.7% of customers were pitched the Basic package which is the most frequent.
* 47.8% of customers are married and they form the biggest marital status type of all customers.
* 70.8% of customers do not have a passport.
* 62% of cusotmers own cars.
* 37.7% of customers are executives and they form the biggest designation type of all customers.

# **Univariate Analysis**


In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
travel.info()

# Univariate analysis for numerical columns
**Observations on Age **

In [None]:
histogram_boxplot(travel, 'Age')

* The distribution of age is normal and no outliers.
* A significant number of customers are aged around 36.

**Observations on DurationOfPitch**

In [None]:
histogram_boxplot(travel, "DurationOfPitch")




*   A vast majority of customers pitch duration are below 15 with low number of outliers indicating extremely long pitch duration.




# **Observations on NumberOfPersonVisiting**

In [None]:
histogram_boxplot(travel, "NumberOfPersonVisiting")


In [None]:
travel['NumberOfPersonVisiting'].value_counts()


# **Observations on NumberOfFollowups**

In [None]:
histogram_boxplot(travel, "NumberOfFollowups")

In [None]:
travel['NumberOfFollowups'].value_counts()


* A majority of follow ups are 4 times followed by 3 times with outliers at 1 followup and 6 follow ups.

**Observations on PreferredPropertyStar**

In [None]:
histogram_boxplot(travel, "PreferredPropertyStar")


In [None]:
travel['PreferredPropertyStar'].value_counts()


* A majority of customers preferred a 3 star to 4 star ratings hotel property.

**Observations on NumberOfTrips**

In [None]:
histogram_boxplot(travel, "NumberOfTrips")

In [None]:
travel["NumberOfTrips"].value_counts()

* A majority of trip numbers are 3 and below with a small number of outliers indicating a significantly higher number of trips.

**Observations on PitchSatisfactionScore**

In [None]:
histogram_boxplot(travel, "PitchSatisfactionScore")


In [None]:
travel["PitchSatisfactionScore"].value_counts()




*   Most pitch satisfaction ratings are 3.0 / 5.0.




**Observations on NumberOfChildrenVisiting**

In [None]:
histogram_boxplot(travel, "NumberOfChildrenVisiting")

In [None]:
travel["NumberOfChildrenVisiting"].value_counts()

* Most customers have up to 2 children on the trip with 1 child being the most common.

**Observations on MonthlyIncome**

In [None]:

histogram_boxplot(travel, "MonthlyIncome")


* Most customers have monthly income around 24K with a small number of outliers of significantly higher income customers.

# **Univariate analysis for categorical columns**

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
travel.info()

**Observations on ProdTaken**

In [None]:
labeled_barplot(travel, 'ProdTaken', perc=True)

* 18.9% of customers in dataset took the package while vast majority at 81.1% did not.

**Observations on TypeofContact**

In [None]:
labeled_barplot(travel, 'TypeofContact', perc=True)

* Most customers 70.8% are self enquired to contact the company.

**Observations on CityTier**

In [None]:
labeled_barplot(travel, 'CityTier', perc=True)


* Majority of the customers 65.2% are from Tire 1 city.

**Observations on Occupation**

In [None]:
labeled_barplot(travel, 'Occupation', perc=True)

* Most customers are salaried employees followed by small business.

**Observations on Gender**

In [None]:
labeled_barplot(travel, 'Gender', perc=True)

* More customers are male.

**Observations on ProductPitched**

In [None]:
labeled_barplot(travel, 'ProductPitched', perc=True)

* Most popular packages pitched were Basic and Deluxe packages.

**Observations on MaritalStatus**

In [None]:
labeled_barplot(travel, 'MaritalStatus', perc=True)

* Majority of the customers are Married.

**Observations on Passport**

In [None]:
labeled_barplot(travel, 'Passport', perc=True)

* Most customers 70.8% do not have a passport.

**Observations on OwnCar**

In [None]:
labeled_barplot(travel, 'OwnCar', perc=True)

* Most customers(62.1%) own a car.

**Observations on Designation**

In [None]:
labeled_barplot(travel, 'Designation', perc=True)

* Most customers are executives or managers followed by senior managers.

# **Bivariate Analysis**

In [None]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
plt.figure(figsize=(15, 5))
sns.heatmap(travel.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

**Observations**

* NumberOfPersonVisiting and NumberOfChildrenVisiting seemed to be mildly positively correlated as both indicate the number travelling with the customer.
* Age and Income shows some correlation.
* There does not seem to be any other correlation among numeric variables thus some variables can be converted to categorical: 'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'PitchSatisfactionScore', 'NumberOfChildrenVisiting'

In [None]:
travel['NumberOfPersonVisiting'] = travel["NumberOfPersonVisiting"].astype("category")
travel["NumberOfFollowups"] = travel["NumberOfFollowups"].astype("category")
travel["PreferredPropertyStar"] = travel["PreferredPropertyStar"].astype("category")
travel["PitchSatisfactionScore"] = travel["PitchSatisfactionScore"].astype("category")
travel["NumberOfChildrenVisiting"] = travel["NumberOfChildrenVisiting"].astype("category")

In [None]:
travel.info()


In [None]:
sns.pairplot(travel, diag_kind='kde', vars=["Age","DurationOfPitch","NumberOfTrips","MonthlyIncome"], hue="ProdTaken")


* There does not seem to be any pattern in the remaining numeric variables.


**Observations on ProdTaken vs Numeric Values**

In [None]:
cols = travel[['Age','DurationOfPitch','NumberOfTrips', 'MonthlyIncome']].columns.tolist()
plt.figure(figsize=(12,12))

for i, variable in enumerate(cols):
                     plt.subplot(3,2,i+1)
                     sns.boxplot(data=travel, x = "ProdTaken", y = variable, palette="PuBu")
                     plt.tight_layout()
                     plt.title(variable)
plt.show()

Observations

* Age median and distribution from 25th to 75th percentile of package taking customers are lower than non-package taking customers. The age range of customers taking up packages is between ~ 28 to just above 40.

* DurationOfPitch median and distribution are rather similar between package and non package takers with non package takers registering a number of outliers with higher pitch durations. Package takers have a marginally higher median and distribution.

* NumberOfTrips, for customer number of trips in a year, median and distribution are rather similar between package and non package takers with both registering a number of outliers on the higher end of number of trips.

* Monthly Income median and distribution are rather similar between package and non package takers with both registering a number of outliers. Non Package takers have a marginally higher median and distribution as well as lower end and much higher end outliers in monthly income.

In [None]:
travel.info()

**Observations on ProdTaken vs TypeofContact**

In [None]:
stacked_barplot(travel, 'TypeofContact', 'ProdTaken')

* Company invited customers are slightly more likely to take up of a package.

**Observations on ProdTaken vs CityTier**

In [None]:
stacked_barplot(travel, 'CityTier', 'ProdTaken')


* Tier 2 and 3 customers are slightly more likely to take up a package than Tier 1 residents.


**Observations on ProdTaken vs Occupation**

In [None]:
stacked_barplot(travel, 'Occupation', 'ProdTaken')


* Free Lancers are very much likely to take up a package followed by Large Business customers. However this might not be statistically significant given that there are only 2 freelancers in the data set.

**Observations on ProdTaken vs Gender**

In [None]:
stacked_barplot(travel, 'Gender', 'ProdTaken')


* There is not much difference in either gender of customers to determine a higher take up of a package.


# **Observations on ProdTaken vs NumberOfPersonVisiting**

In [None]:
stacked_barplot(travel, 'NumberOfPersonVisiting', 'ProdTaken')


* 2 to 4 travel companions with the customer are more likely to take up a package.

**Observations on ProdTaken vs NumberOfFollowups**

In [None]:
stacked_barplot(travel, 'NumberOfFollowups', 'ProdTaken')


* Among customers, a higher number of follow ups to customers will lead to higher success rate to take up a package.

**Observations on ProdTaken vs ProductPitched**

In [None]:
stacked_barplot(travel, 'ProductPitched', 'ProdTaken')

* Basic followed by Standard then Deluxe packages have higher success rate of take up when pitched to customers.

**Observations on ProdTaken vs PreferredPropertyStar**

In [None]:
stacked_barplot(travel, 'PreferredPropertyStar', 'ProdTaken')


*  5 star hotel preferred customers are more likely to take up a package.

**Observations on ProdTaken vs MaritalStatus**

In [None]:
stacked_barplot(travel, 'MaritalStatus', 'ProdTaken')


* Singles are more likely to pick up a package.


**Observations on ProdTaken vs Passport**

In [None]:
stacked_barplot(travel, 'Passport', 'ProdTaken')


* Customers with passports are much more likely to pick up a package.

**Observations on ProdTaken vs PitchSatisfactionScore**

In [None]:
stacked_barplot(travel, 'PitchSatisfactionScore', 'ProdTaken')


* Customers who rated 3.0 or more for pitch satisfaction are more likely to pick up a package.


**Observations on ProdTaken vs OwnCar**

In [None]:
stacked_barplot(travel, 'OwnCar', 'ProdTaken')

* There is no difference in either owning or not owning a car to determine a higher take up of a package by a customer.

**Observations on ProdTaken vs NumberOfChildrenVisiting**

In [None]:
stacked_barplot(travel, 'NumberOfChildrenVisiting', 'ProdTaken')

* There is no discernible difference in number of children travelling with customer to determine a higher take up of a package by a customer.

**Observations on ProdTaken vs Designation**

In [None]:
stacked_barplot(travel, 'Designation', 'ProdTaken')

* Executives are more likely to take up a package.

**Build customer profile**
* All who took the different packages are filtered here.

In [None]:

travelProdTaken = travel.loc[data["ProdTaken"] == 1]

In [None]:
travelProdTaken.shape

In [None]:
travelProdTaken.head()

In [None]:
# let's view a sample of the data
travelProdTaken.sample(n=10, random_state=1)

In [None]:
travelProdTaken.tail()

**Observations on ProductPitched vs Numeric Values**

In [None]:
cols = travelProdTaken[['Age','NumberOfTrips','MonthlyIncome']].columns.tolist()
plt.figure(figsize=(12,12))

for i, variable in enumerate(cols):
                     plt.subplot(3,2,i+1)
                     sns.boxplot(data = travelProdTaken, x = "ProductPitched", y = variable)
                     plt.tight_layout()
                     plt.title(variable)
plt.show()

**Observations**

* **Age**
>* Basic packages are mainly taken by the younger groups from late 20s to late 30s.
> * Deluxe package takers mainly age range from early 30s to early 40s.
> * Standard package takers mainly age range from mid 30s to late 40s.
>* Super Deluxe package takers mainly age range from early 40s to mid 40s.
>* King package takers mainly age range from early 40s to mid 50s.


* **NumberOfTrips**
> * Basic package takers mainly travel between 2 to 3 times a year with several outliers.
>* Deluxe package takers mainly travel between 2 to 5 times a year.
>* King package takers mainly travel between 2 to 3 times a year with some outliers.
>* Standard package takers mainly travel between 2 to 4 times a year with a outlier.
>* Super Deluxe package takers travel between 1 to 5 times a year.

* **MonthlyIncome**
> * Basic package takers income mostly range between 17.5K to ~ 21K.
>* Deluxe package takers income mostly range between 21K to just shy of 25K.
>* Standard package takers income mostly range between 24K to ~ 29K.
>* Super Deluxe takers income mostly range from just below 28K to ~ 32K.
>* King package takers income ranges from 35K and above.









In [None]:
travelProdTaken.info()

**Observations on ProductPitched vs CityTier**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "CityTier")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Basic packages are most favored by Tier 1 city residents.
* Deluxe packages are most favored by Tier 3 followed by Tier 1 city residents.
* Standard to King packages are most favored by Tier 1 and 3 city residents.

**Observations on ProductPitched vs Occupation**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "Occupation")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()


* Basic packages are most favored by salaried people then small business.
* Deluxe packages are most favored by small business then salaried people.
* Standard packages are most favored by both salaried people and small business almost equally.
* Super Deluxe packages are picked up most by salaried people.
* King packages are picked up most by small business.

**Observations on ProductPitched vs Gender**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "Gender")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Basic packages are more favored by males but remains the most popular among genders.
* Deluxe packages are more favored by males but remains 2nd most popular among genders.
* Standard packages are more favored by males.
* Super Deluxe packages are picked up more by males.
* King packages are picked slightly more by females.

**Observations on ProductPitched vs NumberOfPersonVisiting**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "NumberOfPersonVisiting")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Basic, Deluxe and Standard packages are picked up by customers with 2 to 4 travelling companions with 3 being the most popular.
* Super Deluxe and King packages mainly attract 2 or 3 companions groups of customers.

**Observations on ProductPitched vs PreferredPropertyStar**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "PreferredPropertyStar")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Basic, Deluxe and Standard packages are picked up by customers with 2 to 4 travelling companions with 3 being the most popular.
* Super Deluxe and King packages mainly attract 2 or 3 companions groups of customers.

**Observations on ProductPitched vs PreferredPropertyStar**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "PreferredPropertyStar")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Across Basic, Deluxe and Standard packages, most customers who took up packages preferred 3 star hotels followed by 5 stars.
* In Super Deluxe and King packages, customers who took up packages tend to prefer 3-4 stars hotels.

**Observations on ProductPitched vs MaritalStatus**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "MaritalStatus")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* In Basic, most customers who took up package are Singles followed by Married customers.
* Across Deluxe and Standard, most customers are married and unmarried.
* Across Super Deluxe and King, most customers are married and single.

**Observations on ProductPitched vs Passport**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "Passport")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Basic packages are most likely picked up by a customer with a passport.
* Standard package is more picked up by those without passport.
* Deluxe, Super Deluxe and King customers have equal numbers having or not having a passport.

**Observations on ProductPitched vs OwnCar**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "OwnCar")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Those who own cars are more likely to pick up a package then those who don't, the disparity is more so for the Basic and King packages.

**Observations on ProductPitched vs NumberOfChildrenVisiting**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "NumberOfChildrenVisiting")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()

* Basic, Deluxe and Super Deluxe packages are most likely picked up by customers who travels with 1-2 kids.
* Standard packages are most likely picked up by customers with 0-1 kids.
* King packages are most likely picked up by customers with 1 child.

**Observations on ProductPitched vs Designation**

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data = travelProdTaken, x = "ProductPitched", hue = "Designation")
plt.legend(bbox_to_anchor=(1, 1))
plt.show()



* Basic Package attracts Executive level customers.
* Deluxe package attracts Manager level customers.
* Standard package attracts Senior Manager level customers.
* Super Deluxe and King packages are picked up by AVPs and VPs respectively.



**Customer Profiles for package types**
* **Basic Package Customer Profile**

> * Age ranges from late 20s to 30s.
> * Mainly travel between 2 to 3 times a year.
> * Monthly Income mostly range from 17.5K to ~ 21K.
> * Executive level designation.
> * Travels with 1-2 kids.
> * Much more likely to own car.
> * Most likely owns a passport.
> * Most likely Single.
> * Preferred 3 star hotels followed by 5 stars hotels.
> * With 2 to 4 travelling companions.
> * More favored by Males.
> * Most favored by salaried people then small business.
> * Most favored by Tier 1 city residents.

* **Deluxe Package Customer Profile**

> * Age range from early 30s to early 40s.
> * Mainly travel between 2 to 5 times a year.
> * Monthly income mostly range between 21K to just shy of 25K.
> * Manager level designation.
> * Travels with 1-2 kids.
> * Likely to own car.
> * Having either a passport or none.
> * Most likely Single.
> * Preferred 3 star hotels followed by 5 stars hotels.
> * With 2 to 4 travelling companions.
> * More favored by Males.
> * Most favored by small business then salaried people.
> * Most favored by Tier 3 followed by Tier 1 city residents.

* **Standard Package Customer Profile**

> * Age range from mid 30s to late 40s.
> * Mainly travel between 2 to 4 times a year.
> * Monthly income mostly range between 24K to ~ 29K.
> * Senior Manager level designation.
> * Travels with 0-1 kids.
> * Likely to own car.
> * More likely not owning a passport.
> * More likely Married.
> * Preferred 3 star hotels followed by 5 stars hotels.
> * With 2 to 4 travelling companions.
> * More favored by Males.
> * Favored by both salaried people and small business almost equally.
> * Most favored by Tier 1 and 3 city residents.
> * Super Deluxe Package Customer Profile

* **Age range from early 40s to mid 40s.**
> * Travel between 1 to 5 times a year.
> * Monthly income mostly range from just below 28K to ~ 32K.
> * AVP level designation.
> * Travels with 1-2 kids.
> * Likely to own car.
> * Having either a passport or none.
> * Single or Married.
> * Prefer 3-4 stars hotels.
> * With 2 or 3 travelling companions.
> * More favored by Males.
> * Favored most by salaried people.
> * Most favored by Tier 1 and 3 city residents.
> * King Package Customer Profile

* **Age range from early 40s to mid 50s.**
> * Mainly travel between 2 to 3 times a year.
> * Monthly income ranges from 35K and above.
> * VP level designation.
> * Travels with 1 kid.
> * Likely to own car.
> * Having either a passport or none.
> * Single, Married or Divorced.
> * Prefer 3-4 stars hotels.
> * With 2 or 3 travelling companions.
> * More favored by Females.
> * Favored most by small business.
> * Most favored by Tier 1 and 3 city residents.




**EDA Business Insights**
* > It has been shown that the package acceptance is largely among the customers ages 28 to 40. This has made our Basic, Standard and Deluxe packages successful based on marketing the right packages to our customer profiles.

* >Tier 2 and Tier 3 city residents are more accepting of packages then Tier 1 city residents. However, Tier 2 residents only constituted of 4.07% of our customer pool. More expansion is needed among Tier 2 residents as well as Tier 3 customer base now at 30.68% only.

* >Large Business were also found to be more accepting of packages but only constituted 8.9% of our customer base. More expansion is needed on Large business customers.

* >Higher number of followups by our sales teams to the potential customers lead to greater success of the customer accepting a package. This is so for 3 followups and above with each successive followup leading to a greater acceptance level. Sales teams should be briefed to at least have 3 followups with potential customers.

* >Basic followed by Standard and Deluxe packages have the most success rates among our customers. The new Wellness package has to have similar features to them and marketed in 3 star or 5 star hotel combinations. This is because although most customers of Basic,Standard or Premium packages preferred 3 star, a number of them also picked 5 star over 4 star within these packages. 5 star preferred customers are also more accepting of a package.

* >Singles are much more likely then married or divorcees to take up packages but our current customer base only has 32.8% of singles vs 47.8% married. Perhaps a marketing campaign to attract singles to be our customers can be done.

* >Passport holders are much more likely to accept a package than none passport holders but 70.8% of our current customer base do not have passports. Incentives can be launched to attract passport holders to be our customers or encourage our current customers to make a passport.

* >Customer rating of Pitch satisfaction score of 3.0 and above are more likely to accept a package. However numbers show that ratings 1.0 is as high as ratings 4.0 as well as 5.0. This has to be further investigated on why the sales team gets such an unusual higher number of rating 1.0 to help improve sales conversion.

* >Analysis has shown that least acceptance rate among packages is the Super Deluxe package which ties in with its customer profile of people designated as AVPs. AVPs number the least likely to accept a package and Super Deluxe was marketed at them. The package therefore needs a relook and if needed, it can be discontinued. The new Wellness package should not have features similar to the Super Deluxe package.

**Data Preprocessing**
>* Missing values

In [None]:
# checking if there is any missing values in the dataset
travel.isnull().sum()

>* Missing values have been treated earlier.

# **Duplicate values**

In [None]:
# checking if there is any duplicate entries in the dataset
travel[travel.duplicated()].count()



* > Previously duplicates were not found prior to dropping customer ID.
* > We can drop duplicates in the dataset.




In [None]:
# dropping the duplicate entries in the dataset
travel.drop_duplicates(inplace=True)

In [None]:
# checking duplicates again
travel[travel.duplicated()].count()



> Duplicates entries are removed from the dataset.



**Treating Outliers**

In [None]:
# Listing all mumerical value boxplots
numerical_col = travel.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,30))

for i, variable in enumerate(numerical_col):
                     plt.subplot(5,4,i+1)
                     plt.boxplot(travel[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()


* > Three columns have outliers and they are 'DurationOfPitch', 'NumberOfTrips' and 'MonthlyIncome'.
* > Outliers should be treated.

In [None]:
# Lets treat outliers by flooring and capping
def treat_outliers(dataf,col):
    '''
    treats outliers in a varaible
    col: str, name of the numerical varaible
    dataf: data frame
    col: name of the column
    '''
    Q1=dataf[col].quantile(0.25) # 25th quantile
    Q3=dataf[col].quantile(0.75)  # 75th quantile
    IQR=Q3-Q1
    Lower_Whisker = Q1 - 1.5*IQR
    Upper_Whisker = Q3 + 1.5*IQR
    dataf[col] = np.clip(dataf[col], Lower_Whisker, Upper_Whisker) # all the values smaller than Lower_Whisker will be assigned value of Lower_whisker
                                                            # and all the values above upper_whishker will be assigned value of upper_Whisker
    return dataf

def treat_outliers_all(dataf, col_list):
    '''
    treat outlier in all numerical varaibles
    col_list: list of numerical varaibles
    dataf: data frame
    '''
    for c in col_list:
        dataf = treat_outliers(dataf,c)

    return dataf

In [None]:
# treating outliers in travel dataset
numerical_col = travel.select_dtypes(include=np.number).columns.tolist()
travel = treat_outliers_all(travel, numerical_col)

In [None]:
# Listing all mumerical value boxplots
numerical_col = travel.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,30))

for i, variable in enumerate(numerical_col):
                     plt.subplot(5,4,i+1)
                     plt.boxplot(travel[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()



> Outliers are treated.



In [None]:
travel.info()

In [None]:
# let's view a sample of the data
travel.sample(n=10, random_state=1)

# **Feature Engineering**
## Age Binning

In [None]:
binned_age = pd.cut(travel['Age'], [15,25,35,45,55,65])
binned_age.value_counts(dropna=False)

In [None]:
# can add custom labels
travel['Age_bin'] = pd.cut(
    travel['Age'], [15,25,35,45,55,65],
    labels = ["15-25" , "25-35" , "35-45" , "45-55" , "55-65"]
)
travel.drop(['Age'], axis=1, inplace=True)
travel['Age_bin'].value_counts(dropna=False)

In [None]:
# let's view a sample of the data
travel.sample(n=5, random_state=1)

In [None]:
travel["Age_bin"] = travel["Age_bin"].astype("category")

* Age has been put into 5 groups in Age_bin column.
* Age column is dropped from the data set.
* Age_bin column is converted to categorical variable.

**Income Binning**

In [None]:
binned_income = pd.cut(travel['MonthlyIncome'], [10000.0,20000.0,30000.0,40000.0])
binned_income.value_counts(dropna=False)

In [None]:
# can add custom labels
travel['Income_bin'] = pd.cut(
    travel['MonthlyIncome'], [10000.0,20000.0,30000.0,40000.0],
    labels = ["10k-20k" , "20k-30k" , "30k-40k"]
)
travel.drop(['MonthlyIncome'], axis=1, inplace=True)
travel['Income_bin'].value_counts(dropna=False)

In [None]:
# let's view a sample of the data
travel.sample(n=5, random_state=1)

In [None]:
travel["Income_bin"] = travel["Income_bin"].astype("category")

>* MonthlyIncome has been put into 3 groups in Income_bin column.
>* MonthlyIncome column is dropped from the data set.
>* Income_bin column is converted to categorical variable.


**Check variable datatypes**



In [None]:
travel.info()


**Prepare data for modeling**
> **Split the data into train and test sets**

In [None]:

X = travel.drop("ProdTaken" , axis=1)
Y = travel["ProdTaken"]

# creating dummy variables
X = pd.get_dummies(X, drop_first=True)

# splitting in training and test set
# stratify=Y maintains the ratio of default vs non default in the target variable for both training and testing sets)
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 42, stratify=Y)


print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# check the ratio of 1s to 0s
Y.value_counts(1)

In [None]:
# check the ratio of 1s to 0s
y_test.value_counts(1)

In [None]:
# check the ratio of 1s to 0s
y_train.value_counts(1)

**Functions Definitions**

In [None]:
# Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,flag=True):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[]

    #Predicting on train and tests
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    #Accuracy of the model
    train_acc = model.score(X_train,y_train)
    test_acc = model.score(X_test,y_test)

    #Recall of the model
    train_recall = metrics.recall_score(y_train,pred_train)
    test_recall = metrics.recall_score(y_test,pred_test)

    #Precision of the model
    train_precision = metrics.precision_score(y_train,pred_train)
    test_precision = metrics.precision_score(y_test,pred_test)

    #F1 of the model
    train_f1 = f1_score(y_train,pred_train)
    test_f1 = f1_score(y_test,pred_test)

    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision, train_f1, test_f1))

    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Accuracy on training set : ",model.score(X_train,y_train))
        print("Accuracy on test set : ",model.score(X_test,y_test))
        print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
        print("Recall on test set : ",metrics.recall_score(y_test,pred_test))
        print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
        print("Precision on test set : ",metrics.precision_score(y_test,pred_test))
        print("f1 score on training set : ",metrics.f1_score(y_train,pred_train))
        print("f1 score on test set : ",metrics.f1_score(y_test,pred_test))

    return score_list # returning the list with train and test scores

In [None]:
 ## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth

    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

**Model Evaluation Criterion**
**Model can make wrong predictions as:**
1. False Positive: Predicting a customer is a travel package convertable but actually not convertable.
2. False Negative: Predicting a customer is a travel package non-convertable but actually convertable.

**Which case is more important?**
* Both the cases are important as:
* If we predict a customer is a travel package convertable but actually not
* convertable then a wrong person will be getting the targeted marketing effort wasting resources.
* If we predict a customer is a travel package non-convertable but actually convertable, that person will not be able to receive targeted marketing effort and hence may not be aware of the travel package and thus a loss of business.

**How to reduce losses?**
* We can use accuracy but since the data is imbalanced it would not be the
right metric to check the model performance.
* Therefore, f1_score should be maximized, the greater the f1_score higher the chances of identifying both the classes correctly.

In [None]:
dtree = DecisionTreeClassifier(class_weight={0:0.19,1:0.81}, random_state = 1)
dtree.fit(X_train, y_train)

# Model building - Bagging, Random Forest & Decision Tree

# Build Decision Tree Model

* We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
* If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
* In this case, we can pass a dictionary {0:0.19,1:0.81} to the model to specify the weight of each class and the decision tree will give more weightage to class 1 based on the dataset distribution.
* class_weight is a hyperparameter for the decision tree classifier.

In [None]:
dtree = DecisionTreeClassifier(class_weight={0:0.19,1:0.81}, random_state = 1)
dtree.fit(X_train, y_train)

In [None]:
make_confusion_matrix(dtree, y_test)

**Confusion Matrix -**

* Customer takes up package and the model predicted customer takes up package : True Positive (observed=1,predicted=1)
* Customer didn't take up package and the model predicted customer takes up package : False Positive (observed=0,predicted=1)
* Customer didn't take up package and the model predicted customer didn't take up package : True Negative (observed=0,predicted=0)
* Customer takes up package and the model predicted customer didn't take up package : False Negative (observed=1,predicted=0)

In [None]:
#Using above defined function to get accuracy, recall, precision and f1 score on train and test set
dtree_score=get_metrics_score(dtree)

**Observations**

* Decision tree is working well on the training data but is not able to generalize well on the test data.
* This is so as well for the f1 score (train data: 100% ; test data: 70.7%).
In fact this is a sign of overfitting.

**Build Bagging Classifier Model**

In [None]:
#base_estimator for bagging classifier is a decision tree by default
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0:0.19,1:0.81}, random_state=1), random_state=1)
bagging.fit(X_train,y_train)

In [None]:
make_confusion_matrix(bagging,y_test)

In [None]:
bagging_score=get_metrics_score(bagging)

**Observations**

* Bagging classifier is still overfitting on the training set and is not generalizing well on the test data.
* This is so as well for the f1 score (train data: 97.8% ; test data: 66.3%).
* When comparing this model with decision tree model: In test data, False positives show good improvement by reducing from 4.73% to 1.62% but False negatives are increased from 5.93% to 8.68%.
# **Build Bagging Classifier Model - Logistic Regression as base estimator**

In [None]:
bagging_lr=BaggingClassifier(base_estimator=LogisticRegression(class_weight={0:0.19,1:0.81},random_state=1, max_iter=1000),random_state=1)
bagging_lr.fit(X_train,y_train)

In [None]:
make_confusion_matrix(bagging_lr,y_test)

In [None]:
bagging_lr_score=get_metrics_score(bagging_lr)

**Observations**

* Bagging classifier with logistic regression as base_estimator is not overfitting the data but the scores are low.
* This is so as well for the f1 score (train data: 55.5% ; test data: 51.4%).
* Ensemble models are less interpretable than decision tree but bagging classifier is even less interpretable than random forest. It does not even have a feature importance attribute.

**Build Random Forest Model**

In [None]:
rf = RandomForestClassifier(class_weight={0:0.19,1:0.81},random_state=1)
rf.fit(X_train,y_train)

In [None]:
make_confusion_matrix(rf,y_test)

In [None]:
rf_score=get_metrics_score(rf)

**Observations**

>* Random Forest classifier is still overfitting on the training set and is not generalizing well on the test data.
>* This is so as well for the f1 score (train data: 100% ; test data: 58.2%).
It does not do as well as the lone decision tree or the bagging classifier.

**Hyperparameter Tuning**

**Decision Tree**



In [None]:
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(class_weight={0:0.19,1:0.81},random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2,30),
              'min_samples_leaf': [1, 2, 5, 7, 10],
              'max_leaf_nodes' : [2, 3, 5, 10,15],
              'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score) # using highest f1_score to choose

# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)

In [None]:
make_confusion_matrix(dtree_estimator,y_test)

In [None]:
dtree_estimator_score=get_metrics_score(dtree_estimator)

**Observations**

* Overfitting in decision tree tuned has reduced but scores have also reduced.
* This is so as well for the f1 score (train data: 53.9% ; test data: 45.6%).

**Bagging Classifier**

**Some of the important hyperparameters available for bagging classifier are:**

* base_estimator: The base estimator to fit on random subsets of the dataset. If None(default), then the base estimator is a decision tree.
n_estimators: The number of trees in the forest, default = 100.
max_features: The number of features to consider when looking for the best split.
* bootstrap: Whether bootstrap samples are used when building trees. If False, the entire dataset is used to build each tree, default=True.
bootstrap_features: If it is true, then features are drawn with replacement. Default value is False.
* max_samples: If bootstrap is True, then the number of samples to draw from X to train each base estimator. If None (default), then draw N samples, where N is the number of observations in the train data.
oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy, default=False.

In [None]:
# Choose the type of classifier.
bagging_estimator_tuned = BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0:0.19,1:0.81},random_state=1),random_state=1)

# Grid of parameters to choose from
parameters = {'max_samples': [0.7,0.8,0.9,1],
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [10,20,30,40,50],
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score) # using highest f1_score to choose

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)

In [None]:
make_confusion_matrix(bagging_estimator_tuned,y_test)


In [None]:
bagging_estimator_tuned_score=get_metrics_score(bagging_estimator_tuned)


**Observations**

* Bagging classifier tuned is still overfitting but less so on the training set and is generalizing a little better on the test data for precision.
* f1 score's score gap between training and test set remains (train data: 99.3% ; test data: 66.5%).
* There is no improvement on other scores but precision score gap between training and test set is smaller by 4% points.

**Bagging Classifier - Logistic Regression as base estimator**

In [None]:
# Choose the type of classifier.
bagging_lr_tuned = BaggingClassifier(base_estimator=LogisticRegression(class_weight={0:0.19,1:0.81},random_state=1, max_iter=1000),random_state=1)

# Grid of parameters to choose from
parameters = {'max_samples': [0.7,0.8,0.9,1],
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [10,20,30,40,50],
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score) # using highest f1_score to choose

# Run the grid search
grid_obj = GridSearchCV(bagging_lr_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_lr_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_lr_tuned.fit(X_train, y_train)

In [None]:
make_confusion_matrix(bagging_lr_tuned,y_test)

In [None]:
bagging_lr_tuned_score=get_metrics_score(bagging_lr_tuned)


**Observations**

* Bagging classifier tuned with logistic regression as base_estimator is not overfitting the data but the scores performed similarly low to before tuning.
* This is so as well for the f1 score (train data: 54.9% ; test data: 51%).

**Random Forest Classifier**
Now, let's see if we can get a better model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:

* n_estimators: The number of trees in the forest, default = 100.
max_features: The number of features to consider when looking for the best split.
* class_weight: Weights associated with classes in the form {class_label: weight}.If not given, all classes are supposed to have weight one.
* For example: If the frequency of class 0 is 80% and the frequency of class 1 is 20% in the data, then class 0 will become the dominant class and the model will become biased toward the dominant classes. In this case, we can pass a dictionary {0:0.2,1:0.8} to the model to specify the weight of each class and the random forest will give more weightage to class 1.
* bootstrap: Whether bootstrap samples are used when building trees. If False, the entire dataset is used to build each tree, default=True.
* max_samples: If bootstrap is True, then the number of samples to draw from X to train each base estimator. If None (default), then draw N samples, where N is the number of observations in the train data.
* oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy, default=False.

 Note: A lot of hyperparameters of Decision Trees are also available to tune Random Forest like max_depth, min_sample_split etc.

In [None]:
# Choose the type of classifier.
rf_estimator_tuned = RandomForestClassifier(class_weight={0:0.19,1:0.81},random_state=1)

# Grid of parameters to choose from
parameters = {
        "n_estimators": [50,100,150],
        "min_samples_leaf": np.arange(5, 10),
        "max_features": np.arange(0.2, 0.7, 0.1),
        "max_samples": np.arange(0.3, 0.7, 0.1),
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score) # using highest f1_score to choose

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_estimator_tuned.fit(X_train, y_train)

In [None]:
make_confusion_matrix(rf_estimator_tuned,y_test)

In [None]:
rf_estimator_tuned_score=get_metrics_score(rf_estimator_tuned)


**Observations**

>* Random Forest classifier tuned has improved much of the overfitting issue on the training set and is generalizing better on the test data.
>* This is so as well for the f1 score (train data: 81.2% ; test data: 60.9%), an improvement of 20% points in the score gap but the scores are low.

# **Model building - Boosting & Stacking**


**Build AdaBoost Classifier Model**

In [None]:
#Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)

#Calculating different metrics
get_metrics_score(ab_classifier)

#Creating confusion matrix
make_confusion_matrix(ab_classifier,y_test)

**Observations**

*Adaboost is giving more generalized performance than previous models but the test f1-score is too low.

*F1 score is at (train data: 47.6% ; test data: 43.9%).
Hyperparameter Tuning

**Hyperparameter Tuning**

In [None]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    #Let's try different max_depth for base_estimator
    "base_estimator":[DecisionTreeClassifier(max_depth=1),DecisionTreeClassifier(max_depth=2),
                      DecisionTreeClassifier(max_depth=3)],
    "n_estimators": np.arange(10,110,10),
    "learning_rate":np.arange(0.1,2,0.1)
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
get_metrics_score(abc_tuned)

#Creating confusion matrix
make_confusion_matrix(abc_tuned,y_test)

**Observations**

* The tuned AdaBoost model performance has increased but the model has started to overfit the training data among several metrics.
* F1 score is at (train data: 88.4% ; test data: 56.5%).

**Build Gradient Boosting Classifier Model**

In [None]:
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)

#Calculating different metrics
get_metrics_score(gb_classifier)

#Creating confusion matrix
make_confusion_matrix(gb_classifier,y_test)

**Observations**

* The Gradient Boosting model does not overfit as much but some of the test data metrics are low.
* F1 score is at (train data: 62% ; test data: 50.8%).

**Hyperparameter Tuning**

In [None]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1), random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [100,150,200,250],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
get_metrics_score(gbc_tuned)

#Creating confusion matrix
make_confusion_matrix(gbc_tuned,y_test)

**Observations**

* The tuned Gradient Boosting model does not overfit yet and all test data metrics improved except for precision but there is more room for improvement.
* F1 score is at (train data: 74.9% ; test data: 54.6%).

**Build XGBoost Classifier Model**

In [None]:
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)

#Calculating different metrics
get_metrics_score(xgb_classifier)

#Creating confusion matrix
make_confusion_matrix(xgb_classifier,y_test)

**Observations**

* The XGB model is starting to overfit and all test data metrics have improved significantly.
* F1 score is at (train data: 99.7% ; test data: 72%).

**Hyperparameter Tuning**

In [None]:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric='logloss')

# Grid of parameters to choose from
parameters = {
    "n_estimators": [10,30,50],
    "scale_pos_weight":[1,2,5],
    "subsample":[0.7,0.9,1],
    "learning_rate":[0.05, 0.1,0.2],
    "colsample_bytree":[0.7,0.9,1],
    "colsample_bylevel":[0.5,0.7,1]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
get_metrics_score(xgb_tuned)

#Creating confusion matrix
make_confusion_matrix(xgb_tuned,y_test)

**Observations**

* The tuned XGB model has generalized the test data better but the metrics scores are lower than XGB model.
* F1 score is at (train data: 94.4% ; test data: 68.4%).

**Build Stacking Classifier Model**

In [None]:
estimators = [('Random Forest',rf_estimator_tuned), ('Gradient Boosting',gbc_tuned), ('Decision Tree',dtree_estimator)]

final_estimator = xgb_tuned

stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)

stacking_classifier.fit(X_train,y_train)

In [None]:
#Calculating different metrics
get_metrics_score(stacking_classifier)

#Creating confusion matrix
make_confusion_matrix(stacking_classifier,y_test)

**Observations**

* The stacker classifier has generalized the test data similar to XGB tuned model but scored lower on the metrics.
* F1 score is at (train data: 84.4% ; test data: 64.7%).

**Comparing all models**

In [None]:
# defining list of models
models = [dtree, dtree_estimator,rf, rf_estimator_tuned, bagging, bagging_estimator_tuned, bagging_lr, bagging_lr_tuned,
          ab_classifier, abc_tuned, gb_classifier, gbc_tuned, xgb_classifier,xgb_tuned, stacking_classifier]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []

# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:

    j = get_metrics_score(model,False)
    acc_train.append(j[0])
    acc_test.append(j[1])
    recall_train.append(j[2])
    recall_test.append(j[3])
    precision_train.append(j[4])
    precision_test.append(j[5])
    f1_train.append(j[6])
    f1_test.append(j[7])

In [None]:
comparison_frame = pd.DataFrame({'Model':['Decision Tree','Tuned Decision Tree','Random Forest','Tuned Random Forest',
                                          'Bagging Classifier','Tuned Bagging Classifier','Bagging Classifier Logistic Regression',
                                          'Tuned Bagging Classifier Logistic Regression','AdaBoost Classifier','Tuned AdaBoost Classifier',
                                          'Gradient Boosting Classifier', 'Tuned Gradient Boosting Classifier',
                                          'XGBoost Classifier',  'Tuned XGBoost Classifier', 'Stacking Classifier'],
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                           'Train_F1-Score':f1_train, 'Test_F1-Score':f1_test})

#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_F1-Score',ascending=False)

**Observations**

* The models are either tending towards overfitting or score poorly in terms of f1-score.
* XGBoost Classifier should be the model to press ahead as it scored the highest test f1-score.
* There may well be other combinations of hyperparameters not attempted yet to improve the metrics of the models. This will however require much more time to find out more comprehensively a better model.

**Feature Importance of XGBoost Classifier Model**

In [None]:
feature_names = X_train.columns
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

*Observations*

* In the XGBoost classifier model, 'Designation_Executive' is the most important feature followed by features - 'Passport_1' and MaritalStatus_Single.

* Business Insights and Recommendations

* Based on the performances of the different models, XGBoost Classifier  
* > performed the best using f1_score as the deciding factor due to the
* >unbalanced data as well as a lower score gap between training and testing data
* >Significant variables include ‘Designation’, ‘Passport’ and ‘MaritalStatus’.
Coupled with EDA insights, Executive level, Passport holders and Single customers are more likely to accept a package deal.
* >However, the current customer base has 70.8% Non-Passport holders and only 18.4% of customers are single compared to 47.8% are married.
* >Therefore more targeted marketing effort at attracting Passport holders and Singles to expand the customer base with these customer segments. Incentives can also be given to encourage existing customer base to obtain a passport or attract new passport holding customers.
* >Currently, the Executive level customers dominated the share of customers at 37.9% and more can be done to recruit them and bolster the Executive level numbers.

* Comments on additional data sources for model improvement

* >Additional data can be obtained from measured feedback of initial targeted marketing efforts to the public in order to strengthen the model.
* >Feedback can be gathered from non-package convertible customers for further analysis.
* Model implementation in real world and potential business benefits from model

* >The model implemented in the real world will help to raise more successful targeted marketing converts of its campaign and reduce the costs of marketing to potential non-converts or miss target marketing to potential converts. This will increase revenue and reduce both variable marketing costs and opportunity costs.
* Other Recommendations – From EDA

* >More expansion is needed among Tier 2 residents as well as Tier 3 customer base now at 30.77% only.
* >More expansion is needed on Large business customers now at 8.9% of our customer base.
* >Sales teams should be briefed to at least have 3 follow-ups with potential customers.
* >The new Wellness package has to have similar features to Basic, Standard and Deluxe packages and marketed in 3 star or 5 star hotel combinations.
Further investigation on unusual high number of rating 1.0 to help improve sales conversion.
* >Super Deluxe package and King package performs poorly and these should require a relook at their features and discontinued if needed. The new Wellness package should not have similar features to Super Deluxe package and King package.