'The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

This time company wants to harness the available data of existing and potential customers to target the right customers.'

* There are 5 types of packages the company is offering
1. Basic
2. Standard
3. Deluxe
4. Super Deluxe
5. King

* from the last offering last year- 18% of customers purchased of these packages.

* A new offering from the company called "The Wellness Tourism Package" is under launch and the company is trying to obtain
key insights to be able to reduce the cost and increase their effectiveness in their campaigns.

***OBJECTIVE:***  To predict which customer is more likely to purchase the newly introduced travel package. 

- **CustomerID**              - Unique customer ID
- **ProdTaken**               - Whether the customer has purchased a package or not (0: No, 1: Yes)
- **Age**                     - Age of customer
- **TypeofContact**	          - How customer was contacted (Company Invited or Self Inquiry)
- **CityTier**                -	City tier depends on the development of a city, population, facilities, and living standards. 
> The categories are -   ordered i.e. Tier 1 > Tier 2 > Tier 3
- **DurationOfPitch**         -	Duration of the pitch by a salesperson to the customer
- **Occupation**              -	Occupation of customer
- **Gender**                  -	Gender of customer
- **NumberOfPersonVisiting**  -	Total number of persons planning to take the trip with the customer
- **NumberOfFollowups**       -	Total number of follow-ups has been done by sales person after sales pitch
- **ProductPitched**          -	Product pitched by the salesperson
- **PreferredPropertyStar**	  - Preferred hotel property rating by customer
- **MaritalStatus**           -	Marital status of customer
- **NumberOfTrips**           -	Average number of trips in a year by customer
- **Passport**                -	The customer has a passport or not (0: No, 1: Yes)
- **PitchSatisfactionScore**  -	Sales pitch satisfaction score
- **OwnCar**                  -	Whether the customers own a car or not (0: No, 1: Yes)
- **NumberOfChildrenVisiting**-	Total number of children with age less than 5 planning to take the trip with the customer
- **Designation**             -	Designation of the customer in the current organization
- **MonthlyIncome**           -	Gross monthly income of the customer


In [1]:
# I will load potentially more libraries to insure that I dont have issues later 

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

from sklearn import metrics

# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Library to suppress warnings or deprecation notes 
import warnings
warnings.filterwarnings('ignore')

# Libraries to split data, impute missing values 
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier

# library to help impute missing values in a better method than simply using Mode, Median, or Mean
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

# Libtune to tune model, get different metric scores
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import BaggingRegressor,RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, StackingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor

# load the LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

#To install xgboost library use - !pip install xgboost 
from xgboost import XGBClassifier
from xgboost import XGBRegressor



### Load Dataset

In [2]:
data = pd.read_excel(r'Tourism.xlsx', sheet_name = 'Tourism')
# copy of data
df = data.copy()

### Data Overview
- Observations
- Sanity Checks

### First 5 and last 5 rows of the data

In [3]:
data.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [4]:
data.tail()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0
4887,204887,1,36.0,Self Enquiry,1,14.0,Salaried,Male,4,4.0,Basic,4.0,Unmarried,3.0,1,3,1,2.0,Executive,24041.0


In [5]:
# check for duplicate values
data.duplicated().sum()

0

* No duplicated values

### The shape of the data

In [6]:
data.shape

(4888, 20)

* there are 4888 observations with 20 rows of data

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport

* Need to drop Customer ID- its a duplicated value 
* TypeofContact,Occupation,Gender, ProductPitched, MaritalStatus, Designation are all objects - we will convert to categories
* the rest of the columns are integers or float

#### We can drop CustomerID since its a duplicate identifier and is not needed. 

In [8]:
data.drop(["CustomerID"], axis=1, inplace=True)

#### Change object to category to improve computational efficiency

In [9]:
for feature in data.columns: # Loop through all columns in the dataframe
    if data[feature].dtype == 'object': # Only apply for columns with categorical strings
        data[feature] = pd.Categorical(data[feature])# Replace strings with an integer
data.head(10)

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0
5,0,32.0,Company Invited,1,8.0,Salaried,Male,3,3.0,Basic,3.0,Single,1.0,0,5,1,1.0,Executive,18068.0
6,0,59.0,Self Enquiry,1,9.0,Small Business,Female,2,2.0,Basic,5.0,Divorced,5.0,1,2,1,1.0,Executive,17670.0
7,0,30.0,Self Enquiry,1,30.0,Salaried,Male,3,3.0,Basic,3.0,Married,2.0,0,2,0,1.0,Executive,17693.0
8,0,38.0,Company Invited,1,29.0,Salaried,Male,2,4.0,Standard,3.0,Unmarried,1.0,0,3,0,0.0,Senior Manager,24526.0
9,0,36.0,Self Enquiry,1,33.0,Small Business,Male,3,3.0,Deluxe,3.0,Divorced,7.0,0,3,1,0.0,Manager,20237.0


#### convert some observations into category which resembles their true nature

In [10]:
data['CityTier']= data['CityTier'].astype('category')
data['PreferredPropertyStar'] = data['PreferredPropertyStar'].astype('category')
data['Passport'] = data['Passport'].astype('category').astype('category')
data['OwnCar'] = data['OwnCar'].astype('category')

#### number breakdown of observations

In [11]:
cat_cols=['ProdTaken','Age','TypeofContact','CityTier','DurationOfPitch','Occupation','Gender','NumberOfPersonVisiting','NumberOfFollowups','ProductPitched','PreferredPropertyStar','MaritalStatus','NumberOfTrips','Passport','PitchSatisfactionScore','OwnCar','NumberOfChildrenVisiting','Designation']

for column in cat_cols:
    print(data[column].value_counts())
    print('-'*30)

0    3968
1     920
Name: ProdTaken, dtype: int64
------------------------------
35.0    237
36.0    231
34.0    211
31.0    203
30.0    199
32.0    197
33.0    189
37.0    185
29.0    178
38.0    176
41.0    155
39.0    150
28.0    147
40.0    146
42.0    142
27.0    138
43.0    130
46.0    121
45.0    116
26.0    106
44.0    105
51.0     90
47.0     88
50.0     86
25.0     74
52.0     68
53.0     66
48.0     65
49.0     65
55.0     64
54.0     61
56.0     58
24.0     56
22.0     46
23.0     46
59.0     44
21.0     41
20.0     38
19.0     32
58.0     31
57.0     29
60.0     29
18.0     14
61.0      9
Name: Age, dtype: int64
------------------------------
Self Enquiry       3444
Company Invited    1419
Name: TypeofContact, dtype: int64
------------------------------
1    3190
3    1500
2     198
Name: CityTier, dtype: int64
------------------------------
9.0      483
7.0      342
8.0      333
6.0      307
16.0     274
15.0     269
14.0     253
10.0     244
13.0     223
11.0     205
12.

#### Number of unique values in each column

In [12]:
data.nunique().sort_values(ascending=False)

MonthlyIncome               2475
Age                           44
DurationOfPitch               34
NumberOfTrips                 12
NumberOfFollowups              6
Designation                    5
PitchSatisfactionScore         5
ProductPitched                 5
NumberOfPersonVisiting         5
MaritalStatus                  4
Occupation                     4
NumberOfChildrenVisiting       4
Gender                         3
PreferredPropertyStar          3
CityTier                       3
Passport                       2
OwnCar                         2
TypeofContact                  2
ProdTaken                      2
dtype: int64

#### Correcting the syntax of Gender Fe Male and Female are combined 

In [13]:
# Correcting Gender- by merging Fe Male and Female
data["Gender"] = data["Gender"].str.replace("Fe Male", "Female")
data["Gender"].describe()

count     4888
unique       2
top       Male
freq      2916
Name: Gender, dtype: object

### Data Preprocessing
- Missing value treatment (if needed)
- Feature engineering
- Outlier detection and treatment (if needed)
- Preparing data for modeling
- Any other preprocessing steps (if needed)

### Check for missingness

In [14]:
#check for missing data
data.isna().sum().sort_values(ascending=False)

DurationOfPitch             251
MonthlyIncome               233
Age                         226
NumberOfTrips               140
NumberOfChildrenVisiting     66
NumberOfFollowups            45
PreferredPropertyStar        26
TypeofContact                25
Designation                   0
OwnCar                        0
PitchSatisfactionScore        0
Passport                      0
ProdTaken                     0
MaritalStatus                 0
NumberOfPersonVisiting        0
Gender                        0
Occupation                    0
CityTier                      0
ProductPitched                0
dtype: int64

In [15]:
missing_type_of_contact = data[data['TypeofContact'].isna()]
print(missing_type_of_contact)

      ProdTaken   Age TypeofContact CityTier  DurationOfPitch      Occupation  \
224           0  31.0           NaN        1              NaN  Small Business   
571           0  26.0           NaN        1              NaN        Salaried   
572           0  29.0           NaN        1              NaN  Small Business   
576           0  27.0           NaN        3              NaN  Small Business   
579           0  34.0           NaN        1              NaN  Small Business   
598           1  28.0           NaN        1              NaN  Small Business   
622           0  32.0           NaN        3              NaN        Salaried   
724           0  24.0           NaN        1              NaN  Small Business   
843           0  26.0           NaN        1              NaN  Small Business   
1021          1  25.0           NaN        3              NaN        Salaried   
1047          0  33.0           NaN        3              NaN  Small Business   
1143          0  45.0       

* Missing observations in 
> - Age, TypeofContact, DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisiting, MontlyIncome

In [16]:
# breakdown of number of missing values with other missing values
num_missing = data.isnull().sum(axis=1)
num_missing.value_counts()

0    4128
1     533
2     202
3      25
dtype: int64

In [17]:
# data missing 2 or more 
data[num_missing == 1].sample(n=5)

Unnamed: 0,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
3214,0,47.0,Self Enquiry,1,7.0,Small Business,Male,3,4.0,King,,Married,2.0,0,5,1,2.0,VP,38305.0
396,0,43.0,Self Enquiry,1,,Salaried,Female,3,3.0,Deluxe,3.0,Married,5.0,1,2,0,0.0,Manager,19522.0
1423,0,,Self Enquiry,1,6.0,Salaried,Male,3,3.0,Basic,3.0,Single,1.0,0,3,0,2.0,Executive,18375.0
51,1,,Self Enquiry,1,11.0,Large Business,Male,2,3.0,Basic,3.0,Single,2.0,1,2,1,0.0,Executive,18441.0
820,0,35.0,Company Invited,3,17.0,Small Business,Male,2,,Deluxe,3.0,Married,2.0,0,5,0,0.0,Manager,19968.0


In [18]:
df[num_missing == 2].sample(n=5)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
1347,201347,0,,Company Invited,2,8.0,Salaried,Male,3,4.0,Deluxe,3.0,Single,2.0,0,3,1,0.0,Manager,
1098,201098,0,,Company Invited,1,14.0,Salaried,Male,3,3.0,Deluxe,3.0,Married,4.0,1,3,1,1.0,Manager,
1149,201149,0,,Self Enquiry,1,25.0,Salaried,Male,3,4.0,Basic,5.0,Married,2.0,0,1,0,0.0,Executive,
2324,202324,0,45.0,Self Enquiry,1,,Small Business,Female,2,3.0,Basic,3.0,Married,5.0,1,3,0,1.0,Executive,
2195,202195,1,,Self Enquiry,1,20.0,Salaried,Male,2,4.0,Basic,4.0,Married,2.0,1,5,1,1.0,Executive,


In [19]:
df[num_missing == 3].sample(n=5)

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
1182,201182,0,36.0,,1,,Small Business,Female,2,4.0,Deluxe,3.0,Married,1.0,0,5,1,1.0,Manager,
2042,202042,0,29.0,,1,,Small Business,Female,3,3.0,Deluxe,3.0,Married,5.0,0,1,0,1.0,Manager,
2068,202068,1,28.0,,1,,Small Business,Male,2,3.0,Basic,3.0,Single,7.0,0,3,1,1.0,Executive,
1694,201694,0,31.0,,1,,Small Business,Male,2,5.0,Deluxe,3.0,Married,1.0,0,3,0,0.0,Manager,
1047,201047,0,33.0,,3,,Small Business,Male,2,3.0,Deluxe,5.0,Divorced,1.0,0,3,0,0.0,Manager,


#### Imputing values into the missing variables
* Use Mode values for categoric data and used rounded mean values for integer and float values. 
* Used round method because there are no decimals in the data, this will stay in alignment with the general data
* the missing data isnt enough to sway the overall performance- we would like to keep the observations for the other data

#### Using Hist_Gradient_Boosting to impute missing values for a more precise adjustment for missingness.

#### imputing for "TypeofContact"

In [20]:
# Create a copy of the DataFrame
impute_data = data.copy()

In [21]:
# Filter the DataFrame to exclude rows with missing 'TypeofContact'
not_missing_data = impute_data[~impute_data['TypeofContact'].isna()]

In [22]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'MonthlyIncome', 'Age', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'NumberOfFollowups', 'PreferredPropertyStar']


In [23]:
# Create X and y for training the model
X = not_missing_data[features]
y = not_missing_data['TypeofContact']

In [24]:
# Initialize the  classifier
gb_classifierTOC = HistGradientBoostingClassifier()

In [25]:
# Fit the classifier on the imputed data
gb_classifierTOC.fit(X, y)

In [26]:
# Filter the DataFrame to include only rows with missing 'TypeofContact'
missing_data = impute_data[impute_data['TypeofContact'].isna()]

# Use the trained classifier to predict the missing values
imputed_values = gb_classifierTOC.predict(missing_data[features])


In [27]:
# Create a DataFrame to store the imputed values along with their indices
imputed_data = pd.DataFrame({'ImputedTypeofContact': imputed_values}, index=missing_data.index)

# Concatenate the original missing_data DataFrame with the imputed_data DataFrame
imputed_df = pd.concat([missing_data, imputed_data], axis=1)

# Print the imputed DataFrame
print(imputed_df)

      ProdTaken   Age TypeofContact CityTier  DurationOfPitch      Occupation  \
224           0  31.0           NaN        1              NaN  Small Business   
571           0  26.0           NaN        1              NaN        Salaried   
572           0  29.0           NaN        1              NaN  Small Business   
576           0  27.0           NaN        3              NaN  Small Business   
579           0  34.0           NaN        1              NaN  Small Business   
598           1  28.0           NaN        1              NaN  Small Business   
622           0  32.0           NaN        3              NaN        Salaried   
724           0  24.0           NaN        1              NaN  Small Business   
843           0  26.0           NaN        1              NaN  Small Business   
1021          1  25.0           NaN        3              NaN        Salaried   
1047          0  33.0           NaN        3              NaN  Small Business   
1143          0  45.0       

In [28]:
# Update the original dataframe with the imputed values
data.loc[data['TypeofContact'].isna(), 'TypeofContact'] = imputed_values

#### imputing for "PreferredPropertyStar"

In [29]:
# Create a copy of the DataFrame- this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [30]:
# Filter the DataFrame to exclude rows with missing 'TypeofContact'
not_missing_data = impute_data[~impute_data['PreferredPropertyStar'].isna()]

In [31]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'MonthlyIncome', 'Age', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'NumberOfFollowups', 'TypeofContact']


In [32]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['TypeofContact'] = label_encoder.fit_transform(not_missing_data['TypeofContact'])

In [33]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [34]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['PreferredPropertyStar']

In [35]:
# Initialize the  classifier
gb_classifierPPS = HistGradientBoostingClassifier()

In [36]:
# Fit the classifier on the imputed data
gb_classifierPPS.fit(X, y)

In [37]:
# Filter the DataFrame to include only rows with missing 'PreferredPropertyStar'
missing_data = impute_data[impute_data['PreferredPropertyStar'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))


In [38]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [39]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierPPS.predict(X_missing)

In [40]:
# Update the original dataframe with the imputed values
data.loc[data['PreferredPropertyStar'].isna(), 'PreferredPropertyStar'] = imputed_values

#### imputing for "TypeofContact"

In [41]:
# Create a copy of the DataFrame - this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [42]:
# Filter the DataFrame to exclude rows with missing 'TypeofContact'
not_missing_data = impute_data[~impute_data['NumberOfFollowups'].isna()]

In [43]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'MonthlyIncome', 'Age', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'PreferredPropertyStar', 'TypeofContact']


In [44]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['TypeofContact'] = label_encoder.fit_transform(not_missing_data['TypeofContact'])

In [45]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [46]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['TypeofContact']

In [47]:
# Initialize the  classifier
gb_classifierNOF = HistGradientBoostingClassifier()

In [48]:
# Fit the classifier on the imputed data
gb_classifierNOF.fit(X, y)

In [49]:
# Filter the DataFrame to include only rows with missing 'NumberOfFollowups'
missing_data = impute_data[impute_data['NumberOfFollowups'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))

In [50]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [51]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierNOF.predict(X_missing)

In [52]:
# Update the original dataframe with the imputed values
data.loc[data['NumberOfFollowups'].isna(), 'NumberOfFollowups'] = imputed_values

#### imputing for "NumberOfChildrenVisiting"

In [53]:
# Create a copy of the DataFrame- this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [54]:
# Filter the DataFrame to exclude rows with missing 'TypeofContact'
not_missing_data = impute_data[~impute_data['NumberOfChildrenVisiting'].isna()]

In [55]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'MonthlyIncome', 'Age', 'NumberOfTrips', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact']


In [56]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['TypeofContact'] = label_encoder.fit_transform(not_missing_data['TypeofContact'])

In [57]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [58]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['NumberOfChildrenVisiting']

In [59]:
# Initialize the  classifier
gb_classifierNOCV = HistGradientBoostingClassifier()

In [60]:
# Fit the classifier on the imputed data
gb_classifierNOCV.fit(X, y)

In [61]:
# Filter the DataFrame to include only rows with missing NumberOfChildrenVisiting''
missing_data = impute_data[impute_data['NumberOfChildrenVisiting'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))

In [62]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [63]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierNOCV.predict(X_missing)

In [64]:
# Update the original dataframe with the imputed values
data.loc[data['NumberOfChildrenVisiting'].isna(), 'NumberOfChildrenVisiting'] = imputed_values

#### imputing for "NumberOfTrips"

In [65]:
# Create a copy of the DataFrame- this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [66]:
# Filter the DataFrame to exclude rows with missing 'TypeofContact'
not_missing_data = impute_data[~impute_data['NumberOfTrips'].isna()]

In [67]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'MonthlyIncome', 'Age', 'NumberOfChildrenVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact']


In [68]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['TypeofContact'] = label_encoder.fit_transform(not_missing_data['TypeofContact'])

In [69]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [70]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['NumberOfTrips']

In [71]:
# Initialize the  classifier
gb_classifierNOT = HistGradientBoostingClassifier()

In [72]:
# Fit the classifier on the imputed data
gb_classifierNOT.fit(X, y)

In [73]:
# Filter the DataFrame to include only rows with missing 'NumberOfTrips'
missing_data = impute_data[impute_data['NumberOfTrips'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))

In [74]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [75]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierNOT.predict(X_missing)

In [76]:
# Update the original dataframe with the imputed values
data.loc[data['NumberOfTrips'].isna(), 'NumberOfTrips'] = imputed_values

#### imputing for "Age"

In [77]:
# Create a copy of the DataFrame- this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [78]:
# Filter the DataFrame to exclude rows with missing 'Age'
not_missing_data = impute_data[~impute_data['Age'].isna()]

In [79]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'MonthlyIncome', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact']


In [80]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['TypeofContact'] = label_encoder.fit_transform(not_missing_data['TypeofContact'])

In [81]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [82]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['Age']

In [83]:
# Initialize the  classifier
gb_classifierAGE = HistGradientBoostingClassifier()

In [84]:
# Fit the classifier on the imputed data
gb_classifierAGE.fit(X, y)

In [85]:
# Filter the DataFrame to include only rows with missing 'NumberOfFollowups'
missing_data = impute_data[impute_data['Age'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))

In [86]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [87]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierAGE.predict(X_missing)

In [88]:
# Update the original dataframe with the imputed values
data.loc[data['Age'].isna(), 'Age'] = imputed_values

#### imputing for "MonthlyIncome"

In [89]:
# Create a copy of the DataFrame- this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [90]:
# Filter the DataFrame to exclude rows with missing 'MonthlyIncome'
not_missing_data = impute_data[~impute_data['MonthlyIncome'].isna()]

In [91]:
# Select the relevant features for training the model
features = ['DurationOfPitch', 'Age', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact']


In [92]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['TypeofContact'] = label_encoder.fit_transform(not_missing_data['TypeofContact'])

In [93]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [94]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['MonthlyIncome']

In [95]:
# Initialize the  classifier
gb_classifierMI = HistGradientBoostingClassifier()

In [96]:
# Fit the classifier on the imputed data
gb_classifierMI.fit(X, y)

ValueError: could not convert string to float: 'Self Enquiry'

In [None]:
# Filter the DataFrame to include only rows with missing 'NumberOfFollowups'
missing_data = impute_data[impute_data['MonthlyIncome'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))

In [None]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [None]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierMI.predict(X_missing)

In [None]:
# Update the original dataframe with the imputed values
data.loc[data['MonthlyIncome'].isna(), 'MontlyIncome'] = imputed_values

#### imputing for "DurationOfPitch"

In [None]:
# Create a copy of the DataFrame- this may not be necessary but i would like to make it redundant to insure data integrity
impute_data = data.copy()

In [None]:
# Filter the DataFrame to exclude rows with missing 'DurationOfPitch'
not_missing_data = impute_data[~impute_data['DurationOfPitch'].isna()]

In [None]:
# Select the relevant features for training the model
features = ['MonthlyIncome', 'Age', 'NumberOfTrips', 'NumberOfChildrenVisiting', 'NumberOfFollowups', 'PreferredPropertyStar', 'TypeofContact']


In [None]:
# Convert categorical variables to numerical representations
label_encoder = LabelEncoder()
not_missing_data['DurationOfPitch'] = label_encoder.fit_transform(not_missing_data['DurationOfPitch'])

In [None]:
# Perform one-hot encoding on the 'TypeofContact' column
onehot_encoder = OneHotEncoder(sparse=False)
encoded_categories = onehot_encoder.fit_transform(not_missing_data['TypeofContact'].values.reshape(-1, 1))

In [None]:
# Create X and y for training the model
X_categorical = encoded_categories
X_numerical = not_missing_data[features].values
X = np.concatenate((X_numerical, X_categorical), axis=1)
y = not_missing_data['DurationOfPitch']

In [None]:
# Initialize the  classifier
gb_classifierDOP = HistGradientBoostingClassifier()

In [None]:
# Fit the classifier on the imputed data
gb_classifierDOP.fit(X, y)

In [None]:
# Filter the DataFrame to include only rows with missing 'NumberOfFollowups'
missing_data = impute_data[impute_data['DurationOfPitch'].isna()]

# Convert categorical variables to numerical representations
missing_data['TypeofContact'] = label_encoder.transform(missing_data['TypeofContact'])

# Perform one-hot encoding on the 'TypeofContact' column
encoded_categories_missing = onehot_encoder.transform(missing_data['TypeofContact'].values.reshape(-1, 1))

In [None]:
# Create X for the missing data
X_categorical_missing = encoded_categories_missing
X_numerical_missing = missing_data[features].values
X_missing = np.concatenate((X_numerical_missing, X_categorical_missing), axis=1)

In [None]:
# Use the trained classifier to predict the missing values
imputed_values = gb_classifierDOP.predict(X_missing)

In [None]:
# Update the original dataframe with the imputed values
data.loc[data['DurationOfPitch'].isna(), 'DurationOfPitch'] = imputed_values

In [None]:
#check for missing data
data.isna().sum().sort_values(ascending=False)

In [None]:
#check for missing data
imputed_df.isna().sum().sort_values(ascending=False)

In [None]:
# data['Age']= data['Age'].fillna(data['Age'].mean()).astype(int)
# data['Age'] = data['Age'].fillna(round(data['Age'].mean()))

In [None]:
#data['DurationOfPitch'] = data['DurationOfPitch'].fillna(round(data['DurationOfPitch'].mean()))

In [None]:
#data['NumberOfFollowups'] = data['NumberOfFollowups'].fillna(round(data['NumberOfFollowups'].mean()))

In [None]:
# data['NumberOfTrips'] = data['NumberOfTrips'].fillna(round(data['NumberOfTrips'].mean()))

In [None]:
#data['MonthlyIncome']= data['MonthlyIncome'].fillna(round(data['MonthlyIncome'].mean()))

In [None]:
for n in num_missing.value_counts().sort_index().index:
    if n > 0:
        print(f'For the rows with exactly {n} missing values, NAs are found in:')
        n_miss_per_col = data[num_missing == n].isnull().sum()
        print(n_miss_per_col[n_miss_per_col > 0])
        print('\n\n')
        

In [None]:
# Statistical Summary of Data
data.describe().T

- 18.8% will buy the product
- Age 37 Average and 50% quartile
- Duration of Pitch Average length is 15.46
- Number of People visiting - 2.9 or 3 
- Children- average is just over 1 and 1 fits in the 50 %
- Monthly income - 23619 average

#### Replace Object type with Categorical type 

* missing values have been properly addressed

In [None]:
data.info()

## Exploratory Data Analysis

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Statistical Summary of the data

In [None]:
data.describe(include='all').T

### Summary of observations

In [None]:
columns = list(data)[0:-1] # Excluding Outcome column which has only 
data[columns].hist(stacked=False, bins=100, figsize=(12,30), layout=(14,2)); 


### Observations on ProdTaken

In [None]:
labeled_barplot(data, "ProdTaken",perc=True)

* 18.8% of customers purchased a package

### Observations on Age

In [None]:
labeled_barplot(data, "Age",perc=True)

In [None]:
histogram_boxplot(data,"Age")

* Age looks to resemble a uniform distribution
* 35 is the most common observation in the Age column

### Observations on TypeofContact 

In [None]:
labeled_barplot(data, "TypeofContact",perc=True)

* 70.5% of the observations are Self Enquiry

### Observations on CityTier

In [None]:
labeled_barplot(data, "CityTier",perc=True)

* City Tier of 1 has 65.3% of the data
* City Tier of 3 has the second most with 30.7
* City Tier of 2 has only 4.1%

### Observations on DurationOfPitch

In [None]:
labeled_barplot(data, "DurationOfPitch",perc=True)

In [None]:
histogram_boxplot(data,"DurationOfPitch")

* DurationOfPitch is Right skewed- most pitches gravitate to shorter timeframes

### Observations on Occupation 

In [None]:
labeled_barplot(data, "Occupation",perc=True)

* Salaried has the most observations with 48.4%, followed closely  by Small Business with 42.6%
* Free Lancer is the smallest of this group 

### Observations on Gender

In [None]:
labeled_barplot(data, "Gender",perc=True)

* Most observations are Male- with 59.7%

### Observations on NumberOfPersonVisiting

In [None]:
labeled_barplot(data, "NumberOfPersonVisiting",perc=True)

In [None]:
histogram_boxplot(data,"NumberOfPersonVisiting")

* Most of the observations have 2 , 3 or 4 people visiting. 3 is the most numerous with 49.1% 
* 5 and 1 are outliers

### Observations on NumberOfFollowups  

In [None]:
histogram_boxplot(data,"NumberOfFollowups")

* NumberOfFollowups- 4 has the most with 42.3%
* 3 has 30% and 5 has 15.7%

### Observations on ProductPitched 

In [None]:
labeled_barplot(data, "ProductPitched",perc=True)

* Basic (37.7%) is the most popular closely followed by the Deluxe package (35.4%)

### Observations on PreferredPropertyStar 

In [None]:
labeled_barplot(data, "PreferredPropertyStar",perc=True)

- 61.8% of respondents prefer a 3 star
- 19.6% prefer a 5 star
- 18.7% prefer a 4 star

### Observations on MaritalStatus

In [None]:
labeled_barplot(data, "MaritalStatus",perc=True)

- 47.9% are Married
- Divorced (19.4%), Single(18.7%) and Unmarried(14%) are smaller than Married by a large sum

### Observations on NumberOfTrips

In [None]:
labeled_barplot(data, "NumberOfTrips",perc=True)

In [None]:
histogram_boxplot(data,"NumberOfTrips")

* Number of Trips is Skewed to the Right
* Most fall between 2 and 4 trips

### Observations on Passport 

In [None]:
labeled_barplot(data, "Passport",perc=True)

* most do not have a passport, however 29.1% do have a passport

### Observations on PitchSatisfactionScore

In [None]:
labeled_barplot(data, "PitchSatisfactionScore",perc=True)

In [None]:
histogram_boxplot(data,"PitchSatisfactionScore")

* the most common observation is a 3 - this may indicate the need to work on the sales pitch to improve the overall rating

### Observations on OwnCar

In [None]:
labeled_barplot(data, "OwnCar",perc=True)

* Most own their own car- 62% versus 38% that do not

### Observations on NumberOfChildrenVisiting

In [None]:
labeled_barplot(data, "NumberOfChildrenVisiting",perc=True)

In [None]:
histogram_boxplot(data,"NumberOfChildrenVisiting")

* 1 or 2 children accounts for 66% of the data

### Observations on Designation

In [None]:
labeled_barplot(data, "Designation",perc=True)

- Executive and Manager account for 73.1% of the data with Executive being the most numerous.

### Observations on MonthlyIncome

In [None]:
histogram_boxplot(data,"MonthlyIncome")

* Montly income is slightly skewed to the right with several outliers the larger the income becomes.
* $23619.853491 is the average income 

## Bivariate Analysis

In [None]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

In [None]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

### ProdTaken vs Age

In [None]:
stacked_barplot(data,"Age", "ProdTaken")

In [None]:
distribution_plot_wrt_target(data, "Age", "ProdTaken")

* Age is skewed to the right slightly for taking the product indicating that there may be a slight age preference
* Age resembles a normal distribution for those not taking the product

### ProdTaken vs TypeofContact

In [None]:
stacked_barplot(data,"TypeofContact", "ProdTaken")

In [None]:
distribution_plot_wrt_target(data,"ProdTaken","TypeofContact")

* company invited respondents have a higher percentage of taking the product
* Self Enquiry respondents are more numerous overall 

### ProdTaken vs CityTier

In [None]:
stacked_barplot(data,"CityTier", "ProdTaken")

* Tier 2 and 3 seem to have a similar success rate that is slightly better than tier 1 

### DurationofPitch vs ProdTaken

In [None]:
stacked_barplot(data,"DurationOfPitch", "ProdTaken")

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.barplot(data=data, y="ProdTaken", x="DurationOfPitch")
plt.xticks(rotation=90)

plt.show()

* Duration of the pitch appears to have important curve. There is a double peak at 19 and 31, however the longer the pitch the less successful generally in converting to take the product

### Occupation vs ProdTaken

In [None]:
stacked_barplot(data,"Occupation","ProdTaken")

* Since there were only 2 observations and in both cases they took the product , we need to consider this very carefully because it could be a outlier
* Large Business clients have a higher percentage over Small Business and Salaried 

### Gender vs ProdTaken

In [None]:
stacked_barplot(data,"ProdTaken","Gender")

* Males have a slightly higher probability of taking the package

### NumberOfPersonVisiting vs ProdTaken

In [None]:
stacked_barplot(data,"ProdTaken","NumberOfPersonVisiting")

* Number of Persons visiting seem to have similar results except when there is only 1 - in this case they have not opted for the product

### NumberOfFollowups vs ProdTaken

In [None]:
stacked_barplot(data,"NumberOfFollowups","ProdTaken")

* The larger the number of Followups , the larger the percentage of respondents that take the product

### ProductPitched  vs ProdTaken

In [None]:
stacked_barplot(data,"ProductPitched","ProdTaken")

* The basic package has a higher percentage and number of acceptance- the higher the product offering the less successful.
* Depending on how this is selected, this could be useful

### PreferredPropertyStar  vs ProdTaken

In [None]:
stacked_barplot(data,"PreferredPropertyStar","ProdTaken")

* the higher the star the higher the probability for success.
* This may indicate that clients that have a more select taste may be more open to the product. 

### MaritalStatus  vs ProdTaken

In [None]:
stacked_barplot(data,"MaritalStatus","ProdTaken")

* this is interesting- Single and Unmarried have a higher percentage of likelihood of taking the package


### NumberOfTrips  vs ProdTaken

In [None]:
stacked_barplot(data,"NumberOfTrips","ProdTaken")

In [None]:
distribution_plot_wrt_target(data,"NumberOfTrips","ProdTaken")

* similar distributions, looks to be a mixed result

### Passport  vs ProdTaken

In [None]:
distribution_plot_wrt_target(data,"ProdTaken","Passport")

* if a obserrvation has a passport, they are more likely to get the package

### PitchSatisfactionScore  vs ProdTaken

In [None]:
stacked_barplot(data,"PitchSatisfactionScore","ProdTaken")

* There seems to be a weak but present indication that the higher the satisfaction on the Pitch, the higher the probability of successfully selling the product

### OwnCar  vs ProdTaken

In [None]:
stacked_barplot(data,"OwnCar","ProdTaken")

In [None]:
distribution_plot_wrt_target(data,"ProdTaken","OwnCar")

* there does not seem to be a strong relationship between owning a car and buying a package

### NumberOfChildrenVisiting vs ProdTaken

In [None]:
stacked_barplot(data,"NumberOfChildrenVisiting","ProdTaken")

* The Number Of Children Visiting does not seem to influence the purchase of the package

### Designation vs ProdTaken

In [None]:
stacked_barplot(data,"Designation","ProdTaken")

Executives and Senior Managers are more likely to buy the package

### MonthlyIncome vs ProdTaken

In [None]:
distribution_plot_wrt_target(data,"MonthlyIncome","ProdTaken")

* when the monthly income is slightly less there is a more probable chance that they will buy the package. This may indicate that those that are making more money cannot take the time to get away. 

In [None]:
stacked_barplot(data,"NumberOfPersonVisiting","NumberOfChildrenVisiting")

In [None]:
distribution_plot_wrt_target(data,"NumberOfPersonVisiting","NumberOfChildrenVisiting")

* there is a relationship between number of kids visiting and number of people visiting.

In [None]:
sns.pairplot(df,diag_kind='kde')

In [None]:
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()

# correlation heatmap
plt.figure(figsize=(20, 14))
sns.heatmap(
    data[numeric_columns].corr(),
    annot=True,
    vmin=-1,
    vmax=1,
    fmt=".2f",
    cmap="Spectral",
)
plt.show()

* The highest correlation is NumberOfChildrenVisiting and NumberOfPersonVisiting. It is .61 so we will not automatically drop this variable but we will watch it.
* The second highest variable is Age to Montly Income- .46

### Outlier Inspection and treatment

In [None]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
# dropping Personal_Loan as it is the dependent variable
# dropping ZIPCode as it is a geographic identifier

numeric_columns.remove("ProdTaken")


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

* there are outliers on NumberOfPersonVisiting however the outliers are real observable data that we would like to maintain for our analysis-

## Building bagging and boosting models

In [None]:
X = data.drop("ProdTaken" , axis=1)
y = data.pop("ProdTaken")

In [None]:
X = pd.get_dummies(X, drop_first=False)

#### the stratify argument maintains the original distribution of classes in the target variable while splitting the data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
print(X_train.shape,X_test.shape)

In [None]:
#

In [None]:
y.value_counts(1)

In [None]:
y_test.value_counts(1)

#### Before building the model, let's create functions to calculate different metrics- Accuracy, Recall and Precision and plot the confusion matrix.

In [None]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf

In [None]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

### Model Prediction 
-. False Positive-  Customer is predicted to buy a travel package but they do not
-  False Negative- Customer is predicted not to buy a travel package but they do  

- Since we are trying to avoid both False Negative and False Positive, we will want to maximize the F1 score

### Decision Tree Classifier

In [None]:
#Fitting the model
d_tree = DecisionTreeClassifier(random_state=1)
d_tree.fit(X_train,y_train)

#Calculating different metrics
d_tree_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train)
print("Training performance:\n",d_tree_model_train_perf)
d_tree_model_test_perf=model_performance_classification_sklearn(d_tree,X_test,y_test)
print("Testing performance:\n",d_tree_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(d_tree,X_test,y_test)

In [None]:
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)

In [None]:
#Choose the type of classifier. 
dtree_estimator = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2), 
              'min_samples_leaf': [1, 2, 5],
              'max_leaf_nodes' : [2, 3, 5],
              'min_impurity_decrease': [0.0001,0.001,0.01,0.1]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer,n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
dtree_estimator.fit(X_train, y_train)

In [None]:
#Calculating different metrics
dtree_estimator_model_train_perf=model_performance_classification_sklearn(d_tree,X_train,y_train)
print("Training performance:\n",dtree_estimator_model_train_perf)
dtree_estimator_model_test_perf=model_performance_classification_sklearn(d_tree,X_test,y_test)
print("Testing performance:\n",dtree_estimator_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(dtree_estimator,X_test,y_test)

* Decision trees are known to overfit, modification of the decision tree yeilded a very similar result. 
* F1 score is 70.5, Accuracy is 89.4
* there did not seem to be any added benefit to tuning for decision tree

### Bagging Models

In [None]:
#Fitting the model
bagging_classifier = BaggingClassifier(random_state=1)
bagging_classifier.fit(X_train,y_train)

#Calculating different metrics
bagging_classifier_model_train_perf=model_performance_classification_sklearn(bagging_classifier,X_train,y_train)
print(bagging_classifier_model_train_perf)
bagging_classifier_model_test_perf=model_performance_classification_sklearn(bagging_classifier,X_test,y_test)
print(bagging_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(bagging_classifier,X_test,y_test)

In [None]:
# Choose the type of classifier. 
bagging_estimator_tuned = BaggingClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_samples': [0.7,0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9,1],
              'n_estimators' : [10,20,30,40,50],
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(bagging_estimator_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
bagging_estimator_tuned_model_train_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_train,y_train)
print(bagging_estimator_tuned_model_train_perf)
bagging_estimator_tuned_model_test_perf=model_performance_classification_sklearn(bagging_estimator_tuned,X_test,y_test)
print(bagging_estimator_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(bagging_estimator_tuned,X_test,y_test)

* Bagging classifier over fitted on both the original run and the modified run
* F1 score was improved on the tuned model from 69 to 74.73
* Recall was also slightly improved using the tuned model from 57.9 to 62.6
* Accuracy and Precision also benefited from the hypertuning.

### Random Forest Classifier

In [None]:
# Fitting the model
rf_estimator = RandomForestClassifier(random_state=1, class_weight="balanced")
rf_estimator.fit(X_train, y_train)

#Calculating different metrics
rf_estimator_model_train_perf = model_performance_classification_sklearn(rf_estimator,X_train,y_train)
print("Training performance:\n",rf_estimator_model_train_perf)
rf_estimator_model_test_perf = model_performance_classification_sklearn(rf_estimator,X_test,y_test)
print("Testing performance:\n",rf_estimator_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(rf_estimator,X_test,y_test)

#### model performance on the Test set

In [None]:
confusion_matrix_sklearn(bagging_estimator_tuned, X_test, y_test)

In [None]:
#Calculating different metrics
rf_estimator_model_train_perf = model_performance_classification_sklearn(rf_estimator,X_train,y_train)
print("Training performance:\n",rf_estimator_model_train_perf)
rf_estimator_model_test_perf = model_performance_classification_sklearn(rf_estimator,X_test,y_test)
print("Testing performance:\n",rf_estimator_model_test_perf)

### Hyperparameter Tuning- Random Forest

* Random Forest Classifier
- Now, let's see if we can get a better model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:

* n_estimators: The number of trees in the forest, default = 100.
* max_features: The number of features to consider when looking for the best split.
* class_weight: Weights associated with classes in the form {class_label: weight}.If not given, all classes are supposed to have weight one.
* For example: If the frequency of class 0 is 80% and the frequency of class 1 is 20% in the data, then class 0 will become the dominant class and the model will become biased toward the dominant classes. In this case, we can pass a dictionary {0:0.2,1:0.8} to the model to specify the weight of each class and the random forest will give more weightage to class 1.
* bootstrap: Whether bootstrap samples are used when building trees. If False, the entire dataset is used to build each tree, default=True.
* max_samples: If bootstrap is True, then the number of samples to draw from X to train each base estimator. If None (default), then draw N samples, where N is the number of observations in the train data.
* oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy, default=False.

* Note: A lot of hyperparameters of Decision Trees are also available to tune Random Forest like max_depth, min_sample_split etc.

In [None]:
# Choose the type of classifier.
rf_tuned = RandomForestClassifier(random_state=1, oob_score=True, bootstrap=True)

parameters = {
    "max_depth": list(np.arange(5, 15, 5)),
    "max_features": ["sqrt", "log2"],
    "min_samples_split": [3, 5, 7],
    "n_estimators": np.arange(10, 100, 10),
    'criterion':['gini','entropy'],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(rf_tuned, parameters, scoring=acc_scorer, cv=5, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)

#### Tuned Model performance on training set

In [None]:
confusion_matrix_sklearn(rf_tuned, X_train, y_train)

In [None]:
rf_tuned_model_train_perf = model_performance_classification_sklearn(
    rf_tuned, X_train, y_train
)
rf_tuned_model_train_perf

In [None]:
confusion_matrix_sklearn(rf_tuned, X_test, y_test)

In [None]:
rf_tuned_model_test_perf = model_performance_classification_sklearn(
    rf_tuned, X_test, y_test
)
rf_tuned_model_test_perf

* Hypertuning yeilded a lower F1 score 
* Train and Test of original model was superior to the hypertuned model, however with the difference in the F1 score and Recall it is overfitting and is not well balanced.
* the original model preformed fairly well despite the overfitting- i would not recommend tuning .

### Random Forest feature importances ranked

#### Original model

In [None]:
importances = rf_estimator.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

#### Tuned Model

In [None]:
importances = rf_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

* Features of importance by rank for both
- 1st Age
- Monthly income
- Duration of Pitch 

### Boosting Models

### Building the Model
* We are going to build 3 ensemble models here - AdaBoost Classifier, Gradient Boosting Classifier and XGBoost Classifier.
* First, let's build these models with default parameters and then use hyperparameter tuning to optimize the model performance.
* We will calculate all three metrics - Accuracy, Precision and Recall but the metric of interest here is recall.
* Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives
* F1- overall performance indicator as mentioned previously will play a important role in our final selection

### AdaBoost Classifier


* An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits   additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
* Some important hyperparamters are:
>  * base_estimator: The base estimator from which the boosted ensemble is built. By default the base estimator is a decision   tree with max_depth=1
>  * n_estimators: The maximum number of estimators at which boosting is terminated. Default value is 50.
>  * learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.

In [None]:
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train,y_train)

In [None]:
#Fitting the model
ab_classifier = AdaBoostClassifier(random_state=1)
ab_classifier.fit(X_train,y_train)

#Calculating different metrics
ab_classifier_model_train_perf=model_performance_classification_sklearn(ab_classifier,X_train,y_train)
print(ab_classifier_model_train_perf)
ab_classifier_model_test_perf=model_performance_classification_sklearn(ab_classifier,X_test,y_test)
print(ab_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(ab_classifier,X_test,y_test)

### Hyperparameter tuning

In [None]:
# Choose the type of classifier. 
abc_tuned = AdaBoostClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {
    #Let's try different max_depth for base_estimator
    "base_estimator":[DecisionTreeClassifier(max_depth=1, random_state=1),DecisionTreeClassifier(max_depth=2, random_state=1),DecisionTreeClassifier(max_depth=3, random_state=1)],
    "n_estimators": np.arange(10,110,10),
    "learning_rate":np.arange(0.1,2,0.1)
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
abc_tuned_model_train_perf=model_performance_classification_sklearn(abc_tuned,X_train,y_train)
print(abc_tuned_model_train_perf)
abc_tuned_model_test_perf=model_performance_classification_sklearn(abc_tuned,X_test,y_test)
print(abc_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(abc_tuned,X_test,y_test)

* Adaboost performed much better after tuning
- 65.21 for F1 is much better than 44.7 on the original
- Precision for the tuned model lost a few points
- Recall was dramatically better and Accuracy was slightly better

### Adaboost feature importance ranked (tuned model)

In [None]:
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

### Gradient Boosting Classifier

In [None]:
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train,y_train)

#### model performance on train set

In [None]:
confusion_matrix_sklearn(gbc, X_train, y_train)

In [None]:
gbc_model_train_perf = model_performance_classification_sklearn(
    gbc, X_train, y_train
)
gbc_model_train_perf

#### model performance on test set

In [None]:
confusion_matrix_sklearn(gbc, X_test, y_test)

In [None]:
gbc_model_test_perf = model_performance_classification_sklearn(
    gbc, X_test, y_test
)
gbc_model_test_perf

#### It can also be done this way with one set of code- good to summarize

In [None]:
#Fitting the model
gb_classifier = GradientBoostingClassifier(random_state=1)
gb_classifier.fit(X_train,y_train)

#Calculating different metrics
gb_classifier_model_train_perf=model_performance_classification_sklearn(gb_classifier,X_train,y_train)
print("Training performance:\n",gb_classifier_model_train_perf)
gb_classifier_model_test_perf=model_performance_classification_sklearn(gb_classifier,X_test,y_test)
print("Testing performance:\n",gb_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(gb_classifier,X_test,y_test)

* Gradient Boost has not overfitted, however the F1 score is low
* Recall is only .402174 on test 


### Gradient Boosting Classifier Tuned
* Most of the hyperparameters available are same as random forest classifier.
* init: An estimator object that is used to compute the initial predictions. If ‘zero’, the initial raw predictions are set to zero. By default, a DummyEstimator predicting the classes priors is used.
* There is no class_weights parameter in gradient boosting.

In [None]:
# Choose the type of classifier. 
gbc_tuned = GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [100,150,200,250],
    "subsample":[0.8,0.9,1],
    "max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
gbc_tuned_model_train_perf=model_performance_classification_sklearn(gbc_tuned,X_train,y_train)
print("Training performance:\n",gbc_tuned_model_train_perf)
gbc_tuned_model_test_perf=model_performance_classification_sklearn(gbc_tuned,X_test,y_test)
print("Testing performance:\n",gbc_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(gbc_tuned,X_test,y_test)

* Gradient Boost with hyperparameters performed better in almost every category
* F1 score between Train and test is a more than we would like , however relative to other models tested this is relatively good
* Recall with hyper parameters is better but still with only marginal success


### Gradient Boosing Features Ranked

In [None]:
importances = gb_classifier.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

### XGBoost Classifier

In [None]:
#Fitting the model
xgb_classifier = XGBClassifier(random_state=1, eval_metric='logloss')
xgb_classifier.fit(X_train,y_train)

#Calculating different metrics
xgb_classifier_model_train_perf=model_performance_classification_sklearn(xgb_classifier,X_train,y_train)
print("Training performance:\n",xgb_classifier_model_train_perf)
xgb_classifier_model_test_perf=model_performance_classification_sklearn(xgb_classifier,X_test,y_test)
print("Testing performance:\n",xgb_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(xgb_classifier,X_test,y_test)

* XGBoost like the other models is overfitting but relating it to the other models this seems to be superior
* Recall reflects once again overfitting but the results are superior to the other models
* accuracy is good and precision is once again overfitting but superior to other models

### XGBoost Classifier
#### XGBoost has many hyper parameters which can be tuned to increase the model performance. You can read about them in the xgboost documentation here. Some of the important parameters are:

* scale_pos_weight:Control the balance of positive and negative weights, useful for unbalanced classes. It has range from 0 to 
∞
.
* subsample: Corresponds to the fraction of observations (the rows) to subsample at each step. By default it is set to 1 meaning that we use all rows.
* colsample_bytree: Corresponds to the fraction of features (the columns) to use.
* colsample_bylevel: The subsample ratio of columns for each level. Columns are subsampled from the set of columns chosen for the current tree.
* colsample_bynode: The subsample ratio of columns for each node (split). Columns are subsampled from the set of columns chosen for the current level.
* max_depth: is the maximum number of nodes allowed from the root to the farthest leaf of a tree.
* learning_rate/eta: Makes the model more robust by shrinking the weights on each step.
* gamma: A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.

In [None]:
# Choose the type of classifier. 
xgb_tuned = XGBClassifier(random_state=1,eval_metric='logloss')

# Grid of parameters to choose from
## add from
parameters = {
    "n_estimators": np.arange(10,100,20),
    "scale_pos_weight":[0,1,2,5],
    "subsample":[0.5,0.7,0.9,1],
    "learning_rate":[0.01,0.1,0.2,0.05],
    "gamma":[0,1,3],
    "colsample_bytree":[0.5,0.7,0.9,1],
    "colsample_bylevel":[0.5,0.7,0.9,1]
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters,scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)

In [None]:
#Calculating different metrics
xgb_tuned_model_train_perf=model_performance_classification_sklearn(xgb_tuned,X_train,y_train)
print("Training performance:\n",xgb_tuned_model_train_perf)
xgb_tuned_model_test_perf=model_performance_classification_sklearn(xgb_tuned,X_test,y_test)
print("Testing performance:\n",xgb_tuned_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(xgb_tuned,X_test,y_test)

* XGBoost preformed comaritively well - however the original model performed better than the tuned model. 
* Overall F1 score of 77.77 is encouraging 
* Accuracy of 92, Recall of 70.2 and Precision of 86.9 are all respectable and compared to other models run - this is certainly a respectable model.
* XGBoost does take longer to run but it appears to be worth the effort

### XGBoost Feature of Importances

In [None]:
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()


### Stacking Claissifier

In [None]:
estimators = [('Random Forest',rf_tuned), ('Gradient Boosting',gbc_tuned), ('Decision Tree',dtree_estimator),
             ('AdaBoost', abc_tuned),('Bagging',bagging_estimator_tuned),('XGBoosting',xgb_classifier)]
# using xgb classifier instead of the tuned model due to performance
final_estimator = xgb_tuned

stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)

stacking_classifier.fit(X_train,y_train)

In [None]:
estimators = [('Random Forest',rf_tuned), ('Gradient Boosting',gbc_tuned), ('Decision Tree',dtree_estimator),
             ('AdaBoost', abc_tuned),('Bagging',bagging_estimator_tuned),('XGBoosting',xgb_classifier)]
# using xgb classifier instead of the tuned model due to performance
final_estimator = xgb_classifier

stacking_classifier= StackingClassifier(estimators=estimators,final_estimator=final_estimator)

stacking_classifier.fit(X_train,y_train)

In [None]:
#Calculating different metrics
stacking_classifier_model_train_perf=model_performance_classification_sklearn(stacking_classifier,X_train,y_train)
print("Training performance:\n",stacking_classifier_model_train_perf)
stacking_classifier_model_test_perf=model_performance_classification_sklearn(stacking_classifier,X_test,y_test)
print("Testing performance:\n",stacking_classifier_model_test_perf)

#Creating confusion matrix
confusion_matrix_sklearn(stacking_classifier,X_test,y_test)

* Stacking Classifier overfitted but comparitively better than other models.
* F1 score is respectable comparitively speaking to the other models in comparison with the original F1 score. 
* Maintained a respectable precision, Recall and Accuracy balance ( thus the F1 score responded accordingly)

## Comparing all models

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [d_tree_model_train_perf.T,dtree_estimator_model_train_perf.T,rf_estimator_model_train_perf.T,rf_tuned_model_train_perf.T,
     bagging_classifier_model_train_perf.T,bagging_estimator_tuned_model_train_perf.T,ab_classifier_model_train_perf.T,
     abc_tuned_model_train_perf.T,gb_classifier_model_train_perf.T,gbc_tuned_model_train_perf.T,xgb_classifier_model_train_perf.T,
    xgb_tuned_model_train_perf.T,stacking_classifier_model_train_perf.T],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Estimator",
    "Random Forest Estimator",
    "Random Forest Tuned",
    "Bagging Classifier",
    "Bagging Estimator Tuned",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier"]
print("Training performance comparison:")
models_train_comp_df

In [None]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [d_tree_model_test_perf.T,dtree_estimator_model_test_perf.T,rf_estimator_model_test_perf.T,rf_tuned_model_test_perf.T,
     bagging_classifier_model_test_perf.T,bagging_estimator_tuned_model_test_perf.T,ab_classifier_model_test_perf.T,
     abc_tuned_model_test_perf.T,gb_classifier_model_test_perf.T,gbc_tuned_model_test_perf.T,xgb_classifier_model_test_perf.T,
    xgb_tuned_model_test_perf.T,stacking_classifier_model_test_perf.T],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Estimator",
    "Random Forest Estimator",
    "Random Forest Tuned",
    "Bagging Classifier",
    "Bagging Estimator Tuned",
    "Adaboost Classifier",
    "Adabosst Classifier Tuned",
    "Gradient Boost Classifier",
    "Gradient Boost Classifier Tuned",
    "XGBoost Classifier",
    "XGBoost Classifier Tuned",
    "Stacking Classifier"] 
print("Testing performance comparison:")
models_test_comp_df

#### Used XGBoost classifier tuned since it has the best F1 score combined with being the final estimator for the stacking classifier which had the best overall results.
- Stacking reflects the overfitting but handled it best
- overall performance on stacking was better and provided closer variances


In [None]:
importances = xgb_classifier.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

* Feature of important variables
- Having had the basic product pitched 
- Large business is the employer
- Tier 3 city closely followed by Tier 1 city
- Unmarried, or Divorced

#### Other features to consider strongly due to performance in other models
- Age and Monthly income

#### Actionable Insights and Recommendations
1. Stacking performed well , but in the event that a single classifier was to be used for computational efficiency then
i would recommend using XGBOOST_classifier (original) or the Bagging Estimator Tuned (if computer efficiency was a concern) due to overall performance.
2. Recommend the marketing campaign to consider their potential clients to be :
     
    a. Age- close to 37
    b. Montly income is around 23619.86
    c. employed by a large business, but I would also consider executives 
    d. lives in a Tier 3 city or Tier 1 
    e. they do not have a passport
    f. They are single
    
I would market the product similar to basic product's marketing or use the basic product to open the door for potential up selling. 

Data integrity would be a emphasis , there were a lot of missing observations, if better recording was done then the model 
might perform slightly better and may reduce the over fitting. 

Since it is cheaper to advertise in a Tier 3 city, I would consider target marketing in those cities

In EDA it became evident that there is room for improvement on the sales pitch. I would have my team work on trying to obtain better Pitch Satisfaction scores and utilize a survey to see if improvement from the first effort has improved. 

- Finally - Following up has shown to be beneficial. Its not good enough to perform the marketing but the data reflects the importance of following up with each.