This repository contains code and datasets for the machine learning project aimed at predicting customer responses to catalog mailers. It is designed to assist a direct marketing firm in identifying and targeting profitable customers.
A direct marketing firm distributes catalogs to its customer base of about 5 million households. The objective of this project is to help the firm to improve their performance by predicting the customers who are likely to respond and make profitable orders, thereby justifying the printing and mailing costs.
The dmtrain.csv dataset includes information about 2,000 customers from the firm's last mailing campaign. It includes the following variables:
id: Customer IDn24: Number of orders in the last 24 monthsrev24: Total order amount ($) in the last 24 monthsrevlast: Amount of last order ($)elpsdm: Time elapsed since last order (months)ordfreq: Order frequency over the last 24 monthsordcat: Order amount categoryresponse: 1 indicates the customer responded, 0 indicates no response
The broad objective is to classify customers who are likely respond to mailers (i.e., response is the dependent variable).
- Read in the data and review the non-binary variables to see if any are skewed and need to be transformed.
- Generate a decision tree on the entire dataset, without any limitations on the depth of the tree. Use entropy as the metric.
- Identify the best decision tree classifier by pruning the tree at different depths.
- Develop random forest classifiers with 100 trees, using the three best values of tree depth identified in the previous step.
- Repeat this experiment with 50 trees.
- Identify the best value of k for k-nearest neighbor models using 10-fold cross-validation.
- Develop a logistic regression model using 10-fold cross-validation.
- Develop a logistic regression model on the entire training dataset.
- Perform an evaluation with 10-fold cross-validation on the four best models identified (decision tree, random forest, k-nearest neighbor, logistic regression).
- Use the entire dataset to develop a final version of the recommended model for testing.
- Use the final model to make predictions on new data from
dmtest.csv.
- Compare the quality of predictions for "lapsing customers" relative to predictions for the others on records in the training set.