# Predicting Catalog Demand

# Step 1: Business and Data Understanding

- A description of the key business decisions that need to be made.

Note: Clean data is provided for this project, so you can skip the data preparation step of the Problem Solving Framewor

In [None]:
# import module
import pandas as pd
import matplotlib.pyplot as plt

# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 7]

In [None]:
PRINT_COST = 6.50
GROSS_MARGIN = 0.5
TOTAL_MAIL = 250

# for this to work, install openpyxl engine that can open .xlsx in pandas
old_customers_data = pd.read_excel('p1-customers.xlsx')
mailinglist_data = pd.read_excel('p1-mailinglist.xlsx')

## Customers Data

'Name', 'Customer_Segment', 'Customer_ID', 'Address', 'City', 'State', 'ZIP', 'Avg_Sale_Amount', 'Store_Number', 'Responded_to_Last_Catalog', 'Avg_Num_Products_Purchased', '#_Years_as_Customer'

## Mailing List Data

'Name', 'Customer_Segment', 'Customer_ID', 'Address', 'City', 'State', 'ZIP', 'Store_Number', 'Avg_Num_Products_Purchased', '#_Years_as_Customer', 'Score_No', 'Score_Yes'

- Target Variable: Avg_Sale_Amount

- Unused Variable: Responded_to_Last_Catalog 

In [None]:
# Drop Responded_to_Last_Catalog as it not present in Mail List Data
old_customers_data = old_customers_data.drop(columns='Responded_to_Last_Catalog')

In [None]:
# Category Data

# Customer_Segment
customer_segment = pd.get_dummies(old_customers_data['Customer_Segment'], prefix='Customer')
# City
city = pd.get_dummies(old_customers_data['City'], prefix='City')
# State -> Data contain only single state CO [Not use in model]
# Zip -> has 86 unique zip [Not use in model]
#  Store_Number
store_number = pd.get_dummies(old_customers_data['Store_Number'], prefix='Store_ID')
# Join Categories to Dataframe
old_customers_data_with_categories = old_customers_data.join([customer_segment, city, store_number])

# Step 2: Analysis, Modeling, and Validation

Build a linear regression model, then use it to predict sales for the 250 customers. We encourage you to use Alteryx to build the best linear model with your data.

Note: For students using software other than Alteryx, if you decide to use Customer Segment as one of your predictor variables, please set the base case to Credit Card Only.

However, feel free to use any tool you’d like. You should create your linear regression model and come up with a linear regression equation.

Once you have your linear regression equation, you should use your linear regression equation to predict sales for the individual people in your mailing list.

In [None]:
# Drop columns: 
# 'Name', 'Customer_Segment', 'Customer_ID', 'Address', 'City', 'State', 'ZIP', 'Store_Number'
cleaned_data = old_customers_data_with_categories.drop(columns=['Name', 'Customer_Segment', 'Customer_ID', 'Address', 'City', 'State', 'ZIP', 'Store_Number'])

Y = cleaned_data['Avg_Sale_Amount']
X = cleaned_data.drop(columns='Avg_Sale_Amount')


In [None]:
# Train Model
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X, Y)

In [None]:
regr.coef_

# Step 3: Writeup

Once you have your predicted or expected profit, write a brief report with your recommendation to whether the company should send the catalog or not.

Hint: We want to calculate the expected revenue from these 250 people in order to get expected profit. This means we need to multiply the probability that a person will buy our catalog as well. For example, if a customer were to buy from us, we predict this customer will buy $450 worth of products. At a 30% chance that this person will actually buy from us, we can expect revenue to be $450 x 30% = $135.