<a href="https://colab.research.google.com/github/Stan-Pugsley/graph_demo/blob/main/Assignments/Scripts/AdviseInvest_FullScript_v1_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AdviseInvest

## Outline

Our Goals with this Project:

1. Import and review the data
2. Perform EDA and Clean Data
3. Fit a model
4. Test the accuracy of the model
6.  Use the model to predict on a new dataset (without the target)

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Assignments/Scripts/advise_invest_full.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## AdviseInvest Data Dictionary

Variable    |Description |Type    |Code
---- | ------- | ------ | -----
Answered       | Customer response |    Binary    |0: customer did not answer scheduled call; 1: customer answered scheduled call
Income       | Customer income in US dollars|	Numeric
Female       | Customer gender|	Binary	0: female; 1: male
Age	         | Age in years|	Numeric
Job          | Nature of job|	Categorical|	0 : unemployed; 1 : entry level position; 2 : midlevel position; 3 : management/ self-employed/ highly qualified employee/ officer
Num_dependents | Number of people for whom the customer provides maintenance|	Numeric
Rent	| Customer rents	|Binary	|0: no; 1: yes
Own_res	| Customer owns residence|	Binary|	0: no; 1: yes
New_car| Recent new car purchase	|Binary|	New car purchase in the last 3 months: 0: no, 1: yes
Chk_acct | Checking account status	|Categorical|	0 : no checking account; 1: checking < 200 USD; 2 : 200 < checking < 2000 USD; 3: 2000 < checking < 35000 USD; 4: >= 3500 USD
Sav_acct	| Average balance in savings account|	Categorical|	0 : no savings account; 1 : 100 <= savings < 500 USD; 2 : 500 <= savings < 2000 USD; 3 : 2000 < savings < 35000 USD; 4: >= 3500 USD
Num_accts	| Number of accounts owned by customer	|Numeric
Mobile	| Mobile phone	|Binary	|0: customer provided non‐mobile phone for follow‐up call; 1: customer provided mobile phone for follow‐up call
Product| Type of product purchased after conversation with sales rep	|Categorical|	0: customer did not answer call; 1: customer answered but did not purchase a product; 2: customer answered and purchased Beginner plan; 3: customer answered and purchased Intermediate plan; 4: customer answered and purchased Advanced plan



##Load Libraries

In this class we will be using
- Pandas
- Scikitlearn
- Matplotlib


In [None]:
import pandas as pd
import matplotlib as mpl
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier
from sklearn import metrics  #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt


## Import Data into Dataframe

 - Import data from the Megatelco dataset into a dataframe (in GitHub go to Assignments > DataSets)
 - Describe or profile the dataframe


In [None]:
df = pd.read_csv('https://github.com/Stan-Pugsley/is_4487_base/blob/main/Assignments/DataSets/adviseinvest.csv?raw=true')
print (df)

In [None]:
df.info()

In [None]:
df.describe()

## Clean up the data
- Remove the product variable.  It is not relevant to our analysis (that step is after the answer).
- Clean up the data in a  new datafram named "df_clean"


In [None]:
#delete rows with outlier data; put it in a new dataframe
df_clean = df[(df['income'] > 0) & (df['num_accts'] < 5) ]

#remove product
df_clean = df_clean.drop('product', axis=1)

#delete any rows with missing values in the clean dataframe
df_clean = df_clean.dropna()

df_clean.describe()

# Standardize attributes

 - Change answered to yes/no categorical
 - Convert new_car to integer

In [None]:
# Create the new variable 'answered_cat' based on the values in 'answered'
df_clean['answered_cat'] = df_clean['answered'].astype('str')

# Replace values
df_clean['answered_cat'] = df_clean['answered_cat'].replace('0', 'no')
df_clean['answered_cat'] = df_clean['answered_cat'].replace('1', 'yes')

df_clean['new_car'] = df_clean['new_car'].astype('int')

df_clean.head(10)

# Convert attributes to categorical

- female
- job
- rent
- own_res
- new_car
- mobile
- chk_acct
- sav_acct

Create a new categorical variable for answered

In [None]:
df_clean['female'] = df_clean['female'].astype('category')
df_clean['job'] = df_clean['job'].astype('category')
df_clean['rent'] = df_clean['rent'].astype('category')
df_clean['own_res'] = df_clean['own_res'].astype('category')
df_clean['new_car'] = df_clean['new_car'].astype('category')
df_clean['mobile'] = df_clean['mobile'].astype('category')
df_clean['chk_acct'] = df_clean['chk_acct'].astype('category')
df_clean['sav_acct'] = df_clean['sav_acct'].astype('category')
df_clean['answered_cat'] = df_clean['answered'].astype('category')

df_clean.info()

# What is the base probability of answering?

If we use no model at all, how good is our chance of predicting someone leaving?

In [None]:
df_clean['answered'].mean()

#Split the training and testing datasets

- split df_clean using train_test_split function
- all variables except answered should be in the x variable
- answered is in the y variable


In [None]:
y = df_clean['answered_cat']
X = df_clean.drop(['answered','answered_cat'], axis=1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=59)

## Fit a basic tree model

Use all available attributes, except product

In [None]:
tree = DecisionTreeClassifier(criterion="entropy", max_depth=4)

# Create Decision Tree Classifer
tree = tree.fit(X_train,y_train)

# Use the tree to predict "leave"
y_predict = tree.predict(X_test)

## What is the accuracy?

Is it better than the 54.6% base probability?

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_predict))

# Create a confusion matrix

This will show false positives, true positives, etc.

In [None]:
# create a confusion matrix
tree_matrix = confusion_matrix(y_test, y_predict)
print(tree_matrix)

Create a more interpretable version of the matrix

In [None]:
cm = sns.heatmap(tree_matrix, annot=True, fmt="d") #fmt will make sure the numbers are formatted as integers
cm.set_title('Confusion Matrix')
plt.ylabel('Observed (Actual)')
plt.xlabel('Predicted')
cm.xaxis.set_ticklabels(['Yes','No'])
cm.yaxis.set_ticklabels(['Yes','No'])
plt.show()

## Calculate Profit

One of the simplifying assumptions we will make in this project is that all the customers who answer the phone will purchase a product. (This assumption is actually verified by the data.) To model "answered" in this case is therefore equivalent to modeling "purchased."

There are costs and benefits in this case. We will assume that customers purchase a product for \$100 dollars. This was the average cost of AdviseInvest products, according to the Director of Sales.  Also, as we learned in the interview, the agent time to make the sale is worth \$25. Profit would therefore be \$75 dollars for an answered call and a purchase. In sum:

**Benefit**: True positive. The customer is predicted to answer, does answer, and purchases a product for \$100 for a profit of 100 - 25 = \$75.

**Cost**: False positive. The customer is predicted to answer, but does not answer, so there is a loss of \$25. (We assume the agent cannot schedule another call at the last minute, or spends the entire time slot trying to make the call.)

For this exercise, we propose that customers who are not predicted to answer will not be called, so there would be no benefits and no costs for them.  

In [None]:
# True answered * 75 -> These people purchased
# False answered * 25 -> You can't schedule another call
(2218 * 75) - (1130 * 25)


# Default Profit

How much profit (revenue - costs) could be expected if all customers are called? We can consider this a baseline case for profit since it does not require a model.

In other words, to calculate profit in this baseline scenario treat the customers who answer as true positives treat the customers who do not answer as false positives.

In [None]:
((2218+459) * 75) - ((1130+2093) * 25)

##Did we improve our profit using the model?