In [1]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bank Marketing Dataset\n",
    "- The [Bank Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) contains a reasonable large number of data related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The goal is to predict if the client will subscribe a term deposit.\n",
    "- It is a fairly large dataset with 41K+ rows, a mixture of categorical and continuous columns as well as data imperfections to identify and manage.\n",
    "\n",
    "## Dataset\n",
    "The data has the following columns\n",
    "\n",
    "\n",
    "\n",
    "Bank client data:\n",
    "\n",
    "|col num | col name | description |\n",
    "|:---|:---|:---|\n",
    "| 1 | age | (numeric) | \n",
    "| 2 | job | type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') |\n",
    "| 3 | marital | marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) |\n",
    "| 4 | education | (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') |\n",
    "| 5 | default | has credit in default? (categorical: 'no','yes','unknown') |\n",
    "| 6 | housing | has housing loan? (categorical: 'no','yes','unknown') |\n",
    "| 7 | loan | has personal loan? (categorical: 'no','yes','unknown') |\n",
    "\n",
    "Related with the last contact of the current campaign:\n",
    "\n",
    "|col num | col name | description |\n",
    "|:---|:---|:---|\n",
    "| 8 | contact | contact communication type (categorical: 'cellular','telephone') |\n",
    "| 9 | month | last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') |\n",
    "| 10 | day_of_week | last contact day of the week (categorical: 'mon','tue','wed','thu','fri') |\n",
    "\n",
    "\n",
    "Other attributes:\n",
    "\n",
    "|col num | col name | description |\n",
    "|:---|:---|:---|\n",
    "| 11 | campaign | number of contacts performed during this campaign and for this client (numeric, includes last contact) |\n",
    "| 12 | pdays | number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |\n",
    "| 13 | previous | number of contacts performed before this campaign and for this client (numeric) |\n",
    "| 14 | poutcome | outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') |\n",
    "\n",
    "Social and economic context attributes:\n",
    "\n",
    "|col num | col name | description |\n",
    "|:---|:---|:---|\n",
    "| 15 | emp.var.rate | employment variation rate - quarterly indicator (numeric) |\n",
    "| 16 | cons.price.idx | consumer price index - monthly indicator (numeric) |\n",
    "| 17 | cons.conf.idx | consumer confidence index - monthly indicator (numeric) |\n",
    "| 18 | euribor3m | euribor 3 month rate - daily indicator (numeric) |\n",
    "| 19 | nr.employed | number of employees - quarterly indicator (numeric) |\n",
    "\n",
    "Output variable (desired target):\n",
    "\n",
    "|col num | col name | description |\n",
    "|:---|:---|:---|\n",
    "| 20 | y | This is the target column. Has the client subscribed a term deposit? (binary: 'yes','no') |\n",
    "\n",
    "## Goal\n",
    "The goal of this project is \n",
    "1. Build and Tune the hyperparameters of a Sklearn model to predict the target column `y` using AWS Sagemaker \n",
    "1. Deploy the model as a `Serverless Inference Endpoint` and test it\n",
    "1. Run `Batch Transform` on the entire input dataset\n",
    "1. Calculate the performance of the model predictions on the entire input dataset\n",
    "\n",
    "## Recommended Steps\n",
    "1. **Data Exploration:** Understand the data by looking at distributions and unique values in the columns. Are there any issues with the data?\n",
    "1. **Data Cleaning:** Handle any issues you found with the data.\n",
    "1. **Feature Engineering:** Handle the various datatypes by applying the appropriate feature engineering techniques\n",
    "1. **Model Selection:** Choose an appropriate sklearn model for this problem and implement the sagemaker model training code\n",
    "1. **Hyperparameter tuning:** Choose appropriate hyperparameter ranges and objective metric for the chosen model and implement the sagemaker hyperparameter tuning code\n",
    "1. **Model training:** Submit the hyperparameter tuning job to sagemaker and monitor the execution progress\n",
    "1. **Model deployment as severless inference:** Pick the best model from hyperparameter tuning, deploy it as a sagemaker serverless inference endpoint and test if it works by posting some sample data to it\n",
    "1. **Batch transform:** Store the input dataset to a json lines file, deploy the model as a batch transform and run the batch transform job on the input json lines file.\n",
    "1. **Performance calculation:** Calculate model performance on the entire input dataset using output of the batch transform job.\n",
    "\n",
    "## Tips\n",
    "- You can use the below code to get the S3 bucket to write any artifacts to\n",
    "    ```\n",
    "    import sagemaker\n",
    "    session = sagemaker.Session()\n",
    "    bucket = session.default_bucket()\n",
    "    ```\n",
    "- Are all the columns necessary or can we drop any?\n",
    "- Does the data contain any issues?\n",
    "- What ML task is this? Classification? Regression? Clustering?\n",
    "- What are the data types of the columns? What pre-processing should you apply?\n",
    "- What is the most appropriate metric for this model?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(41188, 20)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>job</th>\n",
       "      <th>marital</th>\n",
       "      <th>education</th>\n",
       "      <th>default</th>\n",
       "      <th>housing</th>\n",
       "      <th>loan</th>\n",
       "      <th>contact</th>\n",
       "      <th>month</th>\n",
       "      <th>day_of_week</th>\n",
       "      <th>campaign</th>\n",
       "      <th>pdays</th>\n",
       "      <th>previous</th>\n",
       "      <th>poutcome</th>\n",
       "      <th>emp.var.rate</th>\n",
       "      <th>cons.price.idx</th>\n",
       "      <th>cons.conf.idx</th>\n",
       "      <th>euribor3m</th>\n",
       "      <th>nr.employed</th>\n",
       "      <th>y</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>56.0</td>\n",
       "      <td>housemaid</td>\n",
       "      <td>married</td>\n",
       "      <td>basic.4y</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>1.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>57.0</td>\n",
       "      <td>services</td>\n",
       "      <td>married</td>\n",
       "      <td>high.school</td>\n",
       "      <td>unknown</td>\n",
       "      <td>no</td>\n",
       "      <td>NaN</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>1.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>37.0</td>\n",
       "      <td>services</td>\n",
       "      <td>married</td>\n",
       "      <td>high.school</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>1.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>40.0</td>\n",
       "      <td>admin.</td>\n",
       "      <td>married</td>\n",
       "      <td>basic.6y</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>telephone</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>1.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>56.0</td>\n",
       "      <td>services</td>\n",
       "      <td>married</td>\n",
       "      <td>high.school</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>NaN</td>\n",
       "      <td>may</td>\n",
       "      <td>mon</td>\n",
       "      <td>1.0</td>\n",
       "      <td>999.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>nonexistent</td>\n",
       "      <td>1.1</td>\n",
       "      <td>93.994</td>\n",
       "      <td>-36.4</td>\n",
       "      <td>4.857</td>\n",
       "      <td>5191.0</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    age        job  marital    education  default housing loan    contact  \\\n",
       "0  56.0  housemaid  married     basic.4y       no      no   no  telephone   \n",
       "1  57.0   services  married  high.school  unknown      no  NaN  telephone   \n",
       "2  37.0   services  married  high.school       no     yes   no  telephone   \n",
       "3  40.0     admin.  married     basic.6y       no      no   no  telephone   \n",
       "4  56.0   services  married  high.school       no      no  yes        NaN   \n",
       "\n",
       "  month day_of_week  campaign  pdays  previous     poutcome  emp.var.rate  \\\n",
       "0   may         mon       1.0  999.0       0.0  nonexistent           1.1   \n",
       "1   may         mon       1.0  999.0       0.0  nonexistent           1.1   \n",
       "2   may         mon       1.0  999.0       0.0  nonexistent           1.1   \n",
       "3   may         mon       1.0  999.0       0.0  nonexistent           1.1   \n",
       "4   may         mon       1.0  999.0       0.0  nonexistent           1.1   \n",
       "\n",
       "   cons.price.idx  cons.conf.idx  euribor3m  nr.employed   y  \n",
       "0          93.994          -36.4      4.857       5191.0  no  \n",
       "1          93.994          -36.4      4.857       5191.0  no  \n",
       "2          93.994          -36.4      4.857       5191.0  no  \n",
       "3          93.994          -36.4      4.857       5191.0  no  \n",
       "4          93.994          -36.4      4.857       5191.0  no  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "%matplotlib inline\n",
    "\n",
    "df = pd.read_csv(\"https://raw.githubusercontent.com/stephenleo/sagemaker-deployment/main/data/final_project_bank.csv\")\n",
    "\n",
    "print(df.shape)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## All the best!\n",
    "Get started below..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}



NameError: name 'null' is not defined