Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1231 lines (1231 sloc) 588 KB
{
"nbformat_minor": 2,
"cells": [
{
"execution_count": null,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": "# Export as slides command\n# jupyter nbconvert Jupyter\\ Slides.ipynb --to slides --post serve"
},
{
"source": "# Credit Card Approval\n\n\nHeba El-Shimy \nIBM **Cloud** Developer Advocate\n\n\n<sub>GitHub: HebaNAS</sub> \n<sub>Twitter: @heba_el_shimy</sub>\n\n-------------------\nLink for the notebook: [https://github.com/HebaNAS/Customer-Churn-Prediction/blob/master/notebook/Customer-Churn-Prediction-Pipeline.ipynb](https://github.com/HebaNAS/Customer-Churn-Prediction/blob/master/notebook/Customer-Churn-Prediction-Pipeline.ipynb)",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"source": "# Pipeline",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"source": "### 1. Loading Libraries",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 1,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": "import os\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nfrom sklearn import preprocessing, svm\nfrom itertools import combinations\nfrom sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler\nimport sklearn.feature_selection\nfrom sklearn.model_selection import train_test_split\nfrom collections import defaultdict\nfrom sklearn import metrics"
},
{
"source": "### 2. Loading Our Dataset",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"source": "![insert-data](https://github.com/HebaNAS/IBM-Watson-Studio-Enablement/blob/master/CreditCardApprovalModel/imgs/insert-dataframe.jpg?raw=true)",
"cell_type": "markdown",
"metadata": {}
},
{
"execution_count": 2,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": "# The code was removed by DSX for sharing."
},
{
"execution_count": 3,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 3,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n <th>2</th>\n <th>3</th>\n <th>4</th>\n <th>5</th>\n <th>6</th>\n <th>7</th>\n <th>8</th>\n <th>9</th>\n <th>10</th>\n <th>11</th>\n <th>12</th>\n <th>13</th>\n <th>14</th>\n <th>15</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>b</td>\n <td>30.83</td>\n <td>0.000</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>1.250</td>\n <td>t</td>\n <td>t</td>\n <td>1</td>\n <td>f</td>\n <td>g</td>\n <td>00202</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>1</th>\n <td>a</td>\n <td>58.67</td>\n <td>4.460</td>\n <td>u</td>\n <td>g</td>\n <td>q</td>\n <td>h</td>\n <td>3.040</td>\n <td>t</td>\n <td>t</td>\n <td>6</td>\n <td>f</td>\n <td>g</td>\n <td>00043</td>\n <td>560</td>\n <td>+</td>\n </tr>\n <tr>\n <th>2</th>\n <td>a</td>\n <td>24.50</td>\n <td>0.500</td>\n <td>u</td>\n <td>g</td>\n <td>q</td>\n <td>h</td>\n <td>1.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>00280</td>\n <td>824</td>\n <td>+</td>\n </tr>\n <tr>\n <th>3</th>\n <td>b</td>\n <td>27.83</td>\n <td>1.540</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>3.750</td>\n <td>t</td>\n <td>t</td>\n <td>5</td>\n <td>t</td>\n <td>g</td>\n <td>00100</td>\n <td>3</td>\n <td>+</td>\n </tr>\n <tr>\n <th>4</th>\n <td>b</td>\n <td>20.17</td>\n <td>5.625</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>1.710</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>s</td>\n <td>00120</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>5</th>\n <td>b</td>\n <td>32.08</td>\n <td>4.000</td>\n <td>u</td>\n <td>g</td>\n <td>m</td>\n <td>v</td>\n <td>2.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>00360</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>6</th>\n <td>b</td>\n <td>33.17</td>\n <td>1.040</td>\n <td>u</td>\n <td>g</td>\n <td>r</td>\n <td>h</td>\n <td>6.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>00164</td>\n <td>31285</td>\n <td>+</td>\n </tr>\n <tr>\n <th>7</th>\n <td>a</td>\n <td>22.92</td>\n <td>11.585</td>\n <td>u</td>\n <td>g</td>\n <td>cc</td>\n <td>v</td>\n <td>0.040</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>00080</td>\n <td>1349</td>\n <td>+</td>\n </tr>\n <tr>\n <th>8</th>\n <td>b</td>\n <td>54.42</td>\n <td>0.500</td>\n <td>y</td>\n <td>p</td>\n <td>k</td>\n <td>h</td>\n <td>3.960</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>00180</td>\n <td>314</td>\n <td>+</td>\n </tr>\n <tr>\n <th>9</th>\n <td>b</td>\n <td>42.50</td>\n <td>4.915</td>\n <td>y</td>\n <td>p</td>\n <td>w</td>\n <td>v</td>\n <td>3.165</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>00052</td>\n <td>1442</td>\n <td>+</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n0 b 30.83 0.000 u g w v 1.250 t t 1 f g 00202 0 +\n1 a 58.67 4.460 u g q h 3.040 t t 6 f g 00043 560 +\n2 a 24.50 0.500 u g q h 1.500 t f 0 f g 00280 824 +\n3 b 27.83 1.540 u g w v 3.750 t t 5 t g 00100 3 +\n4 b 20.17 5.625 u g w v 1.710 t f 0 f s 00120 0 +\n5 b 32.08 4.000 u g m v 2.500 t f 0 t g 00360 0 +\n6 b 33.17 1.040 u g r h 6.500 t f 0 t g 00164 31285 +\n7 a 22.92 11.585 u g cc v 0.040 t f 0 f g 00080 1349 +\n8 b 54.42 0.500 y p k h 3.960 t f 0 f g 00180 314 +\n9 b 42.50 4.915 y p w v 3.165 t f 0 t g 00052 1442 +"
},
"output_type": "execute_result"
}
],
"source": "# Checking that everything is correct\npd.set_option('display.max_columns', 30)\napplicants.head(10)"
},
{
"source": "### 3. Get some info about our Dataset and whether we have missing values",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 4,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 690 entries, 0 to 689\nData columns (total 16 columns):\n0 690 non-null object\n1 690 non-null object\n2 690 non-null float64\n3 690 non-null object\n4 690 non-null object\n5 690 non-null object\n6 690 non-null object\n7 690 non-null float64\n8 690 non-null object\n9 690 non-null object\n10 690 non-null int64\n11 690 non-null object\n12 690 non-null object\n13 690 non-null object\n14 690 non-null int64\n15 690 non-null object\ndtypes: float64(2), int64(2), object(12)\nmemory usage: 86.3+ KB\n"
}
],
"source": "# After running this cell we will see that we have no missing values\napplicants.info()"
},
{
"execution_count": 5,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 5,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n <th>2</th>\n <th>3</th>\n <th>4</th>\n <th>5</th>\n <th>6</th>\n <th>7</th>\n <th>8</th>\n <th>9</th>\n <th>10</th>\n <th>11</th>\n <th>12</th>\n <th>13</th>\n <th>14</th>\n <th>15</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>b</td>\n <td>30.83</td>\n <td>0.000</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>1.250</td>\n <td>t</td>\n <td>t</td>\n <td>1</td>\n <td>f</td>\n <td>g</td>\n <td>202.0</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>1</th>\n <td>a</td>\n <td>58.67</td>\n <td>4.460</td>\n <td>u</td>\n <td>g</td>\n <td>q</td>\n <td>h</td>\n <td>3.040</td>\n <td>t</td>\n <td>t</td>\n <td>6</td>\n <td>f</td>\n <td>g</td>\n <td>43.0</td>\n <td>560</td>\n <td>+</td>\n </tr>\n <tr>\n <th>2</th>\n <td>a</td>\n <td>24.50</td>\n <td>0.500</td>\n <td>u</td>\n <td>g</td>\n <td>q</td>\n <td>h</td>\n <td>1.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>280.0</td>\n <td>824</td>\n <td>+</td>\n </tr>\n <tr>\n <th>3</th>\n <td>b</td>\n <td>27.83</td>\n <td>1.540</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>3.750</td>\n <td>t</td>\n <td>t</td>\n <td>5</td>\n <td>t</td>\n <td>g</td>\n <td>100.0</td>\n <td>3</td>\n <td>+</td>\n </tr>\n <tr>\n <th>4</th>\n <td>b</td>\n <td>20.17</td>\n <td>5.625</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>1.710</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>s</td>\n <td>120.0</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>5</th>\n <td>b</td>\n <td>32.08</td>\n <td>4.000</td>\n <td>u</td>\n <td>g</td>\n <td>m</td>\n <td>v</td>\n <td>2.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>360.0</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>6</th>\n <td>b</td>\n <td>33.17</td>\n <td>1.040</td>\n <td>u</td>\n <td>g</td>\n <td>r</td>\n <td>h</td>\n <td>6.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>164.0</td>\n <td>31285</td>\n <td>+</td>\n </tr>\n <tr>\n <th>7</th>\n <td>a</td>\n <td>22.92</td>\n <td>11.585</td>\n <td>u</td>\n <td>g</td>\n <td>cc</td>\n <td>v</td>\n <td>0.040</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>80.0</td>\n <td>1349</td>\n <td>+</td>\n </tr>\n <tr>\n <th>8</th>\n <td>b</td>\n <td>54.42</td>\n <td>0.500</td>\n <td>y</td>\n <td>p</td>\n <td>k</td>\n <td>h</td>\n <td>3.960</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>180.0</td>\n <td>314</td>\n <td>+</td>\n </tr>\n <tr>\n <th>9</th>\n <td>b</td>\n <td>42.50</td>\n <td>4.915</td>\n <td>y</td>\n <td>p</td>\n <td>w</td>\n <td>v</td>\n <td>3.165</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>52.0</td>\n <td>1442</td>\n <td>+</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n0 b 30.83 0.000 u g w v 1.250 t t 1 f g 202.0 0 +\n1 a 58.67 4.460 u g q h 3.040 t t 6 f g 43.0 560 +\n2 a 24.50 0.500 u g q h 1.500 t f 0 f g 280.0 824 +\n3 b 27.83 1.540 u g w v 3.750 t t 5 t g 100.0 3 +\n4 b 20.17 5.625 u g w v 1.710 t f 0 f s 120.0 0 +\n5 b 32.08 4.000 u g m v 2.500 t f 0 t g 360.0 0 +\n6 b 33.17 1.040 u g r h 6.500 t f 0 t g 164.0 31285 +\n7 a 22.92 11.585 u g cc v 0.040 t f 0 f g 80.0 1349 +\n8 b 54.42 0.500 y p k h 3.960 t f 0 f g 180.0 314 +\n9 b 42.50 4.915 y p w v 3.165 t f 0 t g 52.0 1442 +"
},
"output_type": "execute_result"
}
],
"source": "# Convert columns with numbers as values but object as datatype into numeric\ncols = [1, 13]\n\n# Set error level to coerce so any string value will be replaced with NaN\napplicants[cols] = applicants[cols].apply(pd.to_numeric, errors='coerce')\napplicants.head(10)"
},
{
"execution_count": 6,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 6,
"metadata": {},
"data": {
"text/plain": "True"
},
"output_type": "execute_result"
}
],
"source": "# Check if we have any NaN values\napplicants.isnull().values.any()"
},
{
"execution_count": 7,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 7,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n <th>2</th>\n <th>3</th>\n <th>4</th>\n <th>5</th>\n <th>6</th>\n <th>7</th>\n <th>8</th>\n <th>9</th>\n <th>10</th>\n <th>11</th>\n <th>12</th>\n <th>13</th>\n <th>14</th>\n <th>15</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>b</td>\n <td>30.83</td>\n <td>0.000</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>1.250</td>\n <td>t</td>\n <td>t</td>\n <td>1</td>\n <td>f</td>\n <td>g</td>\n <td>202.0</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>1</th>\n <td>a</td>\n <td>58.67</td>\n <td>4.460</td>\n <td>u</td>\n <td>g</td>\n <td>q</td>\n <td>h</td>\n <td>3.040</td>\n <td>t</td>\n <td>t</td>\n <td>6</td>\n <td>f</td>\n <td>g</td>\n <td>43.0</td>\n <td>560</td>\n <td>+</td>\n </tr>\n <tr>\n <th>2</th>\n <td>a</td>\n <td>24.50</td>\n <td>0.500</td>\n <td>u</td>\n <td>g</td>\n <td>q</td>\n <td>h</td>\n <td>1.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>280.0</td>\n <td>824</td>\n <td>+</td>\n </tr>\n <tr>\n <th>3</th>\n <td>b</td>\n <td>27.83</td>\n <td>1.540</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>3.750</td>\n <td>t</td>\n <td>t</td>\n <td>5</td>\n <td>t</td>\n <td>g</td>\n <td>100.0</td>\n <td>3</td>\n <td>+</td>\n </tr>\n <tr>\n <th>4</th>\n <td>b</td>\n <td>20.17</td>\n <td>5.625</td>\n <td>u</td>\n <td>g</td>\n <td>w</td>\n <td>v</td>\n <td>1.710</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>s</td>\n <td>120.0</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>5</th>\n <td>b</td>\n <td>32.08</td>\n <td>4.000</td>\n <td>u</td>\n <td>g</td>\n <td>m</td>\n <td>v</td>\n <td>2.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>360.0</td>\n <td>0</td>\n <td>+</td>\n </tr>\n <tr>\n <th>6</th>\n <td>b</td>\n <td>33.17</td>\n <td>1.040</td>\n <td>u</td>\n <td>g</td>\n <td>r</td>\n <td>h</td>\n <td>6.500</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>164.0</td>\n <td>31285</td>\n <td>+</td>\n </tr>\n <tr>\n <th>7</th>\n <td>a</td>\n <td>22.92</td>\n <td>11.585</td>\n <td>u</td>\n <td>g</td>\n <td>cc</td>\n <td>v</td>\n <td>0.040</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>80.0</td>\n <td>1349</td>\n <td>+</td>\n </tr>\n <tr>\n <th>8</th>\n <td>b</td>\n <td>54.42</td>\n <td>0.500</td>\n <td>y</td>\n <td>p</td>\n <td>k</td>\n <td>h</td>\n <td>3.960</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>f</td>\n <td>g</td>\n <td>180.0</td>\n <td>314</td>\n <td>+</td>\n </tr>\n <tr>\n <th>9</th>\n <td>b</td>\n <td>42.50</td>\n <td>4.915</td>\n <td>y</td>\n <td>p</td>\n <td>w</td>\n <td>v</td>\n <td>3.165</td>\n <td>t</td>\n <td>f</td>\n <td>0</td>\n <td>t</td>\n <td>g</td>\n <td>52.0</td>\n <td>1442</td>\n <td>+</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n0 b 30.83 0.000 u g w v 1.250 t t 1 f g 202.0 0 +\n1 a 58.67 4.460 u g q h 3.040 t t 6 f g 43.0 560 +\n2 a 24.50 0.500 u g q h 1.500 t f 0 f g 280.0 824 +\n3 b 27.83 1.540 u g w v 3.750 t t 5 t g 100.0 3 +\n4 b 20.17 5.625 u g w v 1.710 t f 0 f s 120.0 0 +\n5 b 32.08 4.000 u g m v 2.500 t f 0 t g 360.0 0 +\n6 b 33.17 1.040 u g r h 6.500 t f 0 t g 164.0 31285 +\n7 a 22.92 11.585 u g cc v 0.040 t f 0 f g 80.0 1349 +\n8 b 54.42 0.500 y p k h 3.960 t f 0 f g 180.0 314 +\n9 b 42.50 4.915 y p w v 3.165 t f 0 t g 52.0 1442 +"
},
"output_type": "execute_result"
}
],
"source": "# Handle missing values using scikit learn Imputer\nfrom sklearn.preprocessing import Imputer\n\n# Define the values to replce and the strategy of choosing the replacement value\nimp = Imputer(missing_values=\"NaN\", strategy=\"mean\")\n\napplicants[cols] = imp.fit_transform(applicants[cols])\napplicants.head(10)"
},
{
"execution_count": 8,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 8,
"metadata": {},
"data": {
"text/plain": "False"
},
"output_type": "execute_result"
}
],
"source": "# Check if we have any NaN values\napplicants.isnull().values.any()"
},
{
"execution_count": 9,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 690 entries, 0 to 689\nData columns (total 16 columns):\n0 690 non-null object\n1 690 non-null float64\n2 690 non-null float64\n3 690 non-null object\n4 690 non-null object\n5 690 non-null object\n6 690 non-null object\n7 690 non-null float64\n8 690 non-null object\n9 690 non-null object\n10 690 non-null int64\n11 690 non-null object\n12 690 non-null object\n13 690 non-null float64\n14 690 non-null int64\n15 690 non-null object\ndtypes: float64(4), int64(2), object(10)\nmemory usage: 86.3+ KB\n"
}
],
"source": "applicants.info()"
},
{
"source": "### 4. Descriptive analytics for our data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 10,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 10,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>13</th>\n <th>14</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>690.000</td>\n <td>690.000</td>\n <td>690.000</td>\n <td>690.000</td>\n <td>690.000</td>\n <td>690.000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>31.568</td>\n <td>4.759</td>\n <td>2.223</td>\n <td>2.400</td>\n <td>184.015</td>\n <td>1017.386</td>\n </tr>\n <tr>\n <th>std</th>\n <td>11.853</td>\n <td>4.978</td>\n <td>3.347</td>\n <td>4.863</td>\n <td>172.159</td>\n <td>5210.103</td>\n </tr>\n <tr>\n <th>min</th>\n <td>13.750</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>0.000</td>\n <td>0.000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>22.670</td>\n <td>1.000</td>\n <td>0.165</td>\n <td>0.000</td>\n <td>80.000</td>\n <td>0.000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>28.625</td>\n <td>2.750</td>\n <td>1.000</td>\n <td>0.000</td>\n <td>160.000</td>\n <td>5.000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>37.707</td>\n <td>7.207</td>\n <td>2.625</td>\n <td>3.000</td>\n <td>272.000</td>\n <td>395.500</td>\n </tr>\n <tr>\n <th>max</th>\n <td>80.250</td>\n <td>28.000</td>\n <td>28.500</td>\n <td>67.000</td>\n <td>2000.000</td>\n <td>100000.000</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 1 2 7 10 13 14\ncount 690.000 690.000 690.000 690.000 690.000 690.000\nmean 31.568 4.759 2.223 2.400 184.015 1017.386\nstd 11.853 4.978 3.347 4.863 172.159 5210.103\nmin 13.750 0.000 0.000 0.000 0.000 0.000\n25% 22.670 1.000 0.165 0.000 80.000 0.000\n50% 28.625 2.750 1.000 0.000 160.000 5.000\n75% 37.707 7.207 2.625 3.000 272.000 395.500\nmax 80.250 28.000 28.500 67.000 2000.000 100000.000"
},
"output_type": "execute_result"
}
],
"source": "# Describe columns with numerical values\npd.set_option('precision', 3)\napplicants.describe()"
},
{
"execution_count": 11,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 11,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>13</th>\n <th>14</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>1</th>\n <td>1.000</td>\n <td>0.201</td>\n <td>0.393</td>\n <td>0.186</td>\n <td>-0.077</td>\n <td>0.019</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0.201</td>\n <td>1.000</td>\n <td>0.299</td>\n <td>0.271</td>\n <td>-0.222</td>\n <td>0.123</td>\n </tr>\n <tr>\n <th>7</th>\n <td>0.393</td>\n <td>0.299</td>\n <td>1.000</td>\n <td>0.322</td>\n <td>-0.076</td>\n <td>0.051</td>\n </tr>\n <tr>\n <th>10</th>\n <td>0.186</td>\n <td>0.271</td>\n <td>0.322</td>\n <td>1.000</td>\n <td>-0.120</td>\n <td>0.064</td>\n </tr>\n <tr>\n <th>13</th>\n <td>-0.077</td>\n <td>-0.222</td>\n <td>-0.076</td>\n <td>-0.120</td>\n <td>1.000</td>\n <td>0.066</td>\n </tr>\n <tr>\n <th>14</th>\n <td>0.019</td>\n <td>0.123</td>\n <td>0.051</td>\n <td>0.064</td>\n <td>0.066</td>\n <td>1.000</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 1 2 7 10 13 14\n1 1.000 0.201 0.393 0.186 -0.077 0.019\n2 0.201 1.000 0.299 0.271 -0.222 0.123\n7 0.393 0.299 1.000 0.322 -0.076 0.051\n10 0.186 0.271 0.322 1.000 -0.120 0.064\n13 -0.077 -0.222 -0.076 -0.120 1.000 0.066\n14 0.019 0.123 0.051 0.064 0.066 1.000"
},
"output_type": "execute_result"
}
],
"source": "# Find correlations\napplicants.corr(method='pearson')"
},
{
"source": "### 5. Visualize our Data to understand it better",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"source": "#### Plot Relationships",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 12,
"cell_type": "code",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": "<matplotlib.figure.Figure at 0x2ac273f3add8>"
},
"metadata": {}
}
],
"source": "# Create Grid for pairwise relationships\ngr = sns.PairGrid(applicants, size=5, hue=15)\ngr = gr.map_diag(plt.hist)\ngr = gr.map_offdiag(plt.scatter)\ngr = gr.add_legend()"
},
{
"source": "#### Understand Data Distribution",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 13,
"cell_type": "code",
"metadata": {
"scrolled": false,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/seaborn/categorical.py:462: FutureWarning: remove_na is deprecated and is a private function. Do not use.\n box_data = remove_na(group_data)\n"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": "<matplotlib.figure.Figure at 0x2ac27d2a3f98>"
},
"metadata": {}
}
],
"source": "# Set up plot size\nfig, ax = plt.subplots(figsize=(20,10))\n\n# Attributes destribution\na = sns.boxplot(orient=\"v\", palette=\"hls\", data=applicants.iloc[:, :13], fliersize=14)"
},
{
"execution_count": 14,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": "<matplotlib.figure.Figure at 0x2ac27e8f9828>"
},
"metadata": {}
}
],
"source": "# Tenure data distribution\nhistogram = sns.distplot(applicants.iloc[:, 1], hist=True)\nplt.show()"
},
{
"source": "### 6. Encode string values in data into numerical values",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 15,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 15,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>13</th>\n <th>14</th>\n <th>0_?</th>\n <th>0_a</th>\n <th>0_b</th>\n <th>3_?</th>\n <th>3_l</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_?</th>\n <th>4_g</th>\n <th>...</th>\n <th>6_n</th>\n <th>6_o</th>\n <th>6_v</th>\n <th>6_z</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n <th>11_f</th>\n <th>11_t</th>\n <th>12_g</th>\n <th>12_p</th>\n <th>12_s</th>\n <th>15_+</th>\n <th>15_-</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30.83</td>\n <td>0.000</td>\n <td>1.250</td>\n <td>1</td>\n <td>202.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>58.67</td>\n <td>4.460</td>\n <td>3.040</td>\n <td>6</td>\n <td>43.0</td>\n <td>560</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>24.50</td>\n <td>0.500</td>\n <td>1.500</td>\n <td>0</td>\n <td>280.0</td>\n <td>824</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>27.83</td>\n <td>1.540</td>\n <td>3.750</td>\n <td>5</td>\n <td>100.0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20.17</td>\n <td>5.625</td>\n <td>1.710</td>\n <td>0</td>\n <td>120.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>32.08</td>\n <td>4.000</td>\n <td>2.500</td>\n <td>0</td>\n <td>360.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>33.17</td>\n <td>1.040</td>\n <td>6.500</td>\n <td>0</td>\n <td>164.0</td>\n <td>31285</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>22.92</td>\n <td>11.585</td>\n <td>0.040</td>\n <td>0</td>\n <td>80.0</td>\n <td>1349</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>54.42</td>\n <td>0.500</td>\n <td>3.960</td>\n <td>0</td>\n <td>180.0</td>\n <td>314</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>42.50</td>\n <td>4.915</td>\n <td>3.165</td>\n <td>0</td>\n <td>52.0</td>\n <td>1442</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>10 rows \u00d7 53 columns</p>\n</div>",
"text/plain": " 1 2 7 10 13 14 0_? 0_a 0_b 3_? 3_l 3_u 3_y \\\n0 30.83 0.000 1.250 1 202.0 0 0 0 1 0 0 1 0 \n1 58.67 4.460 3.040 6 43.0 560 0 1 0 0 0 1 0 \n2 24.50 0.500 1.500 0 280.0 824 0 1 0 0 0 1 0 \n3 27.83 1.540 3.750 5 100.0 3 0 0 1 0 0 1 0 \n4 20.17 5.625 1.710 0 120.0 0 0 0 1 0 0 1 0 \n5 32.08 4.000 2.500 0 360.0 0 0 0 1 0 0 1 0 \n6 33.17 1.040 6.500 0 164.0 31285 0 0 1 0 0 1 0 \n7 22.92 11.585 0.040 0 80.0 1349 0 1 0 0 0 1 0 \n8 54.42 0.500 3.960 0 180.0 314 0 0 1 0 0 0 1 \n9 42.50 4.915 3.165 0 52.0 1442 0 0 1 0 0 0 1 \n\n 4_? 4_g ... 6_n 6_o 6_v 6_z 8_f 8_t 9_f 9_t 11_f 11_t 12_g \\\n0 0 1 ... 0 0 1 0 0 1 0 1 1 0 1 \n1 0 1 ... 0 0 0 0 0 1 0 1 1 0 1 \n2 0 1 ... 0 0 0 0 0 1 1 0 1 0 1 \n3 0 1 ... 0 0 1 0 0 1 0 1 0 1 1 \n4 0 1 ... 0 0 1 0 0 1 1 0 1 0 0 \n5 0 1 ... 0 0 1 0 0 1 1 0 0 1 1 \n6 0 1 ... 0 0 0 0 0 1 1 0 0 1 1 \n7 0 1 ... 0 0 1 0 0 1 1 0 1 0 1 \n8 0 0 ... 0 0 0 0 0 1 1 0 1 0 1 \n9 0 0 ... 0 0 1 0 0 1 1 0 0 1 1 \n\n 12_p 12_s 15_+ 15_- \n0 0 0 1 0 \n1 0 0 1 0 \n2 0 0 1 0 \n3 0 0 1 0 \n4 0 1 1 0 \n5 0 0 1 0 \n6 0 0 1 0 \n7 0 0 1 0 \n8 0 0 1 0 \n9 0 0 1 0 \n\n[10 rows x 53 columns]"
},
"output_type": "execute_result"
}
],
"source": "# Use pandas get_dummies\napplicants_encoded = pd.get_dummies(applicants)\napplicants_encoded.head(10)"
},
{
"source": "### 7. Create Training Set and Labels ",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 16,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 16,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>13</th>\n <th>14</th>\n <th>0_?</th>\n <th>0_a</th>\n <th>0_b</th>\n <th>3_?</th>\n <th>3_l</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_?</th>\n <th>4_g</th>\n <th>...</th>\n <th>6_h</th>\n <th>6_j</th>\n <th>6_n</th>\n <th>6_o</th>\n <th>6_v</th>\n <th>6_z</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n <th>11_f</th>\n <th>11_t</th>\n <th>12_g</th>\n <th>12_p</th>\n <th>12_s</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30.83</td>\n <td>0.000</td>\n <td>1.250</td>\n <td>1</td>\n <td>202.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>58.67</td>\n <td>4.460</td>\n <td>3.040</td>\n <td>6</td>\n <td>43.0</td>\n <td>560</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>24.50</td>\n <td>0.500</td>\n <td>1.500</td>\n <td>0</td>\n <td>280.0</td>\n <td>824</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>27.83</td>\n <td>1.540</td>\n <td>3.750</td>\n <td>5</td>\n <td>100.0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20.17</td>\n <td>5.625</td>\n <td>1.710</td>\n <td>0</td>\n <td>120.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>5</th>\n <td>32.08</td>\n <td>4.000</td>\n <td>2.500</td>\n <td>0</td>\n <td>360.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>33.17</td>\n <td>1.040</td>\n <td>6.500</td>\n <td>0</td>\n <td>164.0</td>\n <td>31285</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>22.92</td>\n <td>11.585</td>\n <td>0.040</td>\n <td>0</td>\n <td>80.0</td>\n <td>1349</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>54.42</td>\n <td>0.500</td>\n <td>3.960</td>\n <td>0</td>\n <td>180.0</td>\n <td>314</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>42.50</td>\n <td>4.915</td>\n <td>3.165</td>\n <td>0</td>\n <td>52.0</td>\n <td>1442</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>10 rows \u00d7 51 columns</p>\n</div>",
"text/plain": " 1 2 7 10 13 14 0_? 0_a 0_b 3_? 3_l 3_u 3_y \\\n0 30.83 0.000 1.250 1 202.0 0 0 0 1 0 0 1 0 \n1 58.67 4.460 3.040 6 43.0 560 0 1 0 0 0 1 0 \n2 24.50 0.500 1.500 0 280.0 824 0 1 0 0 0 1 0 \n3 27.83 1.540 3.750 5 100.0 3 0 0 1 0 0 1 0 \n4 20.17 5.625 1.710 0 120.0 0 0 0 1 0 0 1 0 \n5 32.08 4.000 2.500 0 360.0 0 0 0 1 0 0 1 0 \n6 33.17 1.040 6.500 0 164.0 31285 0 0 1 0 0 1 0 \n7 22.92 11.585 0.040 0 80.0 1349 0 1 0 0 0 1 0 \n8 54.42 0.500 3.960 0 180.0 314 0 0 1 0 0 0 1 \n9 42.50 4.915 3.165 0 52.0 1442 0 0 1 0 0 0 1 \n\n 4_? 4_g ... 6_h 6_j 6_n 6_o 6_v 6_z 8_f 8_t 9_f 9_t 11_f \\\n0 0 1 ... 0 0 0 0 1 0 0 1 0 1 1 \n1 0 1 ... 1 0 0 0 0 0 0 1 0 1 1 \n2 0 1 ... 1 0 0 0 0 0 0 1 1 0 1 \n3 0 1 ... 0 0 0 0 1 0 0 1 0 1 0 \n4 0 1 ... 0 0 0 0 1 0 0 1 1 0 1 \n5 0 1 ... 0 0 0 0 1 0 0 1 1 0 0 \n6 0 1 ... 1 0 0 0 0 0 0 1 1 0 0 \n7 0 1 ... 0 0 0 0 1 0 0 1 1 0 1 \n8 0 0 ... 1 0 0 0 0 0 0 1 1 0 1 \n9 0 0 ... 0 0 0 0 1 0 0 1 1 0 0 \n\n 11_t 12_g 12_p 12_s \n0 0 1 0 0 \n1 0 1 0 0 \n2 0 1 0 0 \n3 1 1 0 0 \n4 0 0 0 1 \n5 1 1 0 0 \n6 1 1 0 0 \n7 0 1 0 0 \n8 0 1 0 0 \n9 1 1 0 0 \n\n[10 rows x 51 columns]"
},
"output_type": "execute_result"
}
],
"source": "# Create training data for non-preprocessed approach\nX_npp = applicants_encoded.iloc[:, :-2]\npd.DataFrame(X_npp).head(10)"
},
{
"execution_count": 47,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 47,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>13</th>\n <th>14</th>\n <th>0_?</th>\n <th>0_a</th>\n <th>0_b</th>\n <th>3_?</th>\n <th>3_l</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_?</th>\n <th>4_g</th>\n <th>...</th>\n <th>6_h</th>\n <th>6_j</th>\n <th>6_n</th>\n <th>6_o</th>\n <th>6_v</th>\n <th>6_z</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n <th>11_f</th>\n <th>11_t</th>\n <th>12_g</th>\n <th>12_p</th>\n <th>12_s</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30.83</td>\n <td>0.000</td>\n <td>1.25</td>\n <td>1</td>\n <td>202.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>58.67</td>\n <td>4.460</td>\n <td>3.04</td>\n <td>6</td>\n <td>43.0</td>\n <td>560</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>24.50</td>\n <td>0.500</td>\n <td>1.50</td>\n <td>0</td>\n <td>280.0</td>\n <td>824</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>27.83</td>\n <td>1.540</td>\n <td>3.75</td>\n <td>5</td>\n <td>100.0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20.17</td>\n <td>5.625</td>\n <td>1.71</td>\n <td>0</td>\n <td>120.0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows \u00d7 51 columns</p>\n</div>",
"text/plain": " 1 2 7 10 13 14 0_? 0_a 0_b 3_? 3_l 3_u 3_y 4_? \\\n0 30.83 0.000 1.25 1 202.0 0 0 0 1 0 0 1 0 0 \n1 58.67 4.460 3.04 6 43.0 560 0 1 0 0 0 1 0 0 \n2 24.50 0.500 1.50 0 280.0 824 0 1 0 0 0 1 0 0 \n3 27.83 1.540 3.75 5 100.0 3 0 0 1 0 0 1 0 0 \n4 20.17 5.625 1.71 0 120.0 0 0 0 1 0 0 1 0 0 \n\n 4_g ... 6_h 6_j 6_n 6_o 6_v 6_z 8_f 8_t 9_f 9_t 11_f 11_t \\\n0 1 ... 0 0 0 0 1 0 0 1 0 1 1 0 \n1 1 ... 1 0 0 0 0 0 0 1 0 1 1 0 \n2 1 ... 1 0 0 0 0 0 0 1 1 0 1 0 \n3 1 ... 0 0 0 0 1 0 0 1 0 1 0 1 \n4 1 ... 0 0 0 0 1 0 0 1 1 0 1 0 \n\n 12_g 12_p 12_s \n0 1 0 0 \n1 1 0 0 \n2 1 0 0 \n3 1 0 0 \n4 0 0 1 \n\n[5 rows x 51 columns]"
},
"output_type": "execute_result"
}
],
"source": "# Create training data for that will undergo preprocessing\nX = applicants_encoded.iloc[:, :-2]\nX.head()"
},
{
"execution_count": 18,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 18,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 0\n0 0\n1 0\n2 0\n3 0\n4 0\n5 0\n6 0\n7 0\n8 0\n9 0"
},
"output_type": "execute_result"
}
],
"source": "# Extract labels\nfrom sklearn.preprocessing import LabelEncoder\n\n# Split last column from original dataset as the labels column\ny = applicants[15]\n\n# Apply encoder to transform strings to numeric values 0 and 1\nle = LabelEncoder().fit(y)\n\ny_enc = le.transform(y)\npd.DataFrame(y_enc).head(10)"
},
{
"source": "### 8. Detect outliers in numerical values",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 41,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": "# Detect outlier using interquartile method and remove them\ndef find_outliers(df):\n quartile_1, quartile_3 = np.percentile(df, [25, 75])\n iqr = quartile_3 - quartile_1\n lower_bound = quartile_1 - (iqr * 1.5)\n upper_bound = quartile_3 + (iqr * 1.5)\n\n outlier_indices = list(df.index[(df < lower_bound)|(df > upper_bound)])\n outlier_values = list(df[outlier_indices])\n \n df[outlier_indices] = np.NaN\n \n return df"
},
{
"execution_count": 48,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0 30.83\n1 58.67\n2 24.50\n3 27.83\n4 20.17\n5 32.08\n6 33.17\n7 22.92\n8 54.42\n9 42.50\n10 22.08\n11 29.92\n12 38.25\n13 48.08\n14 45.83\n15 36.67\n16 28.25\n17 23.25\n18 21.83\n19 19.17\n20 25.00\n21 23.25\n22 47.75\n23 27.42\n24 41.17\n25 15.83\n26 47.00\n27 56.58\n28 57.42\n29 42.08\n ... \n660 22.25\n661 29.83\n662 23.50\n663 32.08\n664 31.08\n665 31.83\n666 21.75\n667 17.92\n668 30.33\n669 51.83\n670 47.17\n671 25.83\n672 50.25\n673 29.50\n674 37.33\n675 41.58\n676 30.58\n677 19.42\n678 17.92\n679 20.08\n680 19.50\n681 27.83\n682 17.08\n683 36.42\n684 40.58\n685 21.08\n686 22.67\n687 25.25\n688 17.92\n689 35.00\nName: 1, Length: 690, dtype: float64\n"
},
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
}
],
"source": "# Find outliers in first column (continuous values)\nprint(find_outliers(X[1]))"
},
{
"execution_count": 49,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0 0.000\n1 4.460\n2 0.500\n3 1.540\n4 5.625\n5 4.000\n6 1.040\n7 11.585\n8 0.500\n9 4.915\n10 0.830\n11 1.835\n12 6.000\n13 6.040\n14 10.500\n15 4.415\n16 0.875\n17 5.875\n18 0.250\n19 8.585\n20 11.250\n21 1.000\n22 8.000\n23 14.500\n24 6.500\n25 0.585\n26 13.000\n27 NaN\n28 8.500\n29 1.040\n ... \n660 9.000\n661 3.500\n662 1.500\n663 4.000\n664 1.500\n665 0.040\n666 11.750\n667 0.540\n668 0.500\n669 2.040\n670 5.835\n671 12.835\n672 0.835\n673 2.000\n674 2.500\n675 1.040\n676 10.665\n677 7.250\n678 10.210\n679 1.250\n680 0.290\n681 1.000\n682 3.290\n683 0.750\n684 3.290\n685 10.085\n686 0.750\n687 13.500\n688 0.205\n689 3.375\nName: 2, Length: 690, dtype: float64\n"
},
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
}
],
"source": "# Find outliers in first column (continuous values)\nprint(find_outliers(X[2]))"
},
{
"execution_count": 50,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0 1.250\n1 3.040\n2 1.500\n3 3.750\n4 1.710\n5 2.500\n6 NaN\n7 0.040\n8 3.960\n9 3.165\n10 2.165\n11 4.335\n12 1.000\n13 0.040\n14 5.000\n15 0.250\n16 0.960\n17 3.170\n18 0.665\n19 0.750\n20 2.500\n21 0.835\n22 NaN\n23 3.085\n24 0.500\n25 1.500\n26 5.165\n27 NaN\n28 NaN\n29 5.000\n ... \n660 0.085\n661 0.165\n662 0.875\n663 1.500\n664 0.040\n665 0.040\n666 0.250\n667 1.750\n668 0.085\n669 1.500\n670 5.500\n671 0.500\n672 0.500\n673 2.000\n674 0.210\n675 0.665\n676 0.085\n677 0.040\n678 0.000\n679 0.000\n680 0.290\n681 3.000\n682 0.335\n683 0.585\n684 3.500\n685 1.250\n686 2.000\n687 2.000\n688 0.040\n689 NaN\nName: 7, Length: 690, dtype: float64\n"
},
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
}
],
"source": "# Find outliers in first column (continuous values)\nprint(find_outliers(X[7]))"
},
{
"execution_count": 51,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0 1.0\n1 6.0\n2 0.0\n3 5.0\n4 0.0\n5 0.0\n6 0.0\n7 0.0\n8 0.0\n9 0.0\n10 0.0\n11 0.0\n12 0.0\n13 0.0\n14 7.0\n15 NaN\n16 3.0\n17 NaN\n18 0.0\n19 7.0\n20 NaN\n21 0.0\n22 6.0\n23 1.0\n24 3.0\n25 2.0\n26 NaN\n27 NaN\n28 3.0\n29 6.0\n ... \n660 0.0\n661 0.0\n662 0.0\n663 0.0\n664 0.0\n665 0.0\n666 0.0\n667 1.0\n668 0.0\n669 0.0\n670 0.0\n671 0.0\n672 0.0\n673 0.0\n674 0.0\n675 0.0\n676 NaN\n677 1.0\n678 0.0\n679 0.0\n680 0.0\n681 0.0\n682 0.0\n683 0.0\n684 0.0\n685 0.0\n686 2.0\n687 1.0\n688 0.0\n689 0.0\nName: 10, Length: 690, dtype: float64\n"
},
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
}
],
"source": "# Find outliers in first column (continuous values)\nprint(find_outliers(X[10]))"
},
{
"execution_count": 52,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0 202.0\n1 43.0\n2 280.0\n3 100.0\n4 120.0\n5 360.0\n6 164.0\n7 80.0\n8 180.0\n9 52.0\n10 128.0\n11 260.0\n12 0.0\n13 0.0\n14 0.0\n15 320.0\n16 396.0\n17 120.0\n18 0.0\n19 96.0\n20 200.0\n21 300.0\n22 0.0\n23 120.0\n24 145.0\n25 100.0\n26 0.0\n27 0.0\n28 0.0\n29 500.0\n ... \n660 0.0\n661 216.0\n662 160.0\n663 120.0\n664 160.0\n665 0.0\n666 180.0\n667 80.0\n668 252.0\n669 120.0\n670 465.0\n671 0.0\n672 240.0\n673 256.0\n674 260.0\n675 240.0\n676 129.0\n677 100.0\n678 0.0\n679 0.0\n680 280.0\n681 176.0\n682 140.0\n683 240.0\n684 400.0\n685 260.0\n686 200.0\n687 200.0\n688 280.0\n689 0.0\nName: 13, Length: 690, dtype: float64\n"
},
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
}
],
"source": "# Find outliers in first column (continuous values)\nprint(find_outliers(X[13]))"
},
{
"execution_count": 54,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 54,
"metadata": {},
"data": {
"text/plain": "True"
},
"output_type": "execute_result"
}
],
"source": "# Check for null values\nX.isnull().values.any()"
},
{
"execution_count": 53,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0 0.0\n1 560.0\n2 824.0\n3 3.0\n4 0.0\n5 0.0\n6 NaN\n7 NaN\n8 314.0\n9 NaN\n10 0.0\n11 200.0\n12 0.0\n13 NaN\n14 0.0\n15 0.0\n16 0.0\n17 245.0\n18 0.0\n19 0.0\n20 NaN\n21 0.0\n22 NaN\n23 11.0\n24 0.0\n25 0.0\n26 0.0\n27 0.0\n28 0.0\n29 NaN\n ... \n660 0.0\n661 0.0\n662 0.0\n663 0.0\n664 0.0\n665 0.0\n666 0.0\n667 5.0\n668 0.0\n669 1.0\n670 150.0\n671 2.0\n672 117.0\n673 17.0\n674 246.0\n675 237.0\n676 3.0\n677 1.0\n678 50.0\n679 0.0\n680 364.0\n681 537.0\n682 2.0\n683 3.0\n684 0.0\n685 0.0\n686 394.0\n687 1.0\n688 750.0\n689 0.0\nName: 14, Length: 690, dtype: float64\n"
},
{
"output_type": "stream",
"name": "stderr",
"text": "/opt/conda/envs/DSX-Python35/lib/python3.5/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: \nA value is trying to be set on a copy of a slice from a DataFrame\n\nSee the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
}
],
"source": "# Find outliers in first column (continuous values)\nprint(find_outliers(X[14]))"
},
{
"execution_count": 56,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 56,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>13</th>\n <th>14</th>\n <th>0_?</th>\n <th>0_a</th>\n <th>0_b</th>\n <th>3_?</th>\n <th>3_l</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_?</th>\n <th>4_g</th>\n <th>...</th>\n <th>6_h</th>\n <th>6_j</th>\n <th>6_n</th>\n <th>6_o</th>\n <th>6_v</th>\n <th>6_z</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n <th>11_f</th>\n <th>11_t</th>\n <th>12_g</th>\n <th>12_p</th>\n <th>12_s</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30.83</td>\n <td>0.000</td>\n <td>1.250</td>\n <td>1.0</td>\n <td>202.0</td>\n <td>0.000</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>58.67</td>\n <td>4.460</td>\n <td>3.040</td>\n <td>6.0</td>\n <td>43.0</td>\n <td>560.000</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>2</th>\n <td>24.50</td>\n <td>0.500</td>\n <td>1.500</td>\n <td>0.0</td>\n <td>280.0</td>\n <td>824.000</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>27.83</td>\n <td>1.540</td>\n <td>3.750</td>\n <td>5.0</td>\n <td>100.0</td>\n <td>3.000</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20.17</td>\n <td>5.625</td>\n <td>1.710</td>\n <td>0.0</td>\n <td>120.0</td>\n <td>0.000</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>5</th>\n <td>32.08</td>\n <td>4.000</td>\n <td>2.500</td>\n <td>0.0</td>\n <td>360.0</td>\n <td>0.000</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>33.17</td>\n <td>1.040</td>\n <td>1.362</td>\n <td>0.0</td>\n <td>164.0</td>\n <td>101.047</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>22.92</td>\n <td>11.585</td>\n <td>0.040</td>\n <td>0.0</td>\n <td>80.0</td>\n <td>101.047</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>54.42</td>\n <td>0.500</td>\n <td>3.960</td>\n <td>0.0</td>\n <td>180.0</td>\n <td>314.000</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>42.50</td>\n <td>4.915</td>\n <td>3.165</td>\n <td>0.0</td>\n <td>52.0</td>\n <td>101.047</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>10 rows \u00d7 51 columns</p>\n</div>",
"text/plain": " 1 2 7 10 13 14 0_? 0_a 0_b 3_? 3_l 3_u \\\n0 30.83 0.000 1.250 1.0 202.0 0.000 0 0 1 0 0 1 \n1 58.67 4.460 3.040 6.0 43.0 560.000 0 1 0 0 0 1 \n2 24.50 0.500 1.500 0.0 280.0 824.000 0 1 0 0 0 1 \n3 27.83 1.540 3.750 5.0 100.0 3.000 0 0 1 0 0 1 \n4 20.17 5.625 1.710 0.0 120.0 0.000 0 0 1 0 0 1 \n5 32.08 4.000 2.500 0.0 360.0 0.000 0 0 1 0 0 1 \n6 33.17 1.040 1.362 0.0 164.0 101.047 0 0 1 0 0 1 \n7 22.92 11.585 0.040 0.0 80.0 101.047 0 1 0 0 0 1 \n8 54.42 0.500 3.960 0.0 180.0 314.000 0 0 1 0 0 0 \n9 42.50 4.915 3.165 0.0 52.0 101.047 0 0 1 0 0 0 \n\n 3_y 4_? 4_g ... 6_h 6_j 6_n 6_o 6_v 6_z 8_f 8_t 9_f 9_t \\\n0 0 0 1 ... 0 0 0 0 1 0 0 1 0 1 \n1 0 0 1 ... 1 0 0 0 0 0 0 1 0 1 \n2 0 0 1 ... 1 0 0 0 0 0 0 1 1 0 \n3 0 0 1 ... 0 0 0 0 1 0 0 1 0 1 \n4 0 0 1 ... 0 0 0 0 1 0 0 1 1 0 \n5 0 0 1 ... 0 0 0 0 1 0 0 1 1 0 \n6 0 0 1 ... 1 0 0 0 0 0 0 1 1 0 \n7 0 0 1 ... 0 0 0 0 1 0 0 1 1 0 \n8 1 0 0 ... 1 0 0 0 0 0 0 1 1 0 \n9 1 0 0 ... 0 0 0 0 1 0 0 1 1 0 \n\n 11_f 11_t 12_g 12_p 12_s \n0 1 0 1 0 0 \n1 1 0 1 0 0 \n2 1 0 1 0 0 \n3 0 1 1 0 0 \n4 1 0 0 0 1 \n5 0 1 1 0 0 \n6 0 1 1 0 0 \n7 1 0 1 0 0 \n8 1 0 1 0 0 \n9 0 1 1 0 0 \n\n[10 rows x 51 columns]"
},
"output_type": "execute_result"
}
],
"source": "# Define the values to replce and the strategy of choosing the replacement value\nsuspected_cols = [1, 2, 7, 10, 13, 14]\nimp = Imputer(missing_values=\"NaN\", strategy=\"mean\")\n\npd.DataFrame(X)[suspected_cols] = imp.fit_transform(pd.DataFrame(X)[suspected_cols])\npd.DataFrame(X).head(10)"
},
{
"execution_count": 57,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"execution_count": 57,
"metadata": {},
"data": {
"text/plain": "False"
},
"output_type": "execute_result"
}
],
"source": "# Check for null values\npd.DataFrame(X).isnull().values.any()"
},
{
"source": "### 9. Feature Engineering",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 108,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 108,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>14</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_g</th>\n <th>4_p</th>\n <th>5_cc</th>\n <th>5_ff</th>\n <th>5_i</th>\n <th>5_q</th>\n <th>5_x</th>\n <th>6_ff</th>\n <th>6_h</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>30.83</td>\n <td>0.000</td>\n <td>1.250</td>\n <td>1.0</td>\n <td>0.000</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>1</th>\n <td>58.67</td>\n <td>4.460</td>\n <td>3.040</td>\n <td>6.0</td>\n <td>560.000</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>24.50</td>\n <td>0.500</td>\n <td>1.500</td>\n <td>0.0</td>\n <td>824.000</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>3</th>\n <td>27.83</td>\n <td>1.540</td>\n <td>3.750</td>\n <td>5.0</td>\n <td>3.000</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>20.17</td>\n <td>5.625</td>\n <td>1.710</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>5</th>\n <td>32.08</td>\n <td>4.000</td>\n <td>2.500</td>\n <td>0.0</td>\n <td>0.000</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>6</th>\n <td>33.17</td>\n <td>1.040</td>\n <td>1.362</td>\n <td>0.0</td>\n <td>101.047</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>7</th>\n <td>22.92</td>\n <td>11.585</td>\n <td>0.040</td>\n <td>0.0</td>\n <td>101.047</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>8</th>\n <td>54.42</td>\n <td>0.500</td>\n <td>3.960</td>\n <td>0.0</td>\n <td>314.000</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <th>9</th>\n <td>42.50</td>\n <td>4.915</td>\n <td>3.165</td>\n <td>0.0</td>\n <td>101.047</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 1 2 7 10 14 3_u 3_y 4_g 4_p 5_cc 5_ff 5_i \\\n0 30.83 0.000 1.250 1.0 0.000 1 0 1 0 0 0 0 \n1 58.67 4.460 3.040 6.0 560.000 1 0 1 0 0 0 0 \n2 24.50 0.500 1.500 0.0 824.000 1 0 1 0 0 0 0 \n3 27.83 1.540 3.750 5.0 3.000 1 0 1 0 0 0 0 \n4 20.17 5.625 1.710 0.0 0.000 1 0 1 0 0 0 0 \n5 32.08 4.000 2.500 0.0 0.000 1 0 1 0 0 0 0 \n6 33.17 1.040 1.362 0.0 101.047 1 0 1 0 0 0 0 \n7 22.92 11.585 0.040 0.0 101.047 1 0 1 0 1 0 0 \n8 54.42 0.500 3.960 0.0 314.000 0 1 0 1 0 0 0 \n9 42.50 4.915 3.165 0.0 101.047 0 1 0 1 0 0 0 \n\n 5_q 5_x 6_ff 6_h 8_f 8_t 9_f 9_t \n0 0 0 0 0 0 1 0 1 \n1 1 0 0 1 0 1 0 1 \n2 1 0 0 1 0 1 1 0 \n3 0 0 0 0 0 1 0 1 \n4 0 0 0 0 0 1 1 0 \n5 0 0 0 0 0 1 1 0 \n6 0 0 0 1 0 1 1 0 \n7 0 0 0 0 0 1 1 0 \n8 0 0 0 1 0 1 1 0 \n9 0 0 0 0 0 1 1 0 "
},
"output_type": "execute_result"
}
],
"source": "# Select best features\nselect = sklearn.feature_selection.SelectKBest(k=20)\nselected_features = select.fit(X, y_enc)\nindexes = selected_features.get_support(indices=True)\ncol_names_selected = [pd.DataFrame(X).columns[i] for i in indexes]\n\nX_selected = pd.DataFrame(X)[col_names_selected]\npd.DataFrame(X_selected).head(10)"
},
{
"source": "### 10. Split our dataset into train and test datasets",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"source": "#### Split non-preprocessed data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 59,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "(483, 51) (483,)\n(207, 51) (207,)\n"
}
],
"source": "X_train_npp, X_test_npp, y_train_npp, y_test_npp = train_test_split(X_npp, y_enc,\\\n test_size=0.3, random_state=42)\nprint(X_train_npp.shape, y_train_npp.shape)\nprint(X_test_npp.shape, y_test_npp.shape)"
},
{
"execution_count": 109,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "(483, 20) (483,)\n(207, 20) (207,)\n"
}
],
"source": "X_train, X_test, y_train, y_test = train_test_split(X_selected, y_enc,\\\n test_size=0.3, random_state=42)\nprint(X_train.shape, y_train.shape)\nprint(X_test.shape, y_test.shape)"
},
{
"source": "### 11. Scale our data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 110,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 110,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>14</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_g</th>\n <th>4_p</th>\n <th>5_cc</th>\n <th>5_ff</th>\n <th>5_i</th>\n <th>5_q</th>\n <th>5_x</th>\n <th>6_ff</th>\n <th>6_h</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>-1.184</td>\n <td>1.234</td>\n <td>-0.084</td>\n <td>1.625</td>\n <td>2.406</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>-0.3</td>\n <td>-0.305</td>\n <td>2.790</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>-0.977</td>\n <td>0.977</td>\n <td>-1.135</td>\n <td>1.135</td>\n </tr>\n <tr>\n <th>1</th>\n <td>-1.314</td>\n <td>-1.034</td>\n <td>-0.881</td>\n <td>-0.607</td>\n <td>3.407</td>\n <td>-1.665</td>\n <td>1.730</td>\n <td>-1.665</td>\n <td>1.730</td>\n <td>-0.234</td>\n <td>-0.3</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>0.881</td>\n <td>-0.881</td>\n </tr>\n <tr>\n <th>2</th>\n <td>-0.785</td>\n <td>1.790</td>\n <td>0.113</td>\n <td>-0.607</td>\n <td>-0.009</td>\n <td>-1.665</td>\n <td>1.730</td>\n <td>-1.665</td>\n <td>1.730</td>\n <td>-0.234</td>\n <td>-0.3</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>0.881</td>\n <td>-0.881</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1.240</td>\n <td>0.017</td>\n <td>-0.765</td>\n <td>-0.049</td>\n <td>-0.069</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>-0.3</td>\n <td>-0.305</td>\n <td>2.790</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>2.034</td>\n <td>-0.977</td>\n <td>0.977</td>\n <td>-1.135</td>\n <td>1.135</td>\n </tr>\n <tr>\n <th>4</th>\n <td>-1.314</td>\n <td>-0.993</td>\n <td>-0.680</td>\n <td>1.625</td>\n <td>-0.521</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>-0.3</td>\n <td>-0.305</td>\n <td>2.790</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>-1.135</td>\n <td>1.135</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 1 2 7 10 14 3_u 3_y 4_g 4_p 5_cc 5_ff \\\n0 -1.184 1.234 -0.084 1.625 2.406 0.600 -0.578 0.600 -0.578 -0.234 -0.3 \n1 -1.314 -1.034 -0.881 -0.607 3.407 -1.665 1.730 -1.665 1.730 -0.234 -0.3 \n2 -0.785 1.790 0.113 -0.607 -0.009 -1.665 1.730 -1.665 1.730 -0.234 -0.3 \n3 1.240 0.017 -0.765 -0.049 -0.069 0.600 -0.578 0.600 -0.578 -0.234 -0.3 \n4 -1.314 -0.993 -0.680 1.625 -0.521 0.600 -0.578 0.600 -0.578 -0.234 -0.3 \n\n 5_i 5_q 5_x 6_ff 6_h 8_f 8_t 9_f 9_t \n0 -0.305 2.790 -0.224 -0.313 -0.492 -0.977 0.977 -1.135 1.135 \n1 -0.305 -0.358 -0.224 -0.313 -0.492 1.023 -1.023 0.881 -0.881 \n2 -0.305 -0.358 -0.224 -0.313 -0.492 1.023 -1.023 0.881 -0.881 \n3 -0.305 2.790 -0.224 -0.313 2.034 -0.977 0.977 -1.135 1.135 \n4 -0.305 2.790 -0.224 -0.313 -0.492 1.023 -1.023 -1.135 1.135 "
},
"output_type": "execute_result"
}
],
"source": "# Use StandardScaler\nscaler = preprocessing.StandardScaler().fit(X_train, y_train)\nX_train_scaled = scaler.transform(X_train)\n\npd.DataFrame(X_train_scaled, columns=pd.DataFrame(X_train).columns).head()"
},
{
"execution_count": 111,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"execution_count": 111,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1</td>\n </tr>\n <tr>\n <th>2</th>\n <td>1</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1</td>\n </tr>\n <tr>\n <th>4</th>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 0\n0 0\n1 1\n2 1\n3 1\n4 1"
},
"output_type": "execute_result"
}
],
"source": "pd.DataFrame(y_train).head()"
},
{
"source": "### 12. Start building a classifier",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"source": "#### Logestic Regression on non-preprocessed data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 64,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 64,
"metadata": {},
"data": {
"text/plain": "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0, warm_start=False)"
},
"output_type": "execute_result"
}
],
"source": "from sklearn.linear_model import LogisticRegression\n\nclf_lr_npp = LogisticRegression()\nclf_lr_npp.fit(X_train_npp, y_train_npp)"
},
{
"source": "#### Logestic Regression on preprocessed data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 112,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 112,
"metadata": {},
"data": {
"text/plain": "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0, warm_start=False)"
},
"output_type": "execute_result"
}
],
"source": "from sklearn.linear_model import LogisticRegression\n\nclf_lr = LogisticRegression()\nmodel = clf_lr.fit(X_train_scaled, y_train)\nmodel"
},
{
"source": "### 13. Evaluate our model",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 116,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 116,
"metadata": {},
"data": {
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1</th>\n <th>2</th>\n <th>7</th>\n <th>10</th>\n <th>14</th>\n <th>3_u</th>\n <th>3_y</th>\n <th>4_g</th>\n <th>4_p</th>\n <th>5_cc</th>\n <th>5_ff</th>\n <th>5_i</th>\n <th>5_q</th>\n <th>5_x</th>\n <th>6_ff</th>\n <th>6_h</th>\n <th>8_f</th>\n <th>8_t</th>\n <th>9_f</th>\n <th>9_t</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>0.100</td>\n <td>-0.684</td>\n <td>-0.908</td>\n <td>0.509</td>\n <td>0.013</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>3.328</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>3.199</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>-1.135</td>\n <td>1.135</td>\n </tr>\n <tr>\n <th>1</th>\n <td>1.508</td>\n <td>-0.065</td>\n <td>-0.908</td>\n <td>-0.607</td>\n <td>4.717</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>-0.300</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>-0.977</td>\n <td>0.977</td>\n <td>0.881</td>\n <td>-0.881</td>\n </tr>\n <tr>\n <th>2</th>\n <td>-1.029</td>\n <td>-1.055</td>\n <td>-0.568</td>\n <td>-0.607</td>\n <td>-0.565</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>-0.300</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>0.881</td>\n <td>-0.881</td>\n </tr>\n <tr>\n <th>3</th>\n <td>1.638</td>\n <td>0.553</td>\n <td>-0.227</td>\n <td>-0.607</td>\n <td>0.690</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>0.600</td>\n <td>-0.578</td>\n <td>-0.234</td>\n <td>-0.300</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>0.881</td>\n <td>-0.881</td>\n </tr>\n <tr>\n <th>4</th>\n <td>-1.110</td>\n <td>-1.055</td>\n <td>-0.908</td>\n <td>-0.607</td>\n <td>-0.559</td>\n <td>-1.665</td>\n <td>1.730</td>\n <td>-1.665</td>\n <td>1.730</td>\n <td>-0.234</td>\n <td>-0.300</td>\n <td>-0.305</td>\n <td>-0.358</td>\n <td>-0.224</td>\n <td>-0.313</td>\n <td>-0.492</td>\n <td>1.023</td>\n <td>-1.023</td>\n <td>0.881</td>\n <td>-0.881</td>\n </tr>\n </tbody>\n</table>\n</div>",
"text/plain": " 1 2 7 10 14 3_u 3_y 4_g 4_p 5_cc \\\n0 0.100 -0.684 -0.908 0.509 0.013 0.600 -0.578 0.600 -0.578 -0.234 \n1 1.508 -0.065 -0.908 -0.607 4.717 0.600 -0.578 0.600 -0.578 -0.234 \n2 -1.029 -1.055 -0.568 -0.607 -0.565 0.600 -0.578 0.600 -0.578 -0.234 \n3 1.638 0.553 -0.227 -0.607 0.690 0.600 -0.578 0.600 -0.578 -0.234 \n4 -1.110 -1.055 -0.908 -0.607 -0.559 -1.665 1.730 -1.665 1.730 -0.234 \n\n 5_ff 5_i 5_q 5_x 6_ff 6_h 8_f 8_t 9_f 9_t \n0 3.328 -0.305 -0.358 -0.224 3.199 -0.492 1.023 -1.023 -1.135 1.135 \n1 -0.300 -0.305 -0.358 -0.224 -0.313 -0.492 -0.977 0.977 0.881 -0.881 \n2 -0.300 -0.305 -0.358 -0.224 -0.313 -0.492 1.023 -1.023 0.881 -0.881 \n3 -0.300 -0.305 -0.358 -0.224 -0.313 -0.492 1.023 -1.023 0.881 -0.881 \n4 -0.300 -0.305 -0.358 -0.224 -0.313 -0.492 1.023 -1.023 0.881 -0.881 "
},
"output_type": "execute_result"
}
],
"source": "# Use the scaler fit on trained data to scale our test data\nX_test_scaled = scaler.transform(X_test)\npd.DataFrame(X_test_scaled, columns=pd.DataFrame(X_train).columns).head()"
},
{
"source": "#### Evaluate Logistic Regression on non-preprocessed data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 113,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 113,
"metadata": {},
"data": {
"text/plain": "array([ 3.79114351e+00, -1.53289391e+00, 3.59750686e+00,\n 2.63899790e+00, 5.34015063e+00, 1.91329086e-01,\n 2.98874665e+00, 8.67871930e-02, 3.58419649e+00,\n 3.70771837e+00, 2.70176737e+00, -1.66358540e+00,\n 3.30947758e+00, -2.16630789e-01, 1.65493314e+00,\n -9.88884781e-01, 3.08195370e+00, 4.06404098e+00,\n 1.72760092e+00, 2.24916855e+00, 2.84404228e+00,\n 3.23920898e+00, 3.90750984e+00, -9.35919037e-01,\n 2.75681502e+00, 4.32138435e+00, -3.96619809e+00,\n -2.43401740e+00, 3.17308702e+00, 3.08414031e+00,\n -1.36637422e+00, -4.38155190e+00, -1.47865022e+00,\n -1.53280871e+00, -5.89477010e-03, -3.27439787e+00,\n -3.23184603e+00, -1.70721642e+00, -1.24822963e+00,\n -3.89545810e+01, -2.22081486e+00, -2.03776823e+00,\n 3.28656327e+00, -1.41177687e+00, 3.87553319e+00,\n -5.86624065e+00, 2.55288734e+00, 5.04640615e+00,\n 3.70524984e-01, 3.67618059e+00, 4.14496605e+00,\n -1.06645262e+00, 3.26419631e+00, 2.94093624e+00,\n -1.38959553e+00, 3.30948662e+00, -2.59556398e+00,\n 1.88674146e+00, -1.95916853e+00, 3.09434849e+00,\n -4.07223521e+00, -1.51584072e+00, -1.98688188e+00,\n 4.26446591e-01, 3.34786968e+00, -2.91758212e+00,\n -3.96169228e+00, 9.55569194e-02, -1.08368095e+00,\n 2.97308040e+00, 2.48778124e+00, -1.32618683e+00,\n 3.27962750e+00, -5.24952660e-01, 3.50150442e+00,\n 3.77807493e+00, 2.89412552e+00, 6.84010685e-01,\n -1.24066814e+00, 2.65344525e+00, -2.24530489e+00,\n -2.54550656e+00, 2.97594603e+00, 2.86170658e+00,\n -1.30793237e+00, 2.99635702e+00, -1.52276649e+00,\n -3.55282890e+00, -1.00235837e+00, -1.97330036e+00,\n -7.45244467e-01, 1.41793056e-01, -5.87737322e+00,\n -1.99269067e+00, 5.69623798e-01, -1.21480780e+00,\n -9.50015972e+00, -2.38241709e-01, 3.17514825e+00,\n 5.36445225e+00, 3.62110744e+00, 2.03296351e+00,\n -1.29695567e+00, 1.04823057e+00, -3.82401680e+01,\n -9.51642550e-01, 3.26862368e+00, -3.91844392e+00,\n 4.35637354e+00, -2.12894225e+00, 5.03956329e+00,\n -2.22947647e+00, -2.41773267e+01, 1.22531146e-01,\n -8.84799954e-01, 2.97010078e+00, -5.24369939e+00,\n -2.80920674e+00, -1.24332530e+00, -4.51715884e+00,\n 3.18712038e+00, 2.09019214e+00, -2.05962173e+00,\n 1.88170191e+00, 4.75400258e+00, 5.29170997e-01,\n 3.28973175e+00, 2.73475574e+00, -1.97119683e+00,\n -3.95391398e+00, -2.46562994e+00, 3.52438955e+00,\n -1.34915306e+00, 2.94900993e+00, -3.33782465e+00,\n -1.49319261e+00, -1.31290712e+00, -1.62866105e+00,\n 6.08706078e+00, -1.68378989e+00, -4.86912903e-01,\n -9.23741589e-02, -2.95888858e+00, -2.26403654e+00,\n 3.46684559e+00, 1.90960822e+00, 4.23758334e+00,\n -3.53997054e-01, -4.38350526e-01, 3.63497785e+00,\n 4.36102280e+00, 2.02311100e+00, 3.89632321e+00,\n -1.68652434e+00, 2.09286610e+00, -3.19852344e+00,\n -9.55022951e+00, -6.46570592e-01, 3.15298232e+00,\n 1.11557156e+00, -2.78222787e+00, 3.22151345e+00,\n 3.26719016e+00, 4.48931011e+00, 3.58667747e+00,\n 3.91830889e+00, 3.27661986e+00, -3.83761354e+00,\n 3.63008882e+00, -7.53135970e-01, 2.66739640e+00,\n 3.94169192e+00, -1.72347787e-01, 3.86725662e+00,\n -9.75240767e-02, 3.81199897e+00, 3.63741719e+00,\n 4.16426730e+00, 2.84012076e+00, 3.82328514e+00,\n 2.21065287e+00, -1.31928430e+00, -1.06864401e+00,\n 2.84181981e+00, -2.58833779e+00, 2.79809424e+00,\n -4.29753873e-01, -2.08858490e+00, -9.42408558e-01,\n -1.42794764e+00, 2.96748129e+00, -2.08301485e+00,\n 3.85207430e+00, -2.90721949e+00, -2.14365352e+00,\n -3.11438657e+00, -3.09122742e+00, -2.51190272e+00,\n -1.92046333e+00, -1.49322540e+00, -9.52598854e+00,\n -3.04087652e+00, 1.99685060e+00, -3.01345380e+00,\n -1.13705121e+00, 2.48054301e+00, -3.68041180e+00])"
},
"output_type": "execute_result"
}
],
"source": "y_score_lr_npp = clf_lr_npp.decision_function(X_test_npp)\ny_score_lr_npp"
},
{
"execution_count": 114,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0.840579710145\n"
}
],
"source": "# Get accuracy score\nfrom sklearn.metrics import accuracy_score\n\ny_pred_lr_npp = clf_lr_npp.predict(X_test_npp)\nacc_lr_npp = accuracy_score(y_test_npp, y_pred_lr_npp)\nprint(acc_lr_npp)"
},
{
"execution_count": 69,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Average precision-recall score: 0.90\n"
}
],
"source": "# Get Precision vs. Recall score\nfrom sklearn.metrics import average_precision_score\n\naverage_precision_lr_npp = average_precision_score(y_test_npp, y_score_lr_npp)\n\nprint('Average precision-recall score: {0:0.2f}'.format(\n average_precision_lr_npp))"
},
{
"source": "#### Evaluate Logistic Regression on preprocessed data",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 117,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 117,
"metadata": {},
"data": {
"text/plain": "array([ 3.56748043, -2.73442698, 3.8015596 , 2.35479139, 4.59269848,\n -0.8607685 , 2.88428431, -0.91354643, 3.84927625, 3.22768285,\n 3.58946094, -2.71929997, 3.25251482, -1.07305078, 2.51828729,\n -1.64178561, 2.13600714, 3.95220877, 0.79395167, 1.71794504,\n 2.59317613, 1.94079023, 4.50506087, -2.90542169, 2.82553382,\n 3.81591255, -2.09719645, -2.16097604, 4.36303526, 3.27114743,\n -3.09495293, -5.35336695, -1.22127421, -1.27589192, -0.27360866,\n -1.57714257, -1.38991232, -3.80568844, -1.46912589, -2.48294848,\n -3.77811265, -2.14891293, 3.48721316, -1.34557765, 2.97620065,\n -4.42509853, 2.46208432, 4.43784814, -0.26272097, 4.17192989,\n 4.59461887, -1.11900003, 3.50150034, 3.13145608, -1.05864704,\n 3.72408428, -2.92307375, 3.94432024, -1.03795131, 3.72260601,\n -4.55291223, -2.54102105, -1.77485624, 0.70497922, 4.02256182,\n -1.85814971, -3.46653777, -0.3313209 , -2.06875843, 3.46391108,\n 1.29383252, -2.0739646 , 2.70434366, -0.54400562, 3.701686 ,\n 3.90699329, 3.45530915, -0.05884862, -1.90125582, 2.00635753,\n -1.05812054, -4.12532071, 2.69824479, 2.85481591, -1.92293673,\n 2.09366194, -2.44854461, -2.79414406, -2.87179815, -5.54263547,\n -1.76406601, 0.08337258, -1.66297942, -0.63680903, -0.63439523,\n -0.78314549, -0.58262445, -0.40301226, 3.34660656, 5.82457794,\n 4.93722522, -0.4362406 , -3.5896711 , 1.42747008, -3.40840264,\n -1.11702216, 3.78606668, -2.72809978, 4.51490872, -1.64363597,\n 4.24570329, -4.84659164, -0.88499044, -0.86634338, -0.49400349,\n 2.64660804, -4.70446673, -4.48724374, 2.23548011, -4.02540764,\n 3.06718513, 4.91616963, -3.27630201, 1.22469856, 4.68838084,\n -0.31590368, 4.01015405, 1.94124362, -1.62680525, -3.21102093,\n -2.64277405, 3.12815903, -1.34116763, 3.11350676, -2.5876485 ,\n -1.42301576, -2.46891388, -1.06354279, 5.6124747 , -2.21935447,\n -0.35202978, -0.11024982, -2.40132195, -3.29498563, 3.97879929,\n 1.57676641, 4.5991717 , -0.48738402, -0.16322052, 3.48981937,\n 3.14407624, 3.1549296 , 3.11965969, -3.97812189, 2.18259764,\n -3.88261848, -3.24564837, -0.62834504, 4.42187855, 2.22448123,\n -4.28131873, 4.17346056, 3.83862932, 4.56658714, 3.65156776,\n 3.9966952 , 3.73771057, -1.75206628, 4.25401689, -1.49548997,\n 3.4709233 , 3.68178493, -0.20239066, 3.63625129, 0.36854924,\n 3.25546421, 4.21500234, 3.84720572, 2.58168781, 3.74202326,\n 2.22529624, -1.70873245, -1.39653903, 1.78747374, -2.98883551,\n 1.86003179, 0.11802022, -1.72926967, -0.58968377, -3.25734618,\n 3.44094246, -3.01745767, 4.02030767, -1.07055508, -0.9793049 ,\n -3.03175641, -2.55925347, -2.13823871, -1.76792699, -0.76730825,\n -1.80128357, -2.07730882, 3.99216771, -2.94769407, -1.86277713,\n 3.27253696, -1.90516603])"
},
"output_type": "execute_result"
}
],
"source": "y_score_lr = clf_lr.decision_function(X_test_scaled)\ny_score_lr"
},
{
"execution_count": 118,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "0.850241545894\n"
}
],
"source": "y_pred_lr = clf_lr.predict(X_test_scaled)\nacc_lr = accuracy_score(y_test, y_pred_lr)\nprint(acc_lr)"
},
{
"execution_count": 119,
"cell_type": "code",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Average precision-recall score: 0.90\n"
}
],
"source": "average_precision_lr = average_precision_score(y_test, y_score_lr)\n\nprint('Average precision-recall score: {0:0.2f}'.format(\n average_precision_lr))"
},
{
"source": "### 14. ROC Curve and models comparisons",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 122,
"cell_type": "code",
"metadata": {
"scrolled": true,
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"execution_count": 122,
"metadata": {},
"data": {
"text/plain": "Text(0,0.5,'True Positives')"
},
"output_type": "execute_result"
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": "<matplotlib.figure.Figure at 0x2ac2845384e0>"
},
"metadata": {}
}
],
"source": "# Plot SVC ROC Curve\nplt.figure(0, figsize=(15,10)).clf()\n\nfpr_lr_npp, tpr_lr_npp, thresh_lr_npp = metrics.roc_curve(y_test_npp, y_score_lr_npp)\nauc_lr_npp = metrics.roc_auc_score(y_test_npp, y_score_lr_npp)\nplt.plot(fpr_lr_npp, tpr_lr_npp, label=\"Logistic Regression on Non-preprocessed Data, auc=\" + str(auc_lr_npp))\n\nfpr_lr, tpr_lr, thresh_lr = metrics.roc_curve(y_test, y_score_lr)\nauc_lr = metrics.roc_auc_score(y_test, y_score_lr)\nplt.plot(fpr_lr, tpr_lr, label=\"Logistic Regression on Preprocessed Data, auc=\" + str(auc_lr))\n\nplt.legend(loc=0)\nplt.xlabel('False Positives')\nplt.ylabel('True Positives')"
},
{
"source": "#### Bonus: Deploy model on the cloud using IBM Watson Machine Learning\n\nWe have our model, but we want to use it through multiple apps. A solution is to deploy it on the cloud as an endpoint (url) and send data collected from a web/mobile app as a REST API call with data sent in the form of a JSON request.",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": 123,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "# The code was removed by DSX for sharing."
},
{
"execution_count": 132,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "Bearer eyJhbGciOiJSUzUxMiIsInR5cCI6IkpXVCJ9.eyJ0ZW5hbnRJZCI6IjQ2MWViYWMyLWNlOGUtNDRlNi1iM2I5LWI2ZjQyYTVmMzFiNiIsImluc3RhbmNlSWQiOiI0NjFlYmFjMi1jZThlLTQ0ZTYtYjNiOS1iNmY0MmE1ZjMxYjYiLCJwbGFuSWQiOiIzZjZhY2Y0My1lZGU4LTQxM2EtYWM2OS1mOGFmM2JiMGNiZmUiLCJyZWdpb24iOiJ1cy1zb3V0aCIsInVzZXJJZCI6IjM1MDRlODgyLWI1NDktNGQwNi04ZWM5LTYxNmI2MjRiYjljYiIsImlzcyI6Imh0dHBzOi8vaWJtLXdhdHNvbi1tbC5teWJsdWVtaXgubmV0L3YzL2lkZW50aXR5IiwiaWF0IjoxNTI1OTc3MTE4LCJleHAiOjE1MjYwMDU5MTh9.ogsHrN01ijtqnIlvpFNu4naVPXqz6ByMik3umBqAToVC9VG3ccMGNniSoKwnQoIPwHYiplr319r5Ey09ciADx_ri4-sBaHR3KIspQuI8o_GMX5IFikgn-JXFKZNMffVAcsMMiDq3cmnKxtxc-cxXKe4vmvr7anxpEAXiViZbkbJNRLaYJbp4JTB8eSrllXSiCAAmnFTQjNaJSbuXEYu7IXlbMRcp20X0iq56L4snKhsAI_A5qmLkjNi6FNlOc1dNifktj3GOT0BnDR6-QSQ9o-Rngwdik8kGUxpg6Mv4JIp_I7kFDevoz4WQ68CIToMQouMkILK0tx6mbUx-ObeY2A\n"
}
],
"source": "# To work with the Watson Machine Learning REST API you must generate a Bearer access token\nimport urllib3, requests, json\n\nheaders = urllib3.util.make_headers(basic_auth='{}:{}'.format(credentials['username'], credentials['password']))\nurl = '{}/v3/identity/token'.format(credentials['url'])\nresponse = requests.get(url, headers=headers)\nml_token = 'Bearer ' + json.loads(response.text).get('token')\nprint(ml_token)"
},
{
"execution_count": 125,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "<Response [200]>\n{\"metadata\":{\"guid\":\"461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6\",\"url\":\"https://instances/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6\",\"created_at\":\"2018-03-29T16:59:06.075Z\",\"modified_at\":\"2018-03-29T16:59:06.075Z\"},\"entity\":{\"source\":\"Bluemix\",\"published_models\":{\"url\":\"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models\"},\"usage\":{\"expiration_date\":\"2018-06-01T00:00:00.000Z\",\"computation_time\":{\"limit\":180000,\"current\":0},\"model_count\":{\"limit\":200,\"current\":4},\"prediction_count\":{\"limit\":5000,\"current\":3},\"gpu_count\":{\"limit\":8,\"current\":0},\"capacity_units\":{\"limit\":180000000,\"current\":17},\"deployment_count\":{\"limit\":5,\"current\":5}},\"plan_id\":\"3f6acf43-ede8-413a-ac69-f8af3bb0cbfe\",\"status\":\"Active\",\"organization_guid\":\"acec7554-82ac-49c0-a1d1-2f6803ce2b02\",\"region\":\"us-south\",\"account\":{\"id\":\"13bdb8509a2f1e6aa4bf611f8673a191\",\"name\":\"Heba El-Shimy's Account\",\"type\":\"TRIAL\"},\"owner\":{\"ibm_id\":\"50RX9K19A7\",\"email\":\"Heba.Elshimy1@ibm.com\",\"user_id\":\"45e8c98a-51f4-420b-9202-ecc25050fbb9\",\"country_code\":\"ARE\",\"beta_user\":true},\"deployments\":{\"url\":\"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/deployments\"},\"space_guid\":\"689631fe-8bef-4f8a-a515-96f7ce010036\",\"plan\":\"lite\"}}\n"
}
],
"source": "# Create an online scoring endpoint\n\nendpoint_instance = credentials['url'] + \"/v3/wml_instances/\" + credentials['instance_id']\nheader = {'Content-Type': 'application/json', 'Authorization': ml_token}\n\nresponse_get_instance = requests.get(endpoint_instance, headers=header)\nprint(response_get_instance)\nprint(response_get_instance.text)"
},
{
"execution_count": 126,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "# Create API client\n\nfrom watson_machine_learning_client import WatsonMachineLearningAPIClient\n\nclient = WatsonMachineLearningAPIClient(credentials)"
},
{
"execution_count": 127,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "# Publish model in Watson Machine Learning repository on Cloud\n\nmodel_props = {client.repository.ModelMetaNames.AUTHOR_NAME: \"Heba El-Shimy\", \n client.repository.ModelMetaNames.NAME: \"Credit Card Approval Model\"}"
},
{
"execution_count": 128,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": "published_model = client.repository.store_model(model=model, meta_props=model_props, \\\n training_data=X_train_scaled, training_target=y_train)"
},
{
"execution_count": 129,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "\n\n#######################################################################################\n\nSynchronous deployment creation for uid: 'b76d078b-667f-4333-96a6-a63a42befecf' started\n\n#######################################################################################\n\n\nINITIALIZING\nDEPLOY_SUCCESS\n\n\n------------------------------------------------------------------------------------------------\nSuccessfully finished deployment creation, deployment_uid='b76d078b-667f-4333-96a6-a63a42befecf'\n------------------------------------------------------------------------------------------------\n\n\n"
}
],
"source": "# Create model deployment\n\npublished_model_uid = client.repository.get_model_uid(published_model)\ncreated_deployment = client.deployments.create(published_model_uid, \"Deployment of Credit Card Approval Model\")"
},
{
"execution_count": 130,
"cell_type": "code",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967/deployments/b76d078b-667f-4333-96a6-a63a42befecf/online\n"
}
],
"source": "# Get Scoring URL\nscoring_endpoint = client.deployments.get_scoring_url(created_deployment)\n\nprint(scoring_endpoint)"
},
{
"execution_count": 131,
"cell_type": "code",
"metadata": {
"scrolled": true
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "{\n \"entity\": {\n \"author\": {\n \"name\": \"Heba El-Shimy\"\n },\n \"evaluation_metrics_url\": \"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967/evaluation_metrics\",\n \"learning_iterations_url\": \"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967/learning_iterations\",\n \"training_data_schema\": {\n \"labels\": {\n \"type\": \"ndarray\",\n \"fields\": [\n {\n \"type\": \"int\",\n \"name\": \"l1\"\n }\n ]\n },\n \"features\": {\n \"type\": \"ndarray\",\n \"fields\": [\n {\n \"type\": \"float\",\n \"name\": \"f0\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f1\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f2\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f3\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f4\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f5\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f6\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f7\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f8\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f9\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f10\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f11\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f12\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f13\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f14\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f15\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f16\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f17\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f18\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f19\"\n }\n ]\n }\n },\n \"name\": \"Credit Card Approval Model\",\n \"deployments\": {\n \"count\": 1,\n \"url\": \"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967/deployments\"\n },\n \"label_col\": \"l1\",\n \"learning_configuration_url\": \"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967/learning_configuration\",\n \"model_type\": \"scikit-learn-0.19\",\n \"deployed_version\": {\n \"guid\": \"8fbde70a-56af-423b-a38e-f68c5f4731a8\",\n \"url\": \"https://ibm-watson-ml.mybluemix.net/v3/ml_assets/models/c76ac134-1259-440b-aab6-1009369a1967/versions/8fbde70a-56af-423b-a38e-f68c5f4731a8\"\n },\n \"runtime_environment\": \"python-3.5\",\n \"feedback_url\": \"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967/feedback\",\n \"input_data_schema\": {\n \"labels\": {\n \"type\": \"ndarray\",\n \"fields\": [\n {\n \"type\": \"int\",\n \"name\": \"l1\"\n }\n ]\n },\n \"features\": {\n \"type\": \"ndarray\",\n \"fields\": [\n {\n \"type\": \"float\",\n \"name\": \"f0\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f1\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f2\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f3\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f4\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f5\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f6\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f7\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f8\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f9\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f10\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f11\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f12\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f13\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f14\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f15\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f16\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f17\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f18\"\n },\n {\n \"type\": \"float\",\n \"name\": \"f19\"\n }\n ]\n }\n },\n \"latest_version\": {\n \"created_at\": \"2018-05-10T18:30:13.631Z\",\n \"guid\": \"8fbde70a-56af-423b-a38e-f68c5f4731a8\",\n \"url\": \"https://ibm-watson-ml.mybluemix.net/v3/ml_assets/models/c76ac134-1259-440b-aab6-1009369a1967/versions/8fbde70a-56af-423b-a38e-f68c5f4731a8\"\n }\n },\n \"metadata\": {\n \"created_at\": \"2018-05-10T18:30:13.578Z\",\n \"guid\": \"c76ac134-1259-440b-aab6-1009369a1967\",\n \"modified_at\": \"2018-05-10T18:30:42.254Z\",\n \"url\": \"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/461ebac2-ce8e-44e6-b3b9-b6f42a5f31b6/published_models/c76ac134-1259-440b-aab6-1009369a1967\"\n }\n}\n"
}
],
"source": "# Get model details and expected input\nmodel_details = client.repository.get_details(published_model_uid)\nprint(json.dumps(model_details, indent=2))"
},
{
"source": "### Sending data to the model\nSending new data (may be collected from web/mobile app) in the format the model is excpecting as shown above.\nWe get back a response with the predicted class (0 - Credit Card Application will be rejected)\nand probabilities of both classes (0 or Application Rejection has a probability of 1 which is very high, 1 or Application Acceptance has a probability of 5.096701256722081e-98 which is very low. This gives us an idea about the model's confidence of its predictions.",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "![postman](https://github.com/HebaNAS/IBM-Watson-Studio-Enablement/blob/master/CreditCardApprovalModel/imgs/API-Call.jpg?raw=true)",
"cell_type": "markdown",
"metadata": {}
},
{
"source": "## References:\n\n#### <a name=\"first\" id=\"first\"></a><sub>[1] https://www.sciencedirect.com/science/article/abs/pii/S0148296318301231 \"Customer churn prediction in telecommunication industry using data certainty\"</sub> \n#### <a name=\"second\" id=\"second\"></a><sub>[2] https://www.signal.co/blog/understanding-customer-churn/ \"10 Stats Expose the Real Connection Between Customer Experience and Customer Churn\"</sub> \n#### <a name=\"third\" id=\"third\"></a><sub>[3] https://www.pinterest.com/pin/456904324667676431/ \"Mobile Telco Churn Infographic\"</sub> \n#### <sub>[4] https://pandas.pydata.org/pandas-docs/stable/ \"Pandas Documentation\"</sub> \n#### <sub>[5] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html \"Scikit-Learn Imputer\"</sub> \n#### <sub>[6] https://github.com/ibm-watson-data-lab/pixiedust/wiki/Tutorial:-Extending-the-PixieDust-Visualization \"PixieDust Documentation\"</sub>\n#### <sub>[7] https://seaborn.pydata.org/ \"Seaborn Documentation\"</sub>\n#### <sub>[8] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder \"Scikit-Learn LabelEncoder\"</sub>\n#### <sub>[9] http://colingorrie.github.io/outlier-detection.html \"Outlier Detection Methods\"</sub>\n#### <sub>[10] http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#sphx-glr-auto-examples-linear-model-plot-polynomial-interpolation-py \"Scikit-Learn Polynomial\"</sub>\n#### <sub>[11] http://scikit-learn.org/stable/modules/feature_selection.html \"Scikit-Learn Feature Selection\"</sub>\n#### <sub>[12] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler \"Scikit-Learn StandardScaler\"</sub>\n#### <sub>[13] http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC \"Scikit-Learn SVC\"</sub>\n#### <sub>[14] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression \"Scikit-Learn Logistic Regression\"</sub>\n#### <sub>[15] http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html \"Scikit-Learn MLP Classifier\"</sub>\n#### <sub>[16] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score \"Scikit-Learn Accuracy Score\"</sub>\n#### <sub>[17] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score \"Scikit-Learn Average Precision Score\"</sub>\n#### <sub>[18] https://www.sciencedirect.com/science/article/pii/S016786550500303X \"An introduction to ROC analysis\"</sub>\n#### <sub>[19] https://wml-api-pyclient.mybluemix.net/ \"Watson Machine Learning Client Documentation\"</sub>\n#### <sub>[20] https://dataplatform.ibm.com/docs/content/analyze-data/ml-deploy-notebook.html?context=analytics \"IBM Watson Studio Documentation-Deploy a model from a notebook\"</sub>",
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
}
},
{
"execution_count": null,
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": ""
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.5",
"name": "python3",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"version": "3.5.4",
"name": "python",
"file_extension": ".py",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"version": 3,
"name": "ipython"
}
},
"celltoolbar": "Slideshow"
},
"nbformat": 4
}