In [None]:
{"cells": [{"outputs": [], "metadata": {"_uuid": "f118af7ce42b355fae7faa8db9d5d0157611517d", "_cell_guid": "417ebd30-953c-492a-b185-9bb2742e9fee", "trusted": true, "collapsed": true}, "cell_type": "code", "source": "import numpy as np\nimport pandas as pd\nimport seaborn as sns\nsns.set_palette('husl')\nimport matplotlib.pyplot as plt\n%matplotlib inline\n\nfrom sklearn import metrics\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\n\ndata = pd.read_csv('../input/Iris.csv')", "execution_count": 35}, {"execution_count": null, "metadata": {"_uuid": "afeecf9d79fc1c4d87f459e7405d79a4f6cbef0a", "_cell_guid": "8c64d903-f69b-4983-8748-8522e8fe2bbf"}, "cell_type": "markdown", "source": "# Preview of Data\n- There are 150 observations with 4 features each (sepal length, sepal width, petal length, petal width).\n- There are no null values, so we don't have to worry about that.\n- There are 50 observations of each species (setosa, versicolor, virginica).", "outputs": []}, {"outputs": [], "metadata": {"_uuid": "07bf049c5cabf2ecb75818c4e103e94fcf1e49d4", "_cell_guid": "ae265ce5-40f4-447c-aea0-d5ddbe4abeb5", "trusted": true}, "cell_type": "code", "source": "data.head()", "execution_count": 36}, {"outputs": [], "metadata": {"_uuid": "84a4b1dbfb7b5a89d3c73e1113482b4b50985b14", "_cell_guid": "9257acb0-2b8b-42f5-bc7d-6c08976469f5", "trusted": true}, "cell_type": "code", "source": "data.info()", "execution_count": 37}, {"outputs": [], "metadata": {"trusted": true, "_uuid": "2f51406bff802aae1c0efa7289434a9b60f13bee"}, "cell_type": "code", "source": "data.describe()", "execution_count": 51}, {"outputs": [], "metadata": {"_uuid": "876817fced0db3d4bbbd95a64359f124ee6707c4", "_cell_guid": "13958377-5c7c-420b-b06f-ad5998c59665", "trusted": true}, "cell_type": "code", "source": "data['Species'].value_counts()", "execution_count": 38}, {"execution_count": null, "metadata": {"_uuid": "a67d48fef35b8e62a47774ced7f9a4ae4b562858", "_cell_guid": "2a1077f5-b314-4040-a309-59fc1b8d6c15"}, "cell_type": "markdown", "source": "# Data Visualization\n- After graphing the features in a pair plot, it is clear that the relationship between pairs of features of a iris-setosa (in pink) is distinctly different from those of the other two species.\n- There is some overlap in the pairwise relationships of the other two species, iris-versicolor (brown) and iris-virginica (green).\n", "outputs": []}, {"outputs": [], "metadata": {"_uuid": "fd15b6089651f32d213555b27b7ffbc0655b6447", "_cell_guid": "13293bbc-e587-4085-916f-ac5bafbbfaf0", "trusted": true}, "cell_type": "code", "source": "tmp = data.drop('Id', axis=1)\ng = sns.pairplot(tmp, hue='Species', markers='+')\nplt.show()", "execution_count": 55}, {"outputs": [], "metadata": {"trusted": true, "_uuid": "cee3e44ff3c1a3a28ad1eb1df02ead6b331e083a", "scrolled": true}, "cell_type": "code", "source": "g = sns.violinplot(y='Species', x='SepalLengthCm', data=data, inner='quartile')\nplt.show()\ng = sns.violinplot(y='Species', x='SepalWidthCm', data=data, inner='quartile')\nplt.show()\ng = sns.violinplot(y='Species', x='PetalLengthCm', data=data, inner='quartile')\nplt.show()\ng = sns.violinplot(y='Species', x='PetalWidthCm', data=data, inner='quartile')\nplt.show()", "execution_count": 66}, {"execution_count": null, "metadata": {"_uuid": "edf8f09be66977b2258436e6a9128d6639469d01", "_cell_guid": "5fe31716-3cd8-444a-a17f-bed7659afd0f"}, "cell_type": "markdown", "source": "# Modeling with scikit-learn", "outputs": []}, {"outputs": [], "metadata": {"_uuid": "8a9c62f5fe8a7e0a78896d0edac6cf769a6b1751", "_cell_guid": "20c0f613-e162-4473-8292-4eca12c7343f", "trusted": true}, "cell_type": "code", "source": "X = data.drop(['Id', 'Species'], axis=1)\ny = data['Species']\n# print(X.head())\nprint(X.shape)\n# print(y.head())\nprint(y.shape)", "execution_count": 40}, {"execution_count": null, "metadata": {"_uuid": "3a94f4ab9ad99a6f3df882e201623d241454ef1c", "_cell_guid": "523d62f1-7606-495d-9baa-31aa386e1cbf"}, "cell_type": "markdown", "source": "## Train and test on the same dataset\n- This method is not suggested since the end goal is to predict iris species using a dataset the model has not seen before.\n- There is also a risk of overfitting the training data.", "outputs": []}, {"outputs": [], "metadata": {"_uuid": "9ecc7d9e7029cfe7f60b83cda36751618b9a7346", "_cell_guid": "b41bb2bc-dfec-4991-8f54-e5cf219f371e", "trusted": true, "scrolled": true}, "cell_type": "code", "source": "# experimenting with different n values\nk_range = list(range(1,26))\nscores = []\nfor k in k_range:\n    knn = KNeighborsClassifier(n_neighbors=k)\n    knn.fit(X, y)\n    y_pred = knn.predict(X)\n    scores.append(metrics.accuracy_score(y, y_pred))\n    \nplt.plot(k_range, scores)\nplt.xlabel('Value of k for KNN')\nplt.ylabel('Accuracy Score')\nplt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')\nplt.show()", "execution_count": 41}, {"outputs": [], "metadata": {"_uuid": "db94036d672e0556b6e6cc2182cd12deb4b2759e", "_cell_guid": "f5e7d4d9-a028-416a-ad91-8790871a2fef", "trusted": true, "scrolled": true}, "cell_type": "code", "source": "logreg = LogisticRegression()\nlogreg.fit(X, y)\ny_pred = logreg.predict(X)\nprint(metrics.accuracy_score(y, y_pred))", "execution_count": 42}, {"execution_count": null, "metadata": {"_uuid": "d4a1c0d5b0d7d25fd28e28a8a74e3b75f78e6729", "_cell_guid": "aff6f799-de47-4b8b-936f-0a5179a2f9e4"}, "cell_type": "markdown", "source": "## Split the dataset into a training set and a testing set\n\n### Advantages\n- By splitting the dataset pseudo-randomly into a two separate sets, we can train using one set and test using another.\n- This ensures that we won't use the same observations in both sets.\n- More flexible and faster than creating a model using all of the dataset for training.\n\n### Disadvantages\n- The accuracy scores for the testing set can vary depending on what observations are in the set. \n- This disadvantage can be countered using k-fold cross-validation.\n\n### Notes\n- The accuracy score of the models depends on the observations in the testing set, which is determined by the seed of the pseudo-random number generator (random_state parameter).\n- As a model's complexity increases, the training accuracy (accuracy you get when you train and test the model on the same data) increases.\n- If a model is too complex or not complex enough, the testing accuracy is lower.\n- For KNN models, the value of k determines the level of complexity. A lower value of k means that the model is more complex.", "outputs": []}, {"outputs": [], "metadata": {"_uuid": "d94645e7f6bd8b969240ab3803074014fefcf54b", "_cell_guid": "d26420d7-e35c-4bf4-99da-d328993d7a87", "trusted": true, "scrolled": true}, "cell_type": "code", "source": "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=5)\nprint(X_train.shape)\nprint(y_train.shape)\nprint(X_test.shape)\nprint(y_test.shape)", "execution_count": 43}, {"outputs": [], "metadata": {"_uuid": "93af51b74e09a4b0b48b07023e5752e03beb0d69", "_cell_guid": "113f9c35-bf95-4b5e-8515-099bba3ea7d6", "trusted": true}, "cell_type": "code", "source": "# experimenting with different n values\nk_range = list(range(1,26))\nscores = []\nfor k in k_range:\n    knn = KNeighborsClassifier(n_neighbors=k)\n    knn.fit(X_train, y_train)\n    y_pred = knn.predict(X_test)\n    scores.append(metrics.accuracy_score(y_test, y_pred))\n    \nplt.plot(k_range, scores)\nplt.xlabel('Value of k for KNN')\nplt.ylabel('Accuracy Score')\nplt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')\nplt.show()", "execution_count": 46}, {"outputs": [], "metadata": {"_uuid": "a6f63f861456de86dbd19bc5943ab29a79fa55ad", "_cell_guid": "30c17d80-1031-4aee-b83e-3f82b2243949", "trusted": true}, "cell_type": "code", "source": "logreg = LogisticRegression()\nlogreg.fit(X_train, y_train)\ny_pred = logreg.predict(X_test)\nprint(metrics.accuracy_score(y_test, y_pred))", "execution_count": 45}, {"execution_count": null, "metadata": {"collapsed": true, "_uuid": "b366c043483734afad62a823915112d00a2fe912", "trusted": false, "_cell_guid": "9fa29c1f-fe13-4ae2-804a-0545fa327b63"}, "cell_type": "markdown", "source": "## Choosing KNN to Model Iris Species Prediction with k = 12\nAfter seeing that a value of k = 12 is a pretty good number of neighbors for this model, I used it to fit the model for the entire dataset instead of just the training set.", "outputs": []}, {"outputs": [], "metadata": {"trusted": true, "_uuid": "b94a6f120e4a08ae48c16e8f5c3c7cb2ca387f4a"}, "cell_type": "code", "source": "knn = KNeighborsClassifier(n_neighbors=12)\nknn.fit(X, y)\n\n# make a prediction for an example of an out-of-sample observation\nknn.predict([[6, 3, 4, 2]])", "execution_count": 52}], "nbformat": 4, "metadata": {"language_info": {"version": "3.6.1", "mimetype": "text/x-python", "file_extension": ".py", "nbconvert_exporter": "python", "codemirror_mode": {"version": 3, "name": "ipython"}, "pygments_lexer": "ipython3", "name": "python"}, "kernelspec": {"display_name": "Python 3", "name": "python3", "language": "python"}}, "nbformat_minor": 1}
