# Purpose of Project - To demonstrate real life example of what a data science and machine learning proof of concept might look like.

## What is classification? Classification involves deciding whether a sample is part of one class or another (single-class classification). If there are multiple class options, it's referred to as multi-class classification.

### Since we already have a dataset, we'll approach the problem with the following machine learning modelling framework.

#### Exploratory data analysis (EDA) - the process of going through a dataset and finding out more about it.

- Model training - create model(s) to learn to predict a target variable based on other variables.
- Model evaluation - evaluating a models predictions using problem-specific evaluation metrics.
- Model comparison - comparing several different models to find the best one.
- Model fine-tuning - once we've found a good model, how can we improve it?
- Feature importance - since we're predicting the presence of heart disease, are there some things which are more important for prediction?
- Cross-validation - if we do build a good model, can we be sure it will work on unseen data?
- Reporting what we've found - if we had to present our work, what would we show someone?

1. Problem Definition In our case, the problem we will be exploring is binary classification (a sample can only be one of two things). This is because we're going to be using a number of differnet features (pieces of information) about a person to predict whether they have heart disease or not. In a statement, Given clinical parameters about a patient, can we predict whether or not they have heart disease?

2. Data What you'll want to do here is dive into the data your problem definition is based on. This may involve, sourcing, defining different parameters, talking to experts about it and finding out what you should expect. The original data came from the Cleveland database from UCI Machine Learning Repository. Howevever, we've downloaded it in a formatted way from Kaggle. The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features) are the variables what we'll use to predict our target variable. Attributes and features are also referred to as independent variables and a target variable can be referred to as a dependent variable. We use the independent variables to predict our dependent variable. Or in our case, the independent variables are a patients different medical attributes and the dependent variable is whether or not they have heart disease.

3. Evaluation The evaluation metric is something you might define at the start of a project. Since machine learning is very experimental, you might say something like, If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursure this project. The reason this is helpful is it provides a rough goal for a machine learning engineer or data scientist to work towards. However, due to the nature of experimentation, the evaluation metric may change over time.

4. Features Features are different parts of the data. During this step, you'll want to start finding out what you can about the data. One of the most common ways to do this, is to create a data dictionary.



-------

# Preparing the tools

This project needs pandas, Matplotlib and Numpy

In [2]:
#Import libraries needed 

#Regular EDA and plotting libraries
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 

# Configure plots inside notebook
%matplotlib inline 

#Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve


In [5]:
#Import Data from csv file 
df = pd.read_csv('heart-disease.csv')
df.shape

(303, 14)

## Data Exploration (exploratory data analysis or EDA)

#### The goal here is to find out more about the data and become a subject matter export on the dataset you're working with.

- What question(s) are you trying to solve?
- What kind of data do we have and how do we treat different types?
- What's missing from the data and how do you deal with it?
- Where are the outliers and why should you care about them?
- How can you add, change or remove features to get more out of your data?

In [6]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [8]:
df.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0
