# Machine Learning Project

1. (10 points) The source and a clear description of the dataset under consideration. You need to cite your data source properly and you need to clearly identify your response variable.
2. (5 points) Appropriateness and a clear statement of your goals and objectives for modelling this particular data. 
3. (10 points) Data pre-processing as appropriate (dealing with missing values & outliers & incorrect values (such as negative age), dropping ID-like columns, data aggregation if necessary, etc). 
NOTE: If your dataset is already clean and no data pre-processing is required, that's OK. In that case, the next rubric (Data exploration) shall be marked out of 25 points (instead of 15 points). In this case, this additional 10 points will be allocated in the "Data pre-processing" line item.
4. (15 points) Data exploration & visualisation as appropriate: charts, graphs, boxplots, numerical summaries, etc. 
5. (40 points) Predictive modelling:
6. (5 points) A complete overview of your methodology
7. (15 points) Details of feature selection, the algorithms’ fine-tuning process, relevant fine-tuning plots, and detailed performance analysis of each algorithm
8. (10 points) Performance comparison of the algorithms as appropriate (cross-validation, AUC, etc.) using paired t-tests
9. (10 points) A critique of your approach: underlying assumptions, its limitations, its strengths and its weaknesses
10. (10 points) Summary & conclusions: a clear overall summary of your project, a clear and accurate summary of your findings, and your detailed conclusions as they relate to your goals and objectives.
11. (10 points) Presentation: a Table of Contents (3 points by itself), appropriate headers & subheaders, clarity, conciseness, coherence, grammar, punctuation, and other good writing practices.

### 1. (10 points) The source and a clear description of the dataset under consideration. You need to cite your data source properly and you need to clearly identify your response variable.
// See exampe for structure https://www.featureranking.com/tutorials/statistics-tutorials/regression-case-study/

## Introduction
EXAMPLE: The objective of this toy project is to predict the age of an individual with the 1994 US Census Data using multiple linear regression. We use the Statsmodels and Patsy modules for this task with Pyhon version >= 3.6. The dataset was sourced from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/adult (Lichman, 2013).

Table of content:

- [Overview](#overview) section describes the dataset used and the features in this dataset.
- [Data preparation](#dataprep)
- Data exploration & visualisation
- Predictive modelling


.....

## Overview
<a id='overview'></a>
Describing 
- Data Source
- Project Objective
- Target Feature
- Descriptive Feature

## Data Preparation
<a id='dataprep'></a>

In [2]:
import pandas as pd
import numpy as np

# Ignore python warnings
import warnings
warnings.filterwarnings('ignore')

# so that we can see all the columns
pd.set_option('display.max_columns', None) 

df = pd.read_csv('heart.csv', header=None)

In [7]:
# headers = ['age', 'sex', 'chest_pain_type', 'resting_bp', 'serum_cholestoral', 'fasting_blood_sugar', 'resting_ecg_results', 'max_hr_achieved', 'exercise_induced_angina', 'oldpeak', 'slope_of_peak_exercise', 'no_of_major_vessels', 'thal', 'target']
# To see the data easier, setting the long headers to abbreviations 
headers = ['age', 'sex', 'cpt', 'rb', 'sc', 'fbs', 'rer', 'mha', 'eia', 'old', 'sope', 'nomv', 'thal', 'target'] 
df.columns = headers
df.head()

Unnamed: 0,age,sex,cpt,rb,sc,fbs,rer,mha,eia,old,sope,nomv,thal,target
0,70.0,1.0,4.0,130.0,322.0,0.0,2.0,109.0,0.0,2.4,2.0,3.0,3.0,2
1,67.0,0.0,3.0,115.0,564.0,0.0,2.0,160.0,0.0,1.6,2.0,0.0,7.0,1
2,57.0,1.0,2.0,124.0,261.0,0.0,0.0,141.0,0.0,0.3,1.0,0.0,7.0,2
3,64.0,1.0,4.0,128.0,263.0,0.0,0.0,105.0,1.0,0.2,2.0,1.0,7.0,1
4,74.0,0.0,2.0,120.0,269.0,0.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0,1
