Conduct supervised learning using a label of your own choice. That is, you are required to identify your own classification task. For instance, you may  consider using one of the following attributes as your class label: Quality-of-Life, Development-Index, Human-Development-Index, Gender, Age-group, and so on. For instance, suppose you choose to focus on the Development-Index. In this case, you would construct data mining models that contrast the trends in the countries using class labels in {developed, developing and underdeveloped}. 

In [30]:
import psycopg2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
RANDOM_STATE = 42

conn = psycopg2.connect(
    host="localhost",
    database="postgres",
    user="postgres",
    password="password")
cur = conn.cursor()
tables = [
  'country',
  'education',
  'event',
  'fact',
  'health',
  'population',
  'quality_of_life',
  'month'
]

#### Loading Data

In [31]:
def getTable(name):
  cur.execute('SELECT * from ' + name)
  return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])

def getAll(tables):
  query = 'SELECT'
  for (i,table) in enumerate(tables):
    if table !='fact':
      query+= ' ' + table + '.*'
      if i != len(tables)-1:
        query += ','
    else:
      query += 'fact.hdi'
  query += ' FROM Fact,'
  for (i,table) in enumerate(tables):
    if table !='fact':
      query+= table
      if i != len(tables)-1:
        query += ', '
      else:
        query += ' '
  query += 'WHERE '
  for (i,table) in enumerate(tables):
    if table !='fact':
      if table == 'month':
        query += 'Fact.date_key = month.key'
      else:
        query += 'Fact.'+table+'_key = ' + table + '.key'
      if i != len(tables)-1:
        query += ' and '
  print(query)
  cur.execute(query)
  return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])


def getCorrelation(measure, attribute, factor):
  # DISTINCT ? 
  cur.execute('Select DISTINCT F.' + measure + ', A.' + factor + ' from Fact as F, ' + attribute + ' as A where F.' + attribute + '_key = A.'+ attribute + '_key')
  return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])

def getMeasureByAttribute(measure, attribute):
  cur.execute('SELECT DISTINCT Fact.' + measure + ', A.* FROM Fact, ' + attribute + ' as A WHERE Fact.' + attribute + '_key = A.'+ attribute + '_key')
  return pd.DataFrame(cur.fetchall(), columns=[desc[0] for desc in cur.description])

def getDataCorrelation(measure, attribute):
  dataset = getMeasureByAttribute(measure, attribute)
  dataset = dataset.apply(pd.to_numeric)
  dataset[measure] = dataset[measure].round(decimals=1)*10

  train, test = train_test_split(dataset, test_size=0.2, random_state=RANDOM_STATE)

  train_labels = train[measure].values.reshape(-1, 1)
  train_data = train.drop([measure,attribute+'_key'], axis=1).values.reshape(-1, len(train.columns)-2)

  test_labels = test[measure].values.reshape(-1, 1)
  test_data = test.drop([measure,attribute+'_key'], axis=1).values.reshape(-1, len(test.columns)-2)
  return train_labels, train_data, test_labels, test_data

tables = [
  'country',
  'education',
  'event',
  'fact',
  'health',
  'population',
  'quality_of_life',
  'month'
]

frames = {}
for table in tables:
  frames[table] = getTable(table)
# print(frames)





In [32]:
train_labels, train_data, test_labels, test_data = getDataCorrelation('hdi', 'education')



#### 1. (15 marks) Use the Decision Tree, Gradient Boosting and Random Forest algorithms to construct models against your data, following the so-called train-then-test, or holdout method.  


In [33]:
gradientBoost = GradientBoostingClassifier(
  n_estimators=100,
  learning_rate=0.01,
  max_depth=1,
  random_state=RANDOM_STATE
  )
randomForrest = RandomForestClassifier(
  random_state=RANDOM_STATE,
  max_depth=10,
  min_samples_split=10
)

decisionTree = DecisionTreeClassifier(
  max_depth=10,
  min_samples_split=10,
  random_state=RANDOM_STATE
)

gradientBoost.fit(train_data, train_labels)
randomForrest.fit(train_data, train_labels)
decisionTree.fit(train_data, train_labels)


  y = column_or_1d(y, warn=True)
  randomForrest.fit(train_data, train_labels)


DecisionTreeClassifier(max_depth=10, min_samples_split=10, random_state=42)

#### 2. (20 marks) Compare the results of the three learning algorithms, in terms of (i) accuracy, (ii) precision, (iii) recall and (iv) time to construct the models.

In [34]:
startTime = datetime.now()
gradientPredictions = gradientBoost.predict(test_data)
gradientPredictionsReport = classification_report(test_labels.flatten(), gradientPredictions, zero_division=0)
print('Gradient boost train time:', datetime.now()-startTime )

startTime = datetime.now()
randomForrestPredictions = randomForrest.predict(test_data)
randomForrestPredictionsReport = classification_report(test_labels.flatten(), randomForrestPredictions, zero_division=0)
print('Random Forrest train time:', datetime.now()-startTime )

startTime = datetime.now()
decisionTreePredictions = decisionTree.predict(test_data)
decisionTreePredictionsReport = classification_report(test_labels.flatten(), decisionTreePredictions, zero_division=0)
print('Decision Tree train time:', datetime.now()-startTime )

print('Gradient Boost Report')
print(gradientPredictionsReport)
print('Random Forrest Report')
print(randomForrestPredictionsReport)
print('Decision Tree Report')
print(decisionTreePredictionsReport)

Gradient boost train time: 0:00:00.010064
Gradient Boost Report
              precision    recall  f1-score   support

         4.0       1.00      1.00      1.00         2
         5.0       1.00      1.00      1.00         1
         6.0       1.00      0.89      0.94         9
         7.0       0.50      0.80      0.62         5
         8.0       0.00      0.00      0.00         3
         9.0       1.00      1.00      1.00         6

    accuracy                           0.81        26
   macro avg       0.75      0.78      0.76        26
weighted avg       0.79      0.81      0.79        26

Random Forrest Report
              precision    recall  f1-score   support

         4.0       1.00      1.00      1.00         2
         5.0       1.00      1.00      1.00         1
         6.0       1.00      0.89      0.94         9
         7.0       0.56      1.00      0.71         5
         8.0       0.00      0.00      0.00         3
         9.0       1.00      1.00      1.00   

| Metric                   | Gradient | Random Forrest | Decision Tree |
|--------------------------|----------|----------------|---------------|
| accuracy                 |    81%   |      85%       |     81%       |
| precision (weighted avg) |    79%   |      76%       |     73%       |
| recall (weighted avg)    |    81%   |      81%       |     78%       |
| construction time        |          |                |               |


#### 3. (15 marks) Submit a 200 to 300 words summary explaining the actionable knowledge nuggets your team discovered. That is, you should explain what insights you obtained about the data, when investigating the models produced by the three algorithms. 