# CAB330 Case Study 1: Students

Class: CAB330

Students: 

- Christopher Ayling | christopher.ayling@connect.qut.edu.au
- Benjamin Saljooghi | 9448233 | benjamin.saljooghi@connect.qut.edu.au
- Jordi Smit

Due Date: 9th September 2018

Project Demo: Week 8 Wednesday Lab

Weighting: 25%

---
## Setup

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
# Manipulating Data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Visualisations
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import SVG, Image
import graphviz

# Algorithms
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPClassifier, MLPRegressor

from sklearn.model_selection import GridSearchCV

In [None]:
np.random.seed(330)

---
## Data Loading

In [None]:
students = pd.read_csv("./STUDENT.csv")
students.head()

rows, columns = students.shape; rows, columns

# organics = pd.read_csv("./organics.csv")
# organics.head()

---
## Task 1. Data Selection and Distribution. (4 marks)

### Variable Descriptions

The following information would assist you in assigning the variables roles.

- There are three target variables namely, G1, G2 and G3, with different types. Choose the target that suits best according to the given task.
- Identify if the variable is an input variable or a supplementary variable.
- Data transformation is required for a few input variables to get improved accuracy

### 1.1 Proportion of Students Who Will Pass

In [None]:
G3_counts = students['G3'].value_counts()
plt.pie(G3_counts, labels=G3_counts.index, startangle=90, shadow=True, explode=(0, 0.1), autopct='%1.1f%%')
plt.title("Proportion of Students Who Will Pass")
plt.axis('equal')
plt.show()

### 1.2 Data Cleaning

In [None]:
cleaned = students.copy()

# impute age NAs with mean
cleaned['age'].fillna(cleaned['age'].mean(), inplace=True)
# drop unused columns
cleaned.drop(columns=["id", "InitialName"], inplace=True)

### 1.3 Level of Measurement

In [None]:
descriptions = {
    "Id": ["student's id", False, False, 'nominal'],
    "InitialName": ["student's initial", False, False, 'nominal'],
    "School": ["student's school name", True, True, 'nominal'],
    "Sex": ["student's sex", True, True, 'nominal'],
    "Age": ["student's age", True, True, 'numerical'],
    "Address": ["student's home address type", True, True, 'nominal'],
    "Famsize": ["family size (≤ 3 or > 3)", True, True, 'ordinal'],
    "Pstatus": ["parent's cohabitation status (living together or apart)", True, True, 'nominal'],
    "Medu": ["mother’s education(0 – none, 1 – primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)", True, True, 'ordinal'],
    "Fedu": ["father’s education(0 – none, 1 – primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)", True, True, 'ordinal'],
    "Mjob": ["mother's job", True, True, 'nominal'],
    "Fjob": ["father's job", True, True, 'nominal'],
    "Reason": ["reason to choose this school", True, True, 'nominal'],
    "guardian": ["student's guardian", True, True, 'nominal'],
    "traveltime": ["home to school travel time (1 – < 15 min., 2 – 15 to 30 min., 3 – 30 min. to 1 hour or 4 – > 1 hour)", True, True, 'ordinal'],
    "studytime": ["weekly study time (1 – < 2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours or 4 – > 10 hours)", True, True, 'ordinal'],
    "Failures": ["number of past class failures(n if 1 ≤ n < 3, else 4)", True, True, 'ordinal'],
    "schoolsup": ["extra educational school support (yes or no)", True, True, 'nominal'],
    "Famsup": ["family educational support (yes or no)", True, True, 'nominal'],
    "Paid": ["extra paid classes (yes or no)", True, True, 'nominal'],
    "activities": ["extra-curricular activities (yes or no)", True, True, 'nominal'],
    "Nursery": ["attended nursery school (yes or no)", True, True, 'nominal'],
    "Higher": ["wants to take higher education (yes or no)", True, True, 'nominal'],
    "Internet": ["Internet access at home (yes or no)", True, True, 'nominal'],
    "romantic": ["with a romantic relationship (yes or no)", True, True, 'nominal'],
    "Famrel": ["quality of family relationships (1 – very bad to 5 – excellent)", True, True, 'ordinal'],
    "freetime": ["free time after school (1 – very low to 5 – very high)", True, True, 'ordinal'],
    "Gout": ["going out with friends (1 – very low to 5 – very high)", True, True, 'ordinal'],
    "Dalc": ["workday alcohol consumption (1 – very low to 5 – very high)", True, True, 'ordinal'],
    "Walc": ["weekend alcohol consumption (1 – very low to 5 – very high)", True, True, 'ordinal'],
    "Health": ["current health status (1 – very bad to 5 – very good)", True, True, 'ordinal'],
    "absences": ["number of school absences (0 to 75)", True, True, 'numerical'],
    "G1": ["first period grade (0 to 20)", True, True, 'numerical', True],
    "G2": ["second period grade (0 to 20)", True, True, 'numerical', True],
    "G3": ["Final result (PASS/FAIL)", True, True, 'nominal', True],
}

red = 'background-color: Tomato'
green = 'background-color: MediumSeaGreen'
blue = 'background-color: DodgerBlue'
sblue = 'background-color: SlateBlue'
violet = 'background-color: Violet'
gray = 'background-color: Lightgray'
orange = 'background-color: Orange'

def highlight_useful(val):
    """ highlight True as green, false as red """
    return '' if val is None else green if val else red

def highlight_type(val):
    """ highlight based on variable type """
    if val == 'nominal':
        return blue
    elif val == 'ordinal':
        return sblue
    elif  val == 'numerical':
        return violet
    
def highlight_target(val):
    """ highlight if target variable """
    return orange if val else gray


COLUMN_INFORMATION = pd.DataFrame.from_dict(descriptions, orient='index')
COLUMN_INFORMATION.columns = ['Description', 'For Classification', 'For Regression', 'Variable Type', 'Target']
COLUMN_INFORMATION = COLUMN_INFORMATION[['Description', 'Target', 'Variable Type', 'For Classification', 'For Regression']]
COLUMN_INFORMATION['Target'] = COLUMN_INFORMATION['Target'].apply(lambda cell: bool(cell))

with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', -1):
    display(
        COLUMN_INFORMATION.style.\
            applymap(highlight_useful, subset=['For Classification', 'For Regression']).\
            applymap(highlight_type, subset='Variable Type').\
            applymap(highlight_target, subset='Target')
    )

### 1.4 Distribution Scheme

In [None]:
# def split_labels(n: int, test_size=0.2, val_size=0.2):
#     """ return a list containing train/test/val labels """
#     t = int(n * test_size)
#     v = int(n * val_size)
#     labels = np.zeros(n)
#     labels[:t] = 1
#     labels[t:t+v] = 2
#     labels = ['test' if el==1 else 'val' if el==2 else 'train' for el in labels]
#     np.random.shuffle(labels)
#     return labels

distribution = train_test_split(cleaned, test_size=0.2, train_size=0.2, random_state=10, shuffle=False)

---
## Task 2. Predictive Modeling Using Decision Trees

(4 marks)

In [None]:
# def visualise_tree(dt, feature_names=None, class_names=None):
#     """ visualise a decision tree in a Jupyter notebook """
#     graph = graphviz.Source(export_graphviz(dt, out_file=None, filled=True, rounded=True, feature_names=feature_names, class_names=class_names))
#     display(SVG(graph.pipe(format='svg')))

### 2.1 Build a decision tree using default setting. 

In [None]:
# features = ['AGE', 'BILL']

# X = np.column_stack([organics[feature] for feature in features])

# y = np.array(organics['ORGYN'])

# Xtr, Xte, ytr, yte = train_test_split(X, y)

In [None]:
# dt = DecisionTreeClassifier()

In [None]:
# dt.fit(Xtr, ytr)

#### a. What is the classification accuracy on training and test datasets?

In [None]:
# pd.DataFrame([dt.score(Xtr, ytr), dt.score(Xte, yte)], columns=['Accuracy'], index=['Train', 'Test'])

#### b. Which variable is used for the first split? What are the competing splits for the first split?

In [None]:
# importances = pd.DataFrame(dt.feature_importances_, columns=['Importance'], index=features)
# importances

In [None]:
# visualise_tree(dt, feature_names=['AGE', 'BILL'], class_names=[str(bool(c)) for c in dt.classes_])

#### c. What are the 5 important variables in building the tree?

In [None]:
# importances[:5]

#### d. Report if you see any evidence of model overfitting.

There is model overfitting. This is evidenced by the higher train set accuracy.

### 2.2 Build another decision tree tuned with GridSearchCV.

In [None]:
# parameters = {
#     "criterion": ['gini', 'entropy'],
#     "max_depth": [None, 2, 5, 10, 15],
#     "min_samples_split": [0.2, 0.4, 0.6, 0.8],
#     "min_samples_leaf": [1, 20, 50, 80, 100]
# }

In [None]:
# gs_dt = GridSearchCV(DecisionTreeClassifier(), parameters).fit(Xtr, ytr)

#### a. What is the classification accuracy on training and test datasets?

In [None]:
# gs_dt.score(Xtr, ytr), gs_dt.score(Xte, yte)

#### b. What are the parameters used? Explain your decision.

In [None]:
# print(f"Parameters used: {list(parameters.keys())}")

Explanation

#### c. What are the optimal parameters for this decision tree?

In [None]:
# gs_dt.best_params_

In [None]:
# gs_dt.best_estimator_.get_params()

#### d. Which variable is used for the first split? What are the competing splits for this first split?

In [None]:
# importances = pd.DataFrame(gs_dt.best_estimator_.feature_importances_, columns=['Importance'], index=features)
# importances

#### e. What are the 5 important variables in building the tree?

In [None]:
# importances = pd.DataFrame(gs_dt.best_estimator_.feature_importances_, columns=['Importance'], index=features)
# importances[:5]

#### f. Report if you see an evidence of model overfitting.

There is no evidence of model overfitting, the accuracy on the train and test splits is comparable.

---
## Task 3. Predictive Modeling Using Regression

(5.5 marks)

### 1. Apply transformation/scaling methods to variables.

In [None]:
# apply 

In [None]:
# list affected variables

### 2. Build regression models (with and without GridSearchCV)

#### a. Report which variables are included in the regression model.

#### b. Report the top-5 important variables.

#### c. Report any sign of overfitting.

#### d. What are the parameters used?

In [None]:
# explain your decision

In [None]:
# What are the optimal parameters

In [None]:
# Which regression function is being used?

#### e. What is the classification accuracy on training and test datasets?

---
## Task 4. Predictive Modeling Using Neural Networks

(5.5 marks)

### 1. Build a Neural Network model using the default setting.

#### a. What is the network architecture of the model?

#### b. How many iterations are needed to train this network?

#### c. Do you see any sign of over-fitting?

#### d. Did the training process converge and result in the best model?

#### e. What is the classification accuracy on the training and test datasets?

### 2. Refine this network by refining is with GridSearchCV.

In [None]:
# What parameters are used?

In [None]:
# Explain your decision.

### 3. Build  another Neural Network with inputs selected from RFE with regression.

(Use the best model generated in Task 3) and selection with decision tree (use the best model from Task 2).

#### a. Did feature selection help here? Any changes in network architecture? What inputs are being used?

#### b. What is the classification accuracy on the train and test datasets? Any improvements?

#### c. How many iteration are needed to train this network?

#### d. Do you see any sign of over-fitting?

#### e. Did the training process converge and result in the best model?

#### f. Use GridSearchCV to tune the network to see whether the change in network architecture can further improve the performance.

### 3. Using the comparison methods, Which appears to be better?

From the better model, can you identify which customers to target? Can you provide some descriptive summary of those customers?

---
## Task 5. Comparing Predictive Models

(4 marks)

### 1. Using the comparison methods to compare the best decsision tree model, the best regression model...

#### a. Discuss the findings led by (a) ROC Chart and Index; (b) Accuracy Score; (c) Classification Report.

#### b. Do all the mdoels agree on the customers' characteristics? How do they vary?

### 2. Finally, based on all models and analysis, is there a particular model you will use in decision making?

### 3. Can you summarise positives and negaitives of each modelling method based on this analysis?

## Conclusion

In conclusion ...