### Business Analytics - Assignment 1 {-}


---

**Assignment Points**: 100  
**Due Date**: Friday Week 4 (17 March) @ 11.59pm  
**Submission**: Submit your file using the submission link on iLearn


- Put **all your work** into this file;
- Failure to supply solutions in the cells provided below each question will result in a loss of 30 points;
- Follow all instructions closely and **do not** print your variables to the screen unless explicitly asked to do so;
    - Comment out print statements where necessary and make sure that your submitted notebook does not have the output of previously executed print statements;
    - 10 marks will be deducted for every redundant print statement not explicitly asked for.



### About the Assignment

Credit score cards are used as a risk control method in the financial industry. Personal information submitted by credit card applicants are used to predict the probability of future defaults. The bank employs such data to decide whether to issue a credit card to the applicant or not.




| Feature Name         | Explanation     | Additional Remarks |
|--------------|-----------|-----------|
| ID | Randomly allocated client number      |         |
| Income   | Annual income  |  |
| Gender   | Applicant's Gender   | Male = 0, Female = 1  |
| Car | Car Ownership | Yes = 1, No = 0 | 
| Children | Number of Children | |
| Real Estate | Real Estate Ownership | Yes = 1, No = 0 
| Days Since Birth | No. of Days | Count backwards from current day (0), -1 means yesterday
| Days Employed | No. of Days | Count backwards from current day(0). If positive, it means the person is currently unemployed.
| Payment Default | Whether a client has overdue credit card payments | Yes = 1, No = 0



---


### Problem 1 - (50 points) {-}


**Question 1** 

- Import the `assignment_data.xlsx` file from `data` folder into a pandas DataFrame named `df`; 
- Delete duplicate rows from `df` according to `ID`;
- Delete the `ID` column.
- How many rows are left in `df`?

(10 points)

In [31]:
import pandas as pd
df = pd.read_excel('assignment_data.xlsx')
df = df.drop_duplicates(subset = 'ID', keep="first") 
df = df.drop(['ID'], axis = 1)
#len(df) ---> number of remaining rows

5976

There are 5976 rows left in the Dataframe after dropping duplicate ID's.


---

**Question 2**

- Reset the index in `df` using an appropriate function from `pandas` so that the new index corresponds to the number of rows (make sure to delete the old index). 
- How many positive values of `Days Employed` are there?
- Replace the positive values of `Days Employed` with 0 (zero) in `df`

(10 points)

In [22]:
ResetDF = df.reset_index()
ResetDF = ResetDF.drop(['index'], axis = 1)
# print(sum(ResetDF['Days Employed']>0)) No. of instances of unemployment
ResetDF['Days Employed'] = ResetDF['Days Employed'].apply(lambda x: 0 if x>0 else x)

There were 967 positive values for Days Employed, meaning there were 967 instances of unemployment.

---
**Question 3**

Create two new variables in `df` named 

1. `Age`;
2. `Years in Employment`,

which measure age and employment length in **years** (decimal numbers) from `Days Since Birth` and `Days Employed` by applying approapriate transformations on these variables. 

Delete the original variables `Days Since Birth` and `Days Employed`.

(5 points)


In [23]:
ResetDF['Age'] = ResetDF.apply(lambda x: abs(x['Days Since Birth'])/365, axis = 1)
ResetDF['Years in Employment'] = ResetDF.apply(lambda x: abs(x['Days Employed'])/365, axis = 1)
ResetDF=ResetDF.drop(['Days Since Birth'], axis = 1)
ResetDF=ResetDF.drop(['Days Employed'], axis = 1)

---
**Question 4**

- Create a **one**-dimensional NumPy array named `y` by exporting the first 5,000 observations of `Payment_Default`. (Hint: see `ravel()` function)
- Create a NumPy array named `X` by exporting the first 5,000 observations of the following columns `Gender`, `Car`, `Real Estate`, `Children`, `Income`, `Age`, `Years in Employment`.
 
(10 points)


In [24]:
import numpy as np
y = np.ravel(ResetDF.loc[:4999]['Payment Default'], order = 'C')
X = ResetDF.loc[:4999][['Gender','Car','Real Estate','Children','Income','Age','Years in Employment']].to_numpy()

---

**Question 5** 

- Use an appropriate `scikit-learn` library we learned in class to create the following NumPy arrays: `y_train`, `y_test`, `X_train` and `X_test` by splitting the data into 70% training and 30% test datasets. 
- Set `random_state` to 0 and stratify subsamples so that train and test datasets have roughly equal proportions of the target's class labels. 

(5 points) 

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify = y)

---

**Question 6**

- Create new variables by using an appropriate `scikit-learn` library we learned in class to standardize the features from the training and test datasets to mean zero and variance one. Name the new variables by appending '_scaled' to the original variable names.


(10 points)   

In [26]:
from sklearn.preprocessing import StandardScaler
np.set_printoptions(precision=3, suppress = True) # pretty printing

sc = StandardScaler()
sc.fit(X_train)

X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

---

## Problem 2 - (20 Points) {-}

**Question 7**

Fit the following two classifiers to the transformed training dataset using `scikit-learn` libraries.

- Perceptron - name your instance `pc` set `random_state=1`
- Logistic Regression - name your instance `lr` set `random_state=1`

When initializing instances of the above classifiers only set the parameters referenced above and nothing else.

(20 points)

In [27]:
from sklearn.linear_model import Perceptron
from sklearn.linear_model import LogisticRegression

pc = Perceptron(random_state=1)
pc.fit(X_train_scaled,y_train)

lr = LogisticRegression(random_state=1)
lr.fit(X_train_scaled, y_train)

LogisticRegression(random_state=1)

---
## Problem 3 - (30 points) {-}


**Question 8**

- Using a method built into each of the two classifiers compute their prediction accuracies on the training data;
- Store the accuracy values into variables named according to the following pattern: `classifier_name_accuracy_train`, e.g. you should have `lr_accuracy_train`; 
- Print the two accuracy **variables** along with their brief descriptions.

(10 points)

In [28]:
from sklearn.metrics import accuracy_score

pc_y_train_pred = pc.predict(X_train_scaled)
pc_accuracy_train = (f'PC Train Accuracy = {accuracy_score(y_train, pc_y_train_pred):.3f}')

lr_y_train_pred = lr.predict(X_train_scaled)
lr_accuracy_train = (f'LR Train Accuracy = {accuracy_score(y_train, lr_y_train_pred):.3f}')

print(pc_accuracy_train,"\n The Perceptron classifer was able to predict Payment Default with respect to the independent variable in the training set with an accuracy of 49.6% \n")
print(lr_accuracy_train,"\n The Logistic Regression classifer was able to predict Payment Default with respect to the independent variable in the training set with an accuracy of 55.8%")


PC Train Accuracy = 0.496 
 The Perceptron classifer was able to predict Payment Default with respect to the independent variable in the training set with an accuracy of 49.6% 

LR Train Accuracy = 0.558 
 The Logistic Regression classifer was able to predict Payment Default with respect to the independent variable in the training set with an accuracy of 55.8%


---

**Question 9** 

- Using a method built into each of the above classifiers compute their prediction accuracy for the test dataset
- Store the accuracy values into variables named according to the following pattern: `classifier_name_accuracy_test`, e.g. you should have `lr_accuracy_test`. 
- Print the two accuracy **variables** along with brief descriptions.

(10 points)

In [29]:
pc_y_test_pred = pc.predict(X_test_scaled)
pc_accuracy_test = (f'PC Test Accuracy = {accuracy_score(y_test, pc_y_test_pred):.3f}')


lr_y_test_pred = lr.predict(X_test_scaled)
lr_accuracy_test = (f'LR Test Accuracy = {accuracy_score(y_test, lr_y_test_pred):.3f}')

print(pc_accuracy_test,"\n The Perceptron classifer was able to predict Payment Default with respect to the independent variable in the test set with an accuracy of 49.6% \n")
print(lr_accuracy_test,"\n The Logistic Regression classifer was able to predict Payment Default with respect to the independent variable in the test set with an accuracy of 55.8%")

PC Test Accuracy = 0.495 
 The Perceptron classifer was able to predict Payment Default with respect to the independent variable in the test set with an accuracy of 49.6% 

LR Test Accuracy = 0.507 
 The Logistic Regression classifer was able to predict Payment Default with respect to the independent variable in the test set with an accuracy of 55.8%


---

**Question 10** 

Using nicely formated text in Markdown comment on the accuracies computed in Questions 8 & 9 making sure you address:
- training and test set datasets; 
- Perceptrion and Logistic Regression models. 

Are the results as expected, and why or why not? (Hint: You are not expected to comment on why a particular model is better.) 

(10 marks)


## Payment Default Analysis

To analyse the credit score for credit card applicants, two classifier methods were used to accurately predict Payment Default with respect to seven independent variables:
  * Gender
  * Car
  * Real Estate
  * Children
  * Income
  * Age
  * Years in Employment
  
This was done throguh a split of 70/30 into a training and test data set that was fit for both a Perceptron and Logisitic Regression analysis. Logically it would make sense that the train set has a higher accuracy as there is more data for it use against payment default. This is done through assigning a value of 1 or 0 to a predicted y value and then matched with the actual value to see if it was correctly assigned a true (default) or false (no default).

### Perceptron Analysis
Whilst the Perceptron test had the lower accuracy for both the train and test data sets out of the two models, this model had the closest train and test accuracy with 49.6 and 49.5 percent respectively. This means that the Perceptron was able to correctly match predicted values with the actual values around 50% of the time. Having the training set have a higher accuracy to the test set was expected even though it was only 0.01 percent higher.

### Logisitic Regression Analysis
The Logisitc Regression model had a higher accuracy for both the train and test sets however had a larger disparity between the two, boasting 55.8 and 50.7 percent respectively, a differnce of 5.1%. This means that the Logisitic Regression was able to correctly match predicted values with the actual values around 50-55% of the time.This result was expected as for the aforementioned reasoning, and was separated more so than the Perceptron model.

### Conlusion

Overall, the training set for both models matched expectations obtaining a higher accuracy than their test counterparts. Depsite this, an accuracy of around 50-55 percent is relatively low and would not seem to "accurately" predict a payment default for credit card applicants. Perhaps more data is needed or a different set or predictors are required for a better fit for these models.