# CPSC 330 - Applied Machine Learning 

## Homework 9: Communication

**Due date: See the [Calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html)**

<br><br><br><br>

## Instructions 
<hr>
rubric={points:2}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

**You may work on this homework in a group and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 4. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).

<br><br><br><br>

## Exercise 1: Communication
<hr>

### 1.1 Blog post 
rubric={points:23}

Write up your analysis from hw5 or any other assignment or your side project on machine learning in a "blog post" or report format. It's fine if you just write it here in this notebook. Alternatively, you can publish your blog post publicly and include a link here. (See exercise 4.3.) The target audience for your blog post is someone like yourself right before you took this course. They don't necessarily have ML knowledge, but they have a solid foundation in technical matters. The post should focus on explaining **your results and what you did** in a way that's understandable to such a person, **not** a lesson trying to teach someone about machine learning. Again: focus on the results and why they are interesting; avoid pedagogical content.

Your post must include the following elements (not necessarily in this order):

- Description of the problem/decision.
- Description of the dataset (the raw data and/or some EDA).
- Description of the model.
- Description your results, both quantitatively and qualitatively. Make sure to refer to the original problem/decision.
- A section on caveats, describing at least 3 reasons why your results might be incorrect, misleading, overconfident, or otherwise problematic. Make reference to your specific dataset, model, approach, etc. To check that your reasons are specific enough, make sure they would not make sense, if left unchanged, to most students' submissions; for example, do not just say "overfitting" without explaining why you might be worried about overfitting in your specific case.
- At least 3 visualizations. These visualizations must be embedded/interwoven into the text, not pasted at the end. The text must refer directly to each visualization. For example "as shown below" or "the figure demonstrates" or "take a look at Figure 1", etc. It is **not** sufficient to put a visualization in without referring to it directly.

A reasonable length for your entire post would be **800 words**. The maximum allowed is **1000 words**.

#### Example blog posts

Here are some examples of applied ML blog posts that you may find useful as inspiration. The target audiences of these posts aren't necessarily the same as yours, and these posts are longer than yours, but they are well-structured and engaging. You are **not required to read these** posts as part of this assignment - they are here only as examples if you'd find that useful.

From the UBC Master of Data Science blog, written by a past student:

- https://ubc-mds.github.io/2019-07-26-predicting-customer-probabilities/

This next one uses R instead of Python, but that might be good in a way, as you can see what it's like for a reader that doesn't understand the code itself (the target audience for your post here):

- https://rpubs.com/RosieB/taylorswiftlyricanalysis

Finally, here are a couple interviews with winners from Kaggle competitions. The format isn't quite the same as a blog post, but you might find them interesting/relevant:

- https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
- https://medium.com/kaggle-blog/winner-interview-with-shivam-bansal-data-science-for-good-challenge-city-of-los-angeles-3294c0ed1fb2


#### A note on plagiarism

You may **NOT** include text or visualizations that were not written/created by you. If you are in any doubt as to what constitutes plagiarism, please just ask. For more information see the [UBC Academic Misconduct policies](http://www.calendar.ubc.ca/vancouver/index.cfm?tree=3,54,111,959). Please don't copy this from somewhere 🙏. If you can't do it.

## Who will default? Using Machine Learning to predict if someone will default on their credit card bill

### Project Background and the Dataset:

Economists have long described the modern economy as credit-based, and today, credit extension plays a central role in the interactions between individuals, corporations, and financial institutions. To mitigate the risks associated with this activity, lenders attempt to predict default risk.

In this project, we create this type of analysis. Using the Default of [Credit Card Clients Dataset](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset), we build a prediction model for determining whether a user will default on their credit card bill. The dataset includes data from  Taiwan, collected between April to September 2005. It includes 24 columns.

### Building the model

**Data Wrangling:** We define the variable "default.payment.next.month" as the target variable and split the data using a 30% test size. 

**Exploratory Data Analysis (EDA):** This step allows us to uncover many characteristics of the dataset. 
1. No features have missing values. 
2. There are 21000 examples and no duplicate examples in the train set. 
3. In the target, 0 (no default) is almost 3x more likely than 1 (default). 
4. The marriage feature is a categorical feature, and the small proportion in values 0 and 3 can be ignored.

**Feature engineering:** Using EDA, we are able to find out information about all 24 features/columns. The features are of different types and have data of various range, so it is important to do feature engineering before we can build models to analyze the data. 

So for the 24 features in this project, we separate them into numeric, categorical, drop and remainder. Next, we create a column transformer for scaling the numeric features and an OneHotEncoder for the categorical feature. Using a column transformer, we are able to easily apply different transformation to the 4 types of columns that we identified. 
- Numeric features: this dataset have numerous numeric features, such as 'LIMIT_BAL', 'AGE', 'BILL_AMT1', but the good thing is that none of these columns have missing values. So it is sufficient to use a StandardScaler to process this type of data.

- Categorical features: ‘MARRIAGE’ is the only categorical feature we defined in this dataset. There are 3 values for marriage, 1=married, 2=single, 3=others. We decide to use an OneHotEncoder to separate it into 3 columns.

- Drop features: upon EDA, we decide to drop the column ‘ID’. Because we are trying to target the default payment next month for each person in this project, ‘ID’ is not an important feature for us to consider. 

- Remainder features: other than the above mentioned features, we still have 8 columns remaining. 

**Baseline model:**
We create a baseline model using a DummyClassifier for comparing results.  DummyClassifier makes predictions without considering the input feature values, therefore, makes it a great baseline model for us to compare with other complex models that we later choose to use.

**Creating the Models:**

- Linear model

First, we create a logistic regression model, which assigns probabilities to the outcomes using the Sigmoid function. 

- Random forest

Instead of a single DecisionTree model, we decide to use a RandomForest model, which subdivides the dataset and creates decision trees. Since RandomForest takes average of the output of multiple decision trees, and it choose trees randomly, we believe it is a good candidate for our problem. 

- Catboost

Third, we create a Catboost model, which uses oblivious decision trees ( same splitting criterion across the level of the tree). Catboost is a gradient boosted trees model. Based on the examples we seen on lectures, it gives the best results compared to other gradient boosted trees models, so we decide to try Catboost. 

- Knn

Finally, we create a KNN model, which uses proximity to make the prediction. We then use the result from hyperparameter optimization to choose the best k for this model, which is 51. 

The results indicate that logistic regression has the fastest fit time. However, the other 3 models have significantly better test scores.


**Results:**
After we trained data with several models, we found that the Catboost model had the best performance. We could find the best depth and n_estimator for our catboost model through hyperparameter optimization. Ultimately, our model achieves a test score of 0.820 with a training score of 0.829. The score looks good that there is no overfitting or underfitting, and the gap between the test score and the training score is small, because we choose accuracy as the metric for assessment. However, we may be overconfident with our results. We don't have enough time to try and optimize other hyperparameters for the catboost model, and we may find an even better-performing model after optimizing different hyperparameters. We didn't try stacking and averaging on the models above to improve our model performance, and we didn't try models such as decision tree and SVM, so we can't know if we can find a better performing model. Although accuracy is a good metric, we should also try other measures, such as the Brier score, to measure a different aspect of model quality.

<br><br>

### 1.2 Effective communication technique
rubric={points:4}

Describe one effective communication technique that you used in your post, or an aspect of the post that you are particularly satisfied with. (Max 3 sentences.)

We explained the ideas in chunks which follow the flow of the project and have clear titles to signify each step. As the target audience is someone right before taking this course and the word limit, we chose to not go into much depth of explaining each models but briefly go through them and focus more on the purpose of the project as well as the result.

<br><br>

### (optional, not for marks) 1.3

Publish your blog post from 1.1 publicly using a tool like Hugo, or somewhere like medium.com, and paste a link here. Be sure to pick a tool in which code and code output look reasonable. This link could be a useful line on your resume!

<br><br><br><br>

## Exercise 2: Your takeaway from the course 
rubric={points:1}

**Your tasks:**

What is your biggest takeaway from this course? 

> Please write thoughtful answers. I'm looking forward to read your answers 🙂. 

Throughout the course, we were given a lot of tools to deal with different tasks and data sets. My biggest takeaway from the course is that now I know which tools are available to me out there to use when I face a new tasks. Even though we did not go in depth to understand each individual code or each of the tools we used but we’ve gained an understanding of how to make each of them work and apply them in our tasks. It’ll take time to practice and get more familiar with the materials and I will have to revisit the codes as well as lectures again but now I have a better sense of what Machine Learning is as well as its power and applications. NLP is particularly interesting to me. Its applications can be seen everywhere around us now, but only after this course do I really get to know what it is and what have been developed in the field. Working with words and sentences previously seemed very challenging to me even just to imagine, but after seeing what ave been done in the field, it broadens my knowledge of what could potentially be done and utitlized.

<br><br><br><br>

**PLEASE READ BEFORE YOU SUBMIT:** 

When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from "1" will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 
4. Make sure that the plots and output are rendered properly in your submitted file. If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb so that the TAs can view your submission on Gradescope. 

### Congratulations 👏👏

That's all for the assignments! Congratulations on finishing all homework assignments! 

In [1]:
from IPython.display import Image

Image("eva-congrats.png")

FileNotFoundError: No such file or directory: 'eva-congrats.png'

FileNotFoundError: No such file or directory: 'eva-congrats.png'

<IPython.core.display.Image object>