# RMA (Return Merchandise Authorization) Prediction assignment

**Welcome to 'RMA Prediction' assignment, Below is a brief background:**

Return merchandise authorization (RMA) is the process of returning a product to receive a refund, replacement, or repair during the product's warranty period. 

In the chip manufacturing industry, in particular, the manufacturers have an interest in detect these units before it being shipped and reach the market for a few reasons:

- The cost of a single unit during production is nil, compared to replacing an entire end device that causes several times more damage

- The damage done to a company's reputation when a new device is returned as bad one can be very significant

- A malfunctioning unit can cause significant damage (you don't want such bad unit placed in ABS system of your car :) )

There are detailed test programs in which thousands of tests are performed on chips during production to ensure quality and prevent such cases.

RMA units means units that have passed all test cycles properly, and yet have been found to be defective in the field.

The manufacturers are aware that there are such units and their goal is to predict which units will be malfunctioning soon even though they now appear to be working properly. Once predicting such units, they will mark them as bad and not ship them to customers.

This use case is called "RMA Prediction" and this is the challenge you are required to deal with in this assignment!

**Technical notes:**

- Please follow the steps in this notebook

- Please write all your code inside this notebook only

- Please feel free to add more cells as you need

- The goal of this assigment is not only the best model:
    - there is also value in clean code, and in orderly and clear work
    - Use visualizations and comments on the reasons for the actions you take
    - It is recommended to show in the notebook also experiments you tried along the way


- Avoid over-fitting. 

    - Your prediction results will be tested on a test set that is not available to you


Good Luck!

# Packages

First, import all the packages you'll need during this assignment. You can finish successfully this assigment with these packages only but feel free to use more libraries if you find it helpful

- [numpy](www.numpy.org) is the main package for scientific computing with Python.
- [pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- [sklearn](https://scikit-learn.org/stable/) is the most useful and robust library for machine learning in Python.
- [pickle](https://docs.python.org/3/library/pickle.html) is a library for serializing and de-serializing Python object structures, also called marshalling or flattening
- np.random.seed(1) is used to keep all the random function calls consistent. It helps grade your work. Please don't change the seed! 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import metrics
np.random.seed(seed=1)

## The Mertic

First, we will define a positive result as a unit pretict RMA (RMA = True)

Because of the reasons listed above, in an RMA prediction problem, it is easier to get a false negative (predicting a normal unit as faulty) because it causes small damage to the company, than get a false positive (predicting a faulty unit as normal) because it causes large damage to the company as described above.

<img src="images/tp_fn.png" style="width:1000px;height:500px">

**Precision and Recall** 

Precision and Recall are metrics that deal with false positive errors (the relevant part):

 - Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances. (in our case, the percent of units that are RMA out of the total RMA predictions of the model)

<img src="images/precision.png">

- Recall (also known as sensitivity) is the fraction of relevant instances that were retrieved. (In our case, the percentage of RMA units thet correctly classified out of all the units that are RMA)

<img src="images/recall.png">

As described above, in our case, the recall score is more important from the precision.

The 'F score' gives a weighted score to these two metrics:

<img src="images/fbeta.png">

When a beta is equal to 1, these two metrics get an equal weight (and this is a metric f1).

In this prediction assignment, we will use the f2 metric to give the recall twice the weight of the precision.

In [2]:
def f2_score(y_true, y_pred):
    return metrics.fbeta_score(y_true, y_pred, beta = 2)

**Here are some examples of how the metric f2_score behaves:**

Suppose we have a series of labels with a 1000 rows as follows:

In [3]:
y = (np.random.rand(1000) > 0.5).astype(int)

We will set y_pred to be a series of 1000 positive instances:

In [4]:
y_pred = np.ones(1000).astype(int)

We will examine the results of the metrics described above:

In [5]:
print(f"Recall score is: {metrics.recall_score(y,y_pred)}, Precision score is: {metrics.precision_score(y,y_pred)}, F1 score is: {metrics.f1_score(y,y_pred)}, F2 score is: {f2_score(y,y_pred)}")

Recall score is: 1.0, Precision score is: 0.506, F1 score is: 0.6719787516600265, F2 score is: 0.8366402116402117


We can see that:
- Recall: 1.0, because we did not miss any positive instance.
- Precision : ~0.5, Because out of all the positive predictions, we were only right in ~half of the cases.
- It can therefore be seen that a score of f2 that gives double weight to the recall gets a score higher than f1

# Load The Data

First, load the data from the CSV file into the assignment folder.

In [7]:
df_rma = pd.read_csv('input/rma_data.csv')

We will now examine the dimensions of the data:

In [8]:
df_rma.shape

(3616, 17)

We will now print the first five rows of the DataFrame to see the values of the data.

In [9]:
df_rma.head()

Unnamed: 0,Test_0,Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,RMA
0,217.8,B,Low,OP-1,True,506.18,False,True,2018.0,15.5,feb,192,4,-1,0,,False
1,239.58,D,Low,OP-1,True,32.98,False,True,2018.0,86.8,jul,151,3,-1,0,,False
2,225.06,H,Low,OP-1,True,134.64,False,True,2018.0,52.7,apr,279,1,-1,0,,False
3,217.8,F,Mid,OP-17,True,260.74,False,False,,18.6,jun,91,18,-1,0,,False
4,283.14,B,High,OP-1,True,231.62,False,True,2018.0,27.9,jul,869,1,-1,0,,True


As you can see there are 16 independent variables all 'Test_i' (i: 0-15) columns, and one variable is dependent, 'RMA' column

The purpose of the assignment is to predict the value of the 'RMA' variable according to the value of the other variables in the data. Please remember to use visualizations, explain the actions you perform by comments and show your workflow clearly.

# Exploratory Data Analysis (EDA)

## What is EDA? 

Exploratory Data Analysis: this is unavoidable and one of the major step to fine-tune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages. EDA process should make you familiar with your data set.

- Outcome of this phase can be but not limited to:

    - Dataset charechtaristics.
    - Features relationship.
    - Target ('RMA' column) insights. 
    - Human errors indentification.
    - Outliers indentification.
    
**Goal of this phase is to implement EDA process for RMA data set**

In [10]:
### Your code here

# Data pre-processing

## Data preprocessing is a key step in Machine Learning Pipeline.

Quality of data directly affects the ability of a model to learn useful information out of it.

Therefore, it is extremely important that we preprocess our data before feeding it into a model.

In this step use your conclusions from the previous phase in order to apply them to the data.


- Outcome of this phase can be but not limited to:

    - Handling Missing Values
    - Data Normalization 
    
**Goal of this phase is prepare RMA data set to be used in machine learning model**

In [11]:
### Your code here

# Build a ML/DL model

**Build any model/s you think fits the problem, and validate the results on the f2 metric** 

It is recommended to try different models and techniques and compare their results.

- Outcome of this phase can be but not limited to:

    - Train model/s on data set you prepared in previous phase
    - Model/s results evaluation including perfomance metric/s results
    - Hyper-parameters optimization
    - Features selection
    - Features engineering 

**Goal of this phase is to build model/s in iterative way until getting to the best results you can. what is "best" model is up to you to decide based on performance metric/s you chose to optimize**

In [12]:
### Your code here

# Model Interpretation

**Here is the place to interpret your model**

- Outcome of this phase can be but not limited to:

    - Confusion Matrix
    - Features Importance
    - Business Insights

**Goal of this phase is to interpret your best model/s you found in previous phase** 

In [13]:
### Your code here

# Export the model

Export the model to a PKL format

In [None]:
my_model = ...
with open('my_model.pkl', 'wb') as f:
    pickle.dump(my_model, f)

**You're done! Well done!**

All that is left now is to pack your notebook and model in a zip file and send in a return email to the email through which you received the assignment.