# RMA (Return Merchandise Authorization) Prediction assignment

**Welcome to 'RMA Prediction' assignment, Below is a brief background:**

Return merchandise authorization (RMA) is the process of returning a product to receive a refund, replacement, or repair during the product's warranty period. 

In the chip manufacturing industry, in particular, the manufacturers have an interest in detect these units before it being shipped and reach the market for a few reasons:

- The cost of a single unit during production is nil, compared to replacing an entire end device that causes several times more damage

- The damage done to a company's reputation when a new device is returned as bad one can be very significant

- A malfunctioning unit can cause significant damage (you don't want such bad unit placed in ABS system of your car :) )

There are detailed test programs in which thousands of tests are performed on chips during production to ensure quality and prevent such cases.

RMA units along with non-RMA units have passed all test cycles properly, and yet have been found to be defective in the field.

The manufacturers are aware that there are such units and their goal is to predict which units will be malfunctioning soon even though they now appear to be working properly. Once predicting such units, they will mark them as bad and not ship them to customers.
Secondary goal is to avoid excessive scrap of good units by marking good units as bad to avoid significant financial loss.


This use case is called "RMA Prediction" and this is the challenge you are required to deal with in this assignment!

**Technical notes:**

- Please follow the steps in this notebook

- Please write all your code inside this notebook only

- Please feel free to add more cells as you need

- The goal of this assigment is not only the best model:
    - there is also value in clean code, and in orderly and clear work
    - Use visualizations and comments on the reasons for the actions you take
    - It is recommended to show in the notebook also experiments you tried along the way


- Avoid over-fitting. 

    - Your prediction results will be tested on a test set that its lables are not available to you


Good Luck!

# Packages

First, import all the packages you'll need during this assignment. You can finish successfully this assigment with these packages only but feel free to use more libraries if you find it helpful

- [numpy](www.numpy.org) is the main package for scientific computing with Python.
- [pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
- [matplotlib](http://matplotlib.org) is a library to plot graphs in Python.
- [sklearn](https://scikit-learn.org/stable/) is the most useful and robust library for machine learning in Python.
- [f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) Is the metric you should use to evaluate the performance of your model.
- [pickle](https://docs.python.org/3/library/pickle.html) is a library for serializing and de-serializing Python object structures, also called marshalling or flattening
- np.random.seed(1) is used to keep all the random function calls consistent. It helps grade your work. Please don't change the seed! 

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.metrics import f1_score
import pickle
np.random.seed(seed=1)

# Load The Data

First, let's load the data:

In [None]:
url_data_train = (r'https://raw.githubusercontent.com/NI-DS/rma_assignment/main/Input/rma_train_data.csv')
url_data_test = (r'https://raw.githubusercontent.com/NI-DS/rma_assignment/main/Input/rma_test_data.csv')
df_rma_train, df_rma_test = pd.read_csv(url_data_train), pd.read_csv(url_data_test)

We will now examine the dimensions of the data:

In [16]:
df_rma_train.shape, df_rma_test.shape

((3616, 17), (905, 16))

It can be seen that in the train set there is one column more than the test set.

We will now print the first five rows of the DataFrame to see the values of the data.

In [17]:
df_rma_train.head()

Unnamed: 0,Test_0,Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,RMA
0,261.36,E,Low,OP-1,True,102.14,False,True,2018.0,58.9,nov,136,1,120,1,TP-1549,False
1,268.62,H,Mid,OP-17,True,655.42,False,True,2018.0,62.0,apr,114,1,152,2,TP-1549,False
2,297.66,F,Low,OP-1,True,605.5,False,False,2018.0,65.1,nov,285,3,116,4,TP-1549,False
3,515.46,,Low,OP-17,True,144.78,True,True,2018.0,83.7,jan,208,1,93,2,TP-1549,False
4,290.4,D,Low,OP-4,True,416.22,False,True,2019.0,37.2,may,58,2,334,1,TP-1549,False


In [18]:
df_rma_test.head()

Unnamed: 0,Test_0,Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,Test_10,Test_11,Test_12,Test_13,Test_14,Test_15
0,268.62,C,Low,OP-17,True,1116.14,False,True,2018.0,6.2,feb,289,2,174,4,TP-9941
1,268.62,D,Low,OP-4,True,53.0,False,True,2018.0,18.6,may,226,1,363,3,TP-9941
2,246.84,D,Low,OP-1,True,275.56,False,True,2018.0,62.0,apr,470,1,150,5,TP-9941
3,333.96,B,Low,OP-1,True,53.0,False,True,2018.0,68.2,jun,154,1,124,1,TP-9941
4,261.36,B,Mid,OP-1,True,438.32,False,True,2018.0,55.8,nov,177,2,174,1,TP-9941


As you can see there are 16 independent variables all 'Test_i' (i: 0-15) columns, and one variable is dependent, 'RMA' column.

The purpose of the assignment is to predict the value of the 'RMA' variable according to the value of the other variables in the data. Please remember to use visualizations, explain the actions you perform by comments and show your workflow clearly.

Note that the test set has no labels. Your project is to build a machine learning model that will predict the labels for it.

# Exploratory Data Analysis (EDA)

## What is EDA? 

Exploratory Data Analysis: this is unavoidable and one of the major step to fine-tune the given data set(s) in a different form of analysis to understand the insights of the key characteristics of various entities of the data set like column(s), row(s) by applying Pandas, NumPy, Statistical Methods, and Data visualization packages. EDA process should make you familiar with your data set.

- Outcome of this phase can be but not limited to:

    - Dataset charechtaristics.
    - Features relationship.
    - Target ('RMA' column) insights. 
    - Any interesting thing you can learn about the dataset.
    
**Goal of this phase is to implement EDA process for RMA data set**

In [19]:
### Your code here

# Data pre-processing

## Data preprocessing is a key step in Machine Learning Pipeline.

Quality of data directly affects the ability of a model to learn useful information out of it.

Therefore, it is extremely important that we preprocess our data before feeding it into a model.

In this step use your conclusions from the previous phase in order to apply them to the data.

    
**Goal of this phase is prepare RMA data set to be used to build a machine learning model**

In [20]:
### Your code here

# Build a ML/DL model

Build any model/s you find useful to solve the task and evaluate its performance results.

Use the metric f1_score imported above and use it to measure the performance of your model. (positive class is RMA unit (RMA = True))

It is recommended to try different models and techniques and compare the results

- Outcome of this phase can be but not limited to:

  -	Train model/s on data set you prepared in previous phase
  -	Features engineering
  -	Features selection
  -	Hyper-parameters optimization
  -	Model/s results evaluation including performance metric/s values


**Goal of this phase is to see your machine learning experimentation skills toward achieving as best model/s as you can**

In [21]:
### Your code here

# Model Interpretation

**Here is the place to interpret your model**

- Outcome of this phase can be but not limited to:

    - Confusion Matrix
    - Features Importance
    - Business Insights

**Goal of this phase is to interpret your best model/s you found in previous phase** 

In [22]:
### Your code here

# Something to think about:

- Why is the selected metric f1_score?
- Are there metrics that you think would better suit this problem or data?
- What metrics are not appropriate for such a problem?

# Predict on the test and export the model

In [None]:
my_model = ...
my_prediction = my_model.predict(df_rma_test)
my_prediction.to_csv('my_prediction.csv', index = False)
with open('my_model.pkl', 'wb') as f:
    pickle.dump(my_model, f)

**Well done!!**

Congratulations!!! You are almost done.Please , pack your notebook, model and prediction file in one zip file and send us back through the e-mail which you received the assignment.