# PAKDD 2014 - ASUS Malfunctional Components Prediction

## Table Of Contents
- [Description](#Description)
- [Evaluation](#Evaluation)

### Description 

** Goal ** - Predict future malfunctional components of ASUS notebooks from historical data. This will help estimate how many products require maintenance. Information is provided related to shipments and laptops requiring maintenance and repair. Using this information we have to estimate how many of each module of specific model will require repair.

### Evaluation

$$MAE = \frac{1}{n}\sum_{i=1}^{n}| y_i - \hat{y_i} |$$

In [1]:
# display inline plots
%matplotlib inline

# import libraries for numerical and scientific computing
import numpy as np
import scipy as sp

# import matplotlib for plotting
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt

# import pandas for data wrangling and munging
import pandas as pd

# set some options for better view
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# import plotting library built on top of matplotlib
import seaborn as sns

# set some settings related to style of plots that will render
sns.set_style("whitegrid")
sns.set_context("poster")

import warnings
warnings.filterwarnings('ignore')



In [2]:
# load sales and repair log
sales = pd.read_csv('./data/SaleTrain.csv')
repair = pd.read_csv('./data/RepairTrain.csv')

In [3]:
# output id mapping
output_id_map = pd.read_csv('./data/Output_TargetID_Mapping.csv')

In [4]:
sales.head(3)

Unnamed: 0,module_category,component_category,year/month,number_sale
0,M4,P27,2005/1,0
1,M4,P27,2005/5,1042
2,M4,P27,2005/9,1677


** Data about Sale of various modules and components of laptops starting from January, 2005 to February, 2008 **

In [5]:
repair.head(3)

Unnamed: 0,module_category,component_category,year/month(sale),year/month(repair),number_repair
0,M6,P16,2007/9,2009/4,1
1,M2,P30,2007/9,2009/8,1
2,M1,P12,2006/10,2008/2,2


** Data about Repair and Maintenance for various models and components of laptops starting from February, 2005 to December, 2009 **

In [6]:
output_id_map.head(3)

Unnamed: 0,module_category,component_category,year,month
0,M1,P02,2010,1
1,M1,P02,2010,2
2,M1,P02,2010,3


** Predict number of repair for each of module and component combination mentioned in above dataframe for the time-period January, 2010 to July, 2011 **

### Pairs of Module and Component 

In [7]:
def get_all_pairs(sales):
    pairs = []
    for module, component in zip(sales['module_category'], sales['component_category']):
        if (module, component) not in pairs:
            pairs.append((module, component))
    return pairs

In [8]:
pairs_sales = get_all_pairs(sales)

In [9]:
min_module_category = sorted(pairs_sales, key=lambda x: int(x[0][1]))[0][0]
max_module_category = sorted(pairs_sales, key=lambda x: int(x[0][1]), reverse=True)[0][0]

min_component_category = sorted(pairs_sales, key=lambda x: int(x[1][1:]))[0][1]
max_component_category = sorted(pairs_sales, key=lambda x: int(x[1][1:]), reverse=True)[0][1]

In [10]:
print 'Min Module category %s, Max Module Category %s ' %(min_module_category, max_module_category)
print 'Min Component category %s, Max Component category %s ' %(min_component_category, max_component_category)

Min Module category M0, Max Module Category M9 
Min Component category P01, Max Component category P31 


** The minimum module category is M0, maximum module category is M9 and minimum component category is P01 and maximum component category is P31 **

### Modules and Components in test set

In [11]:
pairs_repairs = get_all_pairs(repair)

In [12]:
print 'Number of pairs in sales', len(pairs_sales)
print 'Number of pairs in repair', len(pairs_repairs)

Number of pairs in sales 310
Number of pairs in repair 224


In [13]:
pairs_test = get_all_pairs(output_id_map)

In [14]:
print 'Module and Component pairs not in test set \n\n', list(set(pairs_repairs) - set(pairs_test))

Module and Component pairs not in test set 

[]


** So all the module component pairs are present in the test set **

## Renaming columns

In [15]:
import re

In [16]:
def rename_columns(column):
    column = re.sub(r'[/()]', '_', column)
    return column

sales.columns = sales.columns.map(rename_columns)
repair.columns = repair.columns.map(rename_columns)
output_id_map.columns = output_id_map.columns.map(rename_columns)

In [17]:
output_id_map.head(3)

Unnamed: 0,module_category,component_category,year,month
0,M1,P02,2010,1
1,M1,P02,2010,2
2,M1,P02,2010,3


In [18]:
sales['year_month'] = pd.to_datetime(sales.year_month)

repair['year_month_sale_'] = pd.to_datetime(repair.year_month_sale_)
repair['year_month_repair_'] = pd.to_datetime(repair.year_month_repair_)

## Baseline Model

** Calculate mean of all the (module, component) category pairs and for each example in the test set look up the values calculated for training set and return those values **

In [19]:
group_by_module_component = repair.groupby(['module_category', 'component_category'])['number_repair'].mean()

In [73]:
def prediction(df):
    preds = []
    for module, component in zip(df['module_category'], df['component_category']):
        preds.append(group_by_module_component[(module, component)])
    
    return preds

def create_submission(preds, filename):
    submission_df = pd.read_csv('./data/SampleSubmission.csv')
    submission_df['target'] = preds
    submission_df.to_csv('./submissions/'+ filename, index=False)

In [21]:
preds = prediction(output_id_map)

** This baseline model scores 5.9085 on the private leaderboard. **

## Naive Algorithm

** Using the naïve method, all forecasts for the future are equal to the last observed value of the series **

In [54]:
g = repair.sort_index(by='year_month_repair_')

In [67]:
def naive_algorithm(repair, pairs_test):
    """
    This takes in a dataframe for a specific module component pair and
    for that particular returns forecast for the num_returns as the most recent
    value in the training set
    
    example: for (M1, P02) repair amount for all the months starting January 2010 to July 2011 will be the most
    recent value of num_repair for this (module, component) pair in the training set
    """
    repair = repair.sort_index(by='year_month_repair_')
    
    repair_recent = {}
    
    for module, component in pairs_test:
        mask = (repair.module_category == module) & (repair.component_category == component)
        repair_module_component = repair[mask]
        repair_recent[(module, component)] = repair.irow(repair.shape[0] - 1).number_repair
    
    return repair_recent

In [70]:
def iterate_and_predict(train, test, pairs_test):
    prediction_dict = naive_algorithm(train, pairs_test)
    predictions = []
    
    for module, component in zip(test['module_category'], test['component_category']):
        predictions.append(prediction_dict[(module, component)])
    
    return predictions

preds_naive = iterate_and_predict(repair, output_id_map, pairs_test)

In [74]:
create_submission(preds_naive, 'preds_naive.csv')

## Average 

In [46]:
preds_naive

TypeError: 'NoneType' object has no attribute '__getitem__'

In [26]:
output_id_map.head()

Unnamed: 0,module_category,component_category,year,month
0,M1,P02,2010,1
1,M1,P02,2010,2
2,M1,P02,2010,3
3,M1,P02,2010,4
4,M1,P02,2010,5


In [24]:
repair.head(3)

Unnamed: 0,module_category,component_category,year_month_sale_,year_month_repair_,number_repair
0,M6,P16,2007-09-01,2009-04-01,1
1,M2,P30,2007-09-01,2009-08-01,1
2,M1,P12,2006-10-01,2008-02-01,2


In [30]:
x = repair[(repair.module_category == 'M1') & (repair.component_category == 'P02')].sort_index(by='year_month_repair_')

In [31]:
x.shape

(191, 5)

In [43]:
x.irow(190).number_repair

1