# Question 1

In [1]:
import numpy as np
import pandas as pd

Let's now put X and Y together and save as a CSV file for easier manulation in R.

In [2]:
def consolidate(name):
    """
    Consolidate X, y and save to a csv file.
    
    Params:
        name: string, name of the data file name excluding extension.
    
    """
    def read_eval(file):
        """
        Read in the text and evaluate it.
        
        Params:
            file: string, file name
        """
        with open(file) as f:
            text = f.read()
            return eval(text)
        
    # Complete the data file names 
    name_X = name + '.xdat'
    name_y = name + '.ydat'
    
    # Read in X and y
    X = read_eval(name_X)
    y = read_eval(name_y)
    
    # Convert both to np arrays
    X = np.array(X)
    y = np.array(y)
    
    # Reshape y to make y 2D
    y = y.reshape([y.shape[0], -1])
    data = np.concatenate((X, y), axis=1)
    
    # Save to csv
    np.savetxt(name + '.csv', data, delimiter=",")

In [3]:
consolidate('data0')
consolidate('data1')

# Question 2: 

In [4]:
df_a = pd.read_csv('alternator_actions.csv')
df_s = pd.read_csv('starter_actions.csv')

In [5]:
df_a.head()

Unnamed: 0,correction_description,"""part_name"""
0,CHECK NO START CHARGE BATT ALT NOT CHARGING ...,"""ALTERNATOR"""
1,check mil light p3119 stored road test light...,"""ALTERNATOR"""
2,check mil light p3119 stored road test light...,"""ALTERNATOR"""
3,TEST BATTERY AND CHARGED TESTED ALTERNATOR NEE...,"""ALTERNATOR"""
4,"DIAGNOSIS FOR NO START. SCAN FOR CODE, U0100, ...","""ALTERNATOR"""


In [6]:
df_s.head()

Unnamed: 0,correction_description,"""part_name"""
0,inspected vehicle for clicking and no start co...,"""STARTER"""
1,CHECK FOR NO START JUST CLICK CHECKED FOR A NO...,"""STARTER"""
2,INSPECT NO START CHECK CODES-P127A STARTER CON...,"""STARTER"""
3,CHECK NO START OK NOW REMOVE CONNECTIONS OF...,"""STARTER"""
4,"7 way connector wiring has melted on exhaust, ...","""STARTER"""


In [7]:
df_a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
correction_description    11 non-null object
 "part_name"              11 non-null object
dtypes: object(2)
memory usage: 256.0+ bytes


In [8]:
df_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 2 columns):
correction_description    38 non-null object
 "part_name"              38 non-null object
dtypes: object(2)
memory usage: 688.0+ bytes


## Steps Needed for Classification

### 1. Gather Data
To classify text descriptions into two "labels" / "classes" - alternator and starter, we need to build a Machine Learning model. This model can be trained using a "training set". The training set needs to have samples from both classes and the corresponding class labels, so that the model can learn which sample corresponds to which label. Therefore, the step is to combine both datasets into one. Steps:

1. Combine datasets into one training set. We can use `Pandas` to read in both datasets and use the `concat` function to combine the datasets.
2. Shuffle the dataset so that the dataset can reflect the actual distribution. To shuffle the dataset, we can use `numpy.random` module's `shuffle` function.

### 2. Preprocess Data

There are many inconsistencies in the text, for example, the same words can be in lower or uppper case. Also, since Machine Learning algorithms only deal with numeric inputs, our data will need to be converted from text to numeric values first. All these and more are all examples of preprocessing needed before we are able to feed the data in a Machine Learning algorithm. The detailed steps are:

1. Convert to lowercase: This can be done with Python's string methods `.lower()`.
2. Remove punctuation: This can be done using Regular Expressions(RE).
3. Remove common words that are not helpful (stopwords): In the English language, there are many words that are so frequently seen that they are not helpful in the classification, examples include "the", "this" etc. Both this and the next step can be done using Natrual Language Processing Toolkit (NLTK) or scikit-learn.
4. Convert words to root form (stemming): we don't want different forms of the same word to be treated as different words with different meanings, so we need to perform stemming. This way, "does" and "do" will be the same word. 
5. Split into separate words (tokenization): we want to transform our text samples to separate words so that they can be represented by numbers. To do that, we first need to split the samples. This can be done with Python's string method `.split`.
6. Transform into numerical values (vectorization): For the labels, we can use `LabelEncoder` class from the `sklearn.preprocessing` module to convert to numeric. For the descriptions (features), we can use `scikit-learn`'s `CountVectorizer` or `TfIdfVectorizer`.

Once this step is done, we'll have a matrix of numeric values that we can feed into a Machine Learning algorithm.

### 3. Model Building and Testing

1. Spot-checking: First, I want to spot-check different algorithms to find the most promising algorithms to use for this specific dataset. I'll try different categories of algorithms including:
 - Linear: Logistic Regression, Linear Discriminant Analysis etc.
 - Nonlinear: Naive Bayes, XGBoost, K-Nearest Neighbors, Support Vector Machines, Neural Networks etc.
 - Emsemble: AdaBoost, Random Forest etc.
   I'll use `scikit-learn` and XGBoost for this part.
2. Model building and tuning: Once I identify several algorithms that are most promising, I'll start building the model using cross-validation. I'll use a 80:20 split for train and validation sets and stratified 10-fold cross-validation. I'll use `scikit-learn`'s `GridSearchCV` and `RandomizedSearchCV` for hyperparameter tuning to find the best model.
3. Model testing: Finally, I'll use `pickle` to persist the trained model. Later on, once I have access to a new dataset, I'll be able to load the trained model, instantiate it and make predictions using the model's `predict` method.