In [1]:
import pandas as pd

data_file = './labeled_data/qbj_2020_strata.csv'

# Read the data into a pandas dataframe
df = pd.read_csv(data_file,           # The data file being read, from the variable assignment above
                 on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                 dtype='str')         # This tells Pandas to treat all numbers as words


In [2]:
# Show the number of rows and columns in the dataframe
df.shape

(100, 7)

In [3]:
# Convert all the text in the dataframe to lowercase
df = df.apply(lambda x: x.astype(str).str.lower())

In [35]:
# Create the independent matrix and show first few lines
independent_matrix = df.drop('Disposition', axis=1)
independent_matrix.sample()

Unnamed: 0,RECORD_ID,FOI_TEXT,DEVICE_PROBLEM_CODE,DEVICE_PROBLEM_TEXT,DEVICE_REPORT_PRODUCT_CODE,Repeating terms
48,909902,it was reported that the customer received mul...,2896,communication or transmission problem,qbj,\n\nreplacement transmitter


In [36]:
# Create the dependent matrix and show first few lines
dependent_matrix = df[['Disposition']]
dependent_matrix.sample()

Unnamed: 0,Disposition
92,failure inconclusive


In [6]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(independent_matrix, dependent_matrix)

In [7]:
# Show the number of rows for training and testing
len(x_train), len(x_test)

(75, 25)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
                             analyzer='word', 
                             binary=False, 
                             decode_error='ignore', 
                             lowercase=True, 
                             tokenizer=None, 
                             use_idf=True, 
                             vocabulary=None)

vectorizer.fit(x_train)

TfidfVectorizer(decode_error='ignore', stop_words='english')

## Strategy to use to generate predictions.

[Documentation for sklearn.dummy.DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)

`strategy{“most_frequent”, “prior”, “stratified”, “uniform”, “constant”}, default=”prior”`

**“most_frequent”**: the predict method always returns the most frequent class label in the observed y argument passed to fit. The predict_proba method returns the matching one-hot encoded vector.

**“prior”**: the predict method always returns the most frequent class label in the observed y argument passed to fit (like “most_frequent”). predict_proba always returns the empirical class distribution of y also known as the empirical class prior distribution.

**“stratified”**: the predict_proba method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. The predict method returns the class label which got probability one in the one-hot vector of predict_proba. Each sampled row of both methods is therefore independent and identically distributed.

**“uniform”**: generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability.

**“constant”**: always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.

In [41]:
# Use the DummyClassifier, iterating over the different strategies
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, classification_report

dummy_classifier_scores = {}

strategies = ['most_frequent', 'prior', 'stratified', 'uniform', 'constant']

for strategy in strategies:
    print(f"{strategy.upper()} Strategy\n")
    
    if strategy == 'constant':
        classifier = DummyClassifier(strategy=strategy, random_state=0, constant='component failure')
    else:
        classifier = DummyClassifier(strategy=strategy, random_state=0)
    
    classifier.fit(x_train, y_train)
        
    y_pred = classifier.predict(x_test)
    
    # Print the scores and reports for each classifer
    print(f"Accuracy score = {accuracy_score(y_test, y_pred)}\n")
    
    print(classification_report(y_test, y_pred, zero_division=0))
    
    print('-' * 80, '\n')

MOST_FREQUENT Strategy

Accuracy score = 0.56

                      precision    recall  f1-score   support

   component failure       0.56      1.00      0.72        14
failure inconclusive       0.00      0.00      0.00        11

            accuracy                           0.56        25
           macro avg       0.28      0.50      0.36        25
        weighted avg       0.31      0.56      0.40        25

-------------------------------------------------------------------------------- 

PRIOR Strategy

Accuracy score = 0.56

                      precision    recall  f1-score   support

   component failure       0.56      1.00      0.72        14
failure inconclusive       0.00      0.00      0.00        11

            accuracy                           0.56        25
           macro avg       0.28      0.50      0.36        25
        weighted avg       0.31      0.56      0.40        25

-------------------------------------------------------------------------------- 