#### Workflow for Operationalizing Text Classification Model

Steps for classification
(refer to slides)
 1. Reload the model
 2. Reload the Vectorizer
 3. Preprocess the new text
 4. Numerically encode the input
 5. Predict the label



In [1]:
import pickle
import re
import string
#from sklearn.feature_extraction.text import CountVectorizer

 #### 1. Reload the model

In [2]:
# Check the name of the latest model
!ls -l ../outputs

total 624
-rw-r--r--  1 tanpohkeam  staff  49714 Jul  1 21:52 classifier-2021-07-01.pkl
-rw-r--r--  1 tanpohkeam  staff  49714 Jul  2 10:15 classifier-2021-07-02.pkl
-rw-r--r--  1 tanpohkeam  staff  74021 Jul  1 21:52 countvectoriser-2021-07-01.pkl
-rw-r--r--  1 tanpohkeam  staff  74021 Jul  2 10:15 countvectoriser-2021-07-02.pkl


In [3]:
classifier_path = '../outputs/classifier-2021-07-02.pkl'

# path1 = "classifier-2020-06-16.pkl"  ## The path needs to update to reference the proper file
with open(classifier_path, 'rb') as f:
    model = pickle.load(f)

#### 2. Reload the vectorizer

In [4]:
## The path needs to update to reference the proper file
countvectorizer_path = '../outputs/countvectoriser-2021-07-02.pkl'

with open(countvectorizer_path, 'rb') as f:
    trained_cv = pickle.load(f)

In [5]:
# Check the parameters of the model
model

LogisticRegression()

In [6]:
# Check the parameters of the Count Vectorizer
trained_cv

CountVectorizer(stop_words='english')

#### Function to Preprocess the input text

The way in which you pre-process a new input text should be in the same way that you have used 
pre-processed the trained data. 

In [7]:
def preprocess(text):
    alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
    punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
    text = alphanumeric(text)
    text = punc_lower(text)
    return text

#### Function to numerically encode the input using the Counter Vectorizer
Note that in this function, we use .transform() and not .fit_transform()
If the latter is used, we will end up with different set of features / terms

In [8]:
def encode_text_to_vector(cv, text):
       text_vector = cv.transform( [text ] )
       return text_vector

#### 3. Preprocess the new text

In [14]:
new_text = input("Enter the new text > ")
new_text = preprocess(new_text)

Enter the new text > Can I come to your house to eat


#### 4. Numerically encode the input

In [15]:
# only features (words) that are encountered during the training stage is mapped
new_text_vector = encode_text_to_vector(trained_cv, new_text)

In [16]:
print (new_text_vector)

  (0, 1022)	1
  (0, 1568)	1
  (0, 2445)	1


#### 5. Predict the label

The example below takes in a next text from the command line and call the functions defined above.
The predicted label is given.

In [12]:
predicted_label = (model.predict(new_text_vector))[0]
print (f"The input text <{new_text}> is a <{predicted_label}> ")

The input text <can i come to your house to eat dinner> is a <ham> 


### More Testing Sample text.

For the fun of testing, cut and paste different parts of a SPAM and HAM message


ham	I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

ham	I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.

ham	Oh k...i'm watching here:)
ham	Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.
ham	Fine if thats the way u feel. Thats the way its gota b

ham	Is that seriously how you spell his name?

spam	SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info

spam	URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18

spam	XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL

spam	England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+



In [13]:
#### Optional Step to Feel At Ease

# We need to reassure to ourselves that the text has the same feature size during the develop/modelling phase
print(new_text_vector.toarray().shape)
print(trained_cv.get_feature_names())


(1, 6125)
