In [1]:
import pandas as pd

xls = pd.ExcelFile('mandrill.xlsx')

sheet_names = xls.sheet_names
sheet_names

['dot. aplikacji Mandrill', 'dot. innych']

## Uploading data from Excel to DataFrame

In [2]:
mandrill_posts = pd.read_excel(xls, sheet_name='dot. aplikacji Mandrill')
other_posts = pd.read_excel(xls, sheet_name='dot. innych')

(mandrill_posts.head(), other_posts.head())

  warn(msg)
  warn(msg)


(                                                Post
 0  [blog] Using Nullmailer and Mandrill for your ...
 1  [blog] Using Postfix and free Mandrill email s...
 2  @aalbertson There are several reasons emails g...
 3  @adrienneleigh I just switched it over to Mand...
 4  @ankeshk +1 to @mailchimp We use MailChimp for...,
                                                 Post
 0              ¿En donde esta su remontada Mandrill?
 1  .@Katie_PhD Alternate, 'reproachful mandrill' ...
 2  .@theophani can i get "drill" in there? it wou...
 3  “@ChrisJBoyland: Baby Mandrill Paignton Zoo 29...
 4  “@MISSMYA #NameAnAmazingBand MANDRILL!” Mint C...)

## Translating and cleaning the text, deleting numbers, special characters and low lexical content words

In [3]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re
from googletrans import Translator

def translate_to_english(text):
    translator = Translator()
    translation = translator.translate(text, dest='en')
    return translation.text

def clean_text(text):
    text = translate_to_english(text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])
    return text

mandrill_posts['Cleaned_Post'] = mandrill_posts['Post'].apply(clean_text)
other_posts['Cleaned_Post'] = other_posts['Post'].apply(clean_text)

mandrill_posts[['Post', 'Cleaned_Post']]

Unnamed: 0,Post,Cleaned_Post
0,[blog] Using Nullmailer and Mandrill for your ...,blog using nullmailer mandrill ubuntu linux se...
1,[blog] Using Postfix and free Mandrill email s...,blog using postfix free mandrill email service...
2,@aalbertson There are several reasons emails g...,aalbertson reasons emails spam mind submitting...
3,@adrienneleigh I just switched it over to Mand...,adrienneleigh just switched mandrill lets impr...
4,@ankeshk +1 to @mailchimp We use MailChimp for...,ankeshk mailchimp use mailchimp marketing emai...
...,...,...
145,We've simplified and reduced pricing for every...,weve simplified reduced pricing hooray
146,We’re Unifying Your Mandrill and MailChimp Dat...,unifying mandrill mailchimp data mailchimp ema...
147,"Whaaat, I didn't know @MailChimp had an email ...",whaaat didnt know mailchimp email delivery api...
148,"would like to send emails for welcome, passwor...",like send emails welcome password resets payme...


In [4]:
other_posts[['Post', 'Cleaned_Post']]

Unnamed: 0,Post,Cleaned_Post
0,¿En donde esta su remontada Mandrill?,comeback mandrill
1,".@Katie_PhD Alternate, 'reproachful mandrill' ...",katiephd alternate reproachful mandrill cover ...
2,".@theophani can i get ""drill"" in there? it wou...",theophani drill picture mandrill holding drill...
3,“@ChrisJBoyland: Baby Mandrill Paignton Zoo 29...,chrisjboyland baby mandrill paignton zoo th ap...
4,“@MISSMYA #NameAnAmazingBand MANDRILL!” Mint C...,missmya nameanamazingband mandrill mint condit...
...,...,...
145,Why Are Monkey Butts So Colorful?: Mandrill Wi...,monkey butts colorful mandrill wikimedia commo...
146,You can now experience the thrills of classic ...,experience thrills classic pc gaming mandrill ...
147,ジャンルごった煮のバンド、Mandrillの75年作！オススメはOddiseeがサンプリング...,years mandrill boiled band genrethe recommenda...
148,パーカッシヴなビートに重厚なベースやスペイシーなシンセ等が絡むB1が◎な80年の好作！シッカ...,b heavy bass spacey synths percussive beats ye...


## Vectorizing text and creating the model using Naive Bayes Classifier

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

mandrill_posts['Label'] = 'Mandrill'
other_posts['Label'] = 'Other'

all_posts = pd.concat([mandrill_posts[['Cleaned_Post', 'Label']], other_posts[['Cleaned_Post', 'Label']]])

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(all_posts['Cleaned_Post'])
y = all_posts['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)

## Evaluating the model

In [18]:
y_pred = model.predict(X_test)

report = classification_report(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)

print("Confusion matrix: \n", matrix)
print("\nClass report: \n", report)

Confusion matrix: 
 [[27  2]
 [ 2 29]]

Class report: 
               precision    recall  f1-score   support

    Mandrill       0.93      0.93      0.93        29
       Other       0.94      0.94      0.94        31

    accuracy                           0.93        60
   macro avg       0.93      0.93      0.93        60
weighted avg       0.93      0.93      0.93        60



The measure *accuracy* is a relation of all truly predicted classes to the total cases, so 0.93 value tells that the model is actually very accurative. Next, the *precision* measure tells us what is the ratio of true positive predictions to the sum of true positives and false positives. Shortly, it tells us how the model perfromrs in classifying to the selected class. For every class (Mandill and Other) it classifies very well. *Recall* measures the ability of the model to capture all the relevant instances, or the ratio of correctly predicted positive observations to the total actual positives.

## Predicting classes on own data

In [15]:
own_comments = ["animal hair colorful", "mandrill login issues", "mandrill monkey jungle bananas", "domain api pricing"]

print('Text classification\n')
for comment in own_comments:
    my_text_vectorized = vectorizer.transform([comment])
    my_prediction = model.predict(my_text_vectorized)
    print(f'for comment "{comment}" model predicted class: {my_prediction[0]}')

Text classification

for comment "animal hair colorful" model predicted class: Other
for comment "mandrill login issues" model predicted class: Mandrill
for comment "mandrill monkey jungle bananas" model predicted class: Mandrill
for comment "domain api pricing" model predicted class: Mandrill


## Presenting how vectorizing the text works

In [17]:
test_dictionary = [
    "This is first example",
    "This is second example",
    "And this is third example",
    "Fourth example is also example"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(test_dictionary)

print("Word indexes:", vectorizer.vocabulary_)

print("\nVector matrix:")
print(X.toarray())

Word indexes: {'this': 8, 'is': 5, 'first': 3, 'example': 2, 'second': 6, 'and': 1, 'third': 7, 'fourth': 4, 'also': 0}

Vector matrix:
[[0 0 1 1 0 1 0 0 1]
 [0 0 1 0 0 1 1 0 1]
 [0 1 1 0 0 1 0 1 1]
 [1 0 2 0 1 1 0 0 0]]
