## Practical example 2 : emails classification

In [169]:
import numpy as np
import pandas as pd
import re
import string
import math


Now we will load data and perform some basic preprocessing to see the data.

 So, we are going to use Email spam data to demonstrate each technique and clean the data. The dataset contains 5727 unique email and a label column indicating mail is span or Ham which is the target variable on which based on the content we can classify the mails.

In [170]:
data = pd.read_csv('emails.csv', usecols=['spam','text'])



In [171]:
data

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1.0
1,Subject: the stock trading gunslinger fanny i...,1.0
2,Subject: unbelievable new homes made easy im ...,1.0
3,Subject: 4 color printing special request add...,1.0
4,"Subject: do not have money , get software cds ...",1.0
...,...,...
5722,Subject: re : research and development charges...,0.0
5723,"Subject: re : receipts from visit jim , than...",0.0
5724,Subject: re : enron case study update wow ! a...,0.0
5725,"Subject: re : interest david , please , call...",0.0


In [172]:
#frequency distribution of the class attribute
print(pd.crosstab(index=data["spam"],columns="count"))

col_0  count
spam        
0.0     4359
1.0     1367


In [173]:

data.rename(columns={'spam':'class'},inplace=True)
data['label'] = np.where(data['class']==1,'spam','ham')
data.drop_duplicates(inplace=True)

In [174]:
data

Unnamed: 0,text,class,label
0,Subject: naturally irresistible your corporate...,1.0,spam
1,Subject: the stock trading gunslinger fanny i...,1.0,spam
2,Subject: unbelievable new homes made easy im ...,1.0,spam
3,Subject: 4 color printing special request add...,1.0,spam
4,"Subject: do not have money , get software cds ...",1.0,spam
...,...,...,...
5722,Subject: re : research and development charges...,0.0,ham
5723,"Subject: re : receipts from visit jim , than...",0.0,ham
5724,Subject: re : enron case study update wow ! a...,0.0,ham
5725,"Subject: re : interest david , please , call...",0.0,ham


Now we will start with the techniques for text preprocessing and clean the data which is ready to build a machine learning model. let us see the first mail and when we will apply the text cleaning technique we will observe the changes to the first mail.

In [175]:
data['text'][0]

"Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

We can observe lots of noise at first mail like extra spaces, many hyphen marks « - » , different cases, and many more. let’s get started with studying different techniques.

1) Expand Contractions

Contraction is the shortened form of a word like don’t stands for do not, aren’t stands for are not. Like this, we need to expand this contraction in the text data for better analysis. you can easily get the dictionary of contractions on google or create your own and use the re module to map the contractions.

In [176]:
contractions_dict = {"ain't": "are not","'s":" is","aren't": "are not"}
# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))
def expand_contractions(text,contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, text)
# Expanding Contractions in the reviews
data['text']=data['text'].apply(lambda x:expand_contractions(x))

In [177]:
data['text'][0]

"Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

2) Lower Case

If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine. for example, words like Ball and ball are treated differently by machine. So, we need to make the text in the same case and the most preferred case is a lower case to avoid such problems.

In [178]:
data['text'] = data['text'].str.lower()#Use .str.lower instead of just .lower
#the code with lambda function 
#data['text'] = data['text'].apply(lambda x:x.str.lower())

we have used a sub-method that takes 3 main parameters, the first is a pattern to search, the second is by which we have to replace, and the third is string or text which we have to change. so we have passed all the punctuation and finds if anyone present then replaces with an empty string. Now if you look at the first mail it will look something like this.



In [179]:
data['text'][0]

"subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

-->You can observe the complete text in lower case

3) Remove punctuations


One of the other text processing techniques is removing punctuations. there are total 32 main punctuations that need to be taken care of. we can directly use the string module with a regular expression to replace any punctuation in text with an empty string. 32 punctuations which string module provide us is listed below.

In [180]:
string.punctuation
#'!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [181]:
#remove punctuation
data['text'] = data['text'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '' , x))

we have used a sub-method that takes 3 main parameters, the first is a pattern to search, the second is by which we have to replace, and the third is string or text which we have to change. so we have passed all the punctuation and finds if anyone present then replaces with an empty string. Now if you look at the first mail it will look something like this.



In [182]:
data['text'][0]

'subject naturally irresistible your corporate identity  lt is really hard to recollect a company  the  market is full of suqgestions and the information isoverwhelminq  but a good  catchy logo  stylish statlonery and outstanding website  will make the task much easier   we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader  it isguite ciear that  without good products  effective business organization and practicable aim it  will be hotat nowadays market  but we do promise that your marketing efforts  will become much more effective  here is the list of clear  benefits  creativeness  hand  made  original logos  specially done  to reflect your distinctive company image  convenience  logo and stationery  are provided in all formats  easy  to  use content management system letsyou  change your website content and even its structure  promptness  you  will see logo drafts within three business days  affordability  your  marketing break  through 

4) Remove words containing digits and digits 

Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. hence, We need to remove the words and digits which are combined like game57 or game5ts7. This type of word is difficult to process so better to remove them or replace them with an empty string. we use regular expressions for this. 

The first mail is not having digits but other mails in the dataset contain this problem like mail 4.

In [183]:
data['text'][3]

'subject 4 color printing special  request additional information now  click here  click here for a printable version of our order form  pdf format   phone   626  338  8090 fax   626  338  8102 e  mail  ramsey  goldengraphix  com  request additional information now  click here  click here for a printable version of our order form  pdf format   golden graphix  printing 5110 azusa canyon rd  irwindale  ca 91706 this e  mail message is an advertisement and  or solicitation  '

In [184]:
#remove words and digits
data['text'] = data['text'].apply(lambda x: re.sub(r'\b[0-9]+\b\s*', '',x))


In [185]:
#now observe the changes in the mail.
data['text'][3]

'subject color printing special  request additional information now  click here  click here for a printable version of our order form  pdf format   phone   fax   e  mail  ramsey  goldengraphix  com  request additional information now  click here  click here for a printable version of our order form  pdf format   golden graphix  printing azusa canyon rd  irwindale  ca this e  mail message is an advertisement and  or solicitation  '

5) Remove Stopwords

Stopwords are the most commonly occurring words in a text which do not provide any valuable information. stopwords like they, there, this, where, etc are some of the stopwords.

NLTK library is a common library that is used to remove stopwords and include approximately 180 stopwords which it removes. If we want to add any new word to a set of words then it is easy using the add method.

In our example, we want to remove the subject words from every mail so we will add them to stopwords and HTTP to remove web links.

In [186]:
#remove stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.add('subject')
stop_words.add('http')
stop_words.add('aa')
stop_words.add('aaa')
def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])
#here we have implemented a custom function that will split each word from the text and check whether it is a stopword or not.
#If not then pass as it is in string and if stopword then removes it.
data['text'] = data['text'].apply(lambda x: remove_stopwords(x))

Now the email text will be smaller because all stopwords will be removed.

In [187]:
#now observe the changes in the mail.
data['text'][3]

'color printing special request additional information click click printable version order form pdf format phone fax e mail ramsey goldengraphix com request additional information click click printable version order form pdf format golden graphix printing azusa canyon rd irwindale ca e mail message advertisement solicitation'

6) Stemming and Lemmatization

Stemming is a process to reduce the word to its root stem for example run, running, runs, runed derived from the same word as run. basically stemming do is remove the prefix or suffix from word like ing, s, es, etc. NLTK library is used to stem the words. The stemming technique is not used for production purposes because it is not so efficient technique and most of the time it stems the unwanted words. So, to solve the problem another technique came into the market as Lemmatization. there are various types of stemming algorithms like porter stemmer, snowball stemmer. Porter stemmer is widely used present in the NLTK library.

In [188]:
#stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])
#data["text"] = data["text"].apply(lambda x: stem_words(x))

In [189]:
data["text"][3]

'color printing special request additional information click click printable version order form pdf format phone fax e mail ramsey goldengraphix com request additional information click click printable version order form pdf format golden graphix printing azusa canyon rd irwindale ca e mail message advertisement solicitation'

Lemmatization is similar to stemming, used to stem the words into root word but differs in working. Actually, Lemmatization is a systematic way to reduce the words into their lemma by matching them with a language dictionary.

In [190]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])
data["text"] = data["text"].apply(lambda text: lemmatize_words(text))

-----> Now observe the difference between both the techniques, it has only stemmed those words which are really required as per Language dictionary.

In [191]:
data["text"][3]

'color printing special request additional information click click printable version order form pdf format phone fax e mail ramsey goldengraphix com request additional information click click printable version order form pdf format golden graphix printing azusa canyon rd irwindale ca e mail message advertisement solicitation'

7) Remove Extra Spaces

Most of the time text data contain extra spaces or while performing the above preprocessing techniques more than one space is left between the text so we need to control this problem. regular expression library performs well to solve this problem.

In [192]:
data["text"] = data["text"].apply(lambda x: re.sub(' +', ' ', x))

These are the most important text preprocessing techniques that are mostly used while dealing with NLP problems.

In [193]:
data["text"]

0       naturally irresistible corporate identity lt r...
1       stock trading gunslinger fanny merrill muzo co...
2       unbelievable new home made easy im wanting sho...
3       color printing special request additional info...
4       money get software cd software compatibility g...
                              ...                        
5722    research development charge gpg forwarded shir...
5723    receipt visit jim thanks invitation visit lsu ...
5724    enron case study update wow day super thank mu...
5725    interest david please call shirley crenshaw as...
5726    news aurora update aurora version fastest mode...
Name: text, Length: 5694, dtype: object

## Create test and training 

In [194]:
data

Unnamed: 0,text,class,label
0,naturally irresistible corporate identity lt r...,1.0,spam
1,stock trading gunslinger fanny merrill muzo co...,1.0,spam
2,unbelievable new home made easy im wanting sho...,1.0,spam
3,color printing special request additional info...,1.0,spam
4,money get software cd software compatibility g...,1.0,spam
...,...,...,...
5722,research development charge gpg forwarded shir...,0.0,ham
5723,receipt visit jim thanks invitation visit lsu ...,0.0,ham
5724,enron case study update wow day super thank mu...,0.0,ham
5725,interest david please call shirley crenshaw as...,0.0,ham


In [195]:
#frequency distribution of the class attribute
print(pd.crosstab(index=data["label"],columns="count"))

col_0  count
label       
ham     4327
spam    1367


In [196]:
#**** DECOUPAGE EN TRAIN ET TEST *****
#Nous créons les corpus d’apprentissage et de test via une partition au hasard, 
#stratifiée selon les classes pour préserver les proportions de ‘’spam’’ et ‘’ham’’ dans les sous-ensembles.
#Nousutilisons la procédure train_test_split du module sklearn.model_selection.
#subdivision into train and test sets
from sklearn.model_selection import train_test_split
dataTrain, dataTest = train_test_split(data,train_size=0.8,random_state=1,stratify=data['label'])

In [197]:
dataTrain

Unnamed: 0,text,class,label
3484,garp convention invitation speak andreas look ...,0.0,ham
807,localized software language available hello wo...,1.0,spam
5616,weather course joe recent offer lacima weather...,0.0,ham
1797,visible red mike robert hou ect jose marquez c...,0.0,ham
4609,installation equipment ordered completed autom...,0.0,ham
...,...,...,...
3363,fw usaee conference dear mr kaminski attached ...,0.0,ham
1971,hello hello received message still nyc like lo...,0.0,ham
5634,electricity summit u c berkeley sevil yes plea...,0.0,ham
2404,request submitted access request chris clark e...,0.0,ham


In [198]:
dataTrain.groupby(['label']).size()

label
ham     3461
spam    1094
dtype: int64

In [199]:
dataTest.groupby(['label']).size()

label
ham     866
spam    273
dtype: int64

## Generate the document term matrix - train set

In [200]:
#*** Generate the document term matrix - train set ***

#import the CountVectorizer tool
from sklearn.feature_extraction.text import CountVectorizer

#instantiation of the objet
parseur = CountVectorizer(binary=True)

#create the document term matrix
XTrain = parseur.fit_transform(dataTrain['text'])

In [201]:
XTrain

<4555x27938 sparse matrix of type '<class 'numpy.int64'>'
	with 383549 stored elements in Compressed Sparse Row format>

In [202]:
#list of tokens
print(parseur.get_feature_names())



In [203]:
#number of  tokens
print(len(parseur.get_feature_names()))

27938


Nous observons 27938 termes. Les énumérer serait trop fastidieux.

Pour calculer la fréquence des termes, nous utilisons XTrain. Il est au format « matrice creuse », nous le transformons en matrice « numpy » que nous stockons dans la variable mdtTrain.

In [204]:
#transform the sparse matrix into a numpy matrix
mdtTrain = XTrain.toarray()

#type of the matrix
print(type(mdtTrain))

#size of the matrix
print(mdtTrain.shape)

<class 'numpy.ndarray'>
(4555, 27938)


In [205]:
mdtTrain

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [206]:
#frequency of the terms
freq_mots = np.sum(mdtTrain,axis=0)#sum since we are using binary ponderation
print(freq_mots)
print('****')
#argmax
index = np.argsort(freq_mots)
print(index)
print('****')
#print the terms and their frequency
imp = {'terme':np.asarray(parseur.get_feature_names())[index],'freq':freq_mots[index]}
print(pd.DataFrame(imp).sort_values(by='freq', ascending=False))

[1 1 1 ... 1 2 5]
****
[    0 15347 15348 ...  8066 26510 18692]
****
            terme  freq
27937      please  2228
27936       vince  2220
27935       enron  2041
27934          cc  1709
27933       would  1705
...           ...   ...
8687   schoolmate     1
8686     scholtes     1
8685    scholarly     1
8684   rubberized     1
0      aaaenerfax     1

[27938 rows x 2 columns]


Le terme ‘’please’’ apparaît dans 2228 documents, ..., etc.

### Réduction de dimensionnalité 1 –  fréquence des termes

Dans cette section, nous réitérons l’analyse précédente en introduisant une option lors de l’instanciation de la classe CountVectorizer :  min_df = 10 pour retirer les termes qui apparaissent dans moins (strictement) de 10 documents.

In [207]:
#***** MIN FREQUENCY

#rebuild the parser with new options : min_df = 10
parseurBis = CountVectorizer(stop_words='english',binary=True, min_df = 10)
XTrainBis = parseurBis.fit_transform(dataTrain['text'])

#number of tokens
print(len(parseurBis.get_feature_names()))

#mdt_bis
mdtTrainBis = XTrainBis.toarray()

4625


plus de 6 fois moins de termes 

## train the classifier

### KNN classifier

In [208]:
#import the class KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier

#instatiate the object
knn_classifier = KNeighborsClassifier(n_neighbors=2)

#perform the training process
knn_classifier.fit(mdtTrainBis,dataTrain['label'])



KNeighborsClassifier(n_neighbors=2)

In [209]:
#generate the document term matrix for the test set
#using the object learned from the train set
#import the metrics class for the performance measurement
from sklearn import metrics

#create the document term matrix
mdtTestBis = parseurBis.transform(dataTest['text'])

#prediction for the test set
predTestBis = knn_classifier.predict(mdtTestBis)

#confusion matrix
print('***Confusion matrix')
mcTestBis = metrics.confusion_matrix(dataTest['label'],predTestBis)
print(mcTestBis)

***Confusion matrix
[[814  52]
 [ 49 224]]


In [210]:
#recall
print('Recall')
print(metrics.recall_score(dataTest['label'],predTestBis,pos_label='spam'))

#precision
print('precision')
print(metrics.precision_score(dataTest['label'],predTestBis,pos_label='spam'))

#F1-Score
print('F1-Score')
print(metrics.f1_score(dataTest['label'],predTestBis,pos_label='spam'))

#accuracy rate
print('accuracy rate -->')
print(metrics.accuracy_score(dataTest['label'],predTestBis))

Recall
0.8205128205128205
precision
0.8115942028985508
F1-Score
0.8160291438979964
accuracy rate -->
0.9113257243195786


### LogistiRegression

In [211]:
#*** train the classifier

#import the class LogistiRegression
from sklearn.linear_model import LogisticRegression

#instatiate the object
modelBis = LogisticRegression()

#perform the training process
modelBis.fit(mdtTrainBis,dataTrain['label'])

#generate the document term matrix for the test set
#using the object learned from the train set


#import the metrics class for the performance measurement
#from sklearn import metrics

#create the document term matrix for test
mdtTestBis = parseurBis.transform(dataTest['text'])

#prediction for the test set
predTestBis = modelBis.predict(mdtTestBis)

#confusion matrix
print('***Confusion matrix')
mcTestBis = metrics.confusion_matrix(dataTest['label'],predTestBis)
print(mcTestBis)

#recall
print('Recall')
print(metrics.recall_score(dataTest['label'],predTestBis,pos_label='spam'))

#precision
print('precision')
print(metrics.precision_score(dataTest['label'],predTestBis,pos_label='spam'))

#F1-Score
print('F1-Score')
print(metrics.f1_score(dataTest['label'],predTestBis,pos_label='spam'))

#accuracy rate
print('accuracy rate -->')
print(metrics.accuracy_score(dataTest['label'],predTestBis))

***Confusion matrix
[[857   9]
 [  2 271]]
Recall
0.9926739926739927
precision
0.9678571428571429
F1-Score
0.9801084990958407
accuracy rate -->
0.990342405618964


## Réduction de dimensionnalité 2 – Post traitement du modèle

#### Stratégie de sélection de variables
Est-il possible de réduire encore la dimensionnalité ? S’intéresser aux propriétés du modèle prédictif produit par la régression logistique constitue une autre piste. Certains coefficients de la fonction de classement sont quasiment nuls, ils pèsent de manière négligeable dans la décision.


Une stratégie simple (très fruste même je dirais) consiste 
    <li>(1) à retirer les termes correspondants du dictionnaire, 
    <li>(2) à ré-estimer les paramètres du modèle composé des
termes restants.

In [None]:
modelBis.coef_

In [None]:
#***** REMOVE TERMS WITH COEFFICIENTS NEARLY ZERO
#Tout d’abord il nous faut caractériser les coefficients du modèle. Nous les passons en valeur absolue et nous calculons plusieurs quantiles.

#absolute  value of the coefficients
coef_abs = np.abs(modelBis.coef_[0,:])

coef_abs

In [None]:
#percentiles of the coefficients (absolute value)
thresholds = np.percentile(coef_abs,[0,25,50,75,90,100])
print(thresholds)

La plus petite valeur des coefficients en valeur absolue est 1.17145514e-06, la plus grande 2.85962751.

Nous optons pour le 1er quartile pour définir le seuil. Nous identifions les numéros des termes correspondants.

In [None]:
#identify the coefficients "significantly higher than zero
#use 1st quartile as threshold
indices = np.where(coef_abs > thresholds[2])
print(len(indices[0]))

2312 descripteurs ont été retenus (contre 4... précédemment, après élimination des termes peu fréquents).

Nous créons les matrices documents termes correspondantes, en apprentissage et en test.

In [None]:
#create the new document term matrices

#document term matrices - train and test sets
mdtTrainTer = mdtTrainBis[:,indices[0]]#toutes les lignes et uniquement les colones 
mdtTestTer = mdtTestBis[:,indices[0]]

#checking
print(mdtTrainTer.shape)
print(mdtTestTer.shape)

In [None]:
#instatiate the object
modelTer = LogisticRegression()

#train a new classifier with selected terms
modelTer.fit(mdtTrainTer,dataTrain['label'])

#prediction on the test set
predTestTer = modelTer.predict(mdtTestTer)

#confusion matrix
mcTestTer = metrics.confusion_matrix(dataTest['label'],predTestTer)
print(mcTestTer)

Essayons d’identifier les termes les plus discriminants. Pour ce faire, nous trions le
dictionnaire en fonction de la valeur absolue des coefficients du modèle :

In [None]:
#selected terms
sel_terms = np.array(parseurBis.get_feature_names())[indices[0]]

#sorted indices of the absolute value coefficients
sorted_indices = np.argsort(np.abs(modelTer.coef_[0,:]))

#print the terms and theirs coefficients
imp = {'term':np.asarray(sel_terms)[sorted_indices],'coef':modelTer.coef_[0,:][sorted_indices]}
#Les 10 termes les plus discriminants dans le modèle sont (avec les coefficients associés) :
print(pd.DataFrame(imp).sort_values(by='coef', ascending=False).head(10))

Les coefficients de ces termes étant positifs, tous concourent à la désignation des « spam » c.-à-d. lorsqu’ils sont présents dans les documents, les chances d’avoir affaire à un « spam » augmentent.

<strong > L’analyse fine des résultats commence à ce stade. Il est à prévoir vraisemblablement qu’il faudra affiner le dictionnaire pour améliorer la pertinence du dispositif.

## deployment

Une des finalités de la catégorisation de textes est de produire une fonction permettant
d’assigner automatiquement une classe (« spam » ou « ham ») à un nouveau document. Elle
peut être implémentée dans le logiciel de réception des emails. 


Dans cette section, nous détaillons les différentes étapes des
opérations pour montrer que la tâche est loin d’être triviale.

Nous souhaitons classer la phrase « this is a new free service for you only » à l’aide de notre
troisième modèle modelTer  sachant que la sélection de variables opérée va compliquer un peu les choses.


Description compatible avec la matrice documents termes. Nous transformons le document en un vecteur de présence absence des termes présents dans le dictionnaire :

In [None]:
#document to classify
doc = ['this is a new free service for you only']

#document preprocessing
doc=expand_contractions(str(doc))
doc = str(doc).lower()
doc = re.sub('[%s]' % re.escape(string.punctuation), '' , str(doc))
doc = re.sub('[%s]' % re.escape(string.punctuation), '' ,  str(doc))
doc =re.sub(r'\b[0-9]+\b\s*', '', str(doc))
doc = remove_stopwords(str(doc))

#get its description
desc = parseurBis.transform([doc])
print(desc)

In [None]:

doc = ["Hello elasri.ikram, You are customer #0836901 by Amazon Rewards and we have been waiting for your confirmation since. This delivery is for elasri.ikram To activate delivery, validate here! Cordially,, Amazon reward"]
#document preprocessing
doc=expand_contractions(str(doc))
doc = str(doc).lower()
doc = re.sub('[%s]' % re.escape(string.punctuation), '' , str(doc))
doc =re.sub(r'\b[0-9]+\b\s*', '', str(doc))
doc = remove_stopwords(str(doc))

#get its description
desc = parseurBis.transform([doc])
print(desc)

In [None]:
doc

-->Python nous dit qu’il a recensé les termes n° ..., ... et ..... Nous avons une description « spare » des données c.-à-d. seules les valeurs différentes de 0 (zéro) sont recensées.

De quels termes s’agit-il ?

In [None]:
#which terms
print(np.asarray(parseurBis.get_feature_names())[desc.indices])

In [None]:
#dense representation
dense_desc = desc.toarray()

#apply var. selection
dense_sel = dense_desc[:,indices[0]]

In [None]:
dense_desc 

In [None]:
#prediction of the class membership
pred_doc = modelTer.predict(dense_sel)
print(pred_doc)

In [None]:
#prediction of the class membership probabilities
pred_proba = modelTer.predict_proba(dense_sel)
print(pred_proba)

L’appartenance du message à la classe ‘spam’ avec une probabilité d’appartenance égale à 0.70.