First, we will import a single dataset. This dataset is actually split into four different files. Our set of comments comes from the PSY-Gangnam Style video:


In [5]:
import pandas as pd
d = pd.read_csv("Youtube01-Psy.csv")
d.tail()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
345,z13th1q4yzihf1bll23qxzpjeujterydj,Carmen Racasanu,2014-11-14T13:27:52,How can this have 2 billion views when there's...,0
346,z13fcn1wfpb5e51xe04chdxakpzgchyaxzo0k,diego mogrovejo,2014-11-14T13:28:08,I don't now why I'm watching this in 2014﻿,0
347,z130zd5b3titudkoe04ccbeohojxuzppvbg,BlueYetiPlayz -Call Of Duty and More,2015-05-23T13:04:32,subscribe to me for call of duty vids and give...,1
348,z12he50arvrkivl5u04cctawgxzkjfsjcc4,Photo Editor,2015-06-05T14:14:48,hi guys please my android photo editor downloa...,1
349,z13vhvu54u3ewpp5h04ccb4zuoardrmjlyk0k,Ray Benich,2015-06-05T18:05:16,The first billion viewed this because they tho...,0


 let's look at the count of how many rows in the dataset are spam and how many are not spam.

In [10]:
len(d.query('CLASS==1'))

175

In [11]:
len(d.query('CLASS==0'))

175

In scikit-learn, the bag of words technique is actually called CountVectorizer, which means counting how many times each word appears and puts them into a vector. To create a vector, we need to make an object for CountVectorizer, and then perform the fit and transform simultaneously:


In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
dve = vectorizer.fit_transform(d['CONTENT'])

In [15]:
dve

<350x1418 sparse matrix of type '<class 'numpy.int64'>'
	with 4354 stored elements in Compressed Sparse Row format>

Execute the following command to shuffle the dataset with fraction 100% that is adding GSBD

In [23]:
dshuffl = d.sample(frac=1)

In [28]:
d_train = dshuffl[:300]
d_test = dshuffl[300:]

d_train_attr = vectorizer.fit_transform(d_train['CONTENT'])
d_test_attr = vectorizer.transform(d_test['CONTENT'])

d_train_label = d_train['CLASS']
d_test_lable = d_test['CLASS']
d_train_attr
d_test_attr

<50x1266 sparse matrix of type '<class 'numpy.int64'>'
	with 447 stored elements in Compressed Sparse Row format>

In the preceding code, CountVectorizer
.fit_transform is an important step. At that stage, you have a training set that you want to perform a fit transform on, which means it will learn the words and also produce the matrix. However, for the testing set, we don't perform a fit transform again, since we don't want the model to learn different words for the testing data.

Now we will begin with the building of the random forest classifier. We will be converting this dataset into 80 different trees and we will fit the training set so that we can score its performance on the testing set:

In [29]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators = 80)

In [30]:
RFC.fit(d_train_attr,d_train_label)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=80,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [31]:
RFC.score(d_test_attr,d_test_lable)

1.0

Now We will print Confusion Matrix

In [33]:
from sklearn.metrics import confusion_matrix

In [35]:
pred_lables = RFC.predict(d_test_attr)
confusion_matrix(pred_lables,d_test_lable)

array([[29,  0],
       [ 0, 21]], dtype=int64)

# Now we will print all the data

In [37]:
d = pd.concat([
    pd.read_csv("Youtube01-Psy.csv"),
    pd.read_csv("Youtube02-KatyPerry.csv"),
    pd.read_csv("Youtube03-LMFAO.csv"),
    pd.read_csv("Youtube04-Eminem.csv"),
    pd.read_csv("Youtube05-Shakira.csv"),
])

len(d)

1956

Now shuffle the data

In [40]:
dshuffl  = d.sample(frac=1)
d_content = dshuffl['CONTENT']
d_class = dshuffl['CLASS']

We need to perform a couple of steps here with CountVectorizer followed by the random forest. For this, we will use a feature in scikit-learn called a Pipeline.

In [39]:
from sklearn.pipeline import Pipeline,make_pipeline
pipeline = Pipeline([
    ('bag-of-words',CountVectorizer()),
    ('random-forest',RandomForestClassifier())
])

Or we would have used make_pipeline method
pipeline = make_pipeline(CountVectorizer(),RandomForestClassifier())

In [43]:
pipeline.fit(d_content[:1500],d_class[:1500])

Pipeline(memory=None,
         steps=[('bag-of-words',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabu...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                

In [44]:
pipeline.score(d_content[1500:],d_class[1500:])

0.9714912280701754

now just predict a comment is it a spam or not

In [57]:
print(pipeline.predict([' like me']))
print(pipeline.predict(['video is very nice']))
print(pipeline.predict([' give me money']))
print(pipeline.predict(['fuckk off']))

[0]
[0]
[1]
[0]


 Now let's add TF-IDF to our model to make it more precise:


In [58]:
from sklearn.feature_extraction.text import TfidfTransformer

In [60]:
##this time use make_pipline
pipeline2 = make_pipeline(CountVectorizer(),
                          TfidfTransformer(norm=None),
                         RandomForestClassifier())

In [65]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline2,d_content,d_class,cv = 5)
print(scores.mean())
print(scores.std()*2)

0.9580797536405867
0.00998550057778849


In [66]:
##parameters of all functions
pipeline2.steps

[('countvectorizer',
  CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                  dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                  lowercase=True, max_df=1.0, max_features=None, min_df=1,
                  ngram_range=(1, 1), preprocessor=None, stop_words=None,
                  strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                  tokenizer=None, vocabulary=None)),
 ('tfidftransformer',
  TfidfTransformer(norm=None, smooth_idf=True, sublinear_tf=False, use_idf=True)),
 ('randomforestclassifier',
  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                         criterion='gini', max_depth=None, max_features='auto',
                         max_leaf_nodes=None, max_samples=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, 

We can estimate the value of parameters using GridSearchCV(one of the all available techniques)

In [70]:
##first make a dictionary of parameters we want to search value of
parameters={
    'countvectorizer__max_features':(None,1000,2000),
    'countvectorizer__ngram_range':((1,1),(1,2)),
    'countvectorizer__stop_words':('english',None),
    'tfidftransformer__use_idf':(True,False),
    'randomforestclassifier__n_estimators':(20,50,100)
    
}

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipeline2,parameters,n_jobs=-1,verbose=1)

In [71]:
grid_search.fit(d_content,d_class)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   56.8s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  1.6min finished


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('countvectorizer',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                              

In [73]:
grid_search.best_score_

0.9601192650973432

In [74]:
##print out all best parameters
grid_search.best_params_

{'countvectorizer__max_features': 1000,
 'countvectorizer__ngram_range': (1, 1),
 'countvectorizer__stop_words': 'english',
 'randomforestclassifier__n_estimators': 100,
 'tfidftransformer__use_idf': False}