<h2 align='center'> Text Representation - Bag Of Words (BOW)</h2>

<h4><b>AIM:<b><H4><h6>To perform word embedding using Bag of words and N-gram BOW and then compare the efiiciencies on application of classification models<h6>

<h4><b>DESCRIPTION:<b><H4><h6>The BoW model captures the frequencies of the word occurrences in a text corpus.
Bag of words is not concerned about the order in which words appear in the text; instead, it only cares about which words appear in the text.<h6><h6>An N-Gram is a sequence of N-words in a sentence. Here, N is an integer which stands for the number of words in the sequence.

For example, if we put N=1, then it is referred to as a uni-gram. If you put N=2, then it is a bi-gram. If we substitute N=3, then it is a tri-gram.

The bag of words does not take into consideration the order of the words in which they appear in a document, and only individual words are counted.<h6>

In [1]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv("proc.csv" , encoding = 'latin-1' )
df.head()

Unnamed: 0,clean_text,is_depression,Processed_Title
0,we understand that most people who reply immed...,1,"['we', 'understand', 'that', 'most', 'people',..."
1,welcome to r depression s check in post a plac...,1,"['welcome', 'to', 'r', 'depression', 's', 'che..."
2,anyone else instead of sleeping more when depr...,1,"['anyone', 'else', 'instead', 'of', 'sleep', '..."
3,i ve kind of stuffed around a lot in my life d...,1,"['i', 've', 'kind', 'of', 'stuff', 'around', '..."
4,sleep is my greatest and most comforting escap...,1,"['sleep', 'be', 'my', 'greatest', 'and', 'most..."


In [48]:
df.is_depression.value_counts()

0    3900
1    3831
Name: is_depression, dtype: int64

In [8]:
df.shape

(7731, 3)

In [9]:
df.head()

Unnamed: 0,clean_text,is_depression,Processed_Title
0,we understand that most people who reply immed...,1,"['we', 'understand', 'that', 'most', 'people',..."
1,welcome to r depression s check in post a plac...,1,"['welcome', 'to', 'r', 'depression', 's', 'che..."
2,anyone else instead of sleeping more when depr...,1,"['anyone', 'else', 'instead', 'of', 'sleep', '..."
3,i ve kind of stuffed around a lot in my life d...,1,"['i', 've', 'kind', 'of', 'stuff', 'around', '..."
4,sleep is my greatest and most comforting escap...,1,"['sleep', 'be', 'my', 'greatest', 'and', 'most..."


<h3>Train test split</h3>

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.Processed_Title, df.is_depression, test_size=0.2)

In [17]:
X_train.shape

(6184,)

In [18]:
X_test.shape

(1547,)

In [19]:
type(X_train)

pandas.core.series.Series

In [20]:
X_train[:4]

39      ['i', 'just', 'feel', 'like', 'i', 'll', 'have...
6371    ['no', 'squirrel', 'today', 'they', 'must', 'b...
7478    ['stuff', 'find', 'a', 'small', 'enough', 'pic...
2090    ['sorry', 'to', 'ask', 'again', 'i', 'm', 'jus...
Name: Processed_Title, dtype: object

In [21]:
type(y_train)

pandas.core.series.Series

In [22]:
y_train[:4]

39      1
6371    0
7478    0
2090    1
Name: is_depression, dtype: int64

In [23]:
type(X_train.values)

numpy.ndarray

<h3>Create bag of words representation using CountVectorizer</h3>

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

v = CountVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv

<6184x11579 sparse matrix of type '<class 'numpy.int64'>'
	with 249399 stored elements in Compressed Sparse Row format>

In [25]:
X_train_cv.toarray()[:2][0]

array([0, 0, 0, ..., 0, 0, 0])

In [26]:
X_train_cv.shape

(6184, 11579)

In [27]:
dir(v)

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_feature_names',
 '_check_n_features',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_parameter_constraints',
 '_repr_html_',
 '_repr_html_inner',
 '_repr_mimebundle_',
 '_sort_features',
 '_stop_words_id',
 '_validate_data',
 '_validate_ngram_range',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'bui

In [28]:
v.vocabulary_

{'just': 5597,
 'feel': 3731,
 'like': 5978,
 'll': 6038,
 'have': 4562,
 'depression': 2647,
 'forever': 3952,
 'nothing': 7107,
 'really': 8404,
 'work': 11397,
 'at': 717,
 'least': 5890,
 'not': 7105,
 'for': 3930,
 'long': 6078,
 'too': 10430,
 'tire': 10380,
 'don': 2954,
 'want': 11172,
 'to': 10396,
 'try': 10595,
 'so': 9477,
 'hard': 4530,
 'all': 356,
 'the': 10230,
 'time': 10365,
 'anymore': 541,
 'get': 4235,
 'better': 1063,
 'give': 4263,
 'up': 10874,
 'no': 7056,
 'squirrel': 9663,
 'today': 10399,
 'they': 10273,
 'must': 6822,
 'be': 964,
 'hide': 4690,
 'stuff': 9837,
 'find': 3797,
 'small': 9427,
 'enough': 3348,
 'picture': 7773,
 'will': 11332,
 'this': 10292,
 'weird': 11242,
 'face': 3614,
 'rest': 8643,
 'of': 7234,
 'my': 6830,
 'twitter': 10648,
 'life': 5961,
 'lol': 6068,
 'sorry': 9551,
 'ask': 680,
 'again': 273,
 'do': 2924,
 'great': 4385,
 'moment': 6689,
 'if': 4976,
 'hypothetical': 4932,
 'end': 3312,
 'and': 462,
 'prior': 8066,
 'it': 5376,
 's

In [29]:
X_train_np = X_train_cv.toarray()
X_train_np[0]

array([0, 0, 0, ..., 0, 0, 0])

In [30]:
np.where(X_train_np[0]!=0)

(array([  356,   541,   717,  1063,  2647,  2954,  3731,  3930,  3952,
         4235,  4263,  4530,  4562,  5597,  5890,  5978,  6038,  6078,
         7105,  7107,  8404,  9477, 10230, 10365, 10380, 10396, 10430,
        10595, 10874, 11172, 11397]),)

In [31]:
X_train_np[0][54]

0

<h3>Train the naive bayes model</h3>

In [32]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)

In [33]:
X_test_cv = v.transform(X_test)

<h3>Evaluate Performance</h3>

In [34]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.72      0.83       785
           1       0.77      0.97      0.86       762

    accuracy                           0.85      1547
   macro avg       0.87      0.85      0.84      1547
weighted avg       0.87      0.85      0.84      1547



**SVM**

In [35]:
from sklearn.svm import SVC

model = SVC()
model.fit(X_train_cv, y_train)

In [36]:
X_test_cv = v.transform(X_test)

In [37]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      1.00      0.95       785
           1       1.00      0.89      0.94       762

    accuracy                           0.95      1547
   macro avg       0.95      0.95      0.95      1547
weighted avg       0.95      0.95      0.95      1547



**Random** **Forest**

In [38]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train_cv, y_train)

In [39]:
X_test_cv = v.transform(X_test)

In [40]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.99      0.96       785
           1       0.99      0.92      0.95       762

    accuracy                           0.96      1547
   macro avg       0.96      0.96      0.96      1547
weighted avg       0.96      0.96      0.96      1547



In [41]:
reviews = [
    "['I','loved','it']",
    "['It','be','disgusting','food']"
]

reviews_count = v.transform(reviews)
model.predict(reviews_count)

array([0, 0])

<h3>Train the model using sklearn pipeline and reduce number of lines of code</h3>

In [42]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [43]:
clf.fit(X_train, y_train)

In [44]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.72      0.83       785
           1       0.77      0.97      0.86       762

    accuracy                           0.85      1547
   macro avg       0.87      0.85      0.84      1547
weighted avg       0.87      0.85      0.84      1547



In [45]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
clf = Pipeline([
     ('vectorizer', CountVectorizer(ngram_range = (1, 6))),        #using the ngram_range parameter 
     ('nb', MultinomialNB())         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.11      0.21       785
           1       0.52      1.00      0.69       762

    accuracy                           0.55      1547
   macro avg       0.76      0.56      0.45      1547
weighted avg       0.77      0.55      0.44      1547



In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
clf = Pipeline([
     ('vectorizer', CountVectorizer(ngram_range = (1, 3))),        #using the ngram_range parameter 
     ('rfc', RandomForestClassifier(n_estimators=250,  criterion='gini'))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      1.00      0.94       785
           1       1.00      0.88      0.94       762

    accuracy                           0.94      1547
   macro avg       0.95      0.94      0.94      1547
weighted avg       0.95      0.94      0.94      1547



In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
rfc=RandomForestClassifier()
v = TfidfVectorizer()

X_train_cv = v.fit_transform(X_train.values)
X_train_cv
rfc.fit(X_train_cv, y_train)
X_test_cv = v.transform(X_test)

y_pred = rfc.predict(X_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96       785
           1       1.00      0.91      0.95       762

    accuracy                           0.95      1547
   macro avg       0.96      0.95      0.95      1547
weighted avg       0.96      0.95      0.95      1547



<h4><b>CONCLUSION:<b><H4><h6>It is observed that when BOW model is applied onto the dataframe containing the preprocessed data, a spare matrix of shape 800x1557 is generated.<h6><h6>ON this spare matrix, classification models are applied, which include, Multinomial Naive Bayes, Random Forest, SVM.<h6><h6>Out of all the models applied, it is found that Randoom forest performs nuch better compared to other models with accuracy score of 0.78.<h6><h6>Ngram model using random forest gives precision of 0.82 and f1score of 0.80<h6><h6>TF-IDF model using random forest gives precision of 0.89 and f1score of 0.75<h6>