#Word Representations

###Bag of Words

A bag-of-words is a representation of text that describes the occurrence of words within a document. 

It is called a ***bag*** of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

If your dataset is small and context is domain specific, BoW may work better than Word Embedding because you may not find the corresponding vector from pre-trained word embedding models for some of the words.

![bow](https://miro.medium.com/max/554/0*B9GC_f3BMtjGMdQ-.png)

[Photo source](https://medium.com/analytics-vidhya/does-tf-idf-work-differently-in-textbooks-and-sklearn-routine-cc7a7d1b580d)

In [None]:
corpus = ["Flora is all the plant life present in a particular region or time, generally the naturally occurring (indigenous) native plants.",
    "The corresponding term for animal life is fauna. Flora, fauna, and other forms of life, such as fungi, are collectively referred to as biota.",
    "Sometimes bacteria and fungi are also referred to as flora, as in the terms gut flora or skin flora."]

corpus

['Flora is all the plant life present in a particular region or time, generally the naturally occurring (indigenous) native plants.',
 'The corresponding term for animal life is fauna. Flora, fauna, and other forms of life, such as fungi, are collectively referred to as biota.',
 'Sometimes bacteria and fungi are also referred to as flora, as in the terms gut flora or skin flora.']

We will create the BoW vectors using `CountVectorizer`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(lowercase = False)
bow_representation = vectorizer.fit_transform(corpus)
vocabulary = vectorizer.get_feature_names()

print(vocabulary)
print(len(vocabulary))
print(bow_representation.toarray())

['Flora', 'Sometimes', 'The', 'all', 'also', 'and', 'animal', 'are', 'as', 'bacteria', 'biota', 'collectively', 'corresponding', 'fauna', 'flora', 'for', 'forms', 'fungi', 'generally', 'gut', 'in', 'indigenous', 'is', 'life', 'native', 'naturally', 'occurring', 'of', 'or', 'other', 'particular', 'plant', 'plants', 'present', 'referred', 'region', 'skin', 'such', 'term', 'terms', 'the', 'time', 'to']
43
[[1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1
  0 0 0 0 2 1 0]
 [1 0 1 0 0 1 1 1 2 0 1 1 1 2 0 1 1 1 0 0 0 0 1 2 0 0 0 1 0 1 0 0 0 0 1 0
  0 1 1 0 0 0 1]
 [0 1 0 0 1 1 0 1 2 1 0 0 0 0 3 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
  1 0 0 1 1 0 1]]


####N-grams encoding

Extracts features from text while capturing local word order by defining
counts over sliding windows.

![ngrams](https://i.stack.imgur.com/8ARA1.png)


In [None]:
bigram = CountVectorizer(lowercase = False, ngram_range=(2, 2))
bigram_representation = bigram.fit_transform(corpus)

bigram_vocabulary = bigram.get_feature_names()

print(bigram_vocabulary)
print(len(bigram_vocabulary))
print(bigram_representation.toarray())

['Flora fauna', 'Flora is', 'Sometimes bacteria', 'The corresponding', 'all the', 'also referred', 'and fungi', 'and other', 'animal life', 'are also', 'are collectively', 'as biota', 'as flora', 'as fungi', 'as in', 'bacteria and', 'collectively referred', 'corresponding term', 'fauna Flora', 'fauna and', 'flora as', 'flora or', 'for animal', 'forms of', 'fungi are', 'generally the', 'gut flora', 'in particular', 'in the', 'indigenous native', 'is all', 'is fauna', 'life is', 'life present', 'life such', 'native plants', 'naturally occurring', 'occurring indigenous', 'of life', 'or skin', 'or time', 'other forms', 'particular region', 'plant life', 'present in', 'referred to', 'region or', 'skin flora', 'such as', 'term for', 'terms gut', 'the naturally', 'the plant', 'the terms', 'time generally', 'to as']
56
[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1
  1 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0]
 [1 0 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 

###TF-IDF

TF-IDF represents
text data by indicating the importance of the word relative to the other words in
the text.

A problem with scoring word frequency is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-IDF for short, where:

- **Term Frequency**: the frequency of a given term in a document.



- **Inverse Document Frequency**: the ratio of documents that contain a given term.

![tf](https://www.affde.com/uploads/article/5516/PVpklt43xBCKRFBa.png)

TF-IDF penalizes stopwords, they will not have a high score, but stopwords removal may stil be used to reduce the dimensionality of the input space.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(lowercase = False)
tfidf_representation = tfidf_vectorizer.fit_transform(corpus)

print(tfidf_representation.toarray())

[[0.18334923 0.         0.         0.2410822  0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.2410822  0.         0.18334923 0.2410822  0.18334923 0.18334923
  0.2410822  0.2410822  0.2410822  0.         0.18334923 0.
  0.2410822  0.2410822  0.2410822  0.2410822  0.         0.2410822
  0.         0.         0.         0.         0.36669846 0.2410822
  0.        ]
 [0.15630031 0.         0.20551613 0.         0.         0.15630031
  0.20551613 0.15630031 0.31260063 0.         0.20551613 0.20551613
  0.20551613 0.41103226 0.         0.20551613 0.20551613 0.15630031
  0.         0.         0.         0.         0.15630031 0.31260063
  0.         0.         0.         0.20551613 0.         0.20551613
  0.         0.         0.         0.         0.15630031 0.
  0.         0.20551613 0.20551613 0.         0.         0.
  0.15630031]
 [0.         0.21348818 0.         0.         0.21348818 0.16236326
  

####Limitations

- **Vocabulary**: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.

- **Sparsity:** Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.

- **Meaning**: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

##Sentiment analysis

In [None]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [None]:
from nltk.corpus  import twitter_samples

pos_tweets = twitter_samples.strings('positive_tweets.json')
print(len(pos_tweets))

neg_tweets = twitter_samples.strings('negative_tweets.json')
print(len(neg_tweets))

5000
5000


In [None]:
import pandas as pd
pos_df = pd.DataFrame(pos_tweets, columns = ['tweet'])
pos_df['label'] = 1

In [None]:
neg_df = pd.DataFrame(neg_tweets, columns = ['tweet'])
neg_df['label'] = 0

In [None]:
data_df = pd.concat([pos_df, neg_df], ignore_index=True)
data_df

Unnamed: 0,tweet,label
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...,1
1,@Lamb2ja Hey James! How odd :/ Please call our...,1
2,@DespiteOfficial we had a listen last night :)...,1
3,@97sides CONGRATS :),1
4,yeaaaah yippppy!!! my accnt verified rqst has...,1
...,...,...
9995,I wanna change my avi but uSanele :(,0
9996,MY PUPPY BROKE HER FOOT :(,0
9997,where's all the jaebum baby pictures :((,0
9998,But but Mr Ahmad Maslan cooks too :( https://t...,0


####Split data into train and test

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data_df, test_size=0.2, shuffle = True)
print(train_df)
print(test_df)

                                                  tweet  label
8693  @cIaricestarling i know right... :( i hope it'...      0
8098  Struggling like crazy to get into a race fan m...      0
9132                   7pm on a Friday and I am dead :(      0
4439  @AsianMeerkat @johncrossmirror far from it. Be...      1
665   @FooWhiter Ugh. I've never Rt or fade any of y...      1
...                                                 ...    ...
8671  @bumkeyyfel clowns? i'm not scared of clowns t...      0
3441  @SachinKalbag Ah Sachin, why do you bring up u...      1
9915  @annnalucz i'm not going :( kailan ba? may tix...      0
6466                     @horan_lyra done, please me :(      0
8630  I'm craving breakfast food so badly right now ...      0

[8000 rows x 2 columns]
                                                  tweet  label
3688                             @btsmaqnae followed :)      1
671               @4eyedmonk awesome :) I'll be waiting      1
942   Thanks for connecting @g

####TF-IDF Vectorization

In [None]:
tfidf_vectorizer = TfidfVectorizer(lowercase = False)
tfidf_representation = tfidf_vectorizer.fit(train_df['tweet'])

X_train = tfidf_vectorizer.transform(train_df['tweet'])
X_test = tfidf_vectorizer.transform(test_df['tweet'])

y_train = train_df['label']
y_test = test_df['label']

Classification

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
print('Train Score', logreg.score(X_train, y_train))
print('Test Score', logreg.score(X_test, y_test))

Train Score 0.897125
Test Score 0.7635


##Assignment

To be uploaded here: https://forms.gle/qTzLy6F6jkUtQrvy7 until November 17th

Investigate the effect of text normalization.

- Search for a dataset for classification (or experiment with the same dataset from this lab)
- Preprocess the text
- Compare the vocabulary size with and without preprocessing
- Get the numerical representation of the text
- Train a model
- Test your model 
- Compare the performance of your model with and without text normalization

##Pick database

In [96]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

In [97]:
from nltk.corpus  import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

## Preprocess text

In [159]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-3.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.2 MB/s 
[?25hCollecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.13-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (628 kB)
[K     |████████████████████████████████| 628 kB 57.3 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 59.1 MB/s 
[?25hCollecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (451 kB)
[K     |████████████████████████████████| 451 kB 39.2 MB/s 
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Colle

In [98]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [163]:
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
import spacy

nlp = spacy.load('en_core_web_sm')
stop_words_nltk = set(stopwords.words('english'))

In [164]:
def preprocess(tweets):
  # it makes sense to keep hashtags, punctuation, emoticons and emojis
  prepocessed_tweets = []

  for text in tweets:
    # lowercase text, remove newline characters, mentions, links and digits
    text = ' '.join([word.lower() for word in text.split() 
                    if word[0] not in ['@'] and word[:4] != "http" 
                    and not word.isdigit()])

    # tokenize into separate words, remove stopwords, lemmatize
    text = ' '.join([word.lemma_ for word in nlp(text) if word not in stop_words_nltk])

    prepocessed_tweets += [text]

  return prepocessed_tweets

##Compare the vocabulary size with and without preprocessing

In [166]:
def getVocabularySize(tweets):
  vocabulary = set()
  for tweet in tweets:
    for word in tweet:
      vocabulary.add(word)
  return len(vocabulary)

In [167]:
def printTable():
  print('             | Positive Tweets | Negative Tweets')
  print('             | Vocabulary      | Vocabulary     ')
  print('------------------------------------------------')
  print('Original     | ' + 
        str(getVocabularySize(positive_tweets)) + 
        '             | ' + 
        str(getVocabularySize(negative_tweets)))
  print('------------------------------------------------')
  print('Preprocessed | ' + 
        str(getVocabularySize(preprocessed_positive_tweets)) + 
        '             | ' + 
        str(getVocabularySize(preprocessed_negative_tweets)))

In [168]:
preprocessed_positive_tweets = preprocess(positive_tweets)
preprocessed_negative_tweets = preprocess(negative_tweets)

In [169]:
printTable()

             | Positive Tweets | Negative Tweets
             | Vocabulary      | Vocabulary     
------------------------------------------------
Original     | 274             | 248
------------------------------------------------
Preprocessed | 244             | 217


##Get numerical representation of text

In [170]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [171]:
def getDataFrame(positive_data, negative_data):
  pos_df = pd.DataFrame(positive_data, columns = ['tweet'])
  pos_df['label'] = 1

  neg_df = pd.DataFrame(negative_data, columns = ['tweet'])
  neg_df['label'] = 0

  data_df = pd.concat([pos_df, neg_df], ignore_index=True)
  return data_df

In [172]:
original_df = getDataFrame(positive_tweets, negative_tweets)
preprocessed_df = getDataFrame(preprocessed_positive_tweets, preprocessed_negative_tweets)

##Train and test model

In [173]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(lowercase = False)

In [174]:
def splitTrainTest(df):
  return train_test_split(df, test_size=0.2, shuffle = True)

In [175]:
def train(train_df):
  tfidf_representation = tfidf_vectorizer.fit(train_df['tweet'])

  X_train = tfidf_vectorizer.transform(train_df['tweet'])
  y_train = train_df['label']

  return X_train, y_train

In [186]:
def test(test_df):
  X_test = tfidf_vectorizer.transform(test_df['tweet'])
  y_test = test_df['label']

  return X_test, y_test

In [193]:
original_train, original_test = splitTrainTest(original_df)
preprocessed_train, preprocessed_test = splitTrainTest(preprocessed_df)

##Compare performance with and without text normalization

In [194]:
def getScores(train_df, test_df):
  X_train, y_train = train(train_df)
  X_test, y_test = test(test_df)

  logreg = LogisticRegression()
  logreg.fit(X_train, y_train)

  print('Train Score', logreg.score(X_train, y_train))
  print('Test Score', logreg.score(X_test, y_test))

In [195]:
getScores(original_train, original_test)
getScores(preprocessed_train, preprocessed_test)

Train Score 0.9015
Test Score 0.7575
Train Score 0.872625
Test Score 0.767


####Further reading

- [TF-IDF/Term Frequency](https://medium.com/analytics-vidhya/tf-idf-term-frequency-technique-easiest-explanation-for-text-classification-in-nlp-with-code-8ca3912e58c3)
- [Bag of Words](https://www.mygreatlearning.com/blog/bag-of-words/)