# <center>Machine Learning on Textual Data</center>

---

In this exercise, we will preprocess the data with SpaCy based on the techniques we learnt in SpaCy introduction.

Then, we will see a few application on the data.

- TextClassification
- Text Clustering
- Sentiment Analysis

### Dataset

The data set contains about 1000 online reviews each for various items on 

- Amazon, 
- Yelp and 
- IMDB, 

and of these reviews about 500 were labelled positive and 500 were labelled negative reviews. 

For each company, the data was given the text format which are needed to be added to a dataframe

<a href="https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences">Dataset Link</a>

## Text Classification

---

#### Load Libraries

In [4]:
import pandas as pd

In [5]:
import warnings
warnings.filterwarnings('ignore')

#### Import Data

I. Yelp Data

In [6]:
data_yelp = pd.read_table('yelp_labelled.txt', header = None)
print(data_yelp.shape)
data_yelp[:3]

(1000, 2)


Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0


II. Amazon Data

In [7]:
data_amazon = pd.read_table('amazon_cells_labelled.txt', header = None)
print(data_amazon.shape)
data_amazon[:3]

(1000, 2)


Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1


III. Imdb Data

In [8]:
data_imdb = pd.read_table('imdb_labelled.txt', header = None)
print(data_imdb.shape)
data_imdb[:3]

(748, 2)


Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0


#### Combine the datasets

In [9]:
reviews = pd.concat([data_amazon, data_imdb, data_yelp], axis = 0,
                   keys = ['Amazon', 'imdb', 'yelp'])

In [10]:
reviews.columns = ['review', 'label']

In [11]:
reviews.shape

(2748, 2)

In [12]:
reviews.iloc[[500, 501, 502, 1500, 1501, 1502, 2500, 2501, 2502]]

Unnamed: 0,Unnamed: 1,review,label
Amazon,500,"The bose noise cancelling is amazing, which is...",1
Amazon,501,This battery is an excellent bargain!,1
Amazon,502,Defective crap.,0
imdb,500,It's a case of 'so bad it is laughable'.,0
imdb,501,") very bad performance plays Angela Bennett, a...",0
imdb,502,"It is a film about nothing, just a pretext to ...",0
yelp,752,"Level 5 spicy was perfect, where spice didn't ...",1
yelp,753,We were sat right on time and our server from ...,1
yelp,754,Main thing I didn't enjoy is that the crowd is...,0


# $$ Text Preprocessing $$
- tokenisation
- lammetization
- stop-words removal
- punctuations removal
- vectorisation

#### Create Stop Words list

In [13]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [14]:
nlp = spacy.load('en_core_web_md')

In [15]:
stopwords = list(STOP_WORDS)
print(stopwords[:100])

['we', '‘d', 'full', 'therein', 'most', 'less', '‘re', "n't", 'empty', 'against', 'upon', 'everyone', 'often', 'onto', 'hereby', 'although', 'becoming', 'whoever', 'somehow', 'nine', 'is', 'be', 'themselves', 'here', 'part', 'any', 'now', 'out', 'she', 'along', 'same', 'n‘t', 'than', 'else', 'forty', 'four', 'nowhere', 'over', 'what', 'only', 'hers', "'m", 'while', 'there', 'at', 'since', 'whose', 'many', 'whenever', 'whereupon', 'give', 'for', '’d', 'those', 'whereas', 'through', 'their', 'nothing', 'seems', 'should', 'has', 'below', 'next', 'made', 'if', 'seem', 'whereafter', 'regarding', 'an', 'make', 'down', 'further', 'anyhow', 'off', 'as', 'top', 'always', 'been', 'one', 'put', 'or', 'some', 'serious', 'whence', 'into', 'until', 'your', 'such', '‘ve', 'indeed', 'so', 'also', 'via', '‘m', 'how', 'see', 'thereby', 'they', 'everywhere', 'moreover']


Check if your word in the list

In [16]:
print('which' in stopwords)
print('thing' in stopwords) 

True
False


Let's add the word `thing` to the stop words list

In [17]:
stopwords = ['thing'] + list(STOP_WORDS)
print(stopwords[:5])

['thing', 'we', '‘d', 'full', 'therein']


#### Create Punctuation list

We will use the default punctuation list.

In [18]:
import string

In [19]:
punctuations = string.punctuation

In [20]:
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

#### Create preprocessing Pipeline

Tokenized words needs to be lemmatized and filtered for pronouns, stopwords and punctuations using the defined method 'tokeniser'.

In [21]:
def tokeniser(sentence):
    
    doc = nlp(sentence)
    
    tokens = [token.lemma_.lower().strip() if token.lemma_ != '-PRON-' else token.lower_ for token in doc]
    #print(tokens)
    
    tokens = [token for token in tokens if token not in stopwords and token not in punctuations]
    
    return tokens

In [22]:
sent = "He was saying some thing but he says a lot of things so I just wasn't paying attention." 

In [23]:
tokeniser(sent)

['lot', 'pay', 'attention']

In [24]:
doc = nlp(sent)

In [25]:
text=[]
lemma=[]
pos=[]
tag=[]
dep=[]
shape=[]
alpha=[]
stop=[]
ner_text=[]
ner_label=[]
for token in doc:
    text.append(token.text)
    lemma.append(token.lemma_)
    pos.append(token.pos_)
    tag.append(token.tag_)
    dep.append(token.dep_)
    shape.append(token.shape_)
    alpha.append(token.is_alpha)
    stop.append(token.is_stop)

In [26]:
df = pd.DataFrame(list(zip(text,lemma,pos,tag,dep,shape,alpha,stop))
                 , columns = ['text','lemma','pos','tag','dep','shape','alpha','stop'])

In [27]:
df

Unnamed: 0,text,lemma,pos,tag,dep,shape,alpha,stop
0,He,-PRON-,PRON,PRP,nsubj,Xx,True,True
1,was,be,AUX,VBD,aux,xxx,True,True
2,saying,say,VERB,VBG,ROOT,xxxx,True,False
3,some,some,DET,DT,det,xxxx,True,True
4,thing,thing,NOUN,NN,dobj,xxxx,True,False
5,but,but,CCONJ,CC,cc,xxx,True,True
6,he,-PRON-,PRON,PRP,nsubj,xx,True,True
7,says,say,VERB,VBZ,ROOT,xxxx,True,False
8,a,a,DET,DT,det,x,True,True
9,lot,lot,NOUN,NN,dobj,xxx,True,False


#### Vectorisation

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer 

In [29]:
tfidfvectorizer = TfidfVectorizer(tokenizer = tokeniser)

#### Build a Support Vector Classifier Object

In [30]:
from sklearn.svm import LinearSVC

In [31]:
classifier = LinearSVC()

#### Train - Test Split

The data is split into training and test datasets prior to feeding into the machine learning pipeline. 

In [32]:
from sklearn.model_selection import train_test_split

In [33]:

X_train, X_test, y_train, y_test = train_test_split( reviews['review'], reviews['label'], 
                                                   test_size = 0.2, random_state = 42)


Check the shape

In [34]:
print('train shape : ' , X_train.shape)
print('test shape : ', X_test.shape) 

train shape :  (2198,)
test shape :  (550,)


In [35]:
y_train[1:10]

yelp    739    1
imdb    124    0
yelp    732    1
Amazon  480    1
        879    1
yelp    481    1
imdb    743    0
Amazon  912    0
        785    0
Name: label, dtype: int64

#### Create Machine Learning Pipeline

In [36]:
from sklearn.pipeline import Pipeline

In [37]:
model_pipe = Pipeline( [ ('vectorizer', tfidfvectorizer) ,
                     ('classifier', classifier) ] )  

In [38]:
model_pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokeniser at 0x00000252BCF06D38>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_in

#### Predict on test data

In [39]:
preds = model_pipe.predict(X_test)

In [40]:
preds[:10]

array([0, 1, 0, 0, 0, 0, 1, 1, 0, 0], dtype=int64)

In [41]:
X_test[:10]

yelp    768    It's close to my house, it's low-key, non-fanc...
        894    If you stay in Vegas you must get breakfast he...
imdb    359    Let's start with all the problemsthe acting, ...
        702    It's too bad that everyone else involved didn'...
yelp    912    i felt insulted and disrespected, how could yo...
Amazon  564    Yet Plantronincs continues to use the same fla...
imdb    330    Whatever prompted such a documentary is beyond...
yelp    627    Any grandmother can make a roasted chicken bet...
Amazon  695           Do NOT buy if you want to use the holster.
        321    I ordered this product first and was unhappy w...
Name: review, dtype: object

In [42]:
type(X_test[:10]) 

pandas.core.series.Series

#### Compute the accuracy

In [43]:
from sklearn.metrics import accuracy_score

In [44]:
print('Train Accuracy : ', model_pipe.score(X_train, y_train)) 

Train Accuracy :  0.9754322111010009


In [45]:
print('Test Accuracy : ', model_pipe.score(X_test, y_test)) 

Test Accuracy :  0.8127272727272727


In [46]:
accuracy_score(y_test, preds)

0.8127272727272727

---

## Text Clustering

---


#### Compute the tfidf matrix

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [48]:
tfvectorizer = TfidfVectorizer()

In [49]:
tfidf = tfvectorizer.fit_transform(reviews['review'])

In [50]:
tfidf.shape

(2748, 5155)

In [51]:
tfvectorizer.get_feature_names()[200:210]

['amazed',
 'amazing',
 'amazingly',
 'amazon',
 'ambiance',
 'ambience',
 'america',
 'american',
 'americans',
 'among']

#### Convert tfidf matrix to dense form

In [52]:
dense = tfidf.todense()

In [53]:
print(dense)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [54]:
print(dense.shape)

(2748, 5155)


In [55]:
print(type(dense))

<class 'numpy.matrix'>


#### Build the kmeans model and retrieve labels

In [56]:
from sklearn.cluster import KMeans

In [57]:
kmeans = KMeans(n_clusters = 6, init = 'k-means++')

In [58]:
model = kmeans.fit(dense)

In [59]:
labels = model.labels_

In [60]:
print(labels)

[1 2 0 ... 4 3 1]


In [61]:
pd.DataFrame(labels)[0].value_counts()

2    1110
1     605
3     399
5     311
4     199
0     124
Name: 0, dtype: int64

In [62]:
! pip install textblob

Collecting textblob
  Using cached textblob-0.15.3-py2.py3-none-any.whl (636 kB)
Installing collected packages: textblob
Successfully installed textblob-0.15.3


---

## Sentiment Analysis

---

The sentiment function of textblob returns two properties, polarity, and subjectivity.

- **Polarity** is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. 

- **Subjective** sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

In [63]:
from textblob import TextBlob

In [64]:
blob = TextBlob('you do not have to argue with me')
blob.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

In [65]:
data_sentiment = reviews.copy()

In [66]:
data_sentiment['polarity'] = data_sentiment.apply(lambda x: TextBlob(x['review']).sentiment.polarity, axis = 1)

In [67]:
data_sentiment['subjectivity'] = data_sentiment.apply(lambda x : TextBlob(x['review']).sentiment.subjectivity, axis =1)

In [68]:
data_sentiment.head(10)

Unnamed: 0,Unnamed: 1,review,label,polarity,subjectivity
Amazon,0,So there is no way for me to plug it in here i...,0,0.0,0.0
Amazon,1,"Good case, Excellent value.",1,0.85,0.8
Amazon,2,Great for the jawbone.,1,0.8,0.75
Amazon,3,Tied to charger for conversations lasting more...,0,0.390625,0.25
Amazon,4,The mic is great.,1,0.8,0.75
Amazon,5,I have to jiggle the plug to get it to line up...,0,0.22619,0.60119
Amazon,6,If you have several dozen or several hundred c...,0,0.1,0.066667
Amazon,7,If you are Razr owner...you must have this!,1,0.0,0.0
Amazon,8,"Needless to say, I wasted my money.",0,-0.35,0.5
Amazon,9,What a waste of money and time!.,0,-0.25,0.0


#### Language Translation 

In [69]:
blob = TextBlob("Arise! Awake! and stop not until the goal is reached.")

In [70]:
blob.translate(from_lang = 'en', to ='hi')

TextBlob("उठो! जाग! और तब तक न रुकें जब तक कि लक्ष्य पूरा न हो जाए।")

In [71]:
blob1 = TextBlob('मनुष्य का सबसे बड़ा यदि कोई शत्रु है तो वह है उसका अज्ञान|')

In [72]:
blob1.detect_language()

'hi'

In [73]:
blob1.translate()

TextBlob("If there is any enemy of man, it is his ignorance.")

In [74]:
blob1.translate(from_lang = 'hi', to = 'en')

TextBlob("If there is any enemy of man, it is his ignorance.")

https://spacy.io/usage/models#languages