# Introduction 
[CheatSheet](https://www.kaggle.com/code/raenish/cheatsheet-text-helper-functions/notebook)

Data Source: [kaggle](https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset?select=Twitter_Data.csv)
This set of tweets is politics related.

In this notebook we will explore doc2vec(DBOW&DM).

Some readings I found helpful:
  - Introduction to Doc2Vec [link](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
  - Understand Wrod2Vec [link](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
  - Sentiment Analysis Series by Kim [link](https://medium.com/towards-data-science/another-twitter-sentiment-analysis-with-python-part-11-cnn-word2vec-41f5e28eda74)

This project is not completed.

## Import data & packages

In [7]:
# basic
import numpy as np
import pandas as pd
import re
import string
from time import time
# Preprocessing
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML
import nltk
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classfication_report

# Preprocessing Text 
<a id="1"></a>
Usually the steps includes 

1. Scrape text from raw documents
2. remove punctuation
3. lower case
4. tokenize & remove stop word 
5. lemmatize (lemma or stem)

This is a semi cleaned dataset,so some of the preprocessing is commented out.

In [10]:
def twit_preproc(df,column,now, tokenized=True):
    """Preprocessing for df[column]
        process involved: 
            - remove punctuation
            - lower case
            - tokenize & remove stop word 
            - lemmatize (lemma or stem)
            - optional: joining the tokens in each corpus
        the cleaned column will be in df[now]
        
    """
    def clean_text(text):
        '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
        text = str(text).lower()
        text = re.sub('\[.*?\]', '', text) # remove text in brackets
        text = re.sub('https?://\S+|www\.\S+', '', text) #remove link
        text = re.sub('<.*?>+', '', text) 
        #text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub('\n', '', text) #remove numbers
        text = re.sub('\w*\d\w*', '', text)
        return text
    df[now]= df[column].apply(lambda x:clean_text(x))
    
    # Tokenize & to lower case
    tokenizer = RegexpTokenizer(r'\w+')
    df[now] = df[now].apply(lambda x:tokenizer.tokenize(x))

    def remove_stopword(x):
        return [y for y in x if y not in stopwords.words('english')]
    df[now] = df[now].apply(lambda x:remove_stopword(x))
    
    # lemmatize and join the words
    #lemmatizer = WordNetLemmatizer()
    #def sentence_lemmatize(text):
    #    return ([lemmatizer.lemmatize(x) for x in text])
    #df[now] = df[now].apply(lambda text:sentence_lemmatize(text))
    
    # join the text 
    if (not tokenized):
        df[now] = df[now].apply(lambda text: " ".join(x for x in text))
        
    return df

### Reads data
- `category`: sentiment
    - -1 = Negative
    - 0 = Neutral
    - 1 = Positive

In [11]:
df = pd.read_csv('data/data_2/Twitter_Data.csv')

In [12]:
twit_preproc(df,'clean_text','text')

Unnamed: 0,clean_text,category,text
0,when modi promised “minimum government maximum...,-1.0,"[modi, promised, minimum, government, maximum,..."
1,talk all the nonsense and continue all the dra...,0.0,"[talk, nonsense, continue, drama, vote, modi]"
2,what did just say vote for modi welcome bjp t...,1.0,"[say, vote, modi, welcome, bjp, told, rahul, m..."
3,asking his supporters prefix chowkidar their n...,1.0,"[asking, supporters, prefix, chowkidar, names,..."
4,answer who among these the most powerful world...,1.0,"[answer, among, powerful, world, leader, today..."
...,...,...,...
162975,why these 456 crores paid neerav modi not reco...,-1.0,"[crores, paid, neerav, modi, recovered, congre..."
162976,dear rss terrorist payal gawar what about modi...,-1.0,"[dear, rss, terrorist, payal, gawar, modi, kil..."
162977,did you cover her interaction forum where she ...,0.0,"[cover, interaction, forum, left]"
162978,there big project came into india modi dream p...,0.0,"[big, project, came, india, modi, dream, proje..."


In [36]:
train, test = train_test_split(df,test_size=0.1)
train.head()

Unnamed: 0,clean_text,category,text
118782,watch shri narendra modis exclusive interview ...,0.0,watch shri narendra modis exclusive interview ...
77023,there nothing against modi attack then personal,0.0,nothing modi attack personal
62407,nothing new modi just ribbon cutting mms work ...,-1.0,nothing new modi ribbon cutting mms work busin...
136902,mumbai march —setting the campaign tone for th...,-1.0,mumbai march setting campaign tone lok sabha e...
137403,you want say that impossible that congress ind...,-1.0,want say impossible congress indulged corrupti...


In [37]:
train = train.dropna()
test = test.dropna()
print("Training set has {} rows, and testing set has {} rows".
     format(train.shape[0],test.shape[0]))

Training set has 146672 rows, and testing set has 16297 rows


In [38]:
X,x_test = train['text'],test['text']
y,y_test = train['category'],test['category']
X.head()

118782    watch shri narendra modis exclusive interview ...
77023                          nothing modi attack personal
62407     nothing new modi ribbon cutting mms work busin...
136902    mumbai march setting campaign tone lok sabha e...
137403    want say impossible congress indulged corrupti...
Name: text, dtype: object

# Doc2Vec

### DBOW

In [47]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

from sklearn import utils

Doc2Vec require each of its input to be a TaggedDocument (a list of words and a list of label) 
```
# EXAMPLE
str01 = "dog love cat"
TaggedDocument(str01.split(),'H1')
```
\>>> str01 = `TaggedDocument(words=['dog', 'love', 'cat'], tags=['H1'])`


In [41]:
def tag_corpus(tweets,label_head="TAG"):
    """
    Returns a list of TaggedDocument object with the tags = 'label_index'
    """
    result = []
    for i,tweet in enumerate(tweets):
        result.append(TaggedDocument(tweet.split(), [label_head + '_' + str(i)]))
    return result

In [42]:
X_tagged = tag_corpus(X,"TRAIN")
test_tagged = tag_corpus(x_test,"TEST")

In [69]:
all_tagged = X_tagged + test_tagged


`vector_size` is the dimension of the output layer (as well as hidden layer). It is a hypter parameter.

In [90]:
dbow_model = Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
dbow_model.build_vocab([x for x in all_tagged])

In [91]:
%%time
for epoch in range(30):
    dbow_model.train(utils.shuffle([x for x in all_tagged]), total_examples=len(all_tagged), epochs=1)
    dbow_model.alpha -= 0.002
    dbow_model.min_alpha = dbow_model.alpha

CPU times: user 4min 42s, sys: 45.2 s, total: 5min 27s
Wall time: 2min 51s


In [77]:
# get the vectors from model 
def get_vectors(model, corpus_size, vectors_size, label):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = label + '_' + str(i)
        vectors[i] = model.dv[prefix]
    return vectors
    
train_vectors_dbow = get_vectors(dbow_model, len(X_tagged), 300, 'TRAIN')
test_vectors_dbow = get_vectors(dbow_model, len(test_tagged), 300, 'TEST')

### DMC

In [117]:
dmc_model = Doc2Vec(dm=1, dm_concat=1, vector_size=300, window=2, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
dmc_model.build_vocab([x for x in all_tagged])

for epoch in range(30):
    dmc_model.train(utils.shuffle([x for x in all_tagged]), total_examples=len(all_tagged), epochs=1)
    dmc_model.alpha -= 0.002
    dmc_model.min_alpha = dmc_model.alpha

In [119]:
train_vectors_dbow = get_vectors(dmc_model, len(X_tagged), 300, 'TRAIN')
test_vectors_dbow = get_vectors(dmc_model, len(test_tagged), 300, 'TEST')

In [120]:
%%time
mlr = mlr.fit(train_vectors_dbow,y)
preds = mlr.predict(test_vectors_dbow)
print("Model Accuracy for DBOW")
print(classification_report(y_test, preds))

Model Accuracy for DBOW
              precision    recall  f1-score   support

        -1.0       0.64      0.09      0.15      3494
         0.0       0.35      0.20      0.26      5578
         1.0       0.46      0.80      0.58      7225

    accuracy                           0.44     16297
   macro avg       0.48      0.36      0.33     16297
weighted avg       0.46      0.44      0.38     16297

CPU times: user 1min 46s, sys: 4.84 s, total: 1min 51s
Wall time: 28.2 s


# Modelling
<a id="4"></a>
As we decided in previous notebook, the model we will use is logistic regression.

In [109]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
mlr = LogisticRegression(max_iter=500, multi_class='multinomial',solver='lbfgs')

In [110]:
%%time
mlr = mlr.fit(train_vectors_dbow,y)

CPU times: user 15.2 s, sys: 955 ms, total: 16.2 s
Wall time: 4.19 s


In [118]:
preds = mlr.predict(test_vectors_dbow)
print("Model Accuracy for DBOW")
print(classification_report(y_test, preds))

Model Accuracy for DBOW
              precision    recall  f1-score   support

        -1.0       0.51      0.31      0.38      3494
         0.0       0.60      0.63      0.62      5578
         1.0       0.60      0.70      0.65      7225

    accuracy                           0.59     16297
   macro avg       0.57      0.54      0.55     16297
weighted avg       0.58      0.59      0.58     16297

