# Text Classification

In this assignment you are required to perform text classification on App review dataset consisting of 4 classes:
- Bug reports
- Feature
- Rating
- UserExperience

There are a total of 3733 samples. You need to:
- Split the data into train/validate/test sets (70/15/15) using random seed '777' with shuffling.
- You need to investigate issues of stop-words, infrequent words, text normalization (stemming, lemmatization, other issues of word tokenization like case normalization, punctuations). Additionally, you can also apply techniques to solve data imbalance problem. 
- You need to report appropriate measures like accuracy, precision, recall, and f1 scores (you can classification report api)
- You should show the confusion matrix for the validation and test sets

***Note:***
- Student getting the best macro-average F1-Score receives 15% bonus grades
- Student getting the second-best macro-average F1-Score receives 10% bonus grades
- Student getting the third-best macro-average F1-Score receives 5% bonus grades

## 0. Import

In [242]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss, f1_score, recall_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import multilabel_confusion_matrix, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV
from tqdm import tqdm
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from tensorflow import keras
from keras.layers import (LSTM, 
                          Embedding, 
                          BatchNormalization,
                          Dense, 
                          TimeDistributed, 
                          Dropout, 
                          Bidirectional,
                          Flatten, 
                          GlobalMaxPool1D)
from keras.initializers import Constant
from keras.optimizers import adam_v2
from nltk.tokenize import word_tokenize
import transformers
import nltk


## 1. Read the dataset and split it into different sets
**[20 points]** for invesitaging issues of stop-words, infrequent words, text normalization (stemming, lemmatization, other issues of word tokenization), dataset imbalance

In [54]:
data = pd.read_csv('AppReviews-FourClasses.csv') # read data 
data.head()

Unnamed: 0,Review,label
0,"Besides the occasional crash, this is an amazi...",Bug
1,This could be a great app if it was predictabl...,Bug
2,I can't open since the last 2 updates Pop-ups ...,Bug
3,Use to love this app but it's not working afte...,Bug
4,"Urrrrm\tAfter my third re installing, it final...",Bug


In [55]:
data.describe()

Unnamed: 0,Review,label
count,3733,3733
unique,3217,4
top,Good,Rating
freq,15,2461


In [56]:
data.isnull().sum() # checking for null values

Review    0
label     0
dtype: int64

### removing stopwords and punc

In [57]:
text = data['Review']
punctuation = '"#$%&()+,-./:;<=>@[\]^_`{|}~'

def text_cleaning(data):
    sw_file = open("stopwords.txt", "r")
    content = sw_file.read()
    stopwords = content.split("\n")
    
    remove_punc = [i for i in data if i  not in punctuation]
    remove_punc = ''.join(remove_punc)
    
    remove_stopwords = [i for i in remove_punc.split() if i.lower() not in stopwords]
    processed = ' '.join(remove_stopwords)
    
    return processed


In [58]:
data['cleaned'] = data.Review.apply(text_cleaning)
data.head()


Unnamed: 0,Review,label,cleaned
0,"Besides the occasional crash, this is an amazi...",Bug,Besides occasional crash amazing product tons ...
1,This could be a great app if it was predictabl...,Bug,could great app predictable full bugs unpredic...
2,I can't open since the last 2 updates Pop-ups ...,Bug,can't open since last 2 updates Popups go craz...
3,Use to love this app but it's not working afte...,Bug,Use love app it's working new update Pages won...
4,"Urrrrm\tAfter my third re installing, it final...",Bug,Urrrrm third re installing finally scenery han...


In [251]:
le = LabelEncoder()
le.fit(data['label'])

data['label_encoded'] = le.transform(data['label'])


### Stemming and Lemmatization

In [68]:
stemmer = nltk.SnowballStemmer("english")

def stemm_text(text):
    text = ' '.join(stemmer.stem(word) for word in text.split(' '))
    return text

data['cleaned'] = data['cleaned'].apply(stemm_text)
data.head()

Unnamed: 0,Review,label,cleaned
0,"Besides the occasional crash, this is an amazi...",Bug,besid occas crash amaz product ton potenti dep...
1,This could be a great app if it was predictabl...,Bug,could great app predict full bug unpredict abl...
2,I can't open since the last 2 updates Pop-ups ...,Bug,can't open sinc last 2 updat popup go crazi ip...
3,Use to love this app but it's not working afte...,Bug,use love app it work new updat page won't scro...
4,"Urrrrm\tAfter my third re installing, it final...",Bug,urrrrm third re instal final sceneri hand blac...


In [71]:
# lemmatizer=nltk.stem.WordNetLemmatizer()

# def lemma_text(text):
#     text = ' '.join(lemmatizer.lemmatize(word) for word in text.split(' '))
#     return text

# data['cleaned'] = data['cleaned'].apply(lemma_text)
# data.head()

### label encoding

In [252]:
# lbl_enc = preprocessing.LabelEncoder()
y = data.label_encoded.values
# print(y)

X = data.cleaned.values

### Splitting the data 

In [253]:
trainx, valx, trainy, valy = train_test_split(X,y,test_size=0.30,random_state=777)
testx, valx, testy, valy = train_test_split(valx,valy,test_size=0.50,random_state=777)

print('trainx shape:',trainx.shape)
print('trainy shape:',trainy.shape)
print('valx shape:',valx.shape)
print('valy shape:',valy.shape)
print('testx shape:',testx.shape)
print('testy shape:',testy.shape)

trainx shape: (2613,)
trainy shape: (2613,)
valx shape: (560,)
valy shape: (560,)
testx shape: (560,)
testy shape: (560,)


### Balancing the data set

In [254]:
from imblearn.under_sampling import NearMiss
from collections import Counter

undersample = NearMiss(version=1, n_neighbors=3)
# transform the dataset
trainx, trainy = undersample.fit_resample(trainx, trainy)
counter = Counter(trainy)
print(counter)

Rating            2461
UserExperience     607
Bug                370
Feature            295
Name: label, dtype: int64

## 2. [15 points] Perform text classification using bag-of-words features
In this part, you can use any classifier of your choice such as logistic regression or neural networks

In [255]:
count_vectorizer = CountVectorizer()
trainx_bow = count_vectorizer.fit_transform(trainx)
valx_bow = count_vectorizer.transform(valx)

In [256]:
bag_logistic = BaggingClassifier(LogisticRegression(C=1.0,max_iter = 10000))
bag_logistic.fit(trainx_bow, trainy)

logistic_pred = bag_logistic.predict_proba(valx_bow)
print('BoW log_loss accuracy :' ,log_loss(valy, logistic_pred))



BoW log_loss accuracy : 0.8638643545734025


## 3. [15 points] Perform text classification using tf-idf features
In this part, you can use any classifier of your choice such as logistic regression or neural networks

In [97]:

tfv = TfidfVectorizer(min_df=3,  max_features=None, 
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
            stop_words = 'english')

trainx_tfv = tfv.fit_transform(trainx)
test_tfv =  tfv.transform(testx) 

In [98]:
bag_logistic = BaggingClassifier(LogisticRegression(C=1.0,max_iter = 10000))
bag_logistic.fit(trainx_tfv, trainy)

logistic_pred = bag_logistic.predict_proba(test_tfv)
print('tf-idf log_loss accuracy :' ,log_loss(testy, logistic_pred))



tf-idf log_loss accuracy : 0.7591844526629771


## 4. [20 points] Perform text classification using dense vectors like word2vec or Glove embeddings
In this part, you can use any classifier of your choice such as logistic regression or neural networks. You can download and use precomputed embeddings or create your own word2vec style embeddings using libraries such as ```gensim``` (from gensim.models import Word2Vec)

In [111]:
def create_corpus_new(df):
    corpus=[]
    for tweet in tqdm(df['cleaned']):
        words=[word.lower() for word in word_tokenize(tweet)]
        corpus.append(words)
    return corpus 
corpus=create_corpus_new(data)


100%|██████████| 3733/3733 [00:00<00:00, 10905.84it/s]


In [114]:
embedding_dict={}
with open('glove.6B.100d.txt','r') as f:
    for line in f:
        values=line.split()
        word = values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

In [124]:
MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(corpus)
sequences=tokenizer_obj.texts_to_sequences(corpus)

text_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

In [125]:
word_index=tokenizer_obj.word_index
print('Number of unique words:',len(word_index))

Number of unique words: 4599


In [126]:
num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,100))

for word,i in tqdm(word_index.items()):
    if i < num_words:
        emb_vec=embedding_dict.get(word)
        if emb_vec is not None:
            embedding_matrix[i]=emb_vec 

100%|██████████| 4599/4599 [00:00<00:00, 706915.53it/s]


In [148]:
text_pad[0][0:]
train_glove =text_pad[:text.shape[0]]
test=text_pad[text.shape[0]:]

In [149]:
X_train,X_val,y_train,y_val=train_test_split(train_glove,data['label'].values,test_size=0.15,random_state=777)
print('Shape of train',X_train.shape)
print("Shape of Validation ",X_val.shape)
print(test)

Shape of train (3173, 50)
Shape of Validation  (560, 50)
[]


In [162]:
bag_logistic = BaggingClassifier(LogisticRegression(C=1.0,max_iter = 10000))
bag_logistic.fit(X_train,y_train)

logistic_pred = bag_logistic.predict_proba(X_val)
print('Glove log_loss accuracy :' ,log_loss(y_val, logistic_pred))


tf-idf log_loss accuracy : 0.9844802437020294


## 5. [20 points] Perform text classification using learnt embeddings
Here you should use RNNs. Here you need to start with random embedding vectors that will be learnt together with the main task.

In [213]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


def tokenization(text):
    num_words = 1000
    pad_type = 'post'
    trunc_type = 'post'
    tokenizer = Tokenizer(num_words=num_words)
    tokenizer.fit_on_texts(text)

    word_index = tokenizer.word_index


    train_sequences = tokenizer.texts_to_sequences(data['Review'])


    maxlen = max([len(x) for x in train_sequences])


    train_padded = pad_sequences(train_sequences, maxlen=maxlen, padding=pad_type)


    print(len(word_index))
    print(train_padded.shape)


    print("\nPadded training sequences:\n", train_padded)
    print("\nPadded training shape:", train_padded.shape)
    print("Training sequences data type:", type(train_sequences))
    print("Padded Training sequences data type:", type(train_padded))
    return maxlen,train_sequences,train_padded

In [236]:
maxlen, tensor, X = tokenization(data['Review'])
X_train, X_test, y_train, y_test = train_test_split(X, tensor, test_size=0.30, random_state=777)

5923
(3733, 230)

Padded training sequences:
 [[  1 396   8 ...   0   0   0]
 [  8  93  28 ...   0   0   0]
 [  2  63 131 ...   0   0   0]
 ...
 [ 48 237 562 ...   0   0   0]
 [  2  13   4 ...   0   0   0]
 [747 215  10 ...   0   0   0]]

Padded training shape: (3733, 230)
Training sequences data type: <class 'list'>
Padded Training sequences data type: <class 'numpy.ndarray'>


In [237]:

model = Sequential()

model.add(Embedding(input_dim=X.shape[0], output_dim=X.shape[1], weights = [X], input_length=maxlen))

model.add(Bidirectional(LSTM(maxlen, return_sequences = True, recurrent_dropout=0.2)))

model.add(GlobalMaxPool1D())
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(maxlen, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(maxlen, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation = 'softmax'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])


model.summary()

Model: "sequential_23"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_23 (Embedding)    (None, 230, 230)          858590    
                                                                 
 bidirectional_1 (Bidirectio  (None, 230, 460)         848240    
 nal)                                                            
                                                                 
 global_max_pooling1d_1 (Glo  (None, 460)              0         
 balMaxPooling1D)                                                
                                                                 
 batch_normalization_1 (Batc  (None, 460)              1840      
 hNormalization)                                                 
                                                                 
 dropout_3 (Dropout)         (None, 460)               0         
                                                     

In [None]:
# checkpoint = ModelCheckpoint('model.h5', monitor = 'val_loss', verbose = 1, save_best_only = True)
# reduce_lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.2, verbose = 1, patience = 5,min_lr = 0.001)
history = model.fit(X_train, 
    trainy, 
    epochs = 7,
    batch_size = 32,
    validation_data = (X_test, testy),
    verbose = 1,
)

## 6. [BONUS: 15 points] Perform text classification using contextual embeddings such as BERT
Here you should use RNNs or transformers

In [243]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

def bert_encode(data, maximum_length) :
    input_ids = []
    attention_masks = []

    for text in data:
        encoded = tokenizer.encode_plus(
            text, 
            add_special_tokens=True,
            max_length=maximum_length,
            pad_to_max_length=True,

            return_attention_mask=True,
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
        
    return np.array(input_ids),np.array(attention_masks)

Downloading: 100%|██████████| 226k/226k [00:00<00:00, 466kB/s] 
Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 18.6kB/s]
Downloading: 100%|██████████| 571/571 [00:00<00:00, 570kB/s]


In [258]:
texts = data['cleaned']
target = data['label_encoded']

train_input_ids, train_attention_masks = bert_encode(texts,60)



In [259]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

def create_model(bert_model):
    
    input_ids = tf.keras.Input(shape=(60,),dtype='int32')
    attention_masks = tf.keras.Input(shape=(60,),dtype='int32')

    output = bert_model([input_ids,attention_masks])
    output = output[1]
    output = tf.keras.layers.Dense(32,activation='relu')(output)
    output = tf.keras.layers.Dropout(0.2)(output)
    output = tf.keras.layers.Dense(1,activation='sigmoid')(output)
    
    model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = output)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [260]:
from transformers import TFBertModel
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [261]:
model = create_model(bert_model)
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 60)]         0           []                               
                                                                                                  
 input_4 (InputLayer)           [(None, 60)]         0           []                               
                                                                                                  
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  109482240   ['input_3[0][0]',                
                                thPoolingAndCrossAt               'input_4[0][0]']                
                                tentions(last_hidde                                               
                                n_state=(None, 60,                                          

  super(Adam, self).__init__(name, **kwargs)


In [None]:
history = model.fit(
    [train_input_ids, train_attention_masks],
    target,
    validation_split=0.2, 
    epochs=3,
    batch_size=10
)

## [10 points] Document your conclusions on:
- General conclusions about the task
- Related to text preprocessing 
- Related to using different features/embeddings
- Effect of different hyperparameters

Write bullet points based report assuming you are presenting your conclusions to the project manager.

#### Error analysis and possible improvements 
* The task was quit challanging and intreststing at the same time, hence it allowed us to explore and implment diffrent nlp taqnuiqes.
* We have implemented diffrent tequnqes on the preprocessing phase, such as removing punc, stemming, lemmtization, balancing the data set, and these tqunqese helped improving the classfiers accurcy.
* a shocking result is that the model that used Bag of word approach gave us the highest accuracy score.
* by using bagClassfier boosting tequnqes were able to achive a 87% accuracy
* having an  imbalanced dataset, and we dealt with it by making all the classes equal, and that improved our models accuracy by 3%


