# Sentiment analysis using Ktrain
In this section, let us learn how to perform sentiment analysis using Ktrain. We will use the Amazon product reviews dataset. The dataset can be downloaded from here - http://jmcauley.ucsd.edu/data/amazon/.

We can find the complete review data and also a small subset of data. In this exercise, we will use the subset of data containing the reviews of digital music. We can download digital music reviews from here - http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Digital_Music_5.json.gz. The downloaded digital music reviews will be in a compressed gzip format. So, after downloading, we will uncompress them and get the review in JSON format. 

Now, let us get started! First load the necessary libraries: 

In [1]:
%%capture
!pip install ktrain==0.25.3

In [2]:
import ktrain
from ktrain import text
import pandas as pd

Download and load the dataset

In [3]:
!gdown https://drive.google.com/uc?id=1-8urBLVtFuuvAVHi0s000e7r0KPUgt9f

Downloading...
From: https://drive.google.com/uc?id=1-8urBLVtFuuvAVHi0s000e7r0KPUgt9f
To: /content/reviews_Digital_Music_5.json
89.0MB [00:00, 113MB/s] 


In [4]:
df = pd.read_json(r'reviews_Digital_Music_5.json',lines=True)


Let us have a look at a few rows of our dataset: 

In [5]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A3EBHHCZO6V2A4,5555991584,"Amaranth ""music fan""","[3, 3]","It's hard to believe ""Memory of Trees"" came ou...",5,Enya's last great album,1158019200,"09 12, 2006"
1,AZPWAXJG9OJXV,5555991584,bethtexas,"[0, 0]","A clasically-styled and introverted album, Mem...",5,Enya at her most elegant,991526400,"06 3, 2001"
2,A38IRL0X2T4DPF,5555991584,bob turnley,"[2, 2]",I never thought Enya would reach the sublime h...,5,The best so far,1058140800,"07 14, 2003"
3,A22IK3I6U76GX0,5555991584,Calle,"[1, 1]",This is the third review of an irish album I w...,5,Ireland produces good music.,957312000,"05 3, 2000"
4,A1AISPOIIHTHXX,5555991584,"Cloud ""...""","[1, 1]","Enya, despite being a successful recording art...",4,4.5; music to dream to,1200528000,"01 17, 2008"



We only need the review text and overall rating, so let us subset the dataset with only the reviewText and overall column as shown below:


In [6]:
df = df[['reviewText','overall']]

In [7]:
df.head()

Unnamed: 0,reviewText,overall
0,"It's hard to believe ""Memory of Trees"" came ou...",5
1,"A clasically-styled and introverted album, Mem...",5
2,I never thought Enya would reach the sublime h...,5
3,This is the third review of an irish album I w...,5
4,"Enya, despite being a successful recording art...",4



We can notice, we have ratings ranging from 1 to 5. Let us convert these ratings to sentiment by mapping ratings (1 to 3) to negative class and (4 to 5) to positive class:


In [8]:
sentiment = {1:'negative',2:'negative',3:'negative',
             4:'positive',5:'positive'}

In [9]:

df['sentiment'] = df['overall'].map(sentiment)


Now, let us subset the dataset with only the reviewText, and sentiment column as shown below:

In [10]:
df = df[['reviewText','sentiment']]

In [11]:
df.head()

Unnamed: 0,reviewText,sentiment
0,"It's hard to believe ""Memory of Trees"" came ou...",positive
1,"A clasically-styled and introverted album, Mem...",positive
2,I never thought Enya would reach the sublime h...,positive
3,This is the third review of an irish album I w...,positive
4,"Enya, despite being a successful recording art...",positive




From the above result, we can notice that we have review text and its corresponding sentiment. 

The next step is creating the train and test sets. If our data is in a pandas data frame, we can use a texts_from_df function and if our data is in a folder then we can use a texts_from_folder function. 

Since our dataset is in the pandas data frame, we will use the texts_from_df function. The argument of the function includes the following: 

- train_df - dataframe containing the reviews and their sentiment
- text_column - the name of the column containing the reviews
- label_column - the name of the column containing the label 
- maxlen - maximum length of the review 
- max_features - the maximum number of words we use in vocabulary 
- preprocess_mode - It used for preprocessing the text. If we want to use normal tokenization then we set the preprocess_mode to standard or if we want to perform tokenization as we do in BERT, then we set the preprocess_mode to bert 


In this exercise, we will set maxlen t0 100 and max_features to 100000. We use bert as the preprocess_mode since we are going to use the BERT model for performing classification: 


In [12]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(train_df = df, 
                                                                   text_column = 'reviewText',
                                                                   label_columns=['sentiment'],
                                                                   maxlen=100, 
                                                                   max_features=100000,
                                                                   preprocess_mode='bert',
                                                                   val_pct=0.1)

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en



From the above result, we can notice that the ktrain provides a diverse set of classifiers ranging from logistic regression, bidirectional GRU to the BERT model. In this tutorial, we will use the BERT model. 



Now, let's define the model using the text_classifier function which builds and returns a classifier. The following are the important arguments to the function:


- name - name of the model we want to use, in this case, we will use bert
- train_data - a tuple containing our train data which is (x_train, y_train)
- preproc - instance of our preprocessor 
- metrics - metrics with which we want to access the performance of our model, in this example we will use accuracy


In [13]:
model = text.text_classifier(name='bert', train_data = (x_train, y_train) , preproc=preproc, metrics=['accuracy'])

Is Multi-Label? False
maxlen is 100
done.



Next, we create an instance called learner which is used for training our model. We will use the function get_learner for creating the learner instance. The following are the important arguments to the function: 

- model - model which we defined in the previous step 
- train_data - a tuple containing our training data 
- val_data - a tuple containing our test data 
- batch_size - a batch size which we want to use 
- use_multiprocessing - a boolean value indicating whether we want to use multiprocessing 


In [14]:
learner = ktrain.get_learner(model = model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=32, 
                             use_multiprocessing = True)


Now, we can finally train the model using the fit_onecycle function. The following are the important arguments to the function: 

    
- lr - learning rate 
- epoch - a number of epochs we want to train 
- checkpoint_folder - directory where we want to store the model weights

In [15]:
learner.fit_onecycle(lr=2e-5, epochs=1,checkpoint_folder='output')


In this example, we are training only for one epoch for simplicity. The above code will print the following:



As we can observe from the following results, we have obtained 87% accuracy on the test set. That's it. Training a model using ktrain is this simple. 

Now, we can use the trained model and make predictions using the get_predictor function. We need to pass our trained model and the instance of our preprocessor: 


In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)


Next, we can make a prediction with the predict function by passing the text: 

In [None]:
predictor.predict('I loved the song')

As we can observe, our model has identified that the given text is a positive sentence. 