<a href="https://colab.research.google.com/github/Arafat4341/sentiment_analysis_bag_of_words/blob/master/bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Download data from kaggle using kaggle API**

In [0]:
!pip install kaggle


In [0]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"arafat4341","key":"69b9550f0c3405a58d2b21ec32837b72"}'}

In [0]:
# setting directory
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

#change the permission
!chmod 600 ~/.kaggle/kaggle.json

In [0]:
!kaggle competitions download -c word2vec-nlp-tutorial

Downloading sampleSubmission.csv to /content
  0% 0.00/276k [00:00<?, ?B/s]
100% 276k/276k [00:00<00:00, 37.6MB/s]
Downloading unlabeledTrainData.tsv.zip to /content
 65% 17.0M/26.0M [00:01<00:01, 9.34MB/s]
100% 26.0M/26.0M [00:01<00:00, 15.9MB/s]
Downloading testData.tsv.zip to /content
 40% 5.00M/12.6M [00:00<00:00, 8.99MB/s]
100% 12.6M/12.6M [00:00<00:00, 19.9MB/s]
Downloading labeledTrainData.tsv.zip to /content
 39% 5.00M/13.0M [00:00<00:00, 9.22MB/s]
100% 13.0M/13.0M [00:00<00:00, 20.4MB/s]


**Reading the data**

In [0]:
# unzipping the data
from zipfile import ZipFile
file1 = 'labeledTrainData.tsv.zip'
file2 = 'unlabeledTrainData.tsv.zip'
file3 = 'testData.tsv.zip'

with ZipFile(file1, 'r') as zip:
  zip.extractall()
  print('Done')

Done


In [0]:
import pandas as pd

# # "header=0" indicates that the first line of the file contains column names,
# "delimiter=\t" indicates that the fields are separated by tabs,
# and quoting=3 tells Python to ignore doubled quotes

train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

train['review'][:10]


0    "With all this stuff going down at the moment ...
1    "\"The Classic War of the Worlds\" by Timothy ...
2    "The film starts with a manager (Nicholas Bell...
3    "It must be assumed that those who praised thi...
4    "Superbly trashy and wondrously unpretentious ...
5    "I dont know why people think this is such a b...
6    "This movie could have been very good, but com...
7    "I watched this video at a friend's house. I'm...
8    "A friend of mine bought this film for £1, and...
9    "<br /><br />This movie is full of references....
Name: review, dtype: object

**Cleaning and pre-processing the data**

In [0]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from bs4 import BeautifulSoup as bs
import re
from nltk.corpus import stopwords

def review_to_words( raw_review ):
   
    review_text = bs(raw_review).get_text() #removing html tags
           
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) # removing punctuations
    
    words = letters_only.lower().split()   #lowercase and split the letters                        
   
    stops = set(stopwords.words("english"))
    
    meaningful_words = [w for w in words if not w in stops]   # removing stopwords
    
    return( " ".join( meaningful_words )) # converting list of words into space separated string

In [0]:
clean_train_reviews = []
for i in range(len(train)):
  clean_train_reviews.append(review_to_words(train['review'][i]))

In [0]:
len(clean_train_reviews)

25000

**Creating features from a Bag of Words (using sckit-learn)**

Now that we have our training reviews tidied up, how do we convert them to some kind of numeric representation for machine learning? One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences:

Sentence 1: "The cat sat on the hat"

Sentence 2: "The dog ate the cat and the hat"

From these two sentences, our vocabulary is as follows:

{ the, cat, sat, on, hat, dog, ate, and }

To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:

{ the, cat, sat, on, hat, dog, ate, and }

Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }

Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}

In [0]:
# creating bag of words
# "CountVectorizer" Converts a collection of text documents to a matrix of token counts

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer="word", # feature should be made of words
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

# CountVectorizer comes with its own options to automatically do preprocessing,
# tokenization, and stop word removal -- for each of these,
# instead of specifying "None", we could have used a built-in method or specified our own function to use. 

# fit_transform() does two functions: First, it fits the model and learns the vocabulary;
# second, it transforms our training data into feature vectors. The input to fit_transform should be a list of strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

print(vectorizer.get_feature_names())



In [0]:
train_data_features = train_data_features.toarray() # converted to numpy arrays
print(train_data_features.shape)

(25000, 5000)


**Model implementation**

In [0]:
from sklearn.ensemble import RandomForestClassifier as rf

forest = rf(n_estimators = 100)

forest.fit(train_data_features, train['sentiment'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

**Result calculating and submission**

In [0]:
test = pd.read_csv('testData.tsv', header=0, delimiter='\t', quoting=3)

# test.shape

clean_test_reviews = []

for i in range(len(test['review'])):
  clean_test_reviews.append(review_to_words(test['review'][i]))
  
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# predict features
result = forest.predict(test_data_features)

output = pd.DataFrame(data = {'id':test['id'], 'sentiment':result})

output.to_csv("Bag_of_Words.csv", index=False, quoting=3)

In [0]:
# submission via api

!kaggle competitions submit -c word2vec-nlp-tutorial -f Bag_of_Words.csv -m "Message"

100% 276k/276k [00:07<00:00, 36.3kB/s]
Successfully submitted to Bag of Words Meets Bags of Popcorn