# Training your own classifier 

## Preprocessing

In [1]:
import re
import pandas as pd
import nltk

In [2]:
data = pd.read_csv('train.csv')

In [3]:
data.head(1)

Unnamed: 0,textID,text,sentiment
0,cb774db0d1,"I`d have responded, if I were going",neutral


In [4]:
data.dropna(subset=['text'], inplace=True)

## Remove punctuation

Punctuation doesn't provide any useful information about the sentiment of a piece of text so it should be removed to simplify the training process.

In [5]:
data.text = data.text.apply(lambda x: re.sub(r'[^\w\s]', '', x) )

## Remove stopwords

Stopwords are words in the english language that aren't very meaningful and can be easily skipped when conducting sentiment analysis. It would make the process faster if the stopwords were removed.

In [6]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
data.text = data.text.apply(lambda x: ' '.join([word for word in nltk.word_tokenize(x) if word.lower() not in stopwords]))

## Remove links

Link addresses don't provide much information about its contents so it isn't useful in sentiment analysis and hence should be removed.

In [7]:
data.text = data.text.apply(lambda x: re.sub(r'\(?http\S+', '', x))

## Lemmatization

Lemmatization was the method chosen as it is less crude than the stemming method and seems to be more reliable.

In [8]:
lemmatizer = nltk.WordNetLemmatizer()
data.text = data.text.apply(lambda x: ' '.join(
    [lemmatizer.lemmatize(word) for word in nltk.word_tokenize(x)]
        ) )

## Training a naive bayes sentiment classifier

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics

In [10]:
vectorizer = CountVectorizer()

In [11]:
data.dropna(subset=['text'], inplace=True)
X = vectorizer.fit_transform(data.text)

In [12]:
nb = MultinomialNB()

#nb.fit(features_matrix, item_we_want_to_predict)
nb.fit(X, data.sentiment)

MultinomialNB()

# Advanced Requirement  - Sentiment analysis on data extracted from Reddit

In [13]:
test_data = pd.read_csv('Replies.csv')

In [14]:
test_data.dropna(subset=['Reply'], inplace=True)

## Data Cleaning

In [15]:
test_data.Reply = test_data.Reply.apply(lambda x: re.sub(r'[^\w\s]', '', x) )

In [16]:
test_data.Reply = test_data.Reply.apply(lambda x: ' '.join([word for word in nltk.word_tokenize(x) if word.lower() not in stopwords]))

In [17]:
test_data.Reply = test_data.Reply.apply(lambda x: re.sub(r'\(?http\S+', '', x))

## Lemmatization

In [18]:
test_data.Reply = test_data.Reply.apply(lambda x: ' '.join(
    [lemmatizer.lemmatize(word) for word in nltk.word_tokenize(x)]
        ) )

## Generating sentiments

In [19]:
test_data.dropna(subset=['Reply'], inplace=True)
test_X = vectorizer.transform(test_data.Reply)

In [20]:
predicted = nb.predict(test_X)

In [21]:
prediction_data = test_data

In [22]:
prediction_data = prediction_data.assign(sentiment = predicted)

In [23]:
prediction_data.to_csv('Replies - Advanced Requirement.csv',index=False)

## Comparing pre-built package and trained model

In [24]:
prebuilt_package = pd.read_csv('Replies - Sentiment analysis.csv')

In [25]:
prediction_data.head(20)

Unnamed: 0,Reply,Upvote,Time,Key,sentiment
0,MicroG Services GMS Huawei Devices running HMS,1.0,2020-08-13 15:10:37,0.0,neutral
1,Google Chrome Firefox Focus,11.0,2020-05-29 01:47:12,0.0,neutral
2,use yandex disk Belarus Unlimited storage phot...,8.0,2020-06-01 09:08:23,0.0,neutral
3,Waze work without Google Play service,5.0,2020-05-29 08:39:25,0.0,neutral
4,6 add webapp tube using favorite browser page ...,6.0,2020-05-29 00:40:56,0.0,neutral
5,Google Drive pCloud Gmail move ProtonMail Tuta...,3.0,2020-05-29 00:01:50,0.0,neutral
6,one hell compromise consumer Cant believe US g...,3.0,2020-08-20 10:53:04,0.0,negative
7,got Google mobile service work P40 Pro using t...,4.0,2020-06-03 05:38:53,0.0,neutral
8,Like Huawei Browser sync desktop Huawei cloud ...,2.0,2020-05-29 02:09:17,0.0,neutral
9,Really want Google Pay least Huawei Pay work b...,2.0,2020-05-31 11:10:33,0.0,neutral


In [26]:
prebuilt_package.head(20)

Unnamed: 0,Reply,Upvote,Time,Key,neg,neu,pos,compound,compound_result
0,* * \ [ MicroG Services\ ] GMS for all Huawei ...,1.0,2020-08-13 15:10:37,0.0,0.0,1.0,0.0,0.0,neutral
1,Google Chrome -- -- > Firefox Focus,11.0,2020-05-29 01:47:12,0.0,0.0,1.0,0.0,0.0,neutral
2,I use yandex disk in Belarus . Unlimited stora...,8.0,2020-06-01 09:08:23,0.0,0.052,0.665,0.283,0.8655,positive
3,Waze works without Google Play services ?,5.0,2020-05-29 08:39:25,0.0,0.289,0.711,0.0,-0.2584,negative
4,6. you and add a webapp of you tube my using y...,6.0,2020-05-29 00:40:56,0.0,0.0,0.893,0.107,0.4588,positive
5,Google Drive - pCloud Gmail - move to ProtonMa...,3.0,2020-05-29 00:01:50,0.0,0.0,1.0,0.0,0.0,neutral
6,This is one hell of a compromise for consumers...,3.0,2020-08-20 10:53:04,0.0,0.117,0.758,0.125,0.1165,neutral
7,I got Google mobile services to work on my P40...,4.0,2020-06-03 05:38:53,0.0,0.0,1.0,0.0,0.0,neutral
8,"Like the Huawei Browser , but there is no sync...",2.0,2020-05-29 02:09:17,0.0,0.093,0.884,0.023,-0.6451,negative
9,Really just want Google Pay ! Or at the very l...,2.0,2020-05-31 11:10:33,0.0,0.056,0.779,0.165,0.7547,positive


From taking a sample of 20 from each method (prebuilt package and trained model), it seems that the trained model tends to have more neutral sentiments than the prebuilt package. The prebuilt package tends to have roughly the same amount of neutral and positive sentiments while there are a few negative sentiments as well.

Interestingly enough the only negative sentiment in the trained model (no. 6) was found to be neutral in the prebuilt package. The postive sentiments in the trained model however agreed with the sentiments in the prebuilt package (no. 16 and 18).

In [28]:
prebuilt_package.compound_result.value_counts().positive

1612

In [30]:
prebuilt_package.compound_result.value_counts().negative

673

In [29]:
prebuilt_package.compound_result.value_counts().neutral

1020

In [32]:
prediction_data.sentiment.value_counts().positive

567

In [35]:
prediction_data.sentiment.value_counts().negative

406

In [33]:
prediction_data.sentiment.value_counts().neutral

2345

Exact numbers for each sentiment for each method:

Prebuilt package:
    
    Positive: 1612
    
    Negative: 673
    
    Neutral: 1020

Trained model:
    
    Positive: 567
    
    Negative: 406
    
    Neutral: 2345

Overall, the trained model suggests that there is a neutral sentiment about Huawei while the prebuilt package suggests that there is a positive sentiment about Huawei.