# **Business Problem**

About Dataset

Perform Text Classification on the data. The tweets have been pulled from Twitter and manual tagging has been done then.
The names and usernames have been given codes to avoid any privacy concerns.

Columns:
1) Location

2) Tweet At

3) Original Tweet

4) Label

# **Loading the Data**

In [1]:
import pandas as pd
import numpy as np
import nltk
import warnings
warnings.filterwarnings('ignore')

In [2]:
df_train = pd.read_csv('Corona_NLP_train.csv', encoding='ISO-8859-1')
df_test = pd.read_csv('Corona_NLP_test.csv')

In [3]:
df_train

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
...,...,...,...,...,...,...
41152,44951,89903,"Wellington City, New Zealand",14-04-2020,Airline pilots offering to stock supermarket s...,Neutral
41153,44952,89904,,14-04-2020,Response to complaint not provided citing COVI...,Extremely Negative
41154,44953,89905,,14-04-2020,You know itÂs getting tough when @KameronWild...,Positive
41155,44954,89906,,14-04-2020,Is it wrong that the smell of hand sanitizer i...,Neutral


In [4]:
df_test

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral
...,...,...,...,...,...,...
3793,3794,48746,Israel ??,16-03-2020,Meanwhile In A Supermarket in Israel -- People...,Positive
3794,3795,48747,"Farmington, NM",16-03-2020,Did you panic buy a lot of non-perishable item...,Negative
3795,3796,48748,"Haverford, PA",16-03-2020,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral
3796,3797,48749,,16-03-2020,Gov need to do somethings instead of biar je r...,Extremely Negative


# **Data Exploration**

In [5]:
df_train.shape

(41157, 6)

In [6]:
df_test.shape

(3798, 6)

In [7]:
df_train.dtypes

UserName          int64
ScreenName        int64
Location         object
TweetAt          object
OriginalTweet    object
Sentiment        object
dtype: object

In [8]:
df_test.dtypes

UserName          int64
ScreenName        int64
Location         object
TweetAt          object
OriginalTweet    object
Sentiment        object
dtype: object

In [9]:
for col in df_train.columns.tolist():
  print(col, ': ', df_train[col].nunique())
  print(df_train[col].unique())
  print()

UserName :  41157
[ 3799  3800  3801 ... 44953 44954 44955]

ScreenName :  41157
[48751 48752 48753 ... 89905 89906 89907]

Location :  12220
['London' 'UK' 'Vagabonds' ... 'Juba south sudan' 'OHIO'
 'i love you so much || he/him']

TweetAt :  30
['16-03-2020' '17-03-2020' '18-03-2020' '19-03-2020' '20-03-2020'
 '21-03-2020' '22-03-2020' '23-03-2020' '24-03-2020' '25-03-2020'
 '26-03-2020' '27-03-2020' '28-03-2020' '29-03-2020' '30-03-2020'
 '31-03-2020' '01-04-2020' '02-04-2020' '03-04-2020' '04-04-2020'
 '05-04-2020' '06-04-2020' '07-04-2020' '08-04-2020' '09-04-2020'
 '10-04-2020' '11-04-2020' '12-04-2020' '13-04-2020' '14-04-2020']

OriginalTweet :  41157
['@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8'
 'advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular m

In [10]:
for col in df_train.columns.tolist():
  print(col, ': ', df_train[col].nunique())
  print(df_train[col].unique())
  print()

UserName :  41157
[ 3799  3800  3801 ... 44953 44954 44955]

ScreenName :  41157
[48751 48752 48753 ... 89905 89906 89907]

Location :  12220
['London' 'UK' 'Vagabonds' ... 'Juba south sudan' 'OHIO'
 'i love you so much || he/him']

TweetAt :  30
['16-03-2020' '17-03-2020' '18-03-2020' '19-03-2020' '20-03-2020'
 '21-03-2020' '22-03-2020' '23-03-2020' '24-03-2020' '25-03-2020'
 '26-03-2020' '27-03-2020' '28-03-2020' '29-03-2020' '30-03-2020'
 '31-03-2020' '01-04-2020' '02-04-2020' '03-04-2020' '04-04-2020'
 '05-04-2020' '06-04-2020' '07-04-2020' '08-04-2020' '09-04-2020'
 '10-04-2020' '11-04-2020' '12-04-2020' '13-04-2020' '14-04-2020']

OriginalTweet :  41157
['@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8'
 'advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular m

In [11]:
df_train.isnull().sum()

UserName            0
ScreenName          0
Location         8590
TweetAt             0
OriginalTweet       0
Sentiment           0
dtype: int64

In [12]:
df_test.isnull().sum()

UserName           0
ScreenName         0
Location         834
TweetAt            0
OriginalTweet      0
Sentiment          0
dtype: int64

In [13]:
df_train.duplicated().sum()

np.int64(0)

In [14]:
df_test.duplicated().sum()

np.int64(0)

In [15]:
df_train = df_train.drop(columns = {'UserName', 'ScreenName', 'Location', 'TweetAt'})
df_test = df_test.drop(columns = {'UserName', 'ScreenName', 'Location', 'TweetAt'})

In [16]:
df_train = df_train.sample(20000, random_state = 42)

In [17]:
df_train['Sentiment'].value_counts()

Sentiment
Positive              5587
Negative              4850
Neutral               3716
Extremely Positive    3201
Extremely Negative    2646
Name: count, dtype: int64

In [18]:
X = df_train.drop(columns = {'Sentiment'})
y = df_train['Sentiment']

In [19]:
y.unique()

array(['Neutral', 'Extremely Negative', 'Positive', 'Negative',
       'Extremely Positive'], dtype=object)

In [20]:
from imblearn.under_sampling import RandomUnderSampler

rs = RandomUnderSampler(random_state = 0)
X, y = rs.fit_resample(X, y)

In [21]:
df_train = pd.concat([X, y], axis = 1)

In [22]:
df_train

Unnamed: 0,OriginalTweet,Sentiment
35564,Rice &amp; wheat prices surge amid fears Covid...,Extremely Negative
8568,"As a carer for elderly parents, will you selfi...",Extremely Negative
28089,#Coronavirus' impact on the global food supply...,Extremely Negative
28198,@9NewsSyd @tiffgenders @woolworths @Coles @ald...,Extremely Negative
13969,Question... anybody ever heard of death cases ...,Extremely Negative
...,...,...
39882,WeÂre keeping a close eye on agriculture pric...,Positive
1090,"Please, if you can afford to, get a basket of ...",Positive
8842,Thank you Brad Paisley,Positive
2267,We encourage all those impacted by supermarket...,Positive


# **Data (Text) Preprocessing**

In [24]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stop_words = stopwords.words('english')

corpus_train = []
for tweet in df_train['OriginalTweet'].tolist():
  l = tweet.lower()
  l = re.sub(r'http\S+', '', tweet)  # Remove URLs
  l = re.sub(r'@\w+', '', tweet)  # Remove mentions
  l = re.sub(r'#\w+', '', tweet)  # Remove hashtags
  l = re.sub('[^a-zA-Z]', ' ', tweet)
  l = l.split()
  l1 = [stemmer.stem(i) for i in l if i not in stop_words]
  l = ' '.join(l1)
  corpus_train.append(l)

print(corpus_train)

['rice amp wheat price surg amid fear covid lockdown may threaten global food secur increas panic buy food due coronaviru lockdown led price spike world two stapl grain rice amp wheat import rush stockpil good http co qov jap', 'as carer elderli parent selfish stock pile wanker think peopl i tri self isol fear contract viru pass parent howev i ventur tri buy food coronaviru stockpil', 'coronaviru impact global food suppli chain could sky rocket anxieti driven panic say expert http co id zo fn', 'newssyd tiffgend woolworth cole aldi nswhealth vicgovdhh state close contract min danger coronaviru yet one supermarket risk unless huge crowd insid illog overreact senior staff interview fr', 'question anybodi ever heard death case due lack loo paper pl stop panic shop covid covid coronaviru standtogeth oneworld compass empathi solidar stoppanicbuy http co noyz qosc', 'odd could go supermarket buy usual food suppli may may satur coronaviru go extend period exercis outsid home fear catch transm

In [25]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stop_words = stopwords.words('english')

corpus_test = []
for tweet in df_test['OriginalTweet'].tolist():
  l = tweet.lower()
  l = re.sub(r'http\S+', '', tweet)  # Remove URLs
  l = re.sub(r'@\w+', '', tweet)  # Remove mentions
  l = re.sub(r'#\w+', '', tweet)  # Remove hashtags
  l = re.sub('[^a-zA-Z]', ' ', tweet)
  l = l.split()
  l1 = [stemmer.stem(i) for i in l if i not in stop_words]
  l = ' '.join(l1)
  corpus_test.append(l)

print(corpus_test)

['trend new yorker encount empti supermarket shelv pictur wegman brooklyn sold onlin grocer foodkick maxdeliveri coronaviru fear shopper stock http co gr pcrlwh http co ivmkmsqdt', 'when i find hand sanit fred meyer i turn amazon but pack purel check coronaviru concern drive price http co ygbipbflmi', 'find protect love one coronaviru', 'panic buy hit newyork citi anxiou shopper stock food amp medic suppli healthcar worker becom bigappl st confirm coronaviru patient or bloomberg stage event http co iasiregpc qanon qanon qanon elect cdc http co iszoewxu', 'toiletpap dunnypap coronaviru coronavirusaustralia coronavirusupd covid new corvid newsmelb dunnypaperg costco one week everyon buy babi milk powder next everyon buy toilet paper http co sczryvvsih', 'do rememb last time paid gallon regular ga lo angel price pump go a look coronaviru impact price pm abc http co pyzq ymuv', 'vote age coronaviru hand sanit supertuesday http co z bel o dk', 'drtedro we stop covid without protect healthwo

**Vectorization**

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train = pd.DataFrame(vectorizer.fit_transform(corpus_train).toarray(), columns = vectorizer.get_feature_names_out())
X_test = pd.DataFrame(vectorizer.transform(corpus_test).toarray(), columns = vectorizer.get_feature_names_out())

In [28]:
y_train = df_train['Sentiment']
y_test = df_test['Sentiment']

In [29]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  0.8254724111866969
Accuracy(Test):  0.4178515007898894


In [30]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  0.9832199546485261
Accuracy(Test):  0.5078988941548184


In [31]:
from sklearn.svm import LinearSVC

model = LinearSVC()
model.fit(X_train, y_train)

#Prediction
ypred_train = model.predict(X_train)
ypred_test = model.predict(X_test)

#Evaluation
from sklearn.metrics import accuracy_score

accuracy_train = accuracy_score(y_train, ypred_train)
accuracy_test = accuracy_score(y_test, ypred_test)

print('Accuracy(Train): ', accuracy_train)
print('Accuracy(Test): ', accuracy_test)

Accuracy(Train):  0.9985638699924414
Accuracy(Test):  0.46340179041600843
