Write a complete code using TF-IDF along with Random Forest Classifier to Perform Text Classification on the tweet data about COVID19. The tweets have been pulled from Twitter and manual tagging has been done then. There are 2 csv files, one for train data and one for test data.

The train data is provided to you in the exam and the test data is kept aside by the instructor to evaluate your model and will not be provided to you for the exam.

Use your own discretion to determine the relevant hyperparameters and model parameters. Make sure to provide justification while dropping a column. Remember this is a multiclass classification problem. [10 Marks]

Columns:

1) Location 2) Tweet At 3) Original Tweet 4) Sentiment (To be predicted)

In [1]:
import pandas as pd
import numpy as np
import re # for regular expressions
import pandas as pd 
pd.set_option("display.max_colwidth", 200) 
import nltk # for text manipulation
from nltk.stem.porter import *
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from tqdm import tqdm
# from gensim.models.doc2vec import LabeledSentence
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,roc_auc_score,roc_curve
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
import warnings 
warnings.filterwarnings("ignore")


In [2]:
# !pip3 install sklearn

In [3]:
df = pd.read_csv('Corona_NLP_train.csv',encoding = 'latin')

#### 1. Data exploration

In [4]:
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive
3,3802,48754,,16-03-2020,"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\...",Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don...",Extremely Negative


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       41157 non-null  int64 
 1   ScreenName     41157 non-null  int64 
 2   Location       32567 non-null  object
 3   TweetAt        41157 non-null  object
 4   OriginalTweet  41157 non-null  object
 5   Sentiment      41157 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.9+ MB


#### ==> There are null values in location column

In [6]:
print(f'==> rows {df.shape[0]}, columns: {df.shape[1]}')
print(f'==> columns name: {df.columns.tolist()}')

==> rows 41157, columns: 6
==> columns name: ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment']


In [7]:
tmp=pd.DataFrame(df.TweetAt.value_counts()).reset_index()
tmp

Unnamed: 0,index,TweetAt
0,20-03-2020,3448
1,19-03-2020,3215
2,25-03-2020,2979
3,18-03-2020,2742
4,21-03-2020,2653
5,22-03-2020,2114
6,23-03-2020,2062
7,17-03-2020,1977
8,08-04-2020,1881
9,07-04-2020,1843


In [8]:
months=[]
for x in tmp['index']:
    months.append(x.split('-')[1])

print(f'==> unique months: {set(months)}')

==> unique months: {'04', '03'}


#### ==> we can see that data is from 3/2020 and 4/2020


In [9]:
tmp=df['Location'].value_counts()
tmp=pd.DataFrame(tmp).reset_index()
tmp

Unnamed: 0,index,Location
0,London,540
1,United States,528
2,"London, England",520
3,"New York, NY",395
4,"Washington, DC",373
...,...,...
12215,Staffordshire Moorlands,1
12216,Kithchener ON,1
12217,"Tulsa, Ok",1
12218,"Watford, South Oxhey, Bushey",1


#### ==> There are 12,220/41,157 unique locations


In [10]:
tmp=df['Sentiment'].value_counts()
tmp=pd.DataFrame(tmp).reset_index()
tmp

Unnamed: 0,index,Sentiment
0,Positive,11422
1,Negative,9917
2,Neutral,7713
3,Extremely Positive,6624
4,Extremely Negative,5481


#### ==> There are 5 unique sentiment types


In [11]:
df[df['Sentiment']=='Positive'].head(3)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive
3,3802,48754,,16-03-2020,"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\...",Positive


In [12]:
df[df['Sentiment']=='Negative'].head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,"For corona prevention,we should stop to buy things with the cash and should use online payment methods because corona can spread through the notes. Also we should prefer online shopping from our h...",Negative
24,3823,48775,Downstage centre,16-03-2020,@10DowningStreet @grantshapps what is being done to ensure food and other essential products are being re-stocked at supermarkets and panic buying actively discouraged? It cannot be left to checko...,Negative
26,3825,48777,"Ketchum, Idaho",16-03-2020,"In preparation for higher demand and a potential food shortage, The Hunger Coalition purchased 10 percent more food and implemented new protocols due to the COVID-19 coronavirus. https://t.co/5Cec...",Negative


In [13]:
df[df['Sentiment']=='Neutral'].head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral
7,3806,48758,Austria,16-03-2020,Was at the supermarket today. Didn't buy toilet paper. #Rebel\r\r\n\r\r\n#toiletpapercrisis #covid_19 https://t.co/eVXkQLIdAZ,Neutral
10,3809,48761,"Makati, Manila",16-03-2020,"All month there hasn't been crowding in the supermarkets or restaurants, however reducing all the hours and closing the malls means everyone is now using the same entrance and dependent on a singl...",Neutral


In [14]:
df[df['Sentiment']=='Extremely Positive'].head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
11,3810,48762,"Pitt Meadows, BC, Canada",16-03-2020,"Due to the Covid-19 situation, we have increased demand for all food products. \r\r\n\r\r\nThe wait time may be longer for all online orders, particularly beef share and freezer packs. \r\r\n\r\r\...",Extremely Positive
12,3811,48763,Horningsea,16-03-2020,"#horningsea is a caring community. LetÂs ALL look after the less capable in our village and ensure they stay healthy. Bringing shopping to their doors, help with online shopping and self isolatio...",Extremely Positive
18,3817,48769,North America,16-03-2020,"Amazon Glitch Stymies Whole Foods, Fresh Grocery Deliveries\r\r\nÂAs COVID-19 has spread, weÂve seen a significant increase in people shopping online for groceries,Â a spokeswoman said in a sta...",Extremely Positive


In [15]:
df[df['Sentiment']=='Extremely Negative'].head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don...",Extremely Negative
20,3819,48771,southampton soxx xxx,16-03-2020,with 100 nations inficted with covid 19 the world must not play fair with china 100 goverments must demand china adopts new guilde lines on food safty the chinese goverment is guilty...,Extremely Negative
27,3826,48778,Everywhere You Are!,16-03-2020,"This morning I tested positive for Covid 19. I feel ok, I have no symptoms so far but have been isolated since I found out about my possible exposure to the virus. Stay home people and be pragmat...",Extremely Negative


#### 2. data preprocessing

In [16]:
## write function for removing @user
import re
from nltk.stem.porter import *

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i,'',input_txt)
    return input_txt

    

In [17]:
# create new column with removed @user
df['Tweet'] = np.vectorize(remove_pattern)(df['OriginalTweet'], '@[\w]*')
df.head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Tweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral,https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P"


In [18]:
## remove http and url 
df['Tweet'] = df['Tweet'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
df.head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Tweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral,
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak"


In [19]:
# remove special characters, numbers, punctuations
df['Tweet'] = df['Tweet'].str.replace('[^a-zA-Z#]+',' ')
df.head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Tweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral,
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive,Coronavirus Australia Woolworths to give elderly disabled dedicated shopping hours amid COVID outbreak


In [20]:
# remove short words
df['Tweet'] = df['Tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))
df.head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Tweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral,
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak


In [21]:
## create new variable tokenized tweet 
tokenized_tweet = df['Tweet'].apply(lambda x: x.split())
df.head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Tweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral,
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak


In [22]:
stemmer = PorterStemmer()

## apply stemmer for tokenized_tweet
tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x])
df.head(3)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Tweet
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8,Neutral,
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate su...,Positive,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...
2,3801,48753,Vagabonds,16-03-2020,"Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P",Positive,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak


In [23]:
# # join tokens into one sentence
# for i in range(len(tokenized_tweet)):
#     tokenized_tweet[i] = ' '.join(tokenized_tweet[i])

In [24]:
# df['Tweet']  = tokenized_tweet
# df.head(3)


#### ==> I see UserName, ScreenName, Location, TweetAt are not important for sentiment analysis problem.because it can not represent for sentiment of human. So I decided to remove it out of dataset.


In [25]:
new_df = df[['Tweet','Sentiment']]
new_df.shape

(41157, 2)

In [26]:
new_df[new_df['Tweet'].str.len()<3]


Unnamed: 0,Tweet,Sentiment
0,,Neutral
16,,Neutral
21,,Neutral
173,,Extremely Positive
186,,Neutral
...,...,...
40799,,Extremely Positive
40811,,Negative
40978,,Positive
41113,,Positive


In [27]:
## remove rows has null values for Tweet
new_df=new_df[new_df['Tweet'].str.len()>3]
new_df.shape

(40962, 2)

In [28]:
new_df.head()

Unnamed: 0,Tweet,Sentiment
1,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...,Positive
2,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak,Positive
3,food stock not the only one which empty PLEASE don panic THERE WILL ENOUGH FOOD FOR EVERYONE you not take more than you need Stay calm stay safe #COVID france #COVID #COVID #coronavirus #confineme...,Positive
4,ready supermarket during the #COVID outbreak Not because paranoid but because food stock litteraly empty The #coronavirus serious thing but please don panic causes shortage #CoronavirusFrance #res...,Extremely Negative
5,news the region first confirmed COVID case came out Sullivan County last week people flocked area stores purchase cleaning supplies hand sanitizer food toilet paper and other goods reports,Positive


In [29]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /home/z/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
# new_df['Tweet'].apply(lambda x: [item for item in x if item not in stop])

In [31]:
new_df.head()

Unnamed: 0,Tweet,Sentiment
1,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...,Positive
2,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak,Positive
3,food stock not the only one which empty PLEASE don panic THERE WILL ENOUGH FOOD FOR EVERYONE you not take more than you need Stay calm stay safe #COVID france #COVID #COVID #coronavirus #confineme...,Positive
4,ready supermarket during the #COVID outbreak Not because paranoid but because food stock litteraly empty The #coronavirus serious thing but please don panic causes shortage #CoronavirusFrance #res...,Extremely Negative
5,news the region first confirmed COVID case came out Sullivan County last week people flocked area stores purchase cleaning supplies hand sanitizer food toilet paper and other goods reports,Positive


In [32]:
new_df.shape

(40962, 2)

In [33]:
## final check for null values
new_df.isnull().sum()

Tweet        0
Sentiment    0
dtype: int64

In [34]:
new_df.head()

Unnamed: 0,Tweet,Sentiment
1,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...,Positive
2,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak,Positive
3,food stock not the only one which empty PLEASE don panic THERE WILL ENOUGH FOOD FOR EVERYONE you not take more than you need Stay calm stay safe #COVID france #COVID #COVID #coronavirus #confineme...,Positive
4,ready supermarket during the #COVID outbreak Not because paranoid but because food stock litteraly empty The #coronavirus serious thing but please don panic causes shortage #CoronavirusFrance #res...,Extremely Negative
5,news the region first confirmed COVID case came out Sullivan County last week people flocked area stores purchase cleaning supplies hand sanitizer food toilet paper and other goods reports,Positive


In [35]:
## convert labels to numeric values
labels=['Positive',	'Negative', 'Neutral', 'Extremely Positive', 'Extremely Negative']	
for i in new_df.index:
    label=labels.index(new_df.loc[i, 'Sentiment'])
    new_df.loc[i, 'Sentiment']=label

In [36]:
new_df.head()

Unnamed: 0,Tweet,Sentiment
1,advice Talk your neighbours family exchange phone numbers create contact list with phone numbers neighbours schools employer chemist set online shopping accounts poss adequate supplies regular med...,0
2,Coronavirus Australia Woolworths give elderly disabled dedicated shopping hours amid COVID outbreak,0
3,food stock not the only one which empty PLEASE don panic THERE WILL ENOUGH FOOD FOR EVERYONE you not take more than you need Stay calm stay safe #COVID france #COVID #COVID #coronavirus #confineme...,0
4,ready supermarket during the #COVID outbreak Not because paranoid but because food stock litteraly empty The #coronavirus serious thing but please don panic causes shortage #CoronavirusFrance #res...,4
5,news the region first confirmed COVID case came out Sullivan County last week people flocked area stores purchase cleaning supplies hand sanitizer food toilet paper and other goods reports,0


#### split train / validation: 80%/ 20%

In [37]:
from sklearn.model_selection import train_test_split

train, valid = train_test_split(new_df,test_size = 0.2,random_state=0,stratify = new_df.Sentiment.values) #stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset.
print("==> train shape : ", train.shape)
print("==> valid shape : ", valid.shape)

==> train shape :  (32769, 2)
==> valid shape :  (8193, 2)


In [38]:
X_train=train['Tweet'].tolist()
X_valid=valid['Tweet'].tolist()
Y_train=train['Sentiment'].tolist()
Y_valid=valid['Sentiment'].tolist()

In [39]:
train.head()

Unnamed: 0,Tweet,Sentiment
25146,pictures panic buyers cramming supermarket trolleys without thought for others reflect mere shoppers overwhelming majority customers were buying few extra items because they and their families mus...,1
4277,Let see services still Insurance protection for Covid riders but nothing about drivers like food delivery car for gold and platinum status drivers only Wtf this Not only demand for rides low but,4
10528,#coronavirus current state grocery store hudson valley,2
6297,The department small business development set unleash estimated billion support package assist small micro and medium sized businesses produce more the critical consumer goods needed for the effec...,0
19848,Smooth Transitions Air Ground #TheCrew #Ubisoft #Boats #Planes #Cars #Beauty #Peace #Art #Poetry #Music #Coronavirus #Quarantine #StayHome #Bored #Fun #Fishing #Bassmasters #WalkingDead #ToDo #Toi...,2


In [40]:
valid.head()

Unnamed: 0,Tweet,Sentiment
13715,Ice cream run the grocery store #Covid #SocialDistancing,2
17983,First look the impact consumer spending the,2
9758,Country was broke amp debt now lifetime taxation ppl put money away ride any bad times Governments amp corporations didnt Quantitive Easing created faked stock prices creating yet another bubble T...,4
6403,are you serious Given the current climate are you actually joking You should ashamed Theyve now shut schools and will probably close businesses and your increasing prices Utterly disgraceful and d...,4
3508,think never seen the gas prices soooo cheap since left Venezuela #Coronavirus #coronavirusvancouver,2


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

def vectorize(data,tfidf_vect_fit):
    X_tfidf = tfidf_vect_fit.transform(data)
    words = tfidf_vect_fit.get_feature_names()
    X_tfidf_df = pd.DataFrame(X_tfidf.toarray())
    X_tfidf_df.columns = words
    return(X_tfidf_df)

def clean(text):
    ## this function is to clean data, but we have already clean it before
    return text

In [42]:
tfidf_vect = TfidfVectorizer(analyzer=clean)


In [43]:
tfidf_vect_fit=tfidf_vect.fit(X_train)
X_train=vectorize(X_train,tfidf_vect_fit)

In [44]:
## build models
rf = RandomForestClassifier()
scores = cross_val_score(rf,X_train,Y_train,cv=5)

In [45]:
print(scores)
scores.mean()

[0.34009765 0.33246872 0.32773879 0.33307904 0.33023043]


0.3327229243341108

In [46]:
def print_results(results):
    print('==> BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [47]:
# ## determine best hyberparameters
# rf = RandomForestClassifier()
# parameters = {
#     'n_estimators': [5,50,100],
#     'max_depth': [2,10,20,None]
# }

# cv = GridSearchCV(rf,parameters)
# cv.fit(X_train,Y_train)
# print_results(cv)

In [49]:
# cv.best_estimator_.__dict__


In [50]:
## train model with best hyberparameters
# model = RandomForestClassifier(n_estimators=cv.best_estimator_.__dict__['n_estimators'],
#         max_depth=cv.best_estimator_.__dict__['max_depth'])
model = RandomForestClassifier(n_estimators=100,
        max_depth=None)

model.fit(X_train, Y_train)


RandomForestClassifier()

In [51]:
## test model
X_val=vectorize(X_valid,tfidf_vect_fit)
Y_pred = model.predict(X_val)


In [52]:

## calculate Presision/Recall/Accuracy
from sklearn.metrics import accuracy_score,precision_score,recall_score

precision = round(precision_score(Y_valid,Y_pred, pos_label='positive', average='micro'), 3)
recall = round(recall_score(Y_valid,Y_pred, pos_label='positive', average='micro'), 3)
accuracy = round(accuracy_score(Y_valid,Y_pred), 3)
print('-- A: {} / P: {} / R: {}'.format(accuracy, precision, recall))

-- A: 0.346 / P: 0.346 / R: 0.346
