# Sentiment analysis stanford

#### The data is a CSV with emoticons removed. Data file format has 6 fields:
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

1 - the id of the tweet (2087)

2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)

3 - the query (lyx). If there is no query, then this value is NO_QUERY.

4 - the user that tweeted (robotickilldozr)

5 - the text of the tweet (Lyx is cool)

You can download the data from here http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

### Importing libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import string

from sklearn.feature_extraction.text import TfidfVectorizer

import re
import nltk

from nltk.corpus import stopwords

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

import time

### Reading Data

In [30]:
train_data = pd.read_csv("C:\\Users\\kode surendra aba\\Desktop\\Data science\\python\\Sample projects\\sentiment_analysis_tweet_nlp\\training.1600000.processed.noemoticon.csv",header=None,encoding = "ISO-8859-1")

In [31]:
#Add column headers
train_data = train_data.rename(columns={0: 'polarity', 1: 'id', 2: 'date', 3: 'query_type', 4: 'user',5: 'text'})
train_data.shape

(1600000, 6)

In [32]:
train_data.head()

Unnamed: 0,polarity,id,date,query_type,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [33]:
test_data = pd.read_csv("C:\\Users\\kode surendra aba\\Desktop\\Data science\\python\\Sample projects\\sentiment_analysis_tweet_nlp\\testdata.manual.2009.06.14.csv",header=None,encoding = "ISO-8859-1")

In [34]:
test_data = test_data.rename(columns={0: 'polarity', 1: 'id', 2: 'date', 3: 'query_type', 4: 'user',5: 'text'})
test_data.shape

(498, 6)

In [35]:
test_data.head()

Unnamed: 0,polarity,id,date,query_type,user,text
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


In [36]:
#Merging train and test data to clean the dataset
data = train_data.append(pd.DataFrame(data = test_data), ignore_index=True)

In [37]:
#Consider only 1% of dataset due to memory constraints.You can run algo using full dataset on Cloud machines
data = data.sample(frac = 0.01) 

In [39]:
data.shape

(16005, 6)

### Cleaning Data

In [40]:
#Removing duplicates
data.drop_duplicates(inplace = True)
data.shape #No duplicates found in data

(16005, 6)

In [41]:
print (pd.DataFrame(data.isnull().sum())) #No null values found in data

            0
polarity    0
id          0
date        0
query_type  0
user        0
text        0


In [42]:
#Column id
data.id.value_counts() #Column id has maximum 2frequencies of any id so there are lot of unique values here which is useless in algorithms
del data['id']

In [43]:
#Column date
data.date.value_counts()

#separate elements date column
data['date'] = data['date'].map(lambda date:re.sub('\W+', ' ',date)).apply(lambda x: (x.lower()).split())
#The division will be as follows ['mon', 'apr', '06', '22', '19', '45', 'pdt', '2009']

#extracting weekday from date
data.loc[:, 'weekday'] = data.date.map(lambda x: x[0])
data.weekday.value_counts()


#extracting month from date
data.loc[:, 'month'] = data.date.map(lambda x: x[1])
data.month.value_counts()

#extracting day from date
data.loc[:, 'day'] = data.date.map(lambda x: x[2])
data['day'] = pd.to_numeric(data['day']) #Convert day column to numeric values
data.day.value_counts()

#convert days to bins of different monthframes of a month like month_start, month_mid and month_end
conditions = [
    (data['day'] >=0) & (data['day'] <= 10),
    (data['day'] >=11) & (data['day'] <= 20)
    ]
choices = ['month_start','month_mid']
data['monthframe'] = np.select(conditions, choices, default='month_end')

data.monthframe.value_counts()

#Remove day column as we don't need it now after binning it
del data['day']


#extracting hour from date
data.loc[:, 'hour'] = data.date.map(lambda x: x[3])
data['hour'] = pd.to_numeric(data['hour']) #Convert hour column to numeric values
data.hour.value_counts()

#convert hours to bins of different timeframes of a day like marning, evening , afternoon and night
conditions = [
    (data['hour'] >=0) & (data['hour'] <= 5),
    (data['hour'] >=6) & (data['hour'] <= 12),
    (data['hour'] >=13) & (data['hour'] <= 16),
    (data['hour'] >=17) & (data['hour'] <= 20)
    ]
choices = ['night', 'morning', 'afternoon','evening']
data['timeframe'] = np.select(conditions, choices, default='night')

data.timeframe.value_counts()

#Remove hour column as we don't need it now after binning it
del data['hour']


#extracting year from date
data.loc[:, 'year'] = data.date.map(lambda x: x[7])
data.year.value_counts()
#The data contains just one year so we remove year column
del data['year']

#We remove date column from data since it is of no use now
del data['date']

In [44]:
#Column user
data.user.value_counts() #660120 unique users which are also much unique so we remove user also
del data['user']

In [45]:
#Column query
data.query_type.value_counts() #1600000 are NO_QUERY so we remove this column too
del data['query_type']

### Convert categorical columns to dummy variables

In [46]:
#Create weekday column to numeric dummy variables
df_weekday = pd.get_dummies(data['weekday'])

#Create month column to numeric dummy variables
df_month = pd.get_dummies(data['month'])

#Create monthframe column to numeric dummy variables
df_monthframe = pd.get_dummies(data['monthframe'])

#Create timeframe column to numeric dummy variables
df_timeframe = pd.get_dummies(data['timeframe'])

In [47]:
#merging dummy varibles to form a dataframe
data = pd.concat([data['polarity'].reset_index(drop=True),df_weekday.reset_index(drop=True),df_month.reset_index(drop=True),df_monthframe.reset_index(drop=True),df_timeframe.reset_index(drop=True),data['text'].reset_index(drop=True)],axis=1)
list(data)

['polarity',
 'fri',
 'mon',
 'sat',
 'sun',
 'thu',
 'tue',
 'wed',
 'apr',
 'jun',
 'may',
 'month_end',
 'month_mid',
 'month_start',
 'afternoon',
 'evening',
 'morning',
 'night',
 'text']

### Applying NLP for text column

In [48]:
#separate elements of text column in lists, also removing hashtags. mentionids and urls if any
data['p_text'] = data['text'].map(lambda text:re.sub('(@[A-Za-z0-9]+)|(\w+:\/\/\S+)|(#[A-Za-z0-9]+)|([^A-Za-z\'\"]+)', ' ',text)).apply(lambda x: (x.lower()).split())

In [49]:
#joining p_text aain for removing stopwords and punctuation later
data['p_text'] = data['p_text'].apply(lambda x: " ".join([word for word in x]))

#### Cleaning and Stemming text column

In [50]:
stopwords = nltk.corpus.stopwords.words('english')

#wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()

#Removing stopwords and punctuations
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split(' ', text)
    text = [word for word in tokens if word not in stopwords]
    text = [ps.stem(word) for word in text]
    #text = [wn.lemmatize(word) for word in text]
    return text


#Removing original text column
del data['text']

In [51]:
data.shape

(16005, 19)

#### Vectorization of words from p_text column using tfidf vectorizer

In [52]:
#Vectorizing processed text column i.e. p_text
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['p_text'])
print(X_tfidf.shape)

print(tfidf_vect.get_feature_names())

X_tfidf_df = pd.DataFrame(X_tfidf.toarray())
X_tfidf_df.columns = tfidf_vect.get_feature_names()

(16005, 14959)
['', 'aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaaaaahhhhhh', 'aaaaaaaah', 'aaaaaag', 'aaaaaalllat', 'aaaaaarrrggggg', 'aaaaah', 'aaaaamaz', 'aaaah', 'aaaahhhh', 'aaaahhhhh', 'aaaahhhhhhh', 'aaaamaz', 'aaaamin', 'aaaargh', 'aaah', 'aaahahha', 'aaahhh', 'aaahhhh', 'aaargh', 'aaarrrggghhh', 'aaasaand', 'aaaw', 'aaawwww', 'aac', 'aaeeeew', 'aafk', 'aafreen', 'aah', 'aahhh', 'aaliyah', 'aaltima', 'aar', 'aaron', 'aarrrggghhh', 'aashrit', 'aaww', 'aawww', 'ab', 'abaaa', 'abandon', 'abandond', 'abba', 'abbey', 'abbi', 'abc', 'abcwhitehousetakeov', 'abd', 'abdomen', 'abdul', 'abel', 'abey', 'abhi', 'abi', 'abil', 'abit', 'abl', 'abmb', 'abmeldebest', 'abo', 'aboard', 'abonden', 'abooba', 'aboulut', 'abound', 'abouttohead', 'abroad', 'absenc', 'absolut', 'absolutli', 'absorb', 'abstract', 'absurd', 'abt', 'abu', 'abus', 'abyss', 'ac', 'aca', 'acab', 'acai', 'acarlo', 'accent', 'accept', 'accesorio', 'access', 'accessori', 'accid', 'accident', 'accomplish', 'accord', 'accordin', 'account

### Divide data into train and test

In [53]:
#Taking independent variables together
X_features = pd.concat([data[data.columns[1:18]].reset_index(drop=True),X_tfidf_df.reset_index(drop=True)], axis=1)

X_features.head()

Unnamed: 0,fri,mon,sat,sun,thu,tue,wed,apr,jun,may,...,zune,zunehd,zynga,zz,zzz,zzzz,zzzzzz,zzzzzzz,zzzzzzzz,zzzzzzzzzzzzzzzzzzz
0,0,0,0,0,1,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0,0,0,0,1,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0,0,1,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,0,0,1,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0,1,0,0,0,0,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [54]:
#Divide data in train and test
X_train, X_test, y_train, y_test = train_test_split(X_features, data['polarity'], test_size=0.2)

### Applying Random Forest algorithm on train data

In [55]:
rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(X_train, y_train)

In [56]:
#10 Most important factors affecting our model
sorted(zip(rf_model.feature_importances_, X_train.columns), reverse=True)[0:10]

[(0.05098468590111173, 'month_mid'),
 (0.03645874147362202, 'jun'),
 (0.03503630939980214, 'thank'),
 (0.030718564624643875, 'may'),
 (0.029599926636831194, 'month_start'),
 (0.023799377604030787, 'miss'),
 (0.019845108880364476, 'thu'),
 (0.01619566913354905, 'feel'),
 (0.01471544284202606, 'sad'),
 (0.013608384299254412, 'love')]

In [57]:
y_pred = rf_model.predict(X_test)
precision, recall, fscore, support = score(y_test, y_pred, average='micro')

print('Precision: {} / Recall: {} / Accuracy: {}'.format(round(precision, 3),
                                                        round(recall, 3),
                                                        round((y_pred==y_test).sum() / len(y_pred),3)))

Precision: 0.727 / Recall: 0.727 / Accuracy: 0.727


In [58]:
#Checing best hyperparameter value to choose for better results
def train_RF(n_est, depth):
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, average='micro')
    print('Est: {} / Depth: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        n_est, depth, round(precision, 3), round(recall, 3),
        round((y_pred==y_test).sum() / len(y_pred), 3)))
    
for n_est in [50, 100, 300, 400]:
    for depth in [10, 20, 30, None]:
        train_RF(n_est, depth)

Est: 50 / Depth: 10 ---- Precision: 0.708 / Recall: 0.708 / Accuracy: 0.708
Est: 50 / Depth: 20 ---- Precision: 0.728 / Recall: 0.728 / Accuracy: 0.728
Est: 50 / Depth: 30 ---- Precision: 0.726 / Recall: 0.726 / Accuracy: 0.726
Est: 50 / Depth: None ---- Precision: 0.745 / Recall: 0.745 / Accuracy: 0.745
Est: 100 / Depth: 10 ---- Precision: 0.713 / Recall: 0.713 / Accuracy: 0.713
Est: 100 / Depth: 20 ---- Precision: 0.731 / Recall: 0.731 / Accuracy: 0.731
Est: 100 / Depth: 30 ---- Precision: 0.739 / Recall: 0.739 / Accuracy: 0.739
Est: 100 / Depth: None ---- Precision: 0.744 / Recall: 0.744 / Accuracy: 0.744
Est: 300 / Depth: 10 ---- Precision: 0.713 / Recall: 0.713 / Accuracy: 0.713
Est: 300 / Depth: 20 ---- Precision: 0.735 / Recall: 0.735 / Accuracy: 0.735
Est: 300 / Depth: 30 ---- Precision: 0.739 / Recall: 0.739 / Accuracy: 0.739
Est: 300 / Depth: None ---- Precision: 0.75 / Recall: 0.75 / Accuracy: 0.75
Est: 400 / Depth: 10 ---- Precision: 0.72 / Recall: 0.72 / Accuracy: 0.72
Est

### Choosing best values of hyperparameters

In [59]:
rf = RandomForestClassifier(n_estimators=400, max_depth=None, n_jobs=-1)

#### test our values on test data

In [60]:
start = time.time()
rf_model = rf.fit(X_train, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, average='micro')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 116.963 / Predict time: 1.079 ---- Precision: 0.749 / Recall: 0.749 / Accuracy: 0.749
