# Sentiment Analysis

## Problem Statement

Twitter has now become a useful way to build one's business as it helps in giving the brand a voice and a personality. The platform is also a quick, easy and inexpensive way to gain valuable insight from the desired audience. Identifying the sentiments about the product/brand can help the business take better actions.

You have with you evaluated tweets about multiple brands. The evaluators(random audience) were asked if the tweet expressed positive, negative, or no emotion towards a product/brand and labelled accordingly.

# Business Intuation

* Usecase:-
 - Exploite the area of improvement
 - Evaluate the Brand on Ground
 - `Evaluate sentiment of tweet at real time` 
* Stakeholders:-
 - Quqlity Manager
 - CEO
 - marketing head

# Dataset Description
- Data is provided by Hackathon orgniser
- This dataset contains around 7k tweet text with the sentiment label.  

The file train.csv has 3 columns

tweet_id - Unique id for tweets. tweet - Tweet about the brand/product sentiment - 0: Negative, 1: Neutral, 2: Positive, 3: Can't Tell

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
from string import punctuation
import html
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import f1_score,accuracy_score
from sklearn.linear_model import LogisticRegression

  import pandas.util.testing as tm


In [2]:
train = pd.read_csv("data/train.csv",encoding='utf-8').set_index('tweet_id')

In [3]:
pd.set_option('max_colwidth',150)
train.head(10)

Unnamed: 0_level_0,tweet,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1701,#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller,1
1851,"Learning ab Google doodles! All doodles should be light, funny &amp; innovative, with exceptions for significant occasions. #GoogleDoodle #sxsw",1
2689,"one of the most in-your-face ex. of stealing the show in yrs RT @mention &quot;At #SXSW, Apple schools the mkt experts&quot; {link}",2
4525,This iPhone #SXSW app would b pretty awesome if it didn't crash every 10mins during extended browsing. #Fuckit #Illmakeitwork,0
3604,Line outside the Apple store in Austin waiting for the new iPad #SXSW {link},1
966,#technews One lone dude awaits iPad 2 at AppleÛªs SXSW store {link} #Tech_News #Apple #iPad_2 #SXSW #tablets #tech,1
1395,"SXSW Tips, Prince, NPR Videos, Toy Shopping With Zuckerberg.\r\n{link} #sxsw #ipad",1
8182,NU user RT @mention New #UberSocial for #iPhone now in the App Store includes UberGuide to #SXSW sponsored by #Mashable,1
8835,Free #SXSW sampler on iTunes {link} #FreeMusic,2
883,I think I might go all weekend without seeing the same iPad case twice... #sxsw,2


WittyWicky Inc. is a consulting firm that designs brand strategy for a lot of product startups. Their modus operandi is to gain the pulse of competing products and the associated sentiment from social media. Social media has profound impact in capturing the potential customers and thus there are a lot of consulting firms that operate in the digital strategy space. Whether it is to design a marketing campaign or look at the effect of marketing campaigns on user engagement or sentiment, it is a very valuable tool.

Manual assessment of sentiment is very time consuming and automatic sentiment analysis would deliver a lot of value. As a team of data scientists consulting for WittyWicky Inc., you are now responsible for meeting their business outcomes.

In [87]:
train['sentiment'].value_counts()

1    4311
2    2382
0     456
3     125
Name: sentiment, dtype: int64

#### Function to remove patterns from data

In [4]:
def remov_pattern(pattern,text):
    text = re.sub(pattern,'',text)
    return text

#### Remove @mention
#### Remove {link}
#### Remove html tags
As there are multiple html tags in the tweet column 

In [5]:
pattern = "@[\w]*"
train['tweet'] = train['tweet'].apply(lambda x:remov_pattern(pattern,str(x)))

pattern = "{[\w }]*"
train['tweet'] = train['tweet'].apply(lambda x:remov_pattern(pattern,x))

#Eg: &quot;,&amp;
train['tweet'] = train['tweet'].apply(lambda x:html.unescape(x))

#### Removing bit.ly/ links

In [159]:
pattern = r"bit.ly/[\w]*"
train['tweet'] = train['tweet'].apply(lambda x:remov_pattern(pattern,x))

In [109]:
#pattern = "&[\w;]*"
#train['tweet'] = train['tweet'].apply(lambda x:remov_pattern(pattern,x))

### Treating Slangs

We scraped the website : "https://www.webopedia.com/quick_ref/textmessageabbreviations.asp"
and obtained the table of slang words with their meanings 

In [6]:
slangs_df = pd.read_csv("slangs_df.csv")
slangs_df.drop(['Unnamed: 0'],axis=1,inplace=True)

In [106]:
slangs_df.head(30)

Unnamed: 0,Slangs,Full_Forms
0,4U,For you
1,@TEOTD,At the end of the day
2,121,One-to-one
3,143,I love you
4,1432,I love you too
5,14AA41,"One for all, and all for one"
6,182,I hate you
7,10X,Thanks
8,10Q,Thank you
9,2B,To be


In [7]:
def df_to_dict(df,df_dict):
    for index,row in df.iterrows():
        df_dict[row['Slangs']] = row['Full_Forms']
    return df_dict

slangs_dict = {}
slangs_dict = df_to_dict(slangs_df,slangs_dict)
slangs_dict['PC'] = "Personal Computer"

In [8]:
def treat_slangs(row,slang_dict):
    words = row.split()
    treated_row = []
    reformed = []
    for word in words:
        if word.upper() in slang_dict.keys():
            new_word = word.replace(word,slang_dict[word.upper()])
            treated_row.append(new_word)
        else:
            treated_row.append(word)
    reformed = " ".join(treated_row)
    return reformed
train['tweet'] = train['tweet'].apply(lambda x:treat_slangs(x,slangs_dict))

In [97]:
#IGNORE....................................


chat_words_str = """AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
BRB=Be Right Back
BRT=Be Right There
BTW=By The Way
B4=Before
B4N=Bye For Now
CU=See You
CUL8R=See You Later
CYA=See You
FAQ=Frequently Asked Questions
FC=Fingers Crossed
FWIW=For What It's Worth
FYI=For Your Information
GAL=Get A Life
GG=Good Game
GN=Good Night
GMTA=Great Minds Think Alike
GR8=Great!
G9=Genius
IC=I See
ICQ=I Seek you 
ILU=I Love You
IMHO=In My Honest/Humble Opinion
IMO=In My Opinion
IOW=In Other Words
IRL=In Real Life
KISS=Keep It Simple, Stupid
LDR=Long Distance Relationship
LMAO=Laugh My Ass Off
LOL=Laughing Out Loud
LTNS=Long Time No See
L8R=Later
MTE=My Thoughts Exactly
M8=Mate
NRN=No Reply Necessary
OIC=Oh I See
PITA=Pain In The A..
PRT=Party
PRW=Parents Are Watching
ROFL=Rolling On The Floor Laughing
ROFLOL=Rolling On The Floor Laughing Out Loud
ROTFLMAO=Rolling On The Floor Laughing My Ass Off
SK8=Skate
STATS=Your sex and age
ASL=Age, Sex, Location
THX=Thank You
TTFN=Ta-Ta For Now!
TTYL=Talk To You Later
U=You
U2=You Too
U4E=Yours For Ever
WB=Welcome Back
WTF=What The Fuck
WTG=Way To Go!
WUF=Where Are You From?
W8=Wait
7K=Sick
YRS=Years
MKT=Market"""

chat_words_str = chat_words_str.replace("\n",":")
chat_words_split = chat_words_str.split(":")
slangs = list(chat_words_split)

In [98]:
#IGNORE............................
slangs_words = []
full_forms = []
for slang in slangs:
    new_slang = slang.split("=")
    slangs_words.append(new_slang[0])
    full_forms.append(new_slang[1])

In [352]:
#Ignore.............................
slang_words_small = {}
for i in range(len(slangs_words)):
    slang_words_small[slangs_words[i]] = full_forms[i]

In [9]:
#from gingerit.gingerit import GingerIt

In [209]:
#parser = GingerIt()
#text = "They're giving away iPad 2's, x boxes and books at @mention #sxsw #techenvy"
#tweet = parser.parse(text)
#print(tweet)

{'text': "They're giving away iPad 2's, x boxes and books at @mention #sxsw #techenvy", 'result': "They're giving away iPad 2's, x boxes and books at @mention #sxsw #techenvy", 'corrections': []}


In [18]:
#a = train.iloc[:20,:]

In [19]:
#parser = GingerIt()
#b = a['tweet'].apply(lambda x:parser.parse(x))

In [22]:
#b.values

array([{'text': '#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller', 'result': '#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller', 'corrections': []},
       {'text': 'Learning ab Google doodles! All doodles should be light, funny & innovative, with exceptions for significant occasions. #GoogleDoodle #sxsw', 'result': 'Learning ab Google doodles! All doodles should be light, funny & innovative, with exceptions for significant occasions. #GoogleDoodle #sxsw', 'corrections': []},
       {'text': 'one of the most in-your-face ex. of stealing the show in yrs RT  "At #SXSW, Apple schools the mkt experts"  ', 'result': 'One of the most in-your-face ex. Of stealing the show in yrs RT  "At #SXSW, Apple schools the market experts"  ', 'corrections': [{'start': 94, 'text': 'mkt', 'correct': 'market', 'definition': 'the world of commercial activity where goods and services are bought and sold'}, {'start': 33, 'text':

#### Treating Apostrophes

In [9]:
apostrophes = {
"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"didn't": "did not",
"'s" : " ",
"'m" : " am"
}

In [10]:
def apostrophe_correction(row):
    words = row.split()
    cleaned_row = []
    for word in words:
        for apostrophe in apostrophes:
            if apostrophe in word:
                word = word.replace(apostrophe, apostrophes[apostrophe]) 
        cleaned_row.append(word)
    reformed = " ".join(cleaned_row)
    return reformed
train['tweet']=train.tweet.apply(apostrophe_correction)

In [356]:
train.head(10)

Unnamed: 0_level_0,tweet,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1701,#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller,1
1851,"Learning ab Google doodles! All doodles should be light, funny & innovative, with exceptions for significant occasions. #GoogleDoodle #sxsw",1
2689,"one of the most in-your-face ex. of stealing the show in Years RT ""At #SXSW, Apple schools the Market experts""",2
4525,This iPhone #SXSW app would b pretty awesome if it did not crash every 10mins during extended browsing. #Fuckit #Illmakeitwork,0
3604,Line outside the Apple store in Austin waiting for the new iPad #SXSW,1
966,#technews One lone dude awaits iPad 2 at AppleÛªs SXSW store #Tech_News #Apple #iPad_2 #SXSW #tablets #tech,1
1395,"SXSW Tips, Prince, NPR Videos, Toy Shopping With Zuckerberg. #sxsw #ipad",1
8182,NU user RT New #UberSocial for #iPhone now in the App Store includes UberGuide to #SXSW sponsored by #Mashable,1
8835,Free #SXSW sampler on iTunes #FreeMusic,2
883,I think I might go all weekend without seeing the same iPad case twice... #sxsw,2


In [11]:
tb_polarity = []
tb_subjectivity = []
for row in train["tweet"]:
    temp = TextBlob(row)
    tb_polarity.append(temp.sentiment[0])
    tb_subjectivity.append(temp.sentiment[1])
train["tb_polarity"] = tb_polarity
train["tb_subjectivity"] = tb_subjectivity

In [13]:
#IGNORE..................

analyser = SentimentIntensityAnalyzer()
vs_polarity = []
for row in train["tweet"]:
    temp = analyser.polarity_scores(row)['compound']
    vs_polarity.append(temp)
train["vs_polarity"] = vs_polarity

In [408]:
text = "one of the example of stealing the show in Years RT At #SXSW, Apple schools the Market experts"
analyser = SentimentIntensityAnalyzer()
sent = analyser.polarity_scores(text)


In [409]:
sent


{'neg': 0.179, 'neu': 0.821, 'pos': 0.0, 'compound': -0.5719}

In [421]:
train.head()

Unnamed: 0_level_0,tweet,sentiment,tb_polarity,tb_subjectivity,vs_polarity
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1701,#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller,1,0.15,0.65,0.0
1851,"Learning ab Google doodles! All doodles should be light, funny & innovative, with exceptions for significant occasions. #GoogleDoodle #sxsw",1,0.38125,0.89375,0.784
2689,"one of the most in-your-face ex. of stealing the show in Years RT ""At #SXSW, Apple schools the Market experts""",2,0.5,0.5,-0.5719
4525,This iPhone #SXSW Application would Be pretty awesome if it did not crash every 10mins during extended browsing. #Fuckit #Illmakeitwork,0,0.625,1.0,0.8611
3604,Line outside the Apple store in Austin waiting for the new iPad #SXSW,1,0.068182,0.252273,0.0


#### Remove Special characters
Eg : Ã‰ -> É â€œ -> " â€ -> " Ã‡ -> Ç Ãƒ -> Ã Ã©, 'é Ã -> À Ãº -> ú â€¢ -> - Ã˜ -> Ø Ãµ -> õ Ã­ -> í Ã¢ -> â Ã£ -> ã Ãª -> ê Ã¡ -> á Ã© -> é Ã³ -> ó â€“ -> – Ã§ -> ç Âª -> ª Âº -> º Ã -> à

In [12]:
pattern = "[^a-zA-Z\s#]" 
train['tweet'] = train['tweet'].apply(lambda x:remov_pattern(pattern,x))

In [12]:
train.head(25)

Unnamed: 0_level_0,tweet,sentiment
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1701,#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller,1
1851,Learning ab Google doodles All doodles should be light funny innovative with exceptions for significant occasions #GoogleDoodle #sxsw,1
2689,one of the most inyourface ex of stealing the show in yrs RT At #SXSW Apple schools the mkt experts,2
4525,This iPhone #SXSW Application would Be pretty awesome if it did not crash every mins during extended browsing #Fuckit #Illmakeitwork,0
3604,Line outside the Apple store in Austin waiting for the new iPad #SXSW,1
966,#technews One lone dude awaits iPad at Apples SXSW store #TechNews #Apple #iPad #SXSW #tablets #tech,1
1395,SXSW Tips Prince NPR Videos Toy Shopping With Zuckerberg #sxsw #ipad,1
8182,NU user RT New #UberSocial for #iPhone now in the Application Store includes UberGuide to #SXSW sponsored by #Mashable,1
8835,Free #SXSW sampler on iTunes #FreeMusic,2
883,I think I might go all weekend without seeing the same iPad case twice #sxsw,2


#### 1. Add hashtags to a seperate column
#### 2. Remove hashtags 

In [13]:
train['hashtags'] = train['tweet'].apply(lambda x:','.join(re.findall("#[\w]*",x)))
train['hashtags'] = train['hashtags'].apply(lambda x:re.sub("[#]*","",x))
train['hashtags'] = train['hashtags'].apply(lambda x:x.lower())

In [164]:
train.head()

Unnamed: 0_level_0,tweet,sentiment,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1701,#sxswnui #sxsw #apple defining language of touch with different dialects becoming smaller,1,"sxswnui,sxsw,apple"
1851,Learning ab Google doodles All doodles should be light funny innovative with exceptions for significant occasions #GoogleDoodle #sxsw,1,"googledoodle,sxsw"
2689,one of the most inyourface ex of stealing the show in yrs RT At #SXSW Apple schools the mkt experts,2,sxsw
4525,This iPhone #SXSW app would b pretty awesome if it did not crash every mins during extended browsing #Fuckit #Illmakeitwork,0,"sxsw,fuckit,illmakeitwork"
3604,Line outside the Apple store in Austin waiting for the new iPad #SXSW,1,sxsw


In [14]:
train['hashtags'] = train['hashtags'].apply(lambda x: re.sub(","," ",x))

In [15]:
train['tweet'] = train['tweet'].apply(lambda x:re.sub("#[\w]*",'',x))

In [16]:
train['tweet'] = train['tweet'].str.lower()

#### Removing RT 

In [56]:
train['tweet'] = train['tweet'].apply(lambda x:re.sub(r"\brt","",x,flags=re.IGNORECASE))

In [216]:
train.head(10)

Unnamed: 0_level_0,tweet,sentiment,length,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1701,defining language touch different dialect becoming smaller,1,89,sxswnui sxsw apple
1851,learning ab google doodle doodle light funny innovative exception significant occasion,1,143,googledoodle sxsw
2689,one inyourface ex stealing show yr rt apple school mkt expert,2,132,sxsw
4525,iphone application would pretty awesome crash every min extended browsing,0,125,sxsw fuckit illmakeitwork
3604,line outside apple store austin waiting new ipad,1,77,sxsw
966,one lone dude awaits ipad apple sxsw store,1,115,technews technews apple ipad sxsw tablets tech
1395,sxsw tip prince npr video toy shopping zuckerberg,1,82,sxsw ipad
8182,nu user rt new application store includes uberguide sponsored,1,119,ubersocial iphone sxsw mashable
8835,free sampler itunes,2,46,sxsw freemusic
883,think might go weekend without seeing ipad case twice,2,79,sxsw


In [188]:
#text = "_¼ÛÄ___ü ___¡ _____«_µ... &gt;&gt; @mention Google to Launch Major New Social Network Called Circles, Possibly Today sxs"
#text_c = text.decode("utf8").encode('ascii','ignore')

AttributeError: 'str' object has no attribute 'decode'

#### Stopwords Removal

In [17]:
sxsw_patterns = []
def find_pattern(df,pattern):
    r = re.findall(pattern,df)
    for i in r:
        if i not in sxsw_patterns:
            sxsw_patterns.append(i)
pattern = "sxsw[\w]*"
train['tweet'].apply(lambda x:find_pattern(x,pattern))

tweet_id
1701    None
1851    None
2689    None
4525    None
3604    None
        ... 
3343    None
5334    None
5378    None
2173    None
3162    None
Name: tweet, Length: 7274, dtype: object

In [17]:
stop_words = list(set(stopwords.words('english')))+list(punctuation)
train['tweet'] = train['tweet'].apply(lambda x:word_tokenize(x))
train['tweet'] = train['tweet'].apply(lambda row:[word for word in row if word not in stop_words])

In [206]:
train.head(20)

Unnamed: 0_level_0,tweet,sentiment,length,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1701,"[defining, language, touch, different, dialects, becoming, smaller]",1,89,sxswnui sxsw apple
1851,"[learning, ab, google, doodles, doodles, light, funny, innovative, exceptions, significant, occasions]",1,143,googledoodle sxsw
2689,"[one, inyourface, ex, stealing, show, yrs, rt, apple, schools, mkt, experts]",2,132,sxsw
4525,"[iphone, application, would, pretty, awesome, crash, every, mins, extended, browsing]",0,125,sxsw fuckit illmakeitwork
3604,"[line, outside, apple, store, austin, waiting, new, ipad]",1,77,sxsw
966,"[one, lone, dude, awaits, ipad, apples, sxsw, store]",1,115,technews technews apple ipad sxsw tablets tech
1395,"[sxsw, tips, prince, npr, videos, toy, shopping, zuckerberg]",1,82,sxsw ipad
8182,"[nu, user, rt, new, application, store, includes, uberguide, sponsored]",1,119,ubersocial iphone sxsw mashable
8835,"[free, sampler, itunes]",2,46,sxsw freemusic
883,"[think, might, go, weekend, without, seeing, ipad, case, twice]",2,79,sxsw


### Lemmatization

In [18]:
lemma = WordNetLemmatizer()
train['tweet'] = train['tweet'].apply(lambda x:[lemma.lemmatize(i) for i in x])

In [19]:
train.head(10)

Unnamed: 0_level_0,tweet,sentiment,tb_polarity,tb_subjectivity,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1701,"[defining, language, touch, different, dialect, becoming, smaller]",1,0.15,0.65,sxswnui sxsw apple
1851,"[learning, google, doodle, doodle, light, funny, innovative, exception, significant, occasion]",1,0.38125,0.89375,googledoodle sxsw
2689,"[one, inyourface, ex, stealing, show, year, rt, apple, school, market, expert]",2,0.5,0.5,sxsw
4525,"[iphone, application, would, pretty, awesome, crash, every, min, extended, browsing]",0,0.625,1.0,sxsw fuckit illmakeitwork
3604,"[line, outside, apple, store, austin, waiting, new, ipad]",1,0.068182,0.252273,sxsw
966,"[one, lone, dude, awaits, ipad, apple, sxsw, store]",1,0.0,0.0,technews technews apple ipad sxsw tablets tech
1395,"[sxsw, tip, prince, npr, video, toy, shopping, zuckerberg]",1,0.0,0.0,sxsw ipad
8182,"[nu, user, rt, new, application, store, includes, uberguide, sponsored]",1,0.136364,0.454545,ubersocial iphone sxsw mashable
8835,"[free, sampler, itunes]",2,0.4,0.8,sxsw freemusic
883,"[think, might, go, weekend, without, seeing, ipad, case, twice]",2,0.0,0.125,sxsw


### Stemming

In [67]:
stemmer = PorterStemmer()
train['tweet'] = train['tweet'].apply(lambda x:[stemmer.stem(i) for i in x])

In [20]:
train['tweet'] = train['tweet'].apply(lambda x:' '.join(x))

In [22]:
train.head(10)

Unnamed: 0_level_0,tweet,sentiment,tb_polarity,tb_subjectivity,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1701,defining language touch different dialect becoming smaller,1,0.15,0.65,sxswnui sxsw apple
1851,learning ab google doodle doodle light funny innovative exception significant occasion,1,0.38125,0.89375,googledoodle sxsw
2689,one inyourface ex stealing show year rt apple school market expert,2,0.5,0.5,sxsw
4525,iphone application would pretty awesome crash every min extended browsing,0,0.625,1.0,sxsw fuckit illmakeitwork
3604,line outside apple store austin waiting new ipad,1,0.068182,0.252273,sxsw
966,one lone dude awaits ipad apple sxsw store,1,0.0,0.0,technews technews apple ipad sxsw tablets tech
1395,sxsw tip prince npr video toy shopping zuckerberg,1,0.0,0.0,sxsw ipad
8182,nu user rt new application store includes uberguide sponsored,1,0.136364,0.454545,ubersocial iphone sxsw mashable
8835,free sampler itunes,2,0.4,0.8,sxsw freemusic
883,think might go weekend without seeing ipad case twice,2,0.0,0.125,sxsw


#### Count Vectorizer

In [21]:
cv = CountVectorizer()
text = train['tweet']
vector = cv.fit_transform(text)
X = vector.toarray()

#### Tfidf Vectorizer

In [425]:
text = train['tweet']
tfidf = TfidfVectorizer()
vector = tfidf.fit_transform(text)
X = vector.toarray()

In [22]:
y = train['sentiment']

#### As the target column was imbalanced, we decided to use stratify

In [23]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0,stratify=y)

#### Basic Vanilla Model

In [24]:
log_reg = LogisticRegression(max_iter=200,n_jobs=2)
log_reg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=2, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [25]:
y_pred = log_reg.predict(X_test)

In [26]:
f1 = f1_score(y_test,y_pred,average="weighted")
f1

0.6543150390114427

#### Test Data Processing

In [35]:
test = pd.read_csv("data/test.csv").set_index("tweet_id")

In [36]:
test.head(10)

Unnamed: 0_level_0,tweet
tweet_id,Unnamed: 1_level_1
7506,"Audience Q: What prototyping tools do you use? Sketchbooks/sharpie pens, photoshop, Balsamic, Google docs, Axsure, etc. #myprototype #sxsw"
7992,At SXSW? Send Your Best Photos &amp; Videos to... {link} #citizen_journalism #cnn #ireport #photography #sxsw #Cyber #iPhone
247,@mention and here's a pic of you winning your ipad! #unsix #sxsw cc @mention @mention {link} (cont) {link}
7688,Google Marissa Mayer: mobile phone as a cursor of physical location - new version of map fast and more real life like #sxsw
3294,#SXSW Google maps is even cooler than I thought
6125,RT @mention In front of @mention popup store at #SXSW last night {link}
6131,RT @mention In my next life I'm coming back as an iPad 2. Women can't keep their hands off this thing. #SXSW
4134,Google celebrating Pi Day in style at #SXSW - {link}
8206,Hmmm is it a bit weird that #sxsw is not tending but Google Circle is?
8552,@mention to launch 'Circles' later today at #SXSW?? gotta love #SXSW - one platform for everything from Independent film to Innovative Tech


#### Remove @mention
#### Remove {link}
#### Remove html tags
As there are multiple html tags in the tweet column 

In [37]:
pattern = "@[\w]*"
test['tweet'] = test['tweet'].apply(lambda x:remov_pattern(pattern,str(x)))

pattern = "{[\w }]*"
test['tweet'] = test['tweet'].apply(lambda x:remov_pattern(pattern,x))

test['tweet'] = test['tweet'].apply(lambda x:html.unescape(x))

#pattern = "&[\w;]*"
#train['tweet'] = train['tweet'].apply(lambda x:remov_pattern(pattern,x))

In [38]:
def treat_slangs(row,slang_dict):
    words = row.split()
    treated_row = []
    reformed = []
    for word in words:
        if word.upper() in slang_dict.keys():
            new_word = word.replace(word,slang_dict[word.upper()])
            treated_row.append(new_word)
        else:
            treated_row.append(word)
    reformed = " ".join(treated_row)
    return reformed
test['tweet'] = test['tweet'].apply(lambda x:treat_slangs(x,slangs_dict))

In [39]:
def apostrophe_correction(row): 
    words = row.split()
    cleaned_row = []
    for word in words:
        for apostrophe in apostrophes:
            if apostrophe in word:
                word = word.replace(apostrophe, apostrophes[apostrophe]) 
        cleaned_row.append(word)
    reformed = " ".join(cleaned_row)
    return reformed
test['tweet']=test.tweet.apply(apostrophe_correction)

In [40]:
tb_polarity = []
tb_subjectivity = []
for row in test["tweet"]:
    temp = TextBlob(row)
    tb_polarity.append(temp.sentiment[0])
    tb_subjectivity.append(temp.sentiment[1])
test["tb_polarity"] = tb_polarity
test["tb_subjectivity"] = tb_subjectivity

In [128]:
#IGNORE..............

analyser = SentimentIntensityAnalyzer()
vs_polarity = []
for row in test["tweet"]:
    temp = analyser.polarity_scores(row)['compound']
    vs_polarity.append(temp)
test["vs_polarity"] = vs_polarity

In [41]:
pattern = "[^a-zA-Z\s#]" 
test['tweet'] = test['tweet'].apply(lambda x:remov_pattern(pattern,x))

In [239]:
test.head()

Unnamed: 0_level_0,tweet,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
7506,"[audience, q, prototyping, tools, use, sketchbooks/sharpie, pens, photoshop, balsamic, google, docs, axsure, etc]",myprototype sxsw
7992,"[sxsw, send, best, photos, amp, videos, ...]",citizen_journalism cnn ireport photography sxsw cyber iphone
247,"[pic, winning, ipad, cc, cont]",unsix sxsw
7688,"[google, marissa, mayer, mobile, phone, cursor, physical, location, new, version, map, fast, real, life, like]",sxsw
3294,"[google, maps, even, cooler, thought]",sxsw


In [42]:
test['hashtags'] = test['tweet'].apply(lambda x:','.join(re.findall("#[\w]*",x)))
test['hashtags'] = test['hashtags'].apply(lambda x:re.sub("[#]*","",x))
test['hashtags'] = test['hashtags'].apply(lambda x:x.lower())

In [43]:
test['hashtags'] = test['hashtags'].apply(lambda x: re.sub(","," ",x))

In [44]:
test['tweet'] = test['tweet'].apply(lambda x:re.sub("#[\w]*",'',x))

In [45]:
test['tweet'] = test['tweet'].apply(lambda x:x.lower())

In [235]:
test['tweet'] = test['tweet'].apply(lambda x:re.sub(r"\brt","",x))

In [237]:
test.head(10)

Unnamed: 0_level_0,tweet,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
7506,"audience q: what prototyping tools do you use? sketchbooks/sharpie pens, photoshop, balsamic, google docs, axsure, etc.",myprototype sxsw
7992,at sxsw? send your best photos &amp; videos to...,citizen_journalism cnn ireport photography sxsw cyber iphone
247,and here is a pic of you winning your ipad! cc (cont),unsix sxsw
7688,google marissa mayer: mobile phone as a cursor of physical location - new version of map fast and more real life like,sxsw
3294,google maps is even cooler than i thought,sxsw
6125,in front of popup store at last night,sxsw
6131,in my next life i'm coming back as an ipad 2. women cannot keep their hands off this thing.,sxsw
4134,google celebrating pi day in style at -,sxsw
8206,hmmm is it a bit weird that is not tending but google circle is?,sxsw
8552,to launch 'circles' later today at ?? gotta love - one platform for everything from independent film to innovative tech,sxsw sxsw


In [46]:
stop_words = list(set(stopwords.words('english')))+list(punctuation)
test['tweet'] = test['tweet'].apply(lambda x:word_tokenize(x))
test['tweet'] = test['tweet'].apply(lambda x:[word for word in x if word not in stop_words])

In [47]:
test['tweet'] = test['tweet'].apply(lambda x:[lemma.lemmatize(i) for i in x])

In [48]:
test['tweet'] = test['tweet'].apply(lambda x:' '.join(x))

In [321]:
test.head()

Unnamed: 0_level_0,tweet,tb_polarity,tb_subjectivity,hashtags
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7506,audience q prototyping tool use sketchbookssharpie pen photoshop balsamic google doc axsure etc,0.0,0.0,myprototype sxsw
7992,sxsw send best photo video,1.0,0.3,citizenjournalism cnn ireport photography sxsw cyber iphone
247,picture winning ipad cc cont,0.625,0.75,unsix sxsw
7688,google marissa mayer mobile phone cursor physical location new version map fast real life like,0.207273,0.399481,sxsw
3294,google map even cooler thought,0.0,0.0,sxsw


In [49]:
#cv = CountVectorizer()
text = test['tweet']
vector = cv.transform(text)
X = vector.toarray()

In [None]:
tfidf = TfidfVectorizer()
vector = tfidf.transform(text)
X = vector.toarray()

In [50]:
y_pred = log_reg.predict(X)

In [51]:
y_pred

array([1, 1, 2, ..., 2, 1, 1], dtype=int64)

In [66]:
X.shape

(1819, 7081)

#### Exporting Cleaned Data

In [26]:
train.to_csv("treated_train.csv")
test.to_csv("treated_test.csv")

#### Submission File

In [52]:
sample_submission = pd.DataFrame(y_pred,test.index,columns=['sentiment'])

In [53]:
sample_submission

Unnamed: 0_level_0,sentiment
tweet_id,Unnamed: 1_level_1
7506,1
7992,1
247,2
7688,2
3294,2
...,...
1550,2
1933,2
9052,2
4219,1


In [54]:
sample_submission.to_csv("sample_sub.csv")