# Twitter Sentiment Analysis Classification

## Libraries

In [170]:
import numpy as np
import pandas as pd
import regex as re
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer

## Dataset Load

In [2]:
!curl "https://dbdmg.polito.it/dbdmg_web/wp-content/uploads/2021/12/DSL2122_january_dataset.zip" -Lo dataset.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.7M  100 17.7M    0     0  16.9M      0  0:00:01  0:00:01 --:--:-- 16.9M


In [3]:
!unzip -q dataset.zip; rm dataset.zip; rm -r __MACOSX/

In [244]:
tweets = pd.read_csv("./DSL2122_january_dataset/development.csv")
tweets

Unnamed: 0,sentiment,ids,date,flag,user,text
0,1,1833972543,Mon May 18 01:08:27 PDT 2009,NO_QUERY,Killandra,"@MissBianca76 Yes, talking helps a lot.. going..."
1,1,1980318193,Sun May 31 06:23:17 PDT 2009,NO_QUERY,IMlisacowan,SUNSHINE. livingg itttt. imma lie on the grass...
2,1,1994409198,Mon Jun 01 11:52:54 PDT 2009,NO_QUERY,yaseminx3,@PleaseBeMine Something for your iphone
3,0,1824749377,Sun May 17 02:45:34 PDT 2009,NO_QUERY,no_surprises,@GabrielSaporta couldn't get in to the after p...
4,0,2001199113,Tue Jun 02 00:08:07 PDT 2009,NO_QUERY,Rhi_ShortStack,@bradiewebbstack awww is andy being mean again...
...,...,...,...,...,...,...
224989,0,2261324310,Sat Jun 20 20:36:48 PDT 2009,NO_QUERY,CynthiaBuroughs,@Dropsofreign yeah I hope Iran people reach fr...
224990,1,1989408152,Mon Jun 01 01:25:45 PDT 2009,NO_QUERY,unitechy,Trying the qwerty keypad
224991,0,1991221316,Mon Jun 01 06:38:10 PDT 2009,NO_QUERY,Xaan,I love Jasper &amp; Jackson but that wig in th...
224992,0,2239702807,Fri Jun 19 08:51:56 PDT 2009,NO_QUERY,Ginger_Billie,I am really tired and bored and bleh! I feel c...


## Data exploration

In [245]:
len(tweets["user"].unique())

10647

In [246]:
len(tweets["ids"].unique())

224716

In [247]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224994 entries, 0 to 224993
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   sentiment  224994 non-null  int64 
 1   ids        224994 non-null  int64 
 2   date       224994 non-null  object
 3   flag       224994 non-null  object
 4   user       224994 non-null  object
 5   text       224994 non-null  object
dtypes: int64(2), object(4)
memory usage: 10.3+ MB


In [248]:
tweets["date"]

0         Mon May 18 01:08:27 PDT 2009
1         Sun May 31 06:23:17 PDT 2009
2         Mon Jun 01 11:52:54 PDT 2009
3         Sun May 17 02:45:34 PDT 2009
4         Tue Jun 02 00:08:07 PDT 2009
                      ...             
224989    Sat Jun 20 20:36:48 PDT 2009
224990    Mon Jun 01 01:25:45 PDT 2009
224991    Mon Jun 01 06:38:10 PDT 2009
224992    Fri Jun 19 08:51:56 PDT 2009
224993    Wed Jun 03 06:00:29 PDT 2009
Name: date, Length: 224994, dtype: object

The _date_ feature contains several different information, these are retrived with the following lines of code

In [249]:
tweets[["day_of_week", "month_of_year", "day_of_month", "time", "tz", "year"]] = tweets['date'].str.split(' ', expand=True)

In [250]:
tweets[["hour_of_day", "minute", "second"]] = tweets['time'].str.split(':', expand=True)

At this point, the information whose information have been extracted, can be removed

In [251]:
tweets.drop(columns=["date", "time"], inplace=True)

In [252]:
tweets["flag"].unique()

array(['NO_QUERY'], dtype=object)

In [253]:
tweets["tz"].unique()

array(['PDT'], dtype=object)

In [254]:
tweets["year"].unique()

array(['2009'], dtype=object)

Since the dataset containes only dates in Pacific Daylight Time (PDT) format and only for the year 2009, these features are not relevant and can be dropped.
The flag feature does not contain any useful info and minutes and seconds do not convey any information so they can be removed as well.

In [255]:
tweets.drop(columns=["tz", "year", "minute", "second", "flag"], inplace=True)

In [256]:
#tweets["text"] = tweets["text"][tweets.duplicated(subset=["text"], keep="first")]

Instead of taking into account the specific hour, I decided that it is better to characterize the record by specifing if they were written in night hours (from 18 to 5) or in daylight hourse (from 6 to 17).

_night_ is a boolean feature

In [257]:
tweets["night"] = (tweets["hour_of_day"].astype("int") >= 18) | (tweets["hour_of_day"].astype("int") <= 5)

In [258]:
#tweets.drop(columns=["hour_of_day"], inplace=True) // Choose to remove it or not

In [259]:
tweets

Unnamed: 0,sentiment,ids,user,text,day_of_week,month_of_year,day_of_month,hour_of_day,night
0,1,1833972543,Killandra,"@MissBianca76 Yes, talking helps a lot.. going...",Mon,May,18,01,True
1,1,1980318193,IMlisacowan,SUNSHINE. livingg itttt. imma lie on the grass...,Sun,May,31,06,False
2,1,1994409198,yaseminx3,@PleaseBeMine Something for your iphone,Mon,Jun,01,11,False
3,0,1824749377,no_surprises,@GabrielSaporta couldn't get in to the after p...,Sun,May,17,02,True
4,0,2001199113,Rhi_ShortStack,@bradiewebbstack awww is andy being mean again...,Tue,Jun,02,00,True
...,...,...,...,...,...,...,...,...,...
224989,0,2261324310,CynthiaBuroughs,@Dropsofreign yeah I hope Iran people reach fr...,Sat,Jun,20,20,True
224990,1,1989408152,unitechy,Trying the qwerty keypad,Mon,Jun,01,01,True
224991,0,1991221316,Xaan,I love Jasper &amp; Jackson but that wig in th...,Mon,Jun,01,06,False
224992,0,2239702807,Ginger_Billie,I am really tired and bored and bleh! I feel c...,Fri,Jun,19,08,False


### Hashtag and mentioned user extraction

---
Ho dubbi che questa cosa possa essere effettivamente efficace. Come facciamo a utilizzare queste informazioni
quando ci sono più utenti menzionati e quando ci sono diversi hashtag?
Per gli hashtag, si potrebbe fare qualcosa (tipo tf-idf o tf-df), mentre per gli utenti proprio non saprei
---

In [153]:
tweets["hashtags"] = list(map(lambda t : re.findall("#[\d\w]+", t), tweets["text"]))
tweets["mentioned"] = list(map(lambda t : re.findall("@[\d\w]+", t), tweets["text"]))

In [154]:
tweets["mentioned"]

0            [@MissBianca76]
1                         []
2            [@PleaseBeMine]
3          [@GabrielSaporta]
4         [@bradiewebbstack]
                 ...        
224989       [@Dropsofreign]
224990                    []
224991                    []
224992                    []
224993          [@alyshatan]
Name: mentioned, Length: 224994, dtype: object

In [155]:
tweets["hashtags"]

0         []
1         []
2         []
3         []
4         []
          ..
224989    []
224990    []
224991    []
224992    []
224993    []
Name: hashtags, Length: 224994, dtype: object

In [156]:
np.max(list(map(lambda x: len(x), tweets["hashtags"])))

24

In [157]:
np.mean(list(map(lambda x: len(x), tweets["hashtags"])))

0.03572539712170102

In media ci sono veramente pochi hashtag in ogni tweet, però in alcuni tweet ce ne sono tanti (24 in quello con più hashtags)

## Text mining

In [97]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords as sw
class LemmaTokenizer(object):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()
    def __call__(self, document):
        lemmas = []
        for t in word_tokenize(document):
            t = t.strip()
            lemma = self.lemmatizer.lemmatize(t)
            lemmas.append(lemma)
        return lemmas

lemmaTokenizer = LemmaTokenizer()                                                                      
vectorizer = TfidfVectorizer(tokenizer=lemmaTokenizer, stop_words=sw.words('english'), strip_accents="ascii", use_idf=False, min_df=0.001)
tfidf = vectorizer.fit_transform(tweets["text"])



In [158]:
tweets_text_tfidf = pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names())
tweets_text_tfidf

Unnamed: 0,!,#,$,%,&,','d,'ll,'m,'re,...,yet,yo,young,youngq,youtube,yr,yummy,yup,|,~
0,0.000000,0.0,0.0,0.0,0.377964,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224989,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224990,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224991,0.000000,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224992,0.447214,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [159]:
tweets = pd.concat((tweets, tweets_text_tfidf), axis=1)

---
Potremmo provare ad estrarre informazioni utili dagli username?
Da implementare in seguito...
---

In [160]:
tweets.drop(columns=["user", "text", "mentioned", "hashtags"], inplace=True)

In [161]:
tweets

Unnamed: 0,sentiment,ids,day_of_week,month_of_year,day_of_month,hour_of_day,night,!,#,$,...,yet,yo,young,youngq,youtube,yr,yummy,yup,|,~
0,1,1833972543,Mon,May,18,01,True,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1980318193,Sun,May,31,06,False,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,1994409198,Mon,Jun,01,11,False,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,1824749377,Sun,May,17,02,True,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,2001199113,Tue,Jun,02,00,True,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224989,0,2261324310,Sat,Jun,20,20,True,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224990,1,1989408152,Mon,Jun,01,01,True,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224991,0,1991221316,Mon,Jun,01,06,False,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224992,0,2239702807,Fri,Jun,19,08,False,0.447214,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preprocessing

---
Implement weekday and weekend

---

Map the days of the week into integers

In [162]:
day_of_week_dict = {"Mon": 1, "Tue": 2, "Wed": 3, "Thu": 4, "Fri": 5, "Sat": 6, "Sun": 7}
tweets["day_of_week"] = list(map(lambda x: day_of_week_dict[x], tweets["day_of_week"]))

Map the months of the year into integers

In [163]:
months_dict = {"Jan": 1, "Feb": 2, "Mar": 3, "Apr": 4, "May": 5, "Jun": 6, "Jul": 7, "Aug": 8, "Sep": 9, "Oct": 10, "Nov": 11, "Dec": 12}
tweets["month_of_year"] = list(map(lambda x: months_dict[x], tweets["month_of_year"]))

Convert day and hours into integers

In [164]:
tweets["day_of_month"] = list(map(lambda x: int(x), tweets["day_of_month"]))

In [165]:
tweets["hour_of_day"] = list(map(lambda x: int(x), tweets["hour_of_day"]))

Normalize _ids_ attribute

In [None]:
tweets["ids"] = ColumnTransformer([('somename', MinMaxScaler(), [0])], remainder='passthrough').fit_transform(tweets)

In [173]:
y = tweets.pop("sentiment")
y

0         1
1         1
2         1
3         0
4         0
         ..
224989    0
224990    1
224991    0
224992    0
224993    1
Name: sentiment, Length: 224994, dtype: int64

In [180]:
X = tweets
X

Unnamed: 0,ids,day_of_week,month_of_year,day_of_month,hour_of_day,night,!,#,$,%,...,yet,yo,young,youngq,youtube,yr,yummy,yup,|,~
0,0.42508,1,5,18,1,True,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.594974,7,5,31,6,False,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.611332,1,6,1,11,False,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.414373,7,5,17,2,True,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.619215,2,6,2,0,True,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
224989,0.921197,6,6,20,20,True,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224990,0.605527,1,6,1,1,True,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224991,0.607632,1,6,1,6,False,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
224992,0.896096,5,6,19,8,False,0.447214,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
