# Text mining & Search Project

### Università degli Studi di Milano-Bicocca  2020/2021

**Luzzi Federico** (matricola) **Peracchi Marco** 800578

# Text Exploration

In questa fase procediamo all'esplorazione dei dati a disposizione.

### Librerie

In [3]:
# Librerie base
import nltk
import pandas as pd
import re
import string
import matplotlib.pyplot as plt
import sklearn
from wordcloud import WordCloud

In [4]:
# Librerie per la text tokenization
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import  WordPunctTokenizer
from nltk.tokenize import  BlanklineTokenizer

In [5]:
# Librerie per stemming e lemmatization
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [6]:
# Librerie per text representation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
# Download dei contenuti necessari
nltk.download("stopwords")
stop_words = nltk.corpus.stopwords.words("english")
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Marco\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Dataset

Il dataset a disposizione presenta diverse colonne:
- **count** rappresenta il numero di utenti che hanno espresso una valutazione sul tweet
- **hate_speech** è il numero di utenti che hanno ritenuto il tweet come espressioni violente 
- **offensive_language** sono gli utenti che hanno segnalato come tweet offensivo
- **neither** sono gli utenti che hanno segnalato il tweet come neutrale
- **class** è la label assegnata, 0 come hate speech, 1 per offensive language e 2 come neutrale
- **tweet** rappresenta il testo del tweet

In [8]:
df = pd.read_csv("data/labeled_data.csv", sep = ',').drop("Unnamed: 0", axis=1)
df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [10]:
# Formato delle colonne
df.dtypes

count                  int64
hate_speech            int64
offensive_language     int64
neither                int64
class                  int64
tweet                 object
dtype: object

In [11]:
# Statistiche descrittive del dataset
df.describe()

Unnamed: 0,count,hate_speech,offensive_language,neither,class
count,24783.0,24783.0,24783.0,24783.0,24783.0
mean,3.243473,0.280515,2.413711,0.549247,1.110277
std,0.88306,0.631851,1.399459,1.113299,0.462089
min,3.0,0.0,0.0,0.0,0.0
25%,3.0,0.0,2.0,0.0,1.0
50%,3.0,0.0,3.0,0.0,1.0
75%,3.0,0.0,3.0,0.0,1.0
max,9.0,7.0,9.0,9.0,2.0


In [12]:
# Distribuzione delle label
df["class"].value_counts()

1    19190
2     4163
0     1430
Name: class, dtype: int64

Dalla distribuzione delle label è chiaramente visibile un problema di sbilanciamento delle classi.

In [16]:
# Esempio di un tweet segnalato come neutrale
df["tweet"].loc[6171]

'@infidelpamelaLC I\'m going to blame the black man, since they always blame "whitey" I\'m an equal opportunity hater.'

In [17]:
df.loc[6171]

count                                                                 9
hate_speech                                                           7
offensive_language                                                    1
neither                                                               1
class                                                                 0
tweet                 @infidelpamelaLC I'm going to blame the black ...
Name: 6171, dtype: object

## FRASE TEST

In [6]:
text = df["tweet"][24]
text

'" got ya bitch tip toeing on my hardwood floors " &#128514; http://t.co/cOU2WQ5L4q'

## Tokenization

In [20]:
tt = TweetTokenizer()
wpt = WordPunctTokenizer()
blt = BlanklineTokenizer()

In [21]:
print(tt.tokenize(text.lower())) # Tweet
print(wpt.tokenize(text.lower())) # Wordpunct
print(blt.tokenize(text.lower())) # Blankline

['"', "don't", 'make', 'me', 'make', 'you', 'fall', 'in', 'live', 'with', 'a', 'nigga', 'like', 'meee', '...', '"', 'the', 'birds', '1', '&', '2', 'are', 'my', 'favorite', 'songs', 'by', 'weeknd']
['"', 'don', "'", 't', 'make', 'me', 'make', 'you', 'fall', 'in', 'live', 'with', 'a', 'nigga', 'like', 'meee', '..."', 'the', 'birds', '1', '&', 'amp', ';', '2', 'are', 'my', 'favorite', 'songs', 'by', 'weeknd']
['"don\'t make me make you fall in live with a nigga like meee..." the birds 1&amp;2 are my favorite songs by weeknd']


## Removing stopwords

In [27]:
tokens_text = tt.tokenize(text)
remove_sw = []
for token in tokens_text:
    if token.lower() not in stop_words:
         remove_sw.append(token)
print(remove_sw)

['"', 'make', 'make', 'fall', 'live', 'nigga', 'like', 'meee', '...', '"', 'birds', '1', '&', '2', 'favorite', 'songs', 'Weeknd']


## Remove numbers

Questa parte va inserita prima della rimozione delle stopwords

In [30]:
remove_num = re.sub(r'\d+', '', text)
print(remove_num)

"Don't make me make you fall in live with a nigga like meee..." The birds &amp; are my favorite songs by Weeknd


## Remove punctuations

Ovviamente pure sta parte va fatta prima

In [83]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [31]:
remove_punc = text.translate(str.maketrans('', '', string.punctuation))
print(remove_punc)

Dont make me make you fall in live with a nigga like meee The birds 1amp2 are my favorite songs by Weeknd


## Remove Extra spaces

Direi parecchio utile

In [32]:
remove_sp = re.sub(r'\s\s', '', text)
print(remove_sp)

"Don't make me make you fall in live with a nigga like meee..." The birds 1&amp;2 are my favorite songs by Weeknd


## WordCloud

In [None]:
def plot_wordcloud(cnt, file_name="figure1.png"):
    wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white")
    wordcloud.generate_from_frequencies(dict(cnt))
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    # plt.show()
    plt.savefig(file_name)

## Stemming

In [None]:
tokenized_text = WordPunctTokenizer().tokenize(text)

porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokenized_text]

## Stanford

In [37]:

# stanfordnlp.download('en')   # This downloads the English models. Run only once!
base = "C:\\Users\\Marco\\stanfordnlp_resources\\en_ewt_models\\"
config = {
    'processors': 'tokenize,pos,lemma,depparse', # Comma-separated list of processors to use
    'lang': 'en', # Language code for the language to build the Pipeline in
    'model_path': base + 'en_ewt_tokenizer.pt', 
    'pos_model_path': base + 'en_ewt_tagger.pt', 
    'pos_pretrain_path': base + 'en_ewt.pretrain.pt', 
    'lemma_model_path': base + 'en_ewt_lemmatizer.pt', 
    'depparse_model_path': base + 'en_ewt_parser.pt', 
    'depparse_pretrain_path': base + 'en_ewt.pretrain.pt', 
    'tokenize_pretokenized': False # Use pretokenized text as input and disable tokenization
}

nlp = stanfordnlp.Pipeline(**config)

doc = nlp("this is a simple text")
all_information_tab_separated = doc.conll_file.conll_as_string()

parsed_text = "\n".join([f'text: {word.text+" "}\tlemma: {word.lemma}\tupos: {word.upos}\tdepencency: {word.dependency_relation}' for sent in doc.sentences for word in sent.words])

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\Marco\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'pretokenized': False, 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': 'C:\\Users\\Marco\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'C:\\Users\\Marco\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': 'C:\\Users\\Marco\\stanfordnlp_resources\\en_ewt_models\\en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': 'C:\\Users\\Marco\\stanfordnlp_resources\\en_ewt_models\\en_ewt_parser.pt', 'pretrain_pa

  unlabeled_scores.masked_fill_(diag, -float('inf'))


In [41]:
print(dir(doc))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_conll_file', '_sentences', '_text', 'conll_file', 'load_annotations', 'sentences', 'text', 'write_conll_to_file']


In [59]:
print(dir(doc.sentences[0].words[0]))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_dependency_relation', '_feats', '_governor', '_index', '_lemma', '_parent_token', '_text', '_upos', '_xpos', 'dependency_relation', 'feats', 'governor', 'index', 'lemma', 'parent_token', 'pos', 'text', 'upos', 'xpos']


In [55]:
doc.sentences[0].words[0].text

'this'

In [57]:
doc.sentences[0].words[1].lemma

'be'

In [67]:
doc.sentences[0].words[2].upos

'DET'

## Functions

In [18]:
# 1
def preprocessing(text):
    text = text.lower() # Lowering case
    remove_url = re.sub(r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})', ' ', text) # Removing url
    remove_retweet = re.sub(r"@\w+", " ",remove_url) # Removing retweet
    remove_retweet = re.sub(r"&\w+", " ",remove_retweet) # Remove &amp
    remove_retweet = re.sub(r"\b([!#\$%&\\\(\)\*\+,-\./:;<=>\?@\[\]\^_`\{|\}\"~]+)\b", " ",remove_retweet) # Must check this one
    remove_retweet = re.sub(r"([a-z])\1{3,}", r"\1",remove_retweet)
    remove_punc = remove_retweet.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
    final_text = re.sub(r'\d+', ' ', remove_punc) # Remove number 
    final_text = re.sub(r'\s+', ' ', final_text) # Removing exceeding spaces
    return final_text

In [19]:
#2
def tokenization(text_clean, tok = "tweet"):
    if tok == "tweet": # TweetTokenizer
        tt = TweetTokenizer()
        tokenized_text = tt.tokenize(text_clean)
    elif tok == "wordpunct": # WordPunctTokenizer
        wpt = WordPunctTokenizer()
        tokenized_text = wpt.tokenize(text_clean)
    return tokenized_text

In [20]:
#3
def remove_stopwords(tokenized_text):
    remove_sw = []
    for token in tokenized_text:
        stop_words.append("rt") # Added a stop words, RT of ReplyTweet I think
        if token.lower() not in stop_words:
             remove_sw.append(token)
    return remove_sw

In [21]:
# 4
# pos-tagging (1 document)
def pos_tagging(doc_token):
    return nltk.pos_tag(doc_token)

# convertion of pos tagging
def get_wordnet_pos(word_tag):
    if word_tag.startswith('J'):
        return "a"
    elif word_tag.startswith('V'):
        return "v"
    elif word_tag.startswith('R'):
        return "r"
    else:
        return "n"
    
# lemmatizer one word 
def lemmatizer(word):
    pos = get_wordnet_pos(word[1])
    wnl = WordNetLemmatizer()
    return wnl.lemmatize(word[0], pos = pos)

# lemmatizer one document
def lemmatizer_doc(doc_token):
    lemmas = [] 
    
    pos_document = pos_tagging(doc_token) # pos tagging
    for token in pos_document:
        lemmas.append( lemmatizer(token) ) # lemmatization x word
    
    return lemmas

In [22]:
# 5
def stemmer(tokenized_text):
    porter = PorterStemmer()
    return [porter.stem(word) for word in tokenized_text]

In [50]:
text = df["tweet"][1432]
print(text)
text_prep = preprocessing(text)
print("1. PREPROCESSING:\n", text_prep)
text_prep = tokenization(text_prep)
print("2. TOKENIZATION:\n", text_prep)
text_prep = remove_stopwords(text_prep)
print("3. STOP WORDS:\n", text_prep)
text_prep = lemmatizer_doc(text_prep)
print("4. LEMMATIZATION:\n", text_prep)
text_prep = stemmer(text_prep)
print("5. STEMMER:\n", text_prep)

&#8220;@Nadixo__: "@joelxmontoya: I'm dreaming &#128525;&#128525; http://t.co/Y21PyZ5rai" ewwww&#8221;bitch you must be fucked up &#128514;&#128514;
1. PREPROCESSING:
  im dreaming ew bitch you must be fucked up 
2. TOKENIZATION:
 ['im', 'dreaming', 'ew', 'bitch', 'you', 'must', 'be', 'fucked', 'up']
3. STOP WORDS:
 ['im', 'dreaming', 'ew', 'bitch', 'must', 'fucked']
4. LEMMATIZATION:
 ['im', 'dream', 'ew', 'bitch', 'must', 'fuck']
5. STEMMER:
 ['im', 'dream', 'ew', 'bitch', 'must', 'fuck']


In [23]:
def processing(text):
    text_prep = preprocessing(text)
    text_prep = tokenization(text_prep)
    text_prep = remove_stopwords(text_prep)
    text_prep = lemmatizer_doc(text_prep)
    #text_prep = stemmer(text_prep)
    text_prep = " ".join(text_prep)
    print(text_prep)
    return text_prep

In [32]:
df_red = df[0:1000]

In [33]:
df_red["tweet_clean"] = df_red["tweet"].apply(lambda x : processing(x))

woman shouldnt complain cleaning house man always take trash
boy dat cold tyga dwn bad cuffin dat hoe st place
dawg ever fuck bitch start cry confuse shit
look like tranny
shit hear might true might faker bitch tell ya
shit blow claim faithful somebody still fuck hoe
sit hate another bitch get much shit go
cause im tire big bitch come u skinny girl
might get ya bitch back thats
hobby include fight mariam bitch
keeks bitch curve everyone lol walk conversation like smh
murda gang bitch gang land
hoe smoke loser yea go ig
bad bitch thing like
bitch get
bitch nigga miss
bitch plz whatever
bitch love
bitch get cut everyday b
black bottle bad bitch
break bitch cant tell nothing
cancel bitch like nino
cant see hoe wont change
fuck bitch dont even suck dick kermit video bout fuck ig
get ya bitch tip toe hardwood floor
pussy lip like heaven door
hoe hit
meet pussy ocean dr give pussy pill
need trippy bitch fuck hennessy
spend money want bitch business
txt old bitch new bitch pussy wetter
id say

dese hoe lyin u nigga
oh yeah fuck bitch
eli man threw nfl lead th interception season lmao trash
usc senior rb anthony brown call head coach steve sarkisian racist quits team he bitch
guess also zero time mansplaining adios stfu pussy
chill sexy scally lad
aint share bitch mine mine
money impress simple break bitch
pisces never anybodys bitch dont let people control change rigghhht
tuku ho idk blah blah blah
im mad cuh hahaha suure tuku ho ita
hoe crazy
aint ever gonna slice bitch kevin hart
oomf fuckin fine tho dam dont bitch
wonder girl give head boy drink magaluf twitter probly send nude fag
newsatquestions yeah bitch yeah bitch call maybe b steve c later jesse pinkman
hickies thigh right pussy right
would still love im longer young beautiful ugly monkey as
yall come front bitch aint come
faggot bitch
yea im ipad dont work friday let ball faggot work
scally lad would get
gerryshalloweenparty halloween yesterday stupid nigger
chrus still supahot tho walk teacher test drop test bitch

everything look well bitch next
facto care hoe ass look like
every bad bitch nigga tire fucking
foreign chick lie man bitch beautiful
fresh nigga kno need clean bitch kno im work cuz wanna see team rich
fuck make rule suck dragon ball bitch call goku
generate ascii box nodejs see ability bring flappy bird node via ascii tube bird
get shit bitch fine bear
get as back detroit ya fuckin wigger
girl date okay aint mad yo unless stabbed heart love hoe
know twitter password bitch please
see show see hoe wasnt hot would thick
bitch hitting hoe
married white woman lucky sum bitch uncle ruckus
wanna nigga im trinna saving account omg imma funny as bitch man thats real shit
great grandma hoe grandma hoe mom hoe shes hoe long line hoe grandpa carl everyone
hey go look video man find kidnap girl ohio nigger shitmybosssays
hey pussy still
draft gung ho folk send afghanistan war zone show
hotter bitch
u talk hoe bout bro n em talk hoe bout bro n em
crazy bitch hilarious
know shoe dirty fuck whatever

fuck pussy as hater go suck dick die quick someone youtube comment think wed good friend
hick raver venn diagram large intersection
hoe suck dick cu look like john stockton hit em wit choppa luke fatherrr
wasnt work place id deck aye fuck right id give tyson combo nd end uppercut shut fuck bitch
suck dick dip shes keeper true redneck woman kims dad
im gonna rip bitch apart rip apart lol
britney bitch bible leviticus
nigga bitch bitch as nigga dike as hoe black as bright as hoe fag tag scally wag
pussy tryna go studio dont call mufuckin crib like god yo word god bus yo shit god word
oh ya hoe think cute skin tight cat suit assume body boomin dispute
omg bitch fuck stupid swear blah blah blah week later omg ilysm bae ur best friend swear
way fuck bitch name lord mr race
poor whitey
rid bros still dont trust hoe
weekend go fabulous naa man youre see nae cunt theyll koed second flat hate u fuckgm
whatever good one abraham lincoln quote yall hoe
guy say latinas mean kim kardashian look girl

theboondocks nicca
truth david charlie brown scandal
son bitch moment finally get bed bladder decides time piss
son bitch moment turn radio favorite song end song
son bitch moment walk house dark stub little toe wall
thatfuckmemoment leave car window rain car look like hurricane katrina run bitch
themostannoyingthingsinlife people never happy bitch n moan time
thingsiwillteachmychild play deez nigga n bitch dat snake
thingsnottodoonafirstdate give da dick u gotta half stroke da pussy bc u dump dick dat bitch gon become extremely anoyin
tjohngang lil bitch follow
tomyfutureson fuck dem hoe well love dem hoe
trayvonmartin refer zimmerman creepy as cracker racist thug
tweetlikepontiacholmes pontiac sprinkler nigga nigga nigga nigga spic spic spic spic nigga nigga nigga nigga
tweetlikeyourbestfriend make bitch
umightnotgetin fucc nicca beerandtacos wethelastonesleft
vinitahegwood get job naacp ag hear like diversity tolerance long aint cracker tcot
virginia full white trash
wearerepublicno

bid go date zack morris oh dreamy son bitch
bud lites well shots bar tn plus im give away trip la vega youre cunt come visit bourbon st
torch drain cause either high lose one nigga fin hoe fuck ittt
killer line josh smith drop killer trash talk line kenneth
piece pussy
love salad anyway fuck hoe eat salad
plus sheryl crow
every bitch n passenger seat dun fuck least one time
true early bird get worm first
dirty whore love hoe love jesus watching
hoe get school shirt suck dick
lotta fck nigga twitter really talk like bitch
bitch start shoot party fuck new orleans
bitch give back rub
never coon grow bokoo coon tho never coon shit tho
nigga smash hoe hoe pay condo still yep fucboy
guiltypleasure watch nigger fight youtube vine ig
dont like rest bitch wait give fuck
id rather call nigger uncle tom mark
bet yall bitch deaf mc lytes word treat like queen huh
hi ho
onea yall b tches gon end dead screen shot thinkn sh funny aint many niccas gon laugh b
congrats youve turn hoe housewife dont get

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [72]:
df.to_csv("processing_dataset.csv")

### Interesting rows

- 1432
- 24
- 3129
- 530
- 19999
- 998
- 1605
- 4567
- 555

## Bag of words

In [34]:
df_red.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,tweet_clean
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,woman shouldnt complain cleaning house man alw...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dat cold tyga dwn bad cuffin dat hoe st place
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,dawg ever fuck bitch start cry confuse shit
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,look like tranny
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,shit hear might true might faker bitch tell ya


In [39]:
corpus = df_red["tweet_clean"]

In [46]:
df_red["tweet_clean"].loc[1]

'boy dat cold tyga dwn bad cuffin dat hoe st place'

In [80]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
#print(vectorizer.get_feature_names())
#for vec in X.toarray():
#    print(vec)
X.toarray()[1][X.toarray()[1] == 2]

array([2], dtype=int64)

## Count vector

In [79]:
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(corpus)
#print(vectorizer.get_feature_names())

#for vec in X.toarray():
#    print(vec)

X.toarray()[1][X.toarray()[1] == 2]

array([], dtype=int64)

## TF-IDF

In [82]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
for vec in X.toarray():
    print(["{:.3f}".format(v) for v in vec])

['abc', 'ability', 'able', 'abortion', 'abraham', 'absurd', 'accept', 'accessory', 'account', 'across', 'act', 'actin', 'action', 'actor', 'actual', 'actually', 'adidas', 'adios', 'admit', 'adorable', 'advantage', 'advice', 'af', 'afghanistan', 'afterearth', 'ag', 'agg', 'agnes', 'agree', 'ah', 'ahead', 'ahmesehwetness', 'ahoy', 'aid', 'ainn', 'ainna', 'aint', 'aintnolevelz', 'air', 'airport', 'ajmi', 'ajumma', 'aka', 'al', 'alarm', 'albino', 'alexa', 'alexfromtarget', 'alicia', 'alive', 'allahs', 'allegation', 'allisons', 'allow', 'alls', 'alone', 'along', 'alot', 'already', 'alright', 'alsarabsss', 'also', 'alu', 'alway', 'always', 'amateur', 'amaze', 'amen', 'america', 'american', 'amo', 'amos', 'anal', 'andrewbryant', 'android', 'andy', 'angelique', 'angry', 'animal', 'anncoulter', 'annoy', 'anonymous', 'another', 'anoyin', 'ant', 'anthem', 'anthony', 'antonio', 'anybody', 'anybodys', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywere', 'ap', 'apart', 'ape', 'app', 'app

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
# lemmatization ?
# stemming ?
# normalization ?
# n-grams ?

# representation (word2vec - gensim)
# text clustering/classification
# evaluation con confronti
# conclusioni