The below unsupervised learning sentiment analysis takes in a corpus of 514027 rows of Tweets and YouTube comments and finds word, bigram, and sentence associations. Using the associations of the tokens within the corpus, it vectorizes the distances of the associations. The overall process begins by utilizing cleaning techniques such as lemmatization, removal of NaN values and stop words. Secondly by using the Gensim library's Phrases() and Phraser() to convert the individual words into bigrams and clusters of no more than 5 words (note to self--might need to adjust the minimum number of words in a phrase) and vectorizes their associations.

The biggest leap here is that since this is unsupervised learning, none of the rows or words already have a predetermined classification. There is no already defined positive, negative, or neutral words or statement. So labeling and tagging the words by first creating associations between them is the first major step before judging the sentiment value (i.e. positive versus negative versus neutral).

In [3]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting pip
  Downloading pip-21.0.1-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 2.8 MB/s eta 0:00:01
[?25hCollecting setuptools
  Downloading setuptools-54.2.0-py3-none-any.whl (785 kB)
[K     |████████████████████████████████| 785 kB 9.3 MB/s eta 0:00:01
[?25hCollecting wheel
  Downloading wheel-0.36.2-py2.py3-none-any.whl (35 kB)
Installing collected packages: pip, setuptools, wheel
  Attempting uninstall: pip
    Found existing installation: pip 20.2.1
    Uninstalling pip-20.2.1:
      Successfully uninstalled pip-20.2.1
  Attempting uninstall: setuptools
    Found existing installation: setuptools 49.2.1.post20200802
    Uninstalling setuptools-49.2.1.post20200802:
      Successfully uninstalled setuptools-49.2.1.post20200802
  Attempting uninstall: wheel
    Found existing installation: wheel 0.34.2
    Uninstalling wheel-0.34.2:
      Successfully uninstalled wheel-0.34.2
Successfully installed pip-21.0.1 setuptools-54.2.0 wheel-0.36.2
Coll

Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [4]:
!pip install xlrd



In [5]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.0.1-cp36-cp36m-macosx_10_9_x86_64.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 34.7 MB/s eta 0:00:01
Installing collected packages: gensim
Successfully installed gensim-4.0.1


In [6]:
import re  # For preprocessing
import pandas as pd  # For data handling
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [8]:
pd.set_option('display.width', None)
pd.set_option('max_columns', None)
pd.set_option('max_colwidth', 200)

In [7]:
df = pd.read_csv('lemm.csv')
df.shape

  interactivity=interactivity, compiler=compiler, result=result)


(514027, 7)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,favorite_count,user_id,mentions,repost_count,post_id
0,0,deployment 60 starlink satellite confirmed,97534.0,34743251.0,,9272.0,1.367407e+18
1,1,sn11 almost ready fly,60997.0,44196397.0,,4389.0,1.371995e+18
2,2,ive continued driving scout spot ill drop mar helicopter area get certified fli,56739.0,1.23278323762312e+18,,5605.0,1.369068e+18
3,3,honestly hadnt seen eye didnt footage would 100 think cg,40107.0,3167257102.0,,5219.0,1.371988e+18
4,4,really doe look like something 80 sci fi show credit spacex,36838.0,9.294728238480302e+17,,2216.0,1.367993e+18


In [10]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [11]:
df.head()

Unnamed: 0,text,favorite_count,user_id,mentions,repost_count,post_id
0,deployment 60 starlink satellite confirmed,97534.0,34743251.0,,9272.0,1.367407e+18
1,sn11 almost ready fly,60997.0,44196397.0,,4389.0,1.371995e+18
2,ive continued driving scout spot ill drop mar helicopter area get certified fli,56739.0,1.23278323762312e+18,,5605.0,1.369068e+18
3,honestly hadnt seen eye didnt footage would 100 think cg,40107.0,3167257102.0,,5219.0,1.371988e+18
4,really doe look like something 80 sci fi show credit spacex,36838.0,9.294728238480302e+17,,2216.0,1.367993e+18


In [12]:
df.isnull().sum()

text                  81
favorite_count         0
user_id                3
mentions          181446
repost_count           0
post_id                1
dtype: int64

In [15]:
df_comments = df.drop(['favorite_count', 'user_id', 'mentions', 'repost_count'], axis=1)

In [18]:
df_comments = df_comments.dropna().reset_index(drop=True)
df_comments.isnull().sum()

text       0
post_id    0
dtype: int64

In [19]:
df_comments.head()

Unnamed: 0,text,post_id
0,deployment 60 starlink satellite confirmed,1.367407e+18
1,sn11 almost ready fly,1.371995e+18
2,ive continued driving scout spot ill drop mar helicopter area get certified fli,1.369068e+18
3,honestly hadnt seen eye didnt footage would 100 think cg,1.371988e+18
4,really doe look like something 80 sci fi show credit spacex,1.367993e+18


In [22]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)

In [25]:
# remove non-alphabetical characters in 'text' column

brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df_comments['text'])

In [27]:
# Process texts as a stream, and yield `Doc` objects in order.

t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

Time to clean up everything: 8.75 mins


In [28]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

(500924, 1)

In [29]:
df_clean.head()

Unnamed: 0,clean
0,deployment starlink satellite confirm
1,sn ready fly
2,ve continue drive scout spot ill drop mar helicopter area certify fli
3,honestly nt see eye nt footage think cg
4,doe look like sci fi credit spacex


In [30]:
# Utilizing Gensim Phrases package to automatically detect common phrases (bigrams)
# from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html

from gensim.models.phrases import Phrases, Phraser



In [31]:
sent = [row.split() for row in df_clean['clean']]

In [32]:
# Detect phrases based on collocation counts.
phrases = Phrases(sent)

INFO - 06:56:31: collecting all words and their counts
INFO - 06:56:31: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 06:56:32: PROGRESS: at sentence #10000, processed 100114 words and 87342 word types
INFO - 06:56:32: PROGRESS: at sentence #20000, processed 198794 words and 162683 word types
INFO - 06:56:32: PROGRESS: at sentence #30000, processed 297712 words and 234295 word types
INFO - 06:56:32: PROGRESS: at sentence #40000, processed 395270 words and 301893 word types
INFO - 06:56:32: PROGRESS: at sentence #50000, processed 492039 words and 365707 word types
INFO - 06:56:32: PROGRESS: at sentence #60000, processed 588500 words and 427226 word types
INFO - 06:56:33: PROGRESS: at sentence #70000, processed 683840 words and 486520 word types
INFO - 06:56:33: PROGRESS: at sentence #80000, processed 778614 words and 544369 word types
INFO - 06:56:33: PROGRESS: at sentence #90000, processed 871368 words and 601094 word types
INFO - 06:56:33: PROGRESS: at sentence #

In [33]:
# The goal of Phraser() is to cut down memory consumption of Phrases(),

bigram = Phraser(phrases)

INFO - 06:58:47: exporting phrases from Phrases<2361650 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>
INFO - 06:58:53: FrozenPhrases lifecycle event {'msg': 'exported FrozenPhrases<24851 phrases, min_count=5, threshold=10.0> from Phrases<2361650 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000> in 6.10s', 'datetime': '2021-04-07T06:58:53.684451', 'gensim': '4.0.1', 'python': '3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]', 'platform': 'Darwin-19.6.0-x86_64-i386-64bit', 'event': 'created'}


In [34]:
# transform the corpus based upon bigrams detected
sentences = bigram[sent]

In [35]:
# creating a word frequency count for each individual word
# ensuring that lemmatization, removal of stop words, and bigrams reduced
# the total diversity of sentiment to be able to be more accurately measured and understood

word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

369182

In [36]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

['space', 'mar', 'nasa', 'http_co', 'nt', 'spacex', 'wa', 'm', 'like', 'ha']

In [None]:
The latter approach would be an unsupervised one,
and this one is an object of interest in this article. 
The main idea behind unsupervised learning is that you 
don’t give any previous assumptions and definitions to 
the model about the outcome of variables you feed into 
it — you simply insert the data (of course preprocessed before), 
and want the model to learn the structure of the data itself. 
It is extremely useful in cases when you don’t have labeled data, 
or you are not sure about the structure of the data, 
and you want to learn more about the nature of process you are analyzing, 
without making any previous assumptions about its outcome.