**Project 4 - Elon Buys Twitter**

DATA 620 Team "Lucky 7": Bonnie Cooper, George Cruz Deschamps, Rob Hodde

*Part 3: Topical Analysis By Impact Tier*

*A comparison of the top eight tweet topics related to the Elon Musk buy of Twitter, stratified by Author Impact.*

<br>

In [105]:
#import necessary packages

import pandas as pd

import cleantext  
from emoji import demojize
import re
import nltk
from nltk.tokenize import word_tokenize
#nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

from bertopic import BERTopic

import os
os.chdir('C:\\Data\\')

import pyodbc
sServer = 'localhost'
sDB = 'CUNY'
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
                      "Server=" + sServer + ";"
                      "Database=" + sDB + ";"
                      "Trusted_Connection=yes;") 


In [106]:
# The next three functions are for cleaning up tweets:

# Changes text to lower case
# Removes:
#    numbers and punctuation 
#    extra spaces
#    stop words
# Translates emoji's into phrases 
def clean_text(x):
    x = demojize(x, language='alias') 
    x = re.sub(r"[:]+\ *", " ", x) #removes emoji colons and separates them with a space
    return cleantext.clean(x, extra_spaces=True, lowercase=True, numbers=True, punct=True, stopwords=True,
                     reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace=' ')

In [107]:
#Function to Lemmatize text (convert various forms to root words) 
def lemmatize_word(text):
    lemmatizer = WordNetLemmatizer()
    lemma = [lemmatizer.lemmatize(word) for word in text]
    return lemma

In [109]:
#Rationalize the text: clean, tokenize and lemmatize 
def rationalize_text(txt):
    return txt.apply(lambda x: clean_text(x)).apply(word_tokenize).apply(lambda x: lemmatize_word(x)).apply(lambda x: ''.join(i+' ' for i in x))


In [113]:
def make_model(filename, where):
    sSQL = """SELECT UserContent 
          FROM elonmusktwitter_tweets a 
          INNER JOIN tbl_Musk b ON a.UserName = b.UserName 
          WHERE a.UserLanguage = 'en' AND b.LikeCount """ + where
    df = pd.read_sql_query(sSQL, cnxn)
    df.columns = ['UserContent']
    df.to_csv(filename, encoding='utf-8')
    df['UserContent'] = rationalize_text(df['UserContent'])
    docs = df['UserContent'].tolist()
    model = BERTopic(verbose=True)
    topics, probabilities = model.fit_transform(docs)
    df = model.get_topic_info()
    df.to_csv('topic_info_'+filename, encoding='utf-8')
    return model 
    

<br>

**Top Impact**

The first group consists of the Top 100 authors by number of "Likes" within the dataset. 


In [116]:
filename = "SourcesTop1.csv"
model = make_model(filename, " > 70000")
model.visualize_barchart()

Batches:   0%|          | 0/84 [00:00<?, ?it/s]

2022-05-24 18:40:35,766 - BERTopic - Transformed documents to Embeddings
2022-05-24 18:40:42,720 - BERTopic - Reduced dimensionality
2022-05-24 18:40:42,841 - BERTopic - Clustered reduced embeddings


<br>

**Mid-Tier**

The second group consists of authors who garnered between 1,000 and 70,000 Likes (Ranked 101 - 2779). 

In [117]:
filename = "SourcesMid1.csv"
model = make_model(filename, " BETWEEN 1000 AND 70000 ")
model.visualize_barchart()



Batches:   0%|          | 0/511 [00:00<?, ?it/s]

2022-05-24 18:44:21,176 - BERTopic - Transformed documents to Embeddings
2022-05-24 18:44:27,396 - BERTopic - Reduced dimensionality
2022-05-24 18:44:28,137 - BERTopic - Clustered reduced embeddings


<br>

**Bottom Tier**

The third group consists of authors who garnered between 100 and 999 Likes (Ranked 2780 - 12884). 


In [118]:
filename = "SourcesLow1.csv"
model = make_model(filename, " BETWEEN 100 AND 999 ")
model.visualize_barchart()

Batches:   0%|          | 0/1379 [00:00<?, ?it/s]

2022-05-24 18:53:53,515 - BERTopic - Transformed documents to Embeddings
2022-05-24 18:54:10,935 - BERTopic - Reduced dimensionality
2022-05-24 18:54:14,429 - BERTopic - Clustered reduced embeddings
