<p style="background-color:#EB7101; font-family: arial; color: #ffffff; font-size: 200%; text-align: center; border-radius: 15px 15px;">0 - Libraries</p>

In [1]:
from dotenv import load_dotenv
from sma_modules import analysis, scraper, transformer_pipe

import altair as alt
import os




<p style="background-color:#EB7101; font-family: arial; color: #ffffff; font-size: 200%; text-align: center; border-radius: 15px 15px;">1 - Scenario</p>

We at Chegg strive to empower students in reaching their full potential. Our mission thus entails student support throughout school and beyond. This mission is brought to life by our e-learning platform, offering programs spanning academics, personal growth, and skill-building. Thus, we seek to guarantee the utmost educational value for learners' subscription fees. The value proposition, however, may be at risk, as indicated in reports (data visualization project) relating to future educational delivery. Findings suggest landscape dynamics, leaning towards dialogue-optimized large language models.

While a competitive large language model remains out of reach, recently launched customizable versions of OpenAI's ChatGPT may pose viable alternatives. Tailored versions, such as GPTs, might be introduced to complement existing programs, adapting to shifting landscape dynamics. Nevertheless, any investment directed at exploring this resource must undergo thorough due diligence. A preliminary indicator of investment potential is obtained via market research. 

This project serves the purpose of market research, whose goal is to ascertain public perceptions of GPTs. Hence, the emphasis on the prevailing social media sentiment, raising awareness as an indicator.

<p style="background-color:#EB7101; font-family: arial; color: #ffffff; font-size: 200%; text-align: center; border-radius: 15px 15px;">2 - Data</p>

In sourcing data for this project, social media platform X, formerly known as Twitter, was consulted. X, for its diverse and large population, conducive to sentiment analysis, given myriads of opinions on various topics. Further, research on sentiment in social media commonly relies on X, wherefore downstream methodologies lend themselves to tweets. 

With respect to X's API revisit, free-tier tweet accessibility suffered, necessitating web scraping to acquire sufficient data for the project. X's dynamic elements, spanning login masks, search boxes, and infinite scrolling, presume **_Selenium_**. **_Selenium_** conducts daily searches at the granularest time scale, as a consequence of the shift towards curated feeds over extended horizons. In doing so, adjacent days are broached, sacrificing execution time for greater granularity. As proactive strategies for navigating such searches, scroll depths and intervals were opted, mitigating rate-limiting. In an effort to forestall disruptions, search histories are cleared, otherwise dynamically prolonged, inhibiting scraping relevant elements.

In [2]:
load_dotenv("secret.env")

scrape = scraper(
    os.getenv("user_identifier"),
    os.getenv("password"),
    "GPTs", 
    "en", 
    "%d.%m.%Y",
    "09.11.2023",
    "20.12.2023" 
)

#scrape tweets
scrape.scraping()

Search scope constitutes the research target, GPTs. Thus, tweets featuring GPTs as keyword are sourced, particularly those in English, aligning with our target audience. Tweets sourced range from GPTs release date (09.11.2023) up until the project's start date (20.12.2023).

Author, publication date, and text are factored in when speaking of tweet sourcing. Authorship and date serve as attribution to the source, a practice crucial in establishing research credibility. While primary opinions are captured by original tweet's texts, granting insight onto the broader conversation. Thus the core, fueling underlying sentiment remains, with replies sacrificed for scraper execution time.

In [3]:
scrape.tweets_df.to_csv("tweets_unprocessed.csv",
                        index = False)

Following those criteria, 13,368 instances were scraped and stored in a CSV file for reference. Stored as a checkpoint for eventual future research, extending beyond this project.

<p style="background-color:#EB7101; font-family: arial; color: #ffffff; font-size: 200%; text-align: center; border-radius: 15px 15px;">3 - Processing</p>

Texts of the aforementioned checkpoint constitute ingredients for natural language processing. As concerns sentiment, text classification takes center stage.

In [4]:
nlp = transformer_pipe("tweets_unprocessed.csv", "text")

In [5]:
#predict sentiments
nlp.sens()

sen_clas: 100%|██████████| 13368/13368 [38:50<00:00,  5.74texts/s]


 With regard to classification, transformers qualify for their state-of-the-art performance. Among those transformers, RoBERTa stood out for its superior performance. RoBERTa has been fine-tuned before, coming in a variety of potentially compatible versions. From these, __["twitter-roberta-base-sentiment-latest"](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)__ of __["Cardiff NLP"](https://huggingface.co/cardiffnlp)__ seems reasonable.

 Reasonable, as for fine-tuning fueled by English tweets, conforming to scraped data. Those tweets featured date from not too long ago, supporting the predictions' temporal relevance with respect to language's ongoing evolution. Such predictions also adhere to the envisioned nature of a sentiment, showcasing promising quality as of ad-hoc testing. Ad-hoc tests are further validated by popularity indicative of quality.

 Such quality, however, requires preprocessing. __[Research](https://towardsdatascience.com/does-bert-need-clean-data-part-2-classification-d29adf9f745a)__ advocates light preprocessing, as BERT and derivatives, such as RoBERTa, heavily rely on contextual information. In accordance, emphasis shall be given to noise. Starting with padding, rooted in typographic norms, organizing devoid of semantics, yet lent by transformers. Emails, mentions, and websites abound with semantics in a similar fashion, raised, yet not necessarily intended for attribution in their nomenclature. Numerals, typically factual in nature, may also cause confusion, by virtue of their semantic neutrality as regards sentiment. While alphanumerics tend to lack linguistic coherence, fostering ambiguity and, thus, misinterpretation. All of that noise calls for mitigation.

 Diverse representations bring about quality risks, apart from noise. Semantically similar words may thus differ in meaning due to variations in their written forms. Here, emphasis is placed on prevailing standards—those variations, facing greatest exposure in pre-training. Given that greater exposure comes along with greater contextual awareness. As English texts constituted the predominant source, ASCII character encoding was enforced where feasible. Those English texts were mainly sourced from Brown's corpus and Wikipedia, whereas formal writing dominates, in favor of uncontracted forms. Words within such texts were lowercased, passed on to preprocessing.

 Following consideration of these aspects, corpora are then classified by sentiment.

In [6]:
#extract part-of-speech
for lexeme in ["NOUN", "VERB"]:
    nlp.lexemes(lexeme)

noun_rec: 100%|██████████| 13368/13368 [41:14<00:00,  5.40texts/s] 
verb_rec: 100%|██████████| 13368/13368 [41:00<00:00,  5.43texts/s] 


Among those sentiments, certain are more pertinent to the project than others. Emphasis lies therefore on sentiments expressed in tweets pertaining to education, a niche we are eager to explore. Topics such as education are in essence captured by nouns and verbs. Nouns, as designators of objects, and verbs, as expressions of action. Those parts-of-speech may be tagged via **_spaCy_**. As for the tweets' language, English **_spaCy_** models are considered. Of those __["en_core_web_trf"](https://github.com/explosion/spacy-models/releases/tag/en_core_web_trf-3.7.3)__ stood out as the most accurate. Such accuracy calls for preprocessing.

The primary focus of preprocessing lies in maximizing recognition of parts-of-speech in the **_spaCy_** model. Thus, gerundial forms are abrogated, giving rise to common standards. For all conjugations to be treated, apostrophes shall be encoded as ASCII characters. Reduplications of resulting regular forms, largely indistinguishable from consunant doubling are at least reduced to such. By doing so orthographically valid or at least close renditions are reached. Those renditions exclude entities such as emails, mentions, and websites so as to avoid the semantic mismatch previously mentioned.

Recognized part-of-speech is then reduced by the utmost aggressive stemmer, "Lancaster". Nouns and verbs within a family are thereby brought to a common denominator (e.g. noun: education, verb: educate, stem: educ). Downstream matching is thus simplified, as both noun and verb share stems.

In [7]:
nlp.df.to_csv("tweets_processed.csv",
              index = False)

Resulting data is then stored in a CSV file for reference. Stored as checkpoint in case of future analysis beyond this project's reach.

<p style="background-color:#EB7101; font-family: arial; color: #ffffff; font-size: 200%; text-align: center; border-radius: 15px 15px;">4 - Analysis</p>

The natural language processed checkpoint, constitutes the foundation for analysis. Such an analysis reveals insights rooted in that foundation.

In [8]:
ana = analysis("tweets_processed.csv", ["noun_stems", "verb_stems"])

In [9]:
#filter for keywords
ana.match("education", ["education", "expert", "learning", "teacher", "tutor"])

matching: 100%|██████████| 13368/13368 [00:01<00:00, 10830.99texts/s]


Bringing the foundation into harmony with the interest niche constitutes the first step. The scope of analysis thus should be narrowed to education. Part-of-speech is therefore filtered, based on assumption-wise representative education stems. Representative stems are listed below.

<center>

|Vocabulary|Stem|
|:----:|:----:|
|education|educ|
|expert|expert|
|learning|learn|
|teacher|teach|
|tutor|tut|

</center>

Yet, as stemmers may not always perform as intended, encapsulations are matched via **_TheFuzz_**'s partial ratio. Such matching may, however, cause the inclusion of widespreaded short stems, rather than actual encapsulations. For this reason, stems of at least the same length of the shortest qualify as matches. Matches (1,172) are deemed educationally relevant. 

In [10]:
#remove spam and enable individual keyword tracking
df_vis = (ana.df[ana.df["education_relevancy"]]
          .drop_duplicates(
              subset = [
                  "user_identifier", 
                  "text"
              ])
          .explode(
              column = "match"
          )
)

Relevant tweets for education are screened for duplicates by author and text. Duplicates found are discarded, mitigating spamming-induced bias, leaving 1,013 tweets.

In [11]:
print(
    df_vis["match"].unique()
)
irrelevant = [
    "machinelearn",
    "reduc",
    "earn",
    "tea",
    "deeplearn",
    "ear",
    "institut",
    "teachertwit",
    "substitut",
    "deduc",
    "duc",
    "constitut",
    "unlearn",
    "teachers&student"
]

#separate matching pairs
df_vis[["match_with", "match_base"]] = df_vis["match"].str.split(
    pat = "$", 
    n = 1, 
    expand = True
)

#remove erroneously matched keywords
df_vis.drop(
    df_vis[df_vis["match_base"].isin(irrelevant)].index,
    inplace = True
)

['educ$educ' 'learn$learn' 'tut$tut' 'expert$expert' 'learn$machinelearn'
 'teach$teachersoftwit' 'learn$languagelearn' 'educ$reduc' 'teach$teach'
 'learn$learnt' 'learn$englishlearn' 'learn$learningwitha' 'learn$earn'
 'teach$tea' 'learn$deeplearn' 'learn$ear' 'expert$kqlexpertise'
 'educ$techineduc' 'tut$techtut' 'tut$institut' 'learn$ailearn'
 'teach$teachertwit' 'tut$substitut' 'educ$aieduc' 'educ$deduc' 'educ$edu'
 'educ$earlyeduc' 'educ$duc' 'learn$learningmadeeasy'
 'learn$lifelonglearn' 'learn$create2learn' 'tut$constitut' 'tut$futut'
 'tut$photoshoptut' 'educ$educationinnov' 'learn$learningjourney'
 'learn$learningwithlaugh' 'educ$cannabiseduc' 'learn$laughandlearn'
 'educ$educationrevolv' 'learn$learning-' 'learn$-learning'
 'learn$-learned' 'expert$seoexpert' 'learn$unlearn' 'learn$gamifiedlearn'
 'teach$teachers&student' 'educ$highereduc']


Remaining tweets face scrutiny given the ambiguity associated with filtered stems. Ambiguity may introduce semantic mismatches caused by **_TheFuzz_**. Stems deemed non-relevant to our niche are manually omitted, leaving 780 tweets.

In [12]:
#prepare data for share in tooltip
alt_vis = (
    df_vis.groupby(
        [
            "match_with", 
            "sentiment"
        ]
    )
    .size()
    .reset_index(
        name = "count"
    )
    .assign(
        share = lambda row: row["count"] / row.groupby("match_with")["count"].transform("sum"),
        sentiment = lambda row: row["sentiment"].str.title()
    )
    .sort_values(
        [
            "match_with", 
            "sentiment"
        ]
    )
)

alt.Chart(alt_vis).mark_bar(size = 49).encode(
    x = alt.X(
        shorthand = "sum(count):Q",
        stack = "normalize",
        axis = alt.Axis(
            title = "Public Perception in Percentage (%)",
            labelExpr = "datum.value * 100",
        )
    ),
    y = alt.Y(
        shorthand = "match_with:N",
        axis = alt.Axis(
            title = "Stem",
        ),
    ),
    color = alt.Color(
        shorthand = "sentiment:N",
        scale = alt.Scale(
            range = [
                "#EB7101", 
                "#DEDEDE",
                "#75A99C"
            ]
        ),
        legend = alt.Legend(
            title = "Sentiment"
        )
    ),
    order = alt.Order(
        "color_sentiment_index:Q"
    ),
    tooltip = [
        alt.Tooltip(
            shorthand = "count:Q",
            title = "Quantity"
        ),
        alt.Tooltip(
            shorthand = "share:Q",
            title = "Share",
            format = ".0%"
        )
    ]
).properties(
    title = "Public Perception Distribution of Tweets on GPTs in Education",
    width = 500,
    height = 250
)

The remainder of the tweets are deemed representative of niche subject matter. Their sentiments are illustrated in a normalized stacked bar chart. Given varying population sizes per stem, prone to obscuration, distributions are investigated in solitude. Yet sentiment distributions exhibit clear structures across all stems. In fact, a modest majority gravitates towards GPTs. Slightly less than that are neutrally attuned. While poorly received by only a minority.

<p style="background-color:#EB7101; font-family: arial; color: #ffffff; font-size: 200%; text-align: center; border-radius: 15px 15px;">5 - Conclusion</p>

Dialogue-optimized large language models have experienced explosive popularity of late. In the wake of that hype, research revealed dynamics, opposing traditional e-learning platforms. In keeping with this hype, GPTs were researched, regarded as more feasible alternative to in-house approaches. Research centered around GPTs education vertical in pursuit of investment indicators. The indicator explored in this project constitutes public perception, measured by sentiment.

Sentiments were rooted in scraped tweets, classified by RoBERTa (fine-tuned). With relevancy anchored in stem-based filtering. Stems constituted part-of-speech tags pertaining to nouns and verbs. Resulting findings indicate a lucid majority. Public perception is thus far more positive than negative. GPTs exploratory investments are therefore supported by the indicator at first glance.

This finding should, however, be treated with caution, as localized perceptions are evident. Localized by reference to a snapshot grounded in hypothesis-driven filtering, along with platform-specific scope. Within that snapshot, uncertainty persisted regarding perception emitters, as either end-users or vendors, indicative of the overall health of the ecosystem.

As a result, there is ample room for future research. The scope may thus be broadened to include other platforms. While hypothesis-based filtering may be substituted for topic classification or modeling. Even so, further investigation into the type of user might be beneficial, gasping the ecosystem's health. Besides tweet emitting users, replies may also be sourced for weighting purposes.

Nonetheless the project sheds light on an indicator, yet to be taken with a grain of salt. All the while setting the stage as cornerstone for future research.