### 1. Libraries imports

In [182]:
!pip install vaderSentiment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

#plt.rc('figure',figsize=(17,13))
#import plotly.express as px
#import plotly.graph_objs as go
#import plotly.offline as pyo
#from plotly.subplots import make_subplots
#pyo.init_notebook_mode()

import re
import string

import nltk
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk import pos_tag
from nltk.tokenize import word_tokenize

from wordcloud import WordCloud
from tqdm.auto import tqdm



###2. Data import

In [183]:
# If you want to import the csv dataset file without using Google Drive, you can ignore this cell
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [184]:
file_path = '/content/drive/My Drive/data_group_2.csv' # This path must be replaced. The data that has to be imported here is the tweets.csv file given in the Kaggle.
df = pd.read_csv(file_path)

###3. Data cleaning and processing

In [185]:
df['text']=df['text'].astype(str)
df['text']=[x.replace(':',' ') for x in df['text']]

In [186]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser=SentimentIntensityAnalyzer()
scores=[]
for i in range(len(df['text'])):
    score=analyser.polarity_scores(df['text'][i])
    score=score['compound']
    scores.append(score)
sentiment=[]
for i in scores:
    if i>=0.05:
        sentiment.append('Positive')
    elif i<=(-0.05):
        sentiment.append('Negative')
    else:
        sentiment.append('Neutral')
df['sentiment']=pd.Series(np.array(sentiment))

In [187]:
df = df[df['sentiment'] == 'Negative']

In [188]:
df = df.dropna()

In [189]:
df.drop(['hashtags', 'source', 'user_name', 'sentiment', 'user_friends', 'user_verified'], axis=1, inplace=True)

In [190]:
df.shape

(26237, 7)

In [191]:
# Select randomly 200 rows
df_sample = df.sample(n=200, random_state=42)

###4. The next part consist in couting the number of tokens for the final prompt and for the sample dataset in order to be sure that GPT-turbo-0125 will be able to process our request

In [192]:
prompt = '''
<context>
I am analyzing a dataset of tweets about the development of large language models (LLMs) like GPT-4 to identify public concerns.
I aim to understand how these concerns are influenced by the tweet's date, the geographical location of the user, their profile, and the tweet's popularity.
</context>

<persona>
Adopt the role of a researcher with a strong background in LLM and more generally on AI.
</persona>

<research question>
What are the primary concerns expressed by the public regarding the development of large language models (LLMs), and how do
these concerns correlate with specific factors such as date, geographical location, user profile and popularity ?
</research question>

<instructions>
1. Take the time to do an in-depth analyze of the dataset. Then propose a small summary of the main keypoints that can bring an answer to the research question ? This summary should be 100 words long.
2. Analysis of Concerns: Identify recurring themes in the concerns expressed about LLMs in the tweets. Use examples from tweets to illustrate each identified theme.
3. Correlation with Date: Examine how themes vary with the date of the tweet's publication. Describe observed trends and their possible significance.
4. Influence of Geographical Location: Link specific concerns to particular regions or countries, if possible. Explain how location might influence perceptions of LLMs.
5. Impact of User Profile: Analyze whether the user description and the user creation date seem to influence the type of concerns expressed.
6. Relationship with Popularity: Discuss the relationship between the tweet's popularity (measured by user’s followers, friends, favourites and if he is verified) and the nature of concerns.
    Determine whether more popular tweets reflect concerns that are more widely shared or not.
7. According to the results of your analysis, propose 2 hypothesis that explain the findings. This part should include argumentation and explanation.
From Part 2 to Part 6, consider each part should be of 200-250 words. This number of words is purely indicative, if a part needs more words it can be longer.

Example for part 1 :
"After examining the dataset, primary public concerns about LLMs include data privacy, algorithmic bias, and misuse of technology. Data shows a spike
in concerns related to privacy following high-profile data breaches. Geographically, concerns vary with higher anxiety in regions with strict data protection laws. Popular tweets, often
from influencers, suggest wider shared concerns, particularly about transparency and the ethical use of AI."

Example for part 2 :
"Recurring themes in the concerns about LLMs include:
•	Data Privacy: Many users express fear about personal data misuse. For example, a tweet from January 2021 states, 'Worried about how GPT-4 might handle my data. #PrivacyMatters.'
•	Algorithmic Bias: Concerns about bias in AI decisions are common, exemplified by a tweet from March 2022: 'How do we know AI isn't biased? #TechEthics.'
•	Misuse of AI: Fears of AI being used for deceptive purposes, highlighted by a tweet: 'Could LLMs be the next tool for misinformation? #AIResponsibility.'"

Example for part 3 :
"Analysis reveals an uptick in privacy concerns coinciding with news of data breaches or policy changes. For instance, tweets from late 2021 show
heightened anxiety when a major tech company faced a data scandal. Themes of ethical use of AI surged around elections or significant political events, suggesting that public awareness
is heightened by contextual global events."

Example for part 4 :
"After reviewing the dataset, there does not appear to be a clear or consistent relationship between the geographical location of the tweet
authors and the specific concerns they express about LLMs. For instance, concerns about data privacy are equally prevalent in tweets from users in Europe,
Asia, and North America, without significant variation that aligns with local data protection laws. Similarly, worries about algorithmic bias and AI misuse are
scattered across diverse locations, suggesting these concerns are globally shared and not particularly influenced by regional factors."

Example for part 5 :
"Analysis reveals a clear relationship between user profiles and the concerns they express about LLMs. Accounts created during key AI
milestones, such as the release of GPT-4, often tweet about advanced AI issues like transparency. For example, an account from 2021: 'How safe are these new AIs?' Furthermore,
users with tech-related roles in their descriptions, like 'AI researcher,' frequently discuss ethical implications, shown by a tweet: 'We need stricter AI oversight #AIethics.'
These patterns indicate that both account age and professional interests shape the concerns users articulate."

Example for part 6 :
"Analysis of the tweet data shows no clear correlation between the popularity of tweets (measured by factors such as retweets,
likes, and whether the user is verified) and the nature of concerns expressed about LLMs. Both highly popular and less popular tweets vary widely in their focus, with high-engagement
posts sometimes addressing general topics about AI, while other, less popular tweets might delve into specific issues like privacy or bias. This indicates that the reach of a tweet does
not necessarily reflect a broader or more significant concern among the public."

Example for part 7 :
Hypothesis 1: "The heightened concern about data privacy correlates strongly with geographical locations that have strict data protection laws, suggesting that regulatory environments
significantly influence public sentiment about LLMs."
Hypothesis 2: "The prominence of concerns related to ethical AI use among influencers and verified accounts may amplify these issues, leading to increased public awareness and potentially
 influencing policy discussions. This indicates that public figures play a crucial role in shaping discourse around AI ethics."

The number of words given are purely indicative. If a part need more text to be relevant it can be longer.
</instructions>

<format>
Minimum 1000 words, markdown syntax
</format>
'''

In [193]:
# Transform the sample dataset into a string
csv_string = df_sample.to_csv(index=False, sep=',', header=False)

In [194]:
# Function for counting the number of tokens
def count_tokens(text):
    nltk.download('punkt')
    tokens = word_tokenize(text)
    return len(tokens)

In [195]:
token_count_data = count_tokens(csv_string)
print("Number of tokens in the CSV string:", token_count_data)

Number of tokens in the CSV string: 12879


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [196]:
token_count_prompt = count_tokens(prompt)
print("Number of tokens in the final prompt:", token_count_prompt)

Number of tokens in the final prompt: 1119


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [197]:
print("Total of tokens :", token_count_data + token_count_prompt)

Total of tokens : 13998


###5. The final part of this notebook consist in extracting the dataset into a csv file

In [198]:
df_sample.to_csv('sample_data.csv', index=False)