#### Project Title

- Leslie Cohrt: put their contribution here
- Sarah Auther: put their contribution here
- Shoshana Medved: put their contribution here

#### Introduction

In this study, we aim to understand how people spoke about ChatGPT on Twitter during the first month of its launch. Through this project, we hope to pinpoint any key differences in responses as ChatGPT became more mainstream. By analyzing a data sample of over 200,000 Tweets from November 30th to December 31st, 2022, we will be able to search for key communication trends among users and recurring beliefs about ChatGPT as an emerging platform. 

We are interested in discovering how people view ChatGPT based on data from Twitter
- Through the lens of computer-mediated communication, how has public opinion of ChatGPT evolved as its usage has become more normalized?  
- What generates the most reaction between real people when talking about ChatGPT?

Detailed description of dataset

In [2]:
import pandas as pd

df = pd.read_csv("chatgpt.data.csv")
df.head()

Unnamed: 0,tweet_id,created_at,like_count,quote_count,reply_count,retweet_count,tweet,country,photo_url,city,country_code
0,1.59801e+18,2022-11-30 18:00:15+00:00,2,0,0,0,ChatGPT: Optimizing Language Models for Dialog...,,,,
1,1.59801e+18,2022-11-30 18:02:06+00:00,12179,889,1130,3252,"Try talking with ChatGPT, our new AI system wh...",,,,
2,1.59801e+18,2022-11-30 18:02:58+00:00,2,0,0,1,ChatGPT: Optimizing Language Models for Dialog...,,https://pbs.twimg.com/media/Fi1J8HbWAAMv_yi.jpg,,
3,1.59802e+18,2022-11-30 18:05:58+00:00,561,8,25,66,"THRILLED to share that ChatGPT, our new model ...",,https://pbs.twimg.com/media/Fi1Km3WUYAAfzHS.jpg,,
4,1.59802e+18,2022-11-30 18:06:01+00:00,1,0,0,0,"As of 2 minutes ago, @OpenAI released their ne...",,,,


In [3]:
# preprocessing

import numpy as np

df1 = df.fillna(value=0)
df1 = df1.loc[df1["like_count"] != 0]

df1.describe()

Unnamed: 0,tweet_id,like_count,quote_count,reply_count,retweet_count
count,120136.0,120136.0,120136.0,120136.0,120136.0
mean,1.602453e+18,30.716463,0.716072,2.120688,4.155083
std,2985152000000000.0,658.769995,19.037848,34.355344,85.684421
min,1.59801e+18,1.0,0.0,0.0,0.0
25%,1.60002e+18,1.0,0.0,0.0,0.0
50%,1.60165e+18,2.0,0.0,0.0,0.0
75%,1.60464e+18,6.0,0.0,1.0,1.0
max,1.60934e+18,119321.0,4598.0,5184.0,10593.0


Our research questions are about shared public opinion and about opinions that generate conversation and response. Tweets with no likes are not indicative of opinions shared by many and do not generate response, so we removed them from the dataset.

In [4]:
!pip install nltk

import nltk

nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

!pip install textblob
from textblob import TextBlob

import re

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting click
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Collecting regex>=2021.8.3
  Downloading regex-2023.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (770 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m770.4/770.4 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: regex, click, nltk
Successfully installed click-8.1.3 nltk-3.8.1 regex-2023.6.3


[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m636.8/636.8 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: textblob
Successfully installed textblob-0.17.1


In [None]:
# Word tokenization    
text = df1['tweet'].apply(word_tokenize)


In [None]:
print(text)

In [None]:
import string

In [None]:
stop_words = set(stopwords.words('english'))

for column in df1.columns:
    if df1[column].dtype == 'object': # check if the column contains text data
        df1[column] = df1[column].apply(lambda x: ' '.join([word for word in str(x).split() if word.lower() not in stop_words]))

In [None]:
df1.head()

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [None]:
def get_sentiment(tweet):
    sentiment = sid.polarity_scores(tweet)
    return sentiment['compound']
    
df1['sentiment'] = df1['tweet'].apply(get_sentiment)

In [None]:
df1['sentiment']

In [None]:
df1.describe()

In [None]:
postive_tweets = df1[df1['sentiment'] > 0]
negative_tweets = df1[df1['sentiment'] < 0]
neutral_tweets = df1[df1['sentiment'] == 0]

postive_percentage = len(postive_tweets) / len(df1) * 100
negative_percentage = len(negative_tweets) / len(df1) * 100
neutral_percentage = len(neutral_tweets) / len(df1) * 100

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### How have people's opinions of ChatGPT evolved as its usage has become more normalized?
- sentiment analysis
- dependency parsing

In [None]:
df1.sort_values(by=['created_at'], ascending=True, inplace=True)

df1.head()

data = df1

sns.scatterplot(x="created_at", y="sentiment", hue="sentiment", data=data)
plt.title("Sentiment Analysis Over Time")
plt.xlabel("Date")
plt.ylabel("Sentiment")
plt.legend(title="Sentiment")
plt.show()

### what we gotta do
- sort by date created, then find sentiment averages over time

#### What generates the most reaction between real people when talking about ChatGPT?
- sentiment analysis

In [None]:
#reducing the data set to the 250 tweets with the most interaction based on a combination of likes, retweets, quotes, and replies.

df_r2 = df1.loc[(df1["like_count"] >299) & (df1["reply_count"] >49) & (df1["retweet_count"] >49) & (df1["quote_count"] >24)]

df_r2.describe()

In [None]:
postive_top_tweets = df_r2[df_r2['sentiment'] > 0]
negative_top_tweets = df_r2[df_r2['sentiment'] < 0]
neutral_top_tweets = df_r2[df_r2['sentiment'] == 0]

postive_percentage_top = len(postive_top_tweets) / len(df_r2) * 100
negative_percentage_top = len(negative_top_tweets) / len(df_r2) * 100
neutral_percentage_top = len(neutral_top_tweets) / len(df_r2) * 100

In [None]:
df_r2

In [None]:
sentiment_labels_2 = ['Positive', 'Negative', 'Neutral']
sentiment_percentages_2 = [postive_percentage_top, negative_percentage_top, neutral_percentage_top]

plt.bar(sentiment_labels_2, sentiment_percentages_2)
plt.xlabel('Sentiment')
plt.ylabel('Percentage')
plt.title('Sentiment Analysis Results of Top 250 Tweets')

plt.show()

In [None]:
df = pd.DataFrame9([postive_percentage_top, negative_percentage_top, neutral_percentage_top], index=['Positive Sentiment', 'Negative Sentiment', 'Neutral'], row=['Sentiment Analysis Results of Top 250 Tweets'])
df.plot(kind='pie', subplots=True, figsize=(8, 8))
plt.show

### what we gotta do part 2
- word cloud for the whole dataset and the df_r2 -> compare differences- are there any buzzwords that get more attention

### results and whatnot
- doesn't vary from the full set