# Analysis of the dataset 

### Research question: What is the impact of the emerging of AI on the population based on tweets 
#### Analysis: Sentiment Analysis through NLP, focus on ChatGPT
- Categorize tweets: positive, negative, neutral 
- Frequency of words used, what type of words
- Evolution of the sentiment analysis: from the beginning of the release of ChatGPT until recently 
- Fequency of tweets made about AI (mean of tweets per person) 
- Try to work with threads (to be continued...)
- Categorize tweets to workers, students, reasearchers (professors) and see if the sentiment analysis is different for these categories 


### Some references for the sentiment analysis
- https://www.analyticsvidhya.com/blog/2021/06/twitter-sentiment-analysis-a-nlp-use-case-for-beginners/
- https://huggingface.co/blog/sentiment-analysis-twitter

In [1]:
import re
import numpy as np
import pandas as pd
# plotting
import seaborn as sns
#from wordcloud import WordCloud
import matplotlib.pyplot as plt
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

## Exploratory analysis with labeled dataset
* dataset contains only english tweets
* tweets about ChatGPT gathered for a month 

In [2]:
df = pd.read_csv('datasets/sentiment.csv', usecols=['tweets','labels'])

In [3]:
df.sample(5)

Unnamed: 0,tweets,labels
195068,"News about https://t.co/jn2ucvZO1g! Wordle, C...",bad
55694,I wonder how long before the first Ecology opi...,bad
18167,Future for creators is going to be incredible🏆...,bad
79322,Why do women in marriage/relationships cheat?\...,bad
66030,#ChatGPT is an atheist https://t.co/LdYs4pBnLp,bad


In [4]:
df.columns

Index(['tweets', 'labels'], dtype='object')

In [5]:
print("length of data is", len(df))

length of data is 219294


In [6]:
df.shape

(219294, 2)

In [7]:
df.dtypes

tweets    object
labels    object
dtype: object

In [8]:
df.labels = df.labels.replace('bad', 1)
df.labels = df.labels.replace('neutral', 0)
df.labels = df.labels.replace('good', 0)

df.tail()

Unnamed: 0,tweets,labels
219289,Other Software Projects Are Now Trying to Repl...,1
219290,I asked #ChatGPT to write a #NYE Joke for SEOs...,0
219291,chatgpt is being disassembled until it can onl...,1
219292,2023 predictions by #chatGPT. Nothing really s...,1
219293,"From ChatGPT, neat stuff https://t.co/qjjUF2Z2m0",0


In [9]:
df.head()

Unnamed: 0,tweets,labels
0,ChatGPT: Optimizing Language Models for Dialog...,0
1,"Try talking with ChatGPT, our new AI system wh...",0
2,ChatGPT: Optimizing Language Models for Dialog...,0
3,"THRILLED to share that ChatGPT, our new model ...",0
4,"As of 2 minutes ago, @OpenAI released their ne...",1


In [10]:
df.sample(5)

Unnamed: 0,tweets,labels
125880,chatGPT killer use case confirmed https://t.co...,1
35008,I asked ChatGPT to write about itself in my st...,1
105806,"""Are you ready for a wild ride? The Bearded Ti...",0
131595,As someone who has personally used the chatGPT...,0
200667,Sabine got me thinking of Gödel's incompletene...,0


## Splitting into train and test

In [11]:
from sklearn.model_selection import train_test_split

tweets_train, tweets_test, target_train, target_test = train_test_split(df.tweets,df.labels,test_size = 0.2)

In [12]:
df['tweets'] = tweets_train

## Data preprocessing

In [13]:
df['tweets'] = df['tweets'].str.lower()
df['tweets'].tail()

219289    other software projects are now trying to repl...
219290    i asked #chatgpt to write a #nye joke for seos...
219291    chatgpt is being disassembled until it can onl...
219292    2023 predictions by #chatgpt. nothing really s...
219293                                                  NaN
Name: tweets, dtype: object

In [14]:
# defining set containing all stopwords in english

stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

# cleaning and removing the above stop words list from the tweet text
STOPWORDS = set(stopwordlist)

def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df['tweets'] = df['tweets'].apply(lambda text: cleaning_stopwords(text))
df['tweets'].head()

0    chatgpt: optimizing language models dialogue h...
1                                                  nan
2    chatgpt: optimizing language models dialogue h...
3                                                  nan
4    2 minutes ago, @openai released new chatgpt. \...
Name: tweets, dtype: object

In [15]:
# cleaning and removing punctuations 

import string

english_punctuations = string.punctuation
punctuations_list = english_punctuations

def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)

df['tweets']= df['tweets'].apply(lambda x: cleaning_punctuations(x))
df['tweets'].tail()

219289    software projects trying replicate chatgpt htt...
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                                                  nan
Name: tweets, dtype: object

In [16]:
# cleaning and removing repeating characters

def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)

df['tweets'] = df['tweets'].apply(lambda x: cleaning_repeating_char(x))
df['tweets'].tail()

219289    software projects trying replicate chatgpt htt...
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                                                  nan
Name: tweets, dtype: object

In [17]:
# cleaning and removing URLs

def cleaning_URLs(data):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)

df['tweets'] = df['tweets'].apply(lambda x: cleaning_URLs(x))

def remove_hyperlink(word):
    return re.sub(r"http\S+", "", word)

df['tweets'] = df['tweets'].apply(lambda x: remove_hyperlink(x))

df['tweets'].tail()

219289          software projects trying replicate chatgpt 
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                                                  nan
Name: tweets, dtype: object

In [178]:
# cleaning and removing mentions 

def remove_mentions(word):
    return re.sub(r"@\S+", "", word)

df['tweets'] = df['tweets'].apply(lambda x: remove_mentions(x))
df['tweets'].tail()

219289          software projects trying replicate chatgpt 
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                                  chatgpt neat stuff 
Name: tweets, dtype: object

In [179]:
import nltk 

nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Ajkuna
[nltk_data]     Seipi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [180]:
# tokenization of tweet text : process of splitting text into 
# smaller chuncks, called tokens 
# each tokey is an input to the machine learning algorithm as a featue

from nltk.tokenize import word_tokenize

test2 = df['tweets'].apply(word_tokenize)
test2.tail()

219289     [software, projects, trying, replicate, chatgpt]
219290    [asked, chatgpt, write, nye, joke, seos, deliv...
219291                   [chatgpt, disassembled, dissemble]
219292    [2023, predictions, chatgpt, nothing, really, ...
219293                               [chatgpt, neat, stuff]
Name: tweets, dtype: object

In [181]:
df['tweets'] = test2

In [182]:
# applying stemming : process of removing and replacing suffixes from a token
# to obtain the root or base form of the word 

# Porter stemmer is a widely used stemming technique 

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text])


test = df['tweets'].apply(lambda text: stem_words(text))

test.tail()

219289                   softwar project tri replic chatgpt
219290    ask chatgpt write nye joke seo deliv nnwhi seo...
219291                          chatgpt disassembl dissembl
219292    2023 predict chatgpt noth realli specif trend ...
219293                                   chatgpt neat stuff
Name: tweets, dtype: object

In [183]:
df['tweets'] = test

## Word embedding techniques

In [94]:
# use of Bag of Words as a word embedding technique 

from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(min_df = 2, max_features = 100000)

bow.fit(test)

tweets_processed =bow.transform(test).toarray()

MemoryError: Unable to allocate 69.6 GiB for an array with shape (219294, 42608) and data type int64

In [None]:
# Word2Vec embedding technique

import gensim

tokenize = test.apply(lambda x: x.split())
w2vec_model = gensim.models.Word2Vec(tokenize, min_count = 1, vector_size = 100, window = 5, sg = 1)
w2vec_model.train(tokenize, total_examples = len(test), epochs = 20)

In [None]:
tweets_train = tokenize

### Model Fitting 

* Logistic Regression: highly efficient and simple as classification applications 

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(tweets_train, target_train)

# training the model
prediction = model.predict_proba(tweets_test)
# predicting on the test set

prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 then 1 else 0
prediction_int = prediction_int.astype(np.int)

## Analysis of dataset of first month of launching

In [2]:
df = pd.read_csv('datasets/chatgptfirst.csv')

In [3]:
df.head()

Unnamed: 0,tweet_id,created_at,like_count,quote_count,reply_count,retweet_count,tweet,country,photo_url,city,country_code
0,1598014056790622225,2022-11-30 18:00:15+00:00,2,0,0,0,ChatGPT: Optimizing Language Models for Dialog...,,,,
1,1598014522098208769,2022-11-30 18:02:06+00:00,12179,889,1130,3252,"Try talking with ChatGPT, our new AI system wh...",,,,
2,1598014741527527435,2022-11-30 18:02:58+00:00,2,0,0,1,ChatGPT: Optimizing Language Models for Dialog...,,https://pbs.twimg.com/media/Fi1J8HbWAAMv_yi.jpg,,
3,1598015493666766849,2022-11-30 18:05:58+00:00,561,8,25,66,"THRILLED to share that ChatGPT, our new model ...",,https://pbs.twimg.com/media/Fi1Km3WUYAAfzHS.jpg,,
4,1598015509420994561,2022-11-30 18:06:01+00:00,1,0,0,0,"As of 2 minutes ago, @OpenAI released their ne...",,,,


In [4]:
df = df['tweet']

In [5]:
df.sample(5)

81326     There's a whole genre on bilibili teaching you...
113541                    Okay this ChatGPT AI tool.. wow!!
171849    I wish chatGPT existed when I was in college. ...
67417     What is ChatGPT, the viral social media AI? – ...
145738    Check out my latest blog post, where I see if ...
Name: tweet, dtype: object

In [6]:
# check if there is any NaN value

df.isnull().values.any()
df.isnull().sum()

0

In [7]:
len(df)

219294

In [8]:
df.head()

0    ChatGPT: Optimizing Language Models for Dialog...
1    Try talking with ChatGPT, our new AI system wh...
2    ChatGPT: Optimizing Language Models for Dialog...
3    THRILLED to share that ChatGPT, our new model ...
4    As of 2 minutes ago, @OpenAI released their ne...
Name: tweet, dtype: object