# Analysis of the dataset 

### Research question: What is the impact of the emerging of AI on the population based on tweets 
#### Analysis: Sentiment Analysis through NLP, focus on ChatGPT
- Categorize tweets: positive, negative, neutral 
- Frequency of words used, what type of words
- Evolution of the sentiment analysis: from the beginning of the release of ChatGPT until recently 
- Fequency of tweets made about AI (mean of tweets per person) 
- Try to work with threads (to be continued...)
- Categorize tweets to workers, students, reasearchers (professors) and see if the sentiment analysis is different for these categories 


### Some references for the sentiment analysis
- https://www.analyticsvidhya.com/blog/2021/06/twitter-sentiment-analysis-a-nlp-use-case-for-beginners/
- https://huggingface.co/blog/sentiment-analysis-twitter

In [29]:
import re
import numpy as np
import pandas as pd
# plotting
import seaborn as sns
#from wordcloud import WordCloud
import matplotlib.pyplot as plt
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

## Exploratory analysis with labeled dataset
* dataset contains only english tweets
* tweets about ChatGPT gathered for a month 

In [59]:
df = pd.read_csv('datasets/sentiment.csv')

In [60]:
df.sample(5)

Unnamed: 0.1,Unnamed: 0,tweets,labels
162676,162676,What Yann LeCun thinks of ChatGPT? https://t.c...,bad
133155,133155,What does the future of Cybersecurity look lik...,neutral
103244,103244,OpenAI releasing ChatGPT will probably go down...,neutral
39072,39072,"it's over! you guys broke ChatGPT, and now I c...",bad
159712,159712,Tryna get chatGPT to write my project for me 🔥🔥🔥,bad


In [61]:
df.columns

Index(['Unnamed: 0', 'tweets', 'labels'], dtype='object')

In [62]:
print("length of data is", len(df))

length of data is 219294


In [63]:
df.shape

(219294, 3)

In [64]:
df.dtypes

Unnamed: 0     int64
tweets        object
labels        object
dtype: object

## Data preprocessing

In [65]:
df['tweets'] = df['tweets'].str.lower()
df['tweets'].tail()

219289    other software projects are now trying to repl...
219290    i asked #chatgpt to write a #nye joke for seos...
219291    chatgpt is being disassembled until it can onl...
219292    2023 predictions by #chatgpt. nothing really s...
219293     from chatgpt, neat stuff https://t.co/qjjuf2z2m0
Name: tweets, dtype: object

In [49]:
# defining set containing all stopwords in english

stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from',
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're','s', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

# cleaning and removing the above stop words list from the tweet text
STOPWORDS = set(stopwordlist)

def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df['tweet'] = df['tweet'].apply(lambda text: cleaning_stopwords(text))
df['tweet'].head()

0    chatgpt: optimizing language models dialogue h...
1    try talking chatgpt, new ai system optimized d...
2    chatgpt: optimizing language models dialogue h...
3    thrilled share chatgpt, new model optimized di...
4    2 minutes ago, @openai released new chatgpt. \...
Name: tweet, dtype: object

In [50]:
# cleaning and removing punctuations 

import string

english_punctuations = string.punctuation
punctuations_list = english_punctuations

def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)

df['tweet']= df['tweet'].apply(lambda x: cleaning_punctuations(x))
df['tweet'].tail()

219289    software projects trying replicate chatgpt htt...
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                chatgpt neat stuff httpstcoqjjuf2z2m0
Name: tweet, dtype: object

In [51]:
# cleaning and removing repeating characters

def cleaning_repeating_char(text):
    return re.sub(r'(.)1+', r'1', text)

df['tweet'] = df['tweet'].apply(lambda x: cleaning_repeating_char(x))
df['tweet'].tail()

219289    software projects trying replicate chatgpt htt...
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                chatgpt neat stuff httpstcoqjjuf2z2m0
Name: tweet, dtype: object

In [52]:
# cleaning and removing URLs

def cleaning_URLs(data):
    return re.sub('((www.[^s]+)|(https?://[^s]+))',' ',data)

df['tweet'] = df['tweet'].apply(lambda x: cleaning_URLs(x))
df['tweet'].tail()

219289    software projects trying replicate chatgpt htt...
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                chatgpt neat stuff httpstcoqjjuf2z2m0
Name: tweet, dtype: object

In [53]:
# cleaning and removing mentions 

def remove_mentions(word):
    return re.sub(r"@\S+", "", word)

df['tweet'] = df['tweet'].apply(lambda x: remove_mentions(x))
df['tweet'].tail()

219289    software projects trying replicate chatgpt htt...
219290    asked chatgpt write nye joke seos delivered nn...
219291                       chatgpt disassembled dissemble
219292    2023 predictions chatgpt nothing really specif...
219293                chatgpt neat stuff httpstcoqjjuf2z2m0
Name: tweet, dtype: object

In [56]:
import nltk 

nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Ajkuna
[nltk_data]     Seipi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [57]:
# tokenization of tweet text : process of splitting text into 
# smaller chuncks, called tokens 
# each tokey is an input to the machine learning algorithm as a featue

from nltk.tokenize import word_tokenize

test2 = df['tweet'].apply(word_tokenize)
test2.tail()

219289    [software, projects, trying, replicate, chatgp...
219290    [asked, chatgpt, write, nye, joke, seos, deliv...
219291                   [chatgpt, disassembled, dissemble]
219292    [2023, predictions, chatgpt, nothing, really, ...
219293           [chatgpt, neat, stuff, httpstcoqjjuf2z2m0]
Name: tweet, dtype: object

In [58]:
df['tweet'] = test2

In [None]:
# applying stemming

