### Note on large .csv files ###
The Reddit Climate Change Dataset (pavellexyr) and Sentiment140 (kazanova) are huge files with over 1 million observations. I cannot upload them to GitHub so here are the links to each one from Kaggle. There is also one dataset that is mentioned but has no associated code: Twitter US Airline Sentiment (crowdflower). This is included below if, for any reason, it is needed in the future.

The Reddit Climate Change Dataset (pavellexyr): https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

Sentiment140 dataset with 1.6 million tweets (kazanova): https://www.kaggle.com/datasets/kazanova/sentiment140

Twitter US Airline Sentiment (crowdflower): https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment

In [1]:
import numpy as np
import pandas as pd

In [2]:
#NOT having the limit breaks the kernel. This takes only the first 100K observations per dataset
row_count = 1000
max_obv = 100

### (pavellexyr) The Reddit Climate Change Dataset ###
Only using the-reddit-climate-change-dataset-comments.csv which has a column for sentiment for the text. This file alone is over 4GB and processing all of it breaks the kernel so only the first 100K are processed. The columns of interest are 'body' for the comment's body text, and 'sentiment' for the analyzed sentiment per text. NaN values are dropped, leaving a little under 100K usuable observations. 

This is the only dataset that has continuous sentiment values [-1, 1] instead of discrete {-1, 0, 1} so if using ALL of these datasets together as one, it may be biased toward those discrete values than anything in between.

In [3]:
rclimate_text = []
rclimate_sentiment = []
i = 0
# file_directory = '118Adatasets/the-reddit-climate-change-dataset-comments.csv'
file_directory = 'the-reddit-climate-change-dataset-comments.csv'
for chunk in pd.read_csv(file_directory, chunksize=row_count):
    if i < max_obv:
        rclimate_text += chunk['body'].tolist()
        rclimate_sentiment += chunk['sentiment'].tolist()
        i += 1
    else:
        break

In [4]:
rclimate_df = pd.DataFrame(data={'text': rclimate_text, 'sentiment': rclimate_sentiment}).dropna()
rclimate_df

Unnamed: 0,text,sentiment
0,Yeah but what the above commenter is saying is...,0.5719
1,Any comparison of efficiency between solar and...,-0.9877
2,I'm honestly waiting for climate change and th...,-0.1143
3,Not just Sacramento. It's actually happening a...,0.0000
4,I think climate change tends to get some peopl...,0.6634
...,...,...
99995,"Yet despite that misinformation, Joe fucking P...",0.2411
99996,That's oversimplified. \n\nThird world countri...,-0.7080
99997,The net loss between Lake Powell and Lake Mead...,0.6882
99998,Got news for ya. Democrats have been in power ...,-0.0521


### (cosmos98) Twitter and Reddit Sentimental analysis Dataset ###
Like with the dataset above, observations are limited to the first 100K and reduced to not have NaN values.

In [12]:
cosmos_twitter_text = []
cosmos_twitter_sentiment = []
i = 0
for chunk in pd.read_csv('cosmos98_Twitter_Data.csv', chunksize=row_count):
    if i < max_obv:
        cosmos_twitter_text += chunk['clean_text'].tolist()
        cosmos_twitter_sentiment += chunk['category'].tolist()
        i += 1
    else:
        break

cosmos_reddit_text = []
cosmos_reddit_sentiment = []
i = 0
for chunk in pd.read_csv('cosmos98_Reddit_Data.csv', chunksize=row_count):
    if i <max_obv:
        cosmos_reddit_text += chunk['clean_comment'].tolist()
        cosmos_reddit_sentiment += chunk['category'].tolist()
        i += 1
    else:
        break

In [13]:
cosmos_twitter_df = pd.DataFrame(data={'text': cosmos_twitter_text, 'sentiment': cosmos_twitter_sentiment}).dropna()
cosmos_reddit_df = pd.DataFrame(data={'text': cosmos_reddit_text, 'sentiment': cosmos_reddit_sentiment}).dropna()

### (kazanova) Sentiment140 dataset with 1.6 million tweets -- not usable ###
Dataset originally has 1.6 million tweets, limited to 100K. Dropped NaN values

Unlike the previous datasets, sentiment is recorded as 0=negative, 2=neutral, 4=positive.
However, the target column that is supposed to record this is entirely 0 which is unlikely for a set of 1.6 million tweets. Therefore, the sentiment from this dataset cannot be used with the others. It might be saved for a seperate purpose.

In [14]:
# kazanova_text = []
# kazanova_sentiment = []
# #this dataset used the first observation as the columns
# col_text = 5
# col_sentiment = 0
# i = 0
# for chunk in pd.read_csv('118Adatasets/kazanova_sentiment140.csv', chunksize=row_count):
#     if i < max_obv:
#         kazanova_text += chunk.iloc[:,5].tolist()
#         kazanova_sentiment += chunk.iloc[:,0].tolist()
#         i += 1
#     else:
#         break

In [15]:
# kazanova_df = pd.DataFrame(data={'text':kazanova_text, 'sentiment':kazanova_sentiment})
# kazanova_df['sentiment'].unique()

### (crowdflower) Twitter US Airline Sentiment -- not usable ###
Dataset has Tweet id for the text, but not the actual text itself. Sentiment is recorded in strings as 'positive', 'negative' and 'neutral' which can be converted to numerical values of 1, -1, and 0 respectively. There is no text readily available in the file so I will not write code for the numeric conversion here. Unlike the kazanova dataset, we likely cannot use this for another purpose without taking significant time to extract the text from the Tweet id for each observation. 

### (tirendazacademy) FIFA World Cup 2022 Tweets ###
Has Tweet text body and sentiment in strings of 'positive', 'negative' and 'neutral' which can be converted to 1, -1, and 0 respectively.

In [17]:
fifa_text = []
fifa_sentiment = []
i = 0
for chunk in pd.read_csv('fifa_world_cup_2022_tweets.csv', chunksize=row_count):
    if i < max_obv:
        fifa_text += chunk['Tweet'].tolist()
        fifa_sentiment += chunk['Sentiment'].tolist()
        i += 1
    else:
        break

In [18]:
def str_num_sentiment(x):
    if x.isnumeric():
        return x
    else:
        if x == 'positive':
            return 1
        elif x == 'neutral':
            return 0
        elif x == 'negative':
            return -1
        else:
            return np.nan

In [19]:
fifa_df = pd.DataFrame(data={'text':fifa_text,'sentiment':[str_num_sentiment(x) for x in fifa_sentiment]}).dropna()
fifa_df

Unnamed: 0,text,sentiment
0,What are we drinking today @TucanTribe \n@MadB...,0
1,Amazing @CanadaSoccerEN #WorldCup2022 launch ...,1
2,Worth reading while watching #WorldCup2022 htt...,1
3,Golden Maknae shinning bright\n\nhttps://t.co/...,1
4,"If the BBC cares so much about human rights, h...",-1
...,...,...
22519,Here We go World cup 2022 #WorldCup2022,1
22520,Anderlecht confirms former Viborg FF's Jesper ...,0
22521,Great thread to read before the start of #Worl...,1
22522,Raphinha wants Brazil to be united at the #Wor...,1


Below are the cleaned DataFrames, having only the columns 'text' (unchanged) and 'sentiment' which has values [-1, 1]. Only rclimate_df has continuous values. 

In [12]:
# rclimate_df
# cosmos_reddit_df
# cosmos_twitter_df
# fifa_df

In [20]:
rclimate_df

Unnamed: 0,text,sentiment
0,Yeah but what the above commenter is saying is...,0.5719
1,Any comparison of efficiency between solar and...,-0.9877
2,I'm honestly waiting for climate change and th...,-0.1143
3,Not just Sacramento. It's actually happening a...,0.0000
4,I think climate change tends to get some peopl...,0.6634
...,...,...
99995,"Yet despite that misinformation, Joe fucking P...",0.2411
99996,That's oversimplified. \n\nThird world countri...,-0.7080
99997,The net loss between Lake Powell and Lake Mead...,0.6882
99998,Got news for ya. Democrats have been in power ...,-0.0521


In [32]:
# Training sentiment analysis supevised ML classifier on reddit climate change dataframe.

In [5]:
rclimate_df["sentiment"] = rclimate_df["sentiment"].apply(lambda curr: "negative" if float(curr) < 0 else ("positive" if float(curr) > 0 else "neutral"))
rclimate_df

Unnamed: 0,text,sentiment
0,Yeah but what the above commenter is saying is...,positive
1,Any comparison of efficiency between solar and...,negative
2,I'm honestly waiting for climate change and th...,negative
3,Not just Sacramento. It's actually happening a...,neutral
4,I think climate change tends to get some peopl...,positive
...,...,...
99995,"Yet despite that misinformation, Joe fucking P...",positive
99996,That's oversimplified. \n\nThird world countri...,negative
99997,The net loss between Lake Powell and Lake Mead...,positive
99998,Got news for ya. Democrats have been in power ...,negative


In [6]:
# Need to do more data cleaning here, and remove not good words !!!

In [7]:
# storing the x's and y's
X = rclimate_df["text"]
Y = rclimate_df["sentiment"]

In [8]:
# train test split 
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=15, train_size=.6)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neighbors import KNeighborsClassifier

countVectorizer= CountVectorizer()
X_train = countVectorizer.fit_transform(X_train)

In [29]:
# countVectorizer.vocabulary_.get('climate')

23281

In [10]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train)
X_train = tf_transformer.transform(X_train)

In [11]:
clf = KNeighborsClassifier(3).fit(X_train, Y_train)

In [24]:
tester = ['yayyy!', 'terrible', "she walked to the right", "woohoo", "I don't feel good", "sad", "feel kinda blue"]
X_new_counts = countVectorizer.transform(tester)
X_new_tfidf = tf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

In [25]:
print(predicted)

['positive' 'negative' 'positive' 'positive' 'positive' 'negative'
 'negative']
