<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

# 1. Importing Packages

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style as style
%matplotlib inline
import nltk
import re
import string
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
nltk.download('vader_lexicon')
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


# 2. Loading Data

In [2]:
df_test = pd.read_csv (r"test_with_no_labels.csv")
df_test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \r\nPu...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [3]:
df_train = pd.read_csv(r"train.csv")
df_train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


# 3. Exploratory Data Analysis

In [4]:
df_test.describe()


Unnamed: 0,tweetid
count,10546.0
mean,496899.936943
std,288115.677148
min,231.0
25%,246162.5
50%,495923.0
75%,742250.0
max,999983.0


In [5]:
df_train.describe()


Unnamed: 0,sentiment,tweetid
count,15819.0,15819.0
mean,0.917504,501719.433656
std,0.836537,289045.983132
min,-1.0,6.0
25%,1.0,253207.5
50%,1.0,502291.0
75%,1.0,753769.0
max,2.0,999888.0


*Null values*

In [6]:
df_test.isnull().sum()

message    0
tweetid    0
dtype: int64

In [7]:
df_test.shape

(10546, 2)

In [8]:
df_train.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

In [9]:
df_train.shape

(15819, 3)

In [10]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10546 entries, 0 to 10545
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  10546 non-null  object
 1   tweetid  10546 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 164.9+ KB


In [11]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15819 entries, 0 to 15818
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  15819 non-null  int64 
 1   message    15819 non-null  object
 2   tweetid    15819 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 370.9+ KB


##correlation

# 4.Data Cleaning

In [12]:
# Check for missing values
print(df_train.isnull().sum())

sentiment    0
message      0
tweetid      0
dtype: int64


In [15]:
# Tokenize the text data into words
# Here, we use a simple whitespace tokenizer, but you might want to use a more advanced tokenizer
df_train['tokenized_message'] = df_train['message'].apply(lambda x: x.split())
df_train['tokenized_message']

0        [PolySciMajor, EPA, chief, doesn't, think, car...
1        [It's, not, like, we, lack, evidence, of, anth...
2        [RT, @RawStory:, Researchers, say, we, have, t...
3        [#TodayinMaker#, WIRED, :, 2016, was, a, pivot...
4        [RT, @SoyNovioDeTodas:, It's, 2016,, and, a, r...
                               ...                        
15814    [RT, @ezlusztig:, They, took, down, the, mater...
15815    [RT, @washingtonpost:, How, climate, change, c...
15816    [notiven:, RT:, nytimesworld, :What, does, Tru...
15817    [RT, @sara8smiles:, Hey, liberals, the, climat...
15818    [RT, @Chet_Cannon:, .@kurteichenwald's, 'clima...
Name: tokenized_message, Length: 15819, dtype: object

In [16]:
stop_words = set(stopwords.words('english'))
df_train['filtered_message'] = df_train['tokenized_message'].apply(lambda x: [word for word in x if word.lower() not in stop_words and word.isalnum()])
df_train['filtered_message'].head()

0    [PolySciMajor, EPA, chief, think, carbon, diox...
1    [like, lack, evidence, anthropogenic, global, ...
2    [RT, Researchers, say, three, years, act, clim...
3    [WIRED, 2016, pivotal, year, war, climate, cha...
4       [RT, climate, change, denying, bigot, leading]
Name: filtered_message, dtype: object

In [17]:
# Combine the words back into sentences
df_train['processed_message'] = df_train['filtered_message'].apply(lambda x: ' '.join(x))
df_train['processed_message']

0        PolySciMajor EPA chief think carbon dioxide ma...
1          like lack evidence anthropogenic global warming
2        RT Researchers say three years act climate cha...
3               WIRED 2016 pivotal year war climate change
4                  RT climate change denying bigot leading
                               ...                        
15814          RT took material global LGBT health hocking
15815        RT climate change could breaking relationship
15816    nytimesworld Trump actually believe climate Ri...
15817    RT Hey liberals climate change crap hoax ties ...
15818                              RT change 4 screenshots
Name: processed_message, Length: 15819, dtype: object

In [21]:
# Convert text data into a format suitable for machine learning algorithms (TF-IDF)
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_train['processed_message'])
X

<15819x12020 sparse matrix of type '<class 'numpy.float64'>'
	with 130461 stored elements in Compressed Sparse Row format>

In [23]:
# Convert sentiment labels to numerical values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_train['sentiment'])
y

array([2, 2, 3, ..., 1, 0, 1], dtype=int64)

In [24]:
# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Modelling

# Multinomial Naive Bayes Model

In [25]:
import sklearn
print("scikit-learn version:", sklearn.__version__)

scikit-learn version: 1.2.2


In [26]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Initialize the Multinomial Naive Bayes model
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred = model.predict(X_val)

# Convert sentiment labels to strings for classification_report
class_names = label_encoder.classes_.astype(str)

# Evaluate the model
report = classification_report(y_val, y_val_pred, target_names=class_names)
print("Classification Report:\n", report)

Classification Report:
               precision    recall  f1-score   support

          -1       0.88      0.03      0.05       278
           0       0.94      0.08      0.14       425
           1       0.63      0.97      0.76      1755
           2       0.82      0.47      0.60       706

    accuracy                           0.66      3164
   macro avg       0.82      0.39      0.39      3164
weighted avg       0.74      0.66      0.58      3164



# 6. Model Perfomance

# 7. Model Explanation