<a href="https://colab.research.google.com/github/Dansah2/Classifying-Disaster-Tweets/blob/main/Vader_Classifying_Disaster_Tweets_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Disaster Tweets

Kaggle Dataset Download API Command:

kaggle competitions download -c nlp-getting-started

I will classify a tweet as either a 'Disaster Tweet' or 'Non-Disaster Tweet'.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Classify using Vader

5) Classify using Bag of Words

6) Classify using Hugging Face

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Upload Data from Google Drive


#### Install Required Libraries

In [1]:
!pip install -q -U numpy
!pip install -q -U vaderSentiment

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/18.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/18.2 MB[0m [31m10.6 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/18.2 MB[0m [31m67.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m8.9/18.2 MB[0m [31m85.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m15.5/18.2 MB[0m [31m179.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m18.2/18.2 MB[0m [31m181.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m18.2/18.2 MB[0m [31m181.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m 

#### Import Required Libraries

In [23]:
# cleaning txt data
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# vadar sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report

# handeling data
import numpy as np
import pandas as pd

# reading the data
from google.colab import drive

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


#### Upload Data from Google Drive

In [2]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# read in the data
vader_train = pd.read_csv('/content/drive/My Drive/Disaster_Tweets/train_df.csv')
vader_test = pd.read_csv('/content/drive/My Drive/Disaster_Tweets/test_df.csv')

##**Find Sentiment with Vader Library**

1) Text Preprocessing

2) Predict the Sentiment

3) Evaluate


###Text Preprocessing

In [4]:
vader_train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this earthquake Ma...,1
1,Forest fire near La Ronge Sask Canada,1
2,All residents asked to shelter in place are be...,1
3,13000 people receive wildfires evacuation orde...,1
4,Just got sent this photo from Ruby Alaska as s...,1


In [5]:
def cleaning_text(sentence) :
  sentence = sentence.lower()                    # lower text
  sentence = re.sub('http\S+\s*', '', sentence)  # remove URLs
  sentence = re.sub('\W+', ' ', sentence)        # remove commas
  sentence= re.sub('RT|cc', '', sentence)  # remove RT and cc
  sentence = re.sub('#\S+', '', sentence)  # remove hashtags
  sentence = re.sub('@\S+', '', sentence)  # remove mentions
  sentence = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '',sentence)  # remove punctuations
  sentence = re.sub('\s+', ' ', sentence)  # remove extra whitespace
  sentence = re.sub(r'[0-9]', '', sentence) # remove digits from text
  return sentence

# apply the word_tokens method to the training dataframe
vader_train['text'] = vader_train['text'].apply(lambda x: cleaning_text(x))

# apply the word_tokens method to the testing dataframe
vader_test['text'] = vader_test['text'].apply(lambda x: cleaning_text(x))

In [6]:
vader_train.head()

Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,people receive wildfires evacuation orders in...,1
4,just got sent this photo from ruby alaska as s...,1


In [7]:
def tokenization_lemmatize_stopwording(sentence):
  lemmatizer = WordNetLemmatizer()
  stop_words = stopwords()
  sentence = word_tokenize(sentence)
  sentence = [lemmatizer.lemmatize(i) for i in sentence if not i in stop_words]
  sentence = ' '.join(sentence)
  return sentence

In [8]:
vader_train.head()

Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,people receive wildfires evacuation orders in...,1
4,just got sent this photo from ruby alaska as s...,1


### Predict the Sentiment

In [9]:
# create an instance of the vadar sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# create a method to analyze each sentiment in the dataframe
def analyze_sentiments(text):
  total_polarity = analyzer.polarity_scores(text)
  if total_polarity['compound'] >= 0.05:
    return 0
  elif total_polarity['compound'] <= -0.05:
    return 1
  else:
    return "Unknown"

# apply the method to each of the samples in the data frame
vader_train['vader_sentiment'] = vader_train['text'].apply(lambda x: analyze_sentiments(x))

# apply the method to each of the samples in the data frame
vader_test['vader_sentiment'] = vader_test['text'].apply(lambda x: analyze_sentiments(x))

Notice that they vadar method is not the best in its predictions on this specific dataset. I will try an alternative method.

In [10]:
vader_test.head(10)

Unnamed: 0,text,vader_sentiment
0,just happened a terrible car crash,1
1,heard about earthquake is different cities sta...,0
2,there is a forest fire at spot pond geese are ...,1
3,apocalypse lighting spokane wildfires,Unknown
4,typhoon soudelor kills in china and taiwan,1
5,were shakingits an earthquake,Unknown
6,theyd probably still show more life than arsen...,Unknown
7,hey how are you,Unknown
8,what a nice hat,0
9,fuck off,1


In [11]:
vader_train.head(20)

Unnamed: 0,text,target,vader_sentiment
0,our deeds are the reason of this earthquake ma...,1,0
1,forest fire near la ronge sask canada,1,1
2,all residents asked to shelter in place are be...,1,1
3,people receive wildfires evacuation orders in...,1,Unknown
4,just got sent this photo from ruby alaska as s...,1,Unknown
5,rockyfire update california hwy closed in bot...,1,1
6,flood disaster heavy rain causes flash floodin...,1,1
7,im on top of the hill and i can see a fire in ...,1,1
8,theres an emergency evacuation happening now i...,1,1
9,im afraid that the tornado is coming to our area,1,Unknown


###Evaluate

In [22]:
def eval_metrics(vader_train, preds_col, target):
  y_pred = vader_train[preds_col]
  y_pred = [i if i != "Unknown" else 3 for i in y_pred]

  # Calculate confusion matrix for the test set
  confusion_mat = confusion_matrix(vader_train[target], y_pred)

  # Generate a classification report for the test set
  classification_rep = classification_report(vader_train[target], y_pred, target_names=["Class 0", "Class 1", "Unknown"])

  print(f"Confusion Matrix:\n {confusion_mat}")
  print(f"\nClassification Report:\n {classification_rep}")

eval_metrics(vader_train, 'vader_sentiment', 'target')

Confusion Matrix:
 [[1420 1853 1069]
 [ 541 1861  869]
 [   0    0    0]]


Classification Report:
               precision    recall  f1-score   support

     Class 0       0.72      0.33      0.45      4342
     Class 1       0.50      0.57      0.53      3271
     Unknown       0.00      0.00      0.00         0

    accuracy                           0.43      7613
   macro avg       0.41      0.30      0.33      7613
weighted avg       0.63      0.43      0.49      7613



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
