<a href="https://colab.research.google.com/github/Dansah2/Classifying-Disaster-Tweets/blob/main/Preprocess_Classifying_Disaster_Tweets_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Disaster Tweets

Kaggle Dataset Download API Command:

kaggle competitions download -c nlp-getting-started

I will classify a tweet as either a 'Disaster Tweet' or 'Non-Disaster Tweet'.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Classify using Vadar

5) Classify using Bag of Words

6) Classify using Hugging Face

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Download data from Kaggle


#### Install Required Libraries

In [None]:
!pip install -q -U kaggle
!pip install -q -U numpy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m48.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.25.2 which is incompatible.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.25.2 which is incompatible.[0m[31m
[0m

#### Import Required Libraries

In [None]:
# cleaning txt data
import string

# handeling data
import numpy as np
import pandas as pd

# downloading data
from google.colab import drive

#### Download Data From Kaggle
https://github.com/bnsreenu/python_for_microscopists/blob/master/Tips_tricks_35_loading_kaggle_data_to_colab.ipynb

https://www.youtube.com/watch?v=yEXkEUqK52Q&t=628s

In [None]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# make a directory for kaggle temporary instance location in Colab
! mkdir ~/.kaggle

In [None]:
# upload json fine to Google drive and copy the temporary location
!cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [None]:
# change the file permissions to read/write to the owner only
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# download the kaggle data
! kaggle competitions download -c nlp-getting-started

Downloading nlp-getting-started.zip to /content
100% 593k/593k [00:00<00:00, 744kB/s]
100% 593k/593k [00:00<00:00, 744kB/s]


In [None]:
# unzip the data
! unzip nlp-getting-started.zip

Archive:  nlp-getting-started.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
# create a function to read the data into a dataframe

def read_function(csv_file):

    return pd.read_csv(csv_file)

raw_train = read_function('/content/train.csv')
raw_test = read_function('/content/test.csv')

##Preprocess and organize the data

1) Drop the necessary rows / columns

2) Drop Null Values

3) Text Preprocessing.

4) Export dataframe as CSV to google drive

### Drop the necessary rows / columns

In [None]:
# remove redundant information from the dataframe

def remove_redundant_rows(data_frame):
  return data_frame.drop_duplicates().reset_index(drop=True)

raw_train = remove_redundant_rows(raw_train)
raw_test = remove_redundant_rows(raw_test)

In [None]:
# drop the desired columns from the dataframe
def drop_columns(data_frame, col_name: list):
  try:
    data_frame = data_frame.drop(columns=col_name)
    print(f'The remaining colums are: {data_frame.columns}')
  except:
    print(f'The column(s) have already been dropped, the remaining are: {data_frame.columns}')
  return data_frame

In [None]:
raw_train = drop_columns(raw_train, ['id', 'location', 'keyword'])

The remaining colums are: Index(['text', 'target'], dtype='object')


In [None]:
raw_test = drop_columns(raw_test, ['id', 'location', 'keyword'])

The remaining colums are: Index(['text'], dtype='object')


###Drop Null Values

In [None]:
#drop Null values in the training dataframe
def drop_na(data_frame):
  try:
    data_frame = data_frame.dropna().reset_index(drop=True)
    rows, _ = data_frame.shape
    print(f'Number of samples remaining after nulls have been dropped: {rows}')
  except:
    print(f'Contains no null values. Samples count: {rows}')

  return data_frame

In [None]:
raw_train = drop_na(raw_train)

Number of samples remaining after nulls have been dropped: 7613


In [None]:
raw_test = drop_na(raw_test)

Number of samples remaining after nulls have been dropped: 3263


### Text Preprocessing

In [None]:
# create a method to remove all punctuation
def remove_punctuation(text):
  no_punct="".join([i for i in text if i not in string.punctuation])
  return no_punct

# apply the remove_punctuation method to the training dataframe
raw_train['text'] = raw_train['text'].apply(lambda x: remove_punctuation(x))

# apply the remove_punctuation method to the texting dataframe
raw_test['text'] = raw_test['text'].apply(lambda x: remove_punctuation(x))

In [None]:
raw_train.head(10)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this earthquake Ma...,1
1,Forest fire near La Ronge Sask Canada,1
2,All residents asked to shelter in place are be...,1
3,13000 people receive wildfires evacuation orde...,1
4,Just got sent this photo from Ruby Alaska as s...,1
5,RockyFire Update California Hwy 20 closed in ...,1
6,flood disaster Heavy rain causes flash floodin...,1
7,Im on top of the hill and I can see a fire in ...,1
8,Theres an emergency evacuation happening now i...,1
9,Im afraid that the tornado is coming to our area,1


###Export dataframe as CSV to google drive
https://saturncloud.io/blog/exporting-dataframes-as-csv-files-from-google-colab-to-google-drive/

In [None]:
# export the data
raw_train.to_csv('/content/drive/My Drive/Disaster_Tweets/train_df.csv', index=False)
raw_test.to_csv('/content/drive/My Drive/Disaster_Tweets/test_df.csv', index=False)

#verify it was exported
df_verify = pd.read_csv('/content/drive/My Drive/Disaster_Tweets/train_df.csv')
print(df_verify)

                                                   text  target
0     Our Deeds are the Reason of this earthquake Ma...       1
1                 Forest fire near La Ronge Sask Canada       1
2     All residents asked to shelter in place are be...       1
3     13000 people receive wildfires evacuation orde...       1
4     Just got sent this photo from Ruby Alaska as s...       1
...                                                 ...     ...
7608  Two giant cranes holding a bridge collapse int...       1
7609  ariaahrary TheTawniest The out of control wild...       1
7610  M194 0104 UTC5km S of Volcano Hawaii httptcozD...       1
7611  Police investigating after an ebike collided w...       1
7612  The Latest More Homes Razed by Northern Califo...       1

[7613 rows x 2 columns]
