<a href="https://colab.research.google.com/github/Dansah2/Classifying_Disaster_Tweets/blob/main/1_EDA_Classifying_Disaster_Tweets_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying Disaster Tweets

Kaggle Dataset Download API Command:

kaggle competitions download -c nlp-getting-started

Use three different techniques to classify a tweet as either a 'Disaster Tweet' or 'Non-Disaster Tweet'.

##Project Outline:

1) Download the dataset

2) Explore/Analyze the data

3) Preprocess and organize the data

4) Classify using Vader

5) Classify using Bag of Words

6) Classify using Hugging Face

## Download the Dataset

1) Install required libraries

2) Import required libraries

3) Download data from Kaggle


#### Install Required Libraries

In [None]:
!pip install kaggle numpy > /dev/null 2>&1

#### Import Required Libraries

In [None]:
# handeling data
import numpy as np
import pandas as pd

# graphing data
pd.options.plotting.backend = "plotly"
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# downloading data
from google.colab import drive

#### Download Data From Kaggle
https://github.com/bnsreenu/python_for_microscopists/blob/master/Tips_tricks_35_loading_kaggle_data_to_colab.ipynb

https://www.youtube.com/watch?v=yEXkEUqK52Q&t=628s

In [None]:
# Mount google drive to store Kaggle API for future use
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# make a directory for kaggle temporary instance location in Colab
! mkdir ~/.kaggle

In [None]:
# upload json fine to Google drive and copy the temporary location
!cp /content/drive/MyDrive/Kaggle_API/kaggle.json ~/.kaggle/kaggle.json

In [None]:
# change the file permissions to read/write to the owner only
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
# download the kaggle data
! kaggle competitions download -c nlp-getting-started

Downloading nlp-getting-started.zip to /content
  0% 0.00/593k [00:00<?, ?B/s]
100% 593k/593k [00:00<00:00, 156MB/s]


In [None]:
# unzip the data
! unzip nlp-getting-started.zip

Archive:  nlp-getting-started.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
# create a function to read the data into a dataframe

def read_function(csv_file):

    return pd.read_csv(csv_file)

raw_train = read_function('/content/train.csv')

## Explore/Analyze the Data
1) Obtain info about the training / testing set.

2) Visulize the data.


###Obtain info about the training / testing set.

Notice that all columns expect id and target (int64) contain data in the form of an object.

Training data: 7613 rows x 5 columns

Testing data: 3263 rows x 4 columns

In [None]:
# Create method to explore the data
def show_type(data_frame):
  # obtain data_types
  data_types = data_frame.dtypes.astype(str)

  fig = go.Figure(data=[go.Table(
    header=dict(values=['Column Name', 'Data Type']),
    cells=dict(values=[data_types.index, data_types.values]))])

  # Customize the table layout
  fig.update_layout(
      title='Data Types of DataFrame Columns',
  )

  # Show the plot
  fig.show()

In [None]:
show_type(raw_train)

In [None]:
def show_dataframe_head(data_frame):
  # Get the first few rows of the DataFrame
  head_data = data_frame.head()

  # Create a table using Plotly
  fig = go.Figure(data=[go.Table(
    header=dict(values=head_data.columns),
    cells=dict(values=[head_data[col] for col in head_data.columns]))
  ])

  # Customize the table layout
  fig.update_layout(
    title='First five Data Samples',
  )

  # Show the plot
  fig.show()

show_dataframe_head(raw_train)

Note that there are a lot of null values in the 'location' column for both the training and testing sets. I will drop this column. I will also drop the 'keyword' column although it has far less null values.

In [None]:
def show_missing(data_frame):
  # Calculate the missing values in the DataFrame
  missing_values = data_frame.isna().sum()

  # Create a heatmap
  fig = go.Figure(data=go.Heatmap(
      z=[missing_values.values],  # Provide the missing values as the heatmap data
      x=missing_values.index,     # Feature names as x-axis
      y=["Missing Values"],      # Label for y-axis
      colorscale='Turbo',      # Choose a colorscale (you can customize it)
  ))

  # Add some styling
  fig.update_layout(
      title="Missing Data Heatmap",
      xaxis_title="Features",
      yaxis_title="",
      xaxis_showticklabels=True,
      yaxis_showticklabels=False,
  )

  # Show the plot
  fig.show()

show_missing(raw_train)

In [None]:
# Function to display duplicates in a table
def show_duplicates(df):

  if df.duplicated().sum() == 0:
    print(f'Number of Duplicates:\n {df.duplicated().sum()}')

  else:
    duplicates_data = []

    for column in df.columns:
      duplicated_values = df[df.duplicated(subset=column, keep=False)]
      duplicated_counts = duplicated_values[column].value_counts()

      for value, count in duplicated_counts.items():
        duplicates_data.append([column, value, count])

    duplicates_df = pd.DataFrame(duplicates_data, columns=["Feature", "Duplicated Value", "Count"])

    # Create a table using Plotly
    fig = go.Figure(data=[go.Table(
      header=dict(values=duplicates_df.columns),
      cells=dict(values=[duplicates_df[col] for col in duplicates_df.columns]))
    ])

    # Customize the table layout
    fig.update_layout(
      title='Duplicate Values in DataFrame',
    )

    # Show the table
    fig.show()

# Example usage:
show_duplicates(raw_train)

Number of Duplicates:
 0


###Visulize the data.

Note that there is a data imbalance but it does not appear to be severe.

In [None]:
# create a bar graph that displays the count of each class

def exp_graph_data(data_frame, target_col_name=None):

  print(f"Data shape: {data_frame.shape}\n")
  print(f'Column Names: {list(data_frame.columns)}\n')

  if target_col_name:
    class_counts = data_frame[target_col_name].value_counts()

    print(f'Label Count:\n{class_counts}')

    fig = go.Figure(go.Bar(x=class_counts.index,
                           y=class_counts.values))

    fig.update_layout(xaxis_title_text='Classes',
                      yaxis_title_text='Count',
                      title_text='Number of Samples of Each Class')
    fig.show()

In [None]:
exp_graph_data(raw_train, 'target')

Data shape: (7613, 5)

Column Names: ['id', 'keyword', 'location', 'text', 'target']

Label Count:
0    4342
1    3271
Name: target, dtype: int64
