### 1. Introduction

Provide a brief overview of the project.
Describe the problem you are addressing.
State the goals and objectives of the analysis.

### 2. Data Description

Introduce the dataset.
Describe the features and their meanings.
Discuss the target variable.
Mention the source of the data, if applicable.

### 3. EDA (Exploratory Data Analysis)

Import libraries and load the dataset.
Display basic statistics (e.g., mean, median, standard deviation).
Visualize the data using plots and graphs:
Histograms, box plots, scatter plots, etc.
Examine data distributions and relationships.
Handle missing data and perform data preprocessing if needed.

### 4. Data Preprocessing

Data cleaning (e.g., handling missing values, outliers).
Feature engineering (if necessary).
Encoding categorical variables.
Train-test split for model evaluation.

### 5. Model Building and Training

Define the machine learning algorithm(s) you will use.
Set up the model(s) with appropriate hyperparameters.
Train the model(s) on the training data.
Evaluate model performance on the validation set using relevant metrics.
Hyperparameter tuning (if applicable).

### 6. Results

Present the results of your analysis.
Display key performance metrics (e.g., accuracy, precision, recall, F1-score).
Visualize results using relevant plots (e.g., ROC curve, confusion matrix).
Compare the performance of different models if applicable.

### 7. Discussion and Conclusion

Interpret the results and their significance.
Discuss any insights or patterns observed during the analysis.
Address any limitations of the analysis.
Provide recommendations or next steps for further research.
Conclude with a summary of the project's outcomes and contributions.


### Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data
First, we will import necessary libraries and load datasets

In [None]:


# import libraries for linear algebra, data processing and dictionaries
import numpy as np 
import pandas as pd 
from collections import defaultdict
import re

# libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# libraries for NLP preprocessing
import string
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tokenize import word_tokenize


from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix


# libraries for NN models creation
import tensorflow as tf
import tensorflow_hub as hub

from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from keras.initializers import Constant
from tensorflow.keras.optimizers import Adam

#!pip3 install bert-for-tf2
from bert import bert_tokenization
# load the datasets
train_df = pd.read_csv('nlp-getting-started/train.csv')
test_df = pd.read_csv('nlp-getting-started/test.csv')
submission = pd.read_csv('nlp-getting-started/sample_submission.csv')

### Data inspection and visualization
Check data information, data types and missing values.

We can see that there are missing values in "keyword" and "location" columns both in training and test dataset.

In [None]:
train_df.info()
# Obtain the number of disaster and not disaster tweets
disaster = train_df[train_df['target'] == 1].shape[0]
not_disaster = train_df[train_df['target'] == 0].shape[0]
print(f'There are {disaster} disaster tweets and {not_disaster} general tweets in the training dataset')

In [None]:
# function to create corpus
def create_corpus(df, target='opt'):
    corpus = []
    if target != 'opt':
        for tweet in df[df['target']==target]['text']:
            words = [i for i in word_tokenize(tweet.lower()) if i not in stop]
            corpus.append(words)
    else:
        for tweet in df['text']:
            words = [i for i in word_tokenize(tweet.lower()) if i not in stop]
            corpus.append(words)
    return corpus 
    
# create dictionary and visualize barplot
dic = defaultdict(int)
for tweet in create_corpus(train_df, 0):
    for word in tweet:
        dic[word] += 1

top = sorted(dic.items(),key=lambda item:item[1], reverse=True)[:10]

plt.figure(figsize=(10,5))
x,y=zip(*top)
plt.bar(x,y)
plt.title('Most frequent words in general tweets')