### DATA SCIENCE FELLOWSHIP PRORAM - NATURAL DISASTERS DATASET 
### Oyekemi Abioye

## Loading and preprocessing the data

Before analysing, classifying and transforming data I loaded it. 
To understand the following about the data: 

* what format is it in? 
* wehat recognisable features does it have?

I used Python and the spaCy library to load, explore and manipulate the textual data.

## Loading the data

The dataset (https://www.kaggle.com/competitions/nlp-getting-started/data) contains $7613$ training and $3263$ test data of text from tweets, keywords, and tweets location origins.

The `natural_disaster_data` also contains the target column connoting if a tweet is about natural disasters or not i.e `zeros` & `ones`.

In [6]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt



In [9]:
train = pd.read_csv("data/train.csv")

In [33]:
test = pd.read_csv("data/test.csv")

In [140]:
train_data = train.copy()

In [141]:
test_data = test.copy()

## Preprocessing

The text data loaded, comes as a single sequence of symbols (a string of characters) for each tweet(id).

The first step is to conduct brief exploratory analysis to foster an in-depth understanding of the data.

### Then
1. Removal of URLs, emails & HTML tags
2. Removal of special characters
3. Removal of numeric values
4. Removal of hashtags and handles

#### Using `.info()`
allows us to understand more about the dataset; how many
values there are, and what might be missing.
It also tells us about dtypes; these are the pandas data types for each column
as interpreted by pandas when reading the CSV file

In [78]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


#### The `.shape`
attribute gives the number of rows and columns it contains

In [79]:
print(train_data.shape[0])
print(train_data.shape[1])

7613
5


In [80]:
print(f'Number of training examples: {train_data.shape[0]}')

print('\n===\n')

print(train_data.head())

print('\n===\n')

print(f'Total number of keywords: {len(train_data.keyword.unique())}')

print('\n===\n')

print(f'Examples per keywords:\n\n{train_data.keyword.value_counts()}')

print('\n===\n')

print(f'Total number of location: {len(train_data.location.unique())}')

print('\n===\n')

print(f'Examples per location:\n\n{train_data.location.value_counts()}')

Number of training examples: 7613

===

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  

===

Total number of keywords: 222

===

Examples per keywords:

fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: keyword, Length: 221, dtype: int64

===

Tota

This signifies the following;
Keyword - `fatalities` is the most occuring event with `radiation emergency` as the least.
Location - Majority of the tweets originate from `USA`.
It also shows a need to apply preprocessing to both the `Keyword` & `location` columns 

In [81]:
#checking for duplicates in the data
train_data.duplicated().sum()

0

checking the distribution of the `target` column

In [82]:
train_data['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

**The classes appears well balanced.**

Removal of URLs, emails & HTML tags 

In [142]:
def remove_tags(text):
    text = re.sub(r'((?:https?|ftp|file)://[-\w\d+=&@#/%?~|!:;\.,]*)', '', text)
    text = re.sub(r'([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+)', '', text)
    text = re.sub(r'<.*?>', '', text)
    return text

train_data['clean_text'] = train_data['text'].apply(remove_tags)
test_data['clean_text'] = test_data['text'].apply(remove_tags)



Removal of special characters

In [143]:
import string

def remove_Specialcharacters(text):
    text = re.sub(r'(\d\.\d)|[^\s\w]', '', text)
    text = ''.join([word for word in text if word in string.printable])
    return text

train_data['clean_text'] = train_data['clean_text'].apply(remove_Specialcharacters)
test_data['clean_text'] = test_data['clean_text'].apply(remove_Specialcharacters)

#[ ](?=[ ])|[^-_,A-Za-z0-9 ]+

Removal of numeric values

In [144]:
def remove_numbers(text):
    text = re.sub(r'\b\d+\b', '', text)
    return text

train_data['clean_text'] = train_data['clean_text'].apply(remove_numbers)
test_data['clean_text'] = test_data['clean_text'].apply(remove_numbers)


Removal of  hashtags and handles

In [145]:
def remove_hastags_handles(text):
    text = re.sub(r'@[\w]+','', text)
    text = re.sub(r'#[\w]+','', text)
    return text

train_data['clean_text'] = train_data['clean_text'].apply(remove_hastags_handles)
test_data['clean_text'] = test_data['clean_text'].apply(remove_hastags_handles)


## Preprocessing Continued

### Tokenisation

The first preprocessing step is *tokenisation*: splitting the text into words by using whitespace, punctuation marks, or both.

## Then

### Capitalisation

### Lemmatisation
A lemmatiser tries to convert different word forms to their base representations (lemmas) by looking up whether the lemma is in the vocabulary of the language in question.

In [147]:
import spacy
nlp = spacy.load('en_core_web_md')

Tokenisation

In [178]:
def tokenisation(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    return tokens
    
train_data['clean_text'] = train_data['clean_text'].apply(tokenisation)

Capitalisation - converting string to lowercase

In [194]:
def capitalisation(text):
    text = [x.lower() for x in text]
    return text

train_data['clean_text'] = train_data['clean_text'].apply(capitalisation)

Lemmatisation

In [172]:
def preprocess(sentence, lemmatise):
    text = nlp(sentence)
    if lemmatise:
        lemmas = [token.lemma_ for token in text]
        return lemmas
    else:
        tokens = [token.text.lower() for token in text]
        return tokens
                                            
print(preprocess("Our Deeds are the Reason of this earthquake M", True))
print(preprocess("Our Deeds are the Reason of this earthquake M", False))

['our', 'deed', 'be', 'the', 'reason', 'of', 'this', 'earthquake', 'm']
['our', 'deeds', 'are', 'the', 'reason', 'of', 'this', 'earthquake', 'm']


In [195]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [197]:
t = nlp('Our Deeds are the Reason of this earthquake M')
show_lemmas(t)

Our          PRON   17577015140245084783   our
Deeds        NOUN   14439616265815796507   deed
are          AUX    10382539506755952630   be
the          DET    7425985699627899538    the
Reason       NOUN   8276932932706964455    reason
of           ADP    886050111519832510     of
this         DET    1995909169258310477    this
earthquake   NOUN   2311630255468200397    earthquake
M            NUM    646772771845179972     m
