### Read Csv

In [1]:
import pandas
raw = pandas.read_csv("./data_kaggle/train.csv");

In [2]:
raw

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


### Acquiring the dataset

Ideally the dataset should already be labelled, is large enough and contains posts about a variety of disasters.
My First approach to this was to use [the dataset provided over at Kaggle's competition, 'Natural Language Processing with Disaster Tweets'](https://www.kaggle.com/competitions/nlp-getting-started/overview), which claims to contain about 10k labelled tweets that may or may not be related to actual disasters.

Each row of data contains the following columns:
* "target" indicates whether the tweet is about an actual disaster happening
* "keyword" indicates the keyword related to the disaster (has missing values)
* "location" indicates location provided by the user with the tweet (has missing values)

From the description, it is clear that location does not hold much point since we want the trained model to work anywhere, not just in a certain location.

#### Taking a peek at the values

In [12]:
keywords = raw["keyword"].value_counts().index.to_list();
keywords.sort()
keywords

['ablaze',
 'accident',
 'aftershock',
 'airplane%20accident',
 'ambulance',
 'annihilated',
 'annihilation',
 'apocalypse',
 'armageddon',
 'army',
 'arson',
 'arsonist',
 'attack',
 'attacked',
 'avalanche',
 'battle',
 'bioterror',
 'bioterrorism',
 'blaze',
 'blazing',
 'bleeding',
 'blew%20up',
 'blight',
 'blizzard',
 'blood',
 'bloody',
 'blown%20up',
 'body%20bag',
 'body%20bagging',
 'body%20bags',
 'bomb',
 'bombed',
 'bombing',
 'bridge%20collapse',
 'buildings%20burning',
 'buildings%20on%20fire',
 'burned',
 'burning',
 'burning%20buildings',
 'bush%20fires',
 'casualties',
 'casualty',
 'catastrophe',
 'catastrophic',
 'chemical%20emergency',
 'cliff%20fall',
 'collapse',
 'collapsed',
 'collide',
 'collided',
 'collision',
 'crash',
 'crashed',
 'crush',
 'crushed',
 'curfew',
 'cyclone',
 'damage',
 'danger',
 'dead',
 'death',
 'deaths',
 'debris',
 'deluge',
 'deluged',
 'demolish',
 'demolished',
 'demolition',
 'derail',
 'derailed',
 'derailment',
 'desolate',
 'de

Looking at the keywords in the original dataset, it is clear that:
* Not all keywords are meaningful for predicting disasters. For example, "armageddon" is very unlikely to happen, and "battle" mostly contains gibberish about fictional "wars"
* The keywords may not necessarily indicate the kind of disaster; For example, 'survive', 'survivors' does not really mean anything
* A lot of words are just different forms of the same thing: "buildings burning", "buildings on fire", "bush fires", "fire", "fire truck", "forest fire", "forest fires", "hellfire", "ablaze" is just a different way of saying "fire"

* A lot of tweets contain URLs, which points to either- A: Twitter medias B: external websites

It is also clear that:
* A lot of data does not indicate "actual" disaster; id 6132 "The Prophet (peace be upon him) said 'Save yourself from Hellfire even if it is by giving half a date in charity.' is from Islamic Hadith, however is marked as a "disaster"
* While the dataset provider claims that the data is "hand-labelled", it is in fact not true: [There are discussions suggesting that some sort of heuristic is used to label the sentences](https://www.kaggle.com/competitions/nlp-getting-started/discussion/130458).

While it would be very convenient to use pre-labelled data, with the above points we can conclude that using this data would be largely impractical and will not help us get reliable results regarding classification of 'actual disasters'.

Therefore, unfortunately, it is decided that we would have to collect the data ourselves and label them.

### Collecting & labelling data manually

We collect data from DCInside, by searching for specific terms using the search page.
For each term we collect 20 pages worth of data (20 pages * 20 articles = 400 posts).
For each post, we collect:
* The gallery name
* The title
* Part of the post (text) content
* Post date

which we add label of:
* Search keyword
* Whether the post is about an actual disaster or not
  * This does not necessarily have to be about current disasters; for example, posts talking about past disasters