# Getting the CrisisLexT26 Dataset
This notebook provides an example on who to download, unzip and combine relevant CSV's into a single data frame. The source dataset can be found [here](https://crisislex.org/data-collections.html#CrisisLexT26). The tweets in the dataset have been labelled with:

- **Information Source**, and
- **Informativeness**, and
- **Information Type**.

Additional metadata like crisis event, crisis type and country impacted have been included in the dataset for information purposes.

### Clone the project GitHub repo
The project uses code stored in the publicly available GitHub repo [here](https://github.com/Crisitunity-Lab/ARDC-Project). To access bespoke function, clone the GitHub repo to the local environment.

In [1]:
# Clone GitHub repo
user = "Crisitunity-Lab"
repo = "ARDC-Project"

!git clone https://github.com/$user/$repo /content/repo

Cloning into '/content/repo'...
remote: Enumerating objects: 494, done.[K
remote: Counting objects: 100% (57/57), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 494 (delta 25), reused 34 (delta 14), pack-reused 437[K
Receiving objects: 100% (494/494), 5.20 MiB | 10.05 MiB/s, done.
Resolving deltas: 100% (253/253), done.


### Install requirements
All python library requirements are contained in a requirements.txt file and can be run with the below.

In [2]:
%pip install -r /content/repo/requirements.txt

Collecting pycountry==22.3.5 (from -r /content/repo/requirements.txt (line 5))
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting transformers (from -r /content/repo/requirements.txt (line 11))
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m70.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate (from -r /content/repo/requirements.txt (line 12))
  Downloading accelerate-0.24.0-py3-none-any.whl (260 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops (from -r /content/repo/requirements.t

In [3]:
# Import langchain library
from langchain import HuggingFacePipeline

# Import bespoke functions
import repo.src.structure_extractor.data_utils as du
import repo.src.structure_extractor.model_utils as mu

### Get data
Using the pre-built functions in the data utils file, provide a link to the zipped data and a destination directory. Data will be downloaded and unzipped to the destination directory. The source dataset contains multiple files in each directory. The files include in each directory include:
- A README.md file containing general information about the files in the folder.
- A JSON file containing information about the event and other pieces of metadata.
- A labelled tweets comma separated values (CSV) file named \<crisis_event>-
tweets_labeled.csv. This file contains labelled tweets.
- A CSV file containing a list of identifiers for all tweets associated with the event. The file is
named \<crisis_event>-tweetids_entire_period.csv.

For this example only the _\<crisis_event>-tweets_labeled.csv_ is required.

In [4]:
# Get data from online source and unzip to local directory
zip_file_url = "https://github.com/sajao/CrisisLex/blob/master/releases/CrisisLexT26-v1.0.zip?raw=true"
dest_folder = "/content/data/"

du.unzip_from_url(src=zip_file_url, dest=dest_folder)

Download and Unzip complete


Data is now stored at /content/data/CrisisLexT26 within the Google Colab environment.

### Combine data into single dataframe
As the data is spread across multiple CSV files the *combine_csv_files* function will bring together the tweets_labelled file in each directory in a single file.

**NOTE**: Some tweets in the dataset have very few words. Having little data in the tweet makes it difficult for anyone, let alone a large language model, to understand what the tweet is about. The minimum length of a tweet can be set and tweets with fewer words than the minimum are excluded. "Words" doesn't include hashtags, links and user tags.  

In [5]:
# Set the folder name for where the data is stored
data_folder_name = "CrisisLexT26"
data_loc = dest_folder + data_folder_name
min_tweet_length = 6

# Combine csv's into a single dataframe. By default the folder the data is stored in is included as a
# new field called "label", but this can be turned off by setting the retrieve_label parameter to False.
df = du.combine_csv_files(data_loc, min_tweet_len=min_tweet_length)

In [6]:
# Data is now in a dataframe
df.head(10)

Unnamed: 0,Tweet ID,Tweet Text,Information Source,Information Type,Informativeness,label,year,country_code,crisis_type
0,203440928084602880,E #Navelli... a che punto siamo? Dov'è il pian...,Outsiders,Other Useful Information,Related - but not informative,2012_Italy_earthquakes,2012,IT,Earthquake
1,203843778409283584,"RT @andmarini: #Modena Strage di Brindisi, all...",Not labeled,Not labeled,Not related,2012_Italy_earthquakes,2012,IT,Earthquake
2,204030290715348993,#Terremoto ! in alto si sente parecchio,Outsiders,Sympathy and support,Related - but not informative,2012_Italy_earthquakes,2012,IT,Earthquake
3,204030617850089472,Inequivocabilmente questa era una scossa di #t...,Eyewitness,Sympathy and support,Related - but not informative,2012_Italy_earthquakes,2012,IT,Earthquake
5,204032853439283201,"“@perugini: 44.956°N, 11.241°E - 4.2 Magnitudo...",Eyewitness,Other Useful Information,Related and informative,2012_Italy_earthquakes,2012,IT,Earthquake
6,204033939772407808,RT @Reuters: BREAKING NEWS: 6.3 magnitude eart...,Media,Other Useful Information,Related and informative,2012_Italy_earthquakes,2012,IT,Earthquake
7,204033969124155392,RT @Reuters: BREAKING NEWS: 6.3 magnitude eart...,Media,Other Useful Information,Related and informative,2012_Italy_earthquakes,2012,IT,Earthquake
9,204034510176780288,RT @Reuters: BREAKING NEWS: 6.3 magnitude eart...,Media,Other Useful Information,Related and informative,2012_Italy_earthquakes,2012,IT,Earthquake
11,204034585670062080,"RT @USGSted: Strong earthquake, NORTHERN ITALY...",Media,Caution and advice,Related and informative,2012_Italy_earthquakes,2012,IT,Earthquake
12,204034644398702592,Oddio #terremoto. L'ho sentito solo io alle 4...,Eyewitness,Sympathy and support,Related - but not informative,2012_Italy_earthquakes,2012,IT,Earthquake


In [7]:
# How many records are there?
len(df)

26001