# Project Findings

   # Exploratory Data Analysis

- The dataset contains 7613 entries.
- 0.80% entries are missing a keyword entry.
    - These entries will be dropped since they make a small fraction of the dataset.
    - The keyword is found within the tweet text hence the keyword column will be disregarded for training.
- 33.37% of the entries are missing a location entry.
    - This represents a large fraction of the dataset hence dropping is not an option.
    - The models will be trained without taking into consideration the dataset.
    - Another model in which the missing location entries will be replaced with the top 5 most common locations will be trained.
    - The model with the best performance will be used.
    - Retweets have different locations.
    - The effect of retweets from different locations will be investigated
    - It is worth investigating the case where tweets from a city centre will unlikely report on a veld fire.
- There are duplicate tweets which are labelled differently.
    - These entries will be dropped due to the negative impact which they can have on the model.
    - An alternate approach is to hand label the duplicated tweets then drop those that are mislabelled.
- The data is ordered by the keyword which was used for extraction.
    - Batch training will be done hence data will be shuffled to ensure tweets with the same keyword do not end up in the same batch.
### Testing: 
    - Data will be split into training, validation and testing sets.
    - The split ratio is 80% training, 10% validation and 10% testing.
    - Training data: This set will be used for training the model.
    - Validation data: Validation data will be used to evaluate the model's performance in between batches. This data will be crucial for detecting over-fitting. An early stopping mechanism will be used to prevent over-fitting.
    - Testing data: Upon completion of training, the model's performance will be evaluated using the training data. 
    - The training dataset has a 3:4 ratio for real and fake tweets hence the performance metrics that will be used are accuracy, precision and recall. The main aim will be to balance the precision and recall hence the F1 score will be the most definitive performance metric.
    - For choosing an appropriate performance metric, the view of an emergency team using the model to track natural disasters around a country has been assumed. If real tweets are classified as fake, then the users who send the tweets are in danger. If fake tweets are classified as real then the emergency tean will waste resources commuting to an area where there is no natural disaster occuring.
    - **Precision**: Talks about how precise/accurate the model. I.e. Out of the tweets which are predicted as real how many of them are actually real?
    - Precision is a good measure to determine when the costs of False Positive is high.
    - If the precision is not high a lot of fake tweets will get classified as real.
    - **Recall**: Calculates how many actual positives the model captures through labelling it True Positive.
    - Recall is the preferred performance metric when there is a high cost associated with False Negatives.
    - Classifying real tweets as fake could be detrimental to the Tweet sender's safety hence the recall is also important.
    - **F1**: Since both the precision and recall have been found to be important metrics, it is better to observe the F1 score in order to try and strike a balance between the 2 metrics.

# Data Cleaning

- All text will be set to lower case to ensure every words are ensure the same words with different caps are not read as different words.
- Words will be lemmentized(i.e. Inflected forms of a word will be grouped together and processed as a single word). 
- All links are removed by using HTTP and WWW to identify them.
- Words containing numbers will also be removed.
- Punctuation will also be removed.

# Data Processing

- The best way to test different feature extraction is to feed data a Neural Network model and evaluating the model's performance.
- The feature extraction techniques that have been used for this application are one-hot encodings and word embeddings. 
- One-Hot Encodings: The different sentences are represented by one-hot encoded matrices where each column associated with a word that is found within a sentence is represented by a 1 and all other entries 0.
- The one-hot encoded matrices are sparse.
- Word Embeddings: Word embeddings are able to capture more semantic information in text by representing the information using dense matrices.
- **add more defining information for one-hot encodings and word embeddings**
- The word embeddings have been split into embeddings trained during training the model itself using the keras embedding layer and pre-trained embeddings using Glove and gensim Word2Vec.
- **Keras Embedding Layer**: The embeddings are trained using a supervised Dense(MLP) neural network during the process of training the final model.
    - The weights are updated during the training the model using back-propagation.
    - The embedding layer therfore contributes to the trainable parameters.
    - This method requires many epochs for training as the weights are optimized along with the neural network's weights itself.
- **Pre-Trained Embeddings**: These embeddings are trained before training the main model.
    - These were done using Glove Embeddings which are trained on the Glove Twitter 6B corpus and Word2Vec which was trained on the natural disasters corpus.
    - Both the Glove and Word2Vec models yield similar performance.
    - The Word2Vec models were optimized by varying the windows size, number of iterations for training, and using both CBOW and Skip-Gram.
    - performance is bad when the number of iterations is low
    - Skip-Gram is better for learning infrequent words
    - In CBOW the vectors from the context words are averaged before predicting the center word.
    - In skip-gram there is no averaging of embedding vectors 
    - It seems like the model can learn better representations for the rare words when their vectors are not   averaged with the other context words in the process of making the predictions.
    - This dataset is composed of mostly repeating words, especially the keywords that are used to extract tweets, therefore it is probably better to use CBOW.
    - CBOW does also yield a better performance.
    - Pre-trained embeddings have less trainable parameters than the keras embedding layer 

    **30 iterations**
    - CBOW: 59 F1, 69.9 ACC
    - SG: 68 F1, 73.6 ACC


    **50 iterations**
    - CBOW: 69 F1, 72 ACC
    - SG: 64 F1, 70 ACC

    **100 iterations**
    - CBOW: 67 F1, 72 ACC
    - SG: 63 F1, 72 ACC