## Table of Contents
- [Competition Description Overview](#overview)
- [File Descriptions](#describe)
- [Gather Data](#gather)
- [Modeling and result](#model)
- [Conclusion](#conclude)

<a id='overview'></a>
## Competition Description Overview
Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:


The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

<a id='describe'></a>
## Dataset Description

Three files are available: train.csv, test.csv and sample_submission.csv.
Each sample in the train and test set has the following information:

- The text of a tweet
- A keyword from that tweet (although this may be blank!)
- The location the tweet was sent from (may also be blank)

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

Files
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format
Columns
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [1]:
# Import essential libraries
import sys, os
import warnings
import pickle
warnings.filterwarnings("ignore", category=UserWarning)

import nltk

import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.metrics import accuracy_score, f1_score, \
recall_score, precision_score

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

<a id='gather'></a>
### Data Gathering 

In [2]:
# Load data
samp_sub_df = pd.read_csv('sample_submission.csv')
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

### Data Exploration

In [3]:
samp_sub_df.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [4]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
# check distribution of 0s and 1s
train_df.target.mean()

0.4296597924602653

This dataset is not balanced. Non disaster tweets are more than disaster tweets. Accuracy score will not be used to test how well the model can predict this data.

In [6]:
# train data info
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [7]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
# test data info
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3263 entries, 0 to 3262
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        3263 non-null   int64 
 1   keyword   3237 non-null   object
 2   location  2158 non-null   object
 3   text      3263 non-null   object
dtypes: int64(1), object(3)
memory usage: 102.1+ KB


<a id='model'></a>
### Modeling and Results

- I am interested in text and target columns 

In [9]:
# Extract text and target columns from train_df
X_train = train_df.text.values
y_train = train_df.target

In [10]:
X_test = test_df.text.values

In [11]:
# Get target column from samp_sub_df
y_test = samp_sub_df.target

In [12]:
def tokenize(text):
    
    """
    Split text into a list of tokens.
    
    Attributes: 
            array of plain text
    Returns: 
        list of tokenized words and numbers
    """
    
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = [lemmatizer.lemmatize(word).lower().strip() for word in tokens]
    return clean_tokens

In [13]:
def build_model():
    
    """
    Build Machine Learning Pipeline.
    
    Attributes: 
            None
    Returns: 
        model
    """
    model = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize, stop_words= 'english')),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())    
    ])

    return model

In [14]:
def evaluate_model(model, X_test, y_test):
    
    """
    Test pipeline.
    Prints f1_score, precision, and recall.
    Parameters: 
            model, test data
    Returns: 
        None
    """

    y_pred = model.predict(X_test)

    precision = precision_score(y_test, y_pred, average='micro')
    recall = recall_score(y_test, y_pred, average='micro') 
    
    f1_scores = f1_score(y_test, y_pred, average='micro')

    print("Precision:", precision)
    print("Recall:", recall)
    print('fi_score', f1_scores)

In [15]:
def improve_model(model):
    
    parameters = {
        'clf__min_samples_leaf': [5, 10],
        'clf__min_samples_split': [6, 8, 10],
        'clf__max_depth': [3,5,7],
    }    
    
    cv = GridSearchCV(model, param_grid=parameters)
    cv.fit(X_train, y_train)
    return cv        


def display_results(cv, X_train, X_test, y_train, y_test):
    
    """
    Display result after improving model and return result.
    Prints f1_score, precision, and recall.
    Parameters: 
            cv, X_test, X_train, y_train, y_test
    Returns: 
        y_pred
    """    
    
    y_pred = cv.predict(X_test)


    precision = precision_score(y_test, y_pred, average='micro')
    recall = recall_score(y_test, y_pred, average='micro') 
    f1_scores = f1_score(y_test, y_pred, average='micro')


    print("Precision:", precision)
    print("Recall:", recall)
    print('fi_score', f1_scores)
    print("\nBest Parameters:", cv.best_params_)
    
    result_df = pd.DataFrame(columns=['id', 'target'])
    result_df.loc[:, 'id'] = test_df.id
    result_df.loc[:, 'target'] = y_pred
    result_df.to_csv('result.csv', index = False)
    
    return y_pred
    

In [16]:
def save_model(y_pred,  filepath):
    
    """
    Loads pipeline to save to a pickle file.
    
    Parameters: 
           pipeine
    Returns: 
         None
    """
    
    with open(filepath, 'wb') as f:

        pickle.dump(y_pred, f)

In [17]:
def main():
    if len(sys.argv) == 3:
        
        print('Building model...')
        model = build_model()       
        
        print('Training model...')
        model.fit(X_train, y_train)
        
        
        print('Evaluating model...')
        evaluate_model(model, X_test, y_test)
        
        
        
        print('Improving model...')
        cv = improve_model(model)
        
        print('Display_results...')
        y_pred = display_results(cv, X_train, X_test, y_train, y_test)
        
        
        print('Saving model...')
        save_model(y_pred, 'y_pred.pkl')

        print('y_pred saved!')

if __name__ == '__main__':
    main()

Building model...
Training model...
Evaluating model...
Precision: 0.7159056083358872
Recall: 0.7159056083358872
fi_score 0.7159056083358872
Improving model...
Display_results...
Precision: 0.9561752988047809
Recall: 0.9561752988047809
fi_score 0.9561752988047809

Best Parameters: {'clf__max_depth': 7, 'clf__min_samples_leaf': 5, 'clf__min_samples_split': 6}
Saving model...
y_pred saved!


In [18]:
result_df = pd.read_csv('result.csv')
result_df.head()

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,0
3,9,0
4,11,0


In [19]:
result_df.target.mean()

0.043824701195219126

<a id='conclude'></a>
## Conclusion

I used the text and target columns to build my model. I had about 72% f1_score with the initial model. After a gridsearch for a possible improvement of the model, I had some hyperparameters which was used to tweak the model to get an f1_score of about 96%