## NLP Project
Building a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. Access to a dataset of 10,000 tweets that were hand classified.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd 
from sklearn.metrics import f1_score
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

In [2]:
#Get the data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [3]:
x = train_df.drop(["target"], axis = 1)
y = train_df["target"]

In [4]:
x_train, x_test,y_train,y_test = train_test_split(x , y , test_size=0.2, random_state=42)

In [5]:
x_train.head()

Unnamed: 0,id,keyword,location,text
4996,7128,military,Texas,Courageous and honest analysis of need to use ...
3263,4688,engulfed,,@ZachZaidman @670TheScore wld b a shame if tha...
4907,6984,massacre,Cottonwood Arizona,Tell @BarackObama to rescind medals of 'honor'...
2855,4103,drought,"Spokane, WA",Worried about how the CA drought might affect ...
4716,6706,lava,"Medan,Indonesia",@YoungHeroesID Lava Blast &amp; Power Red #Pan...


In [6]:
train_df.shape

(7613, 5)

In [7]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
#Preprocessing of the data
vectorizer = feature_extraction.text.CountVectorizer()

In [9]:
#Get the count of words using CountVectorizer
x_train_vectors = vectorizer.fit_transform(x_train['text'])
x_test_vectors = vectorizer.transform(x_test['text'])
test_vector = vectorizer.transform(test_df['text'])

In [10]:
# Train Models
models = {"Logistic regression" : LogisticRegression(max_iter=1000),
         "Random forest Classifier" : RandomForestClassifier(),
         "RidgeClassifier" : linear_model.RidgeClassifier(),
         " SVM " : svm.SVC()}

# Create a function to fit and train models
def fit_and_score(models, x_train, y_train, x_test, y_test ):
    np.random.seed(42)
    model_scores = {}
    for name, model in models.items():
        model.fit(x_train_vectors, y_train)
        model_scores[name]= model_selection.cross_val_score(model, x_train, y_train, cv = 5, scoring="f1").mean()
    return model_scores

In [11]:
model_scores = fit_and_score(models, x_train_vectors, y_train, x_test_vectors, y_test )ss

In [12]:
model_scores

{'Logistic regression': 0.7442418912732831,
 'Random forest Classifier': 0.6913749243884325,
 'RidgeClassifier': 0.7168373449442356,
 ' SVM ': 0.7185106393226827}

We recieved the best score with Logistic regression, hence let's do the hyperparameter tuning on Logistice regression to find the best parameters

## Hyperparameter tuning with RandomizedSearchCV

In [13]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [14]:
# Create a hyperparameter grid for LogisticRegression
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

In [15]:
# Tune LogisticRegression

np.random.seed(42)

# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)

# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(x_train_vectors, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [16]:
rs_log_reg.best_params_

{'solver': 'liblinear', 'C': 0.23357214690901212}

In [17]:
rs_log_reg.score(x_test_vectors, y_test)

0.8168089297439265

## Hyperparamter Tuning with GridSearchCV
Since LogisticRegression model provides the best scores so far, we'll try and improve them again using GridSearchCV..

In [18]:
# Different hyperparameters for our LogisticRegression model
log_reg_grid = {"C": np.logspace(-4, 4, 30),
                "solver": ["liblinear"]}

# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=5,
                          verbose=True)

# Fit grid hyperparameter search model
gs_log_reg.fit(x_train_vectors, y_train);

Fitting 5 folds for each of 30 candidates, totalling 150 fits


In [19]:
# Check the best hyperparmaters
gs_log_reg.best_params_

{'C': 0.38566204211634725, 'solver': 'liblinear'}

In [20]:
# Evaluate the grid search LogisticRegression model
gs_log_reg.score(x_test_vectors, y_test)

0.8154957321076822

The accuracy gained from parameter tuning by RandomizedSearchCV and GridSearchCV is identical, hence we go with RandomizedSearchCV

In [21]:
submission_file = pd.read_csv("submission.csv")
submission_file['target'] = gs_log_reg.predict(test_vector)
submission_file.head()

Unnamed: 0.1,Unnamed: 0,id,target
0,0,0,1
1,1,2,1
2,2,3,1
3,3,9,0
4,4,11,1


In [25]:
submission_file.to_csv('submission.csv')