# Natural Language Processing with Disaster Tweets

The idea is to implement the TF-IDF vectorizer and Ridge classifier in order to classificate the text in every tweet.

In [1]:
import numpy as np
import pandas as pd 
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [2]:
trainPath = '../input/nlp-getting-started/train.csv'
testPath = '../input/nlp-getting-started/test.csv'

train_df = pd.read_csv(trainPath)
test_df  = pd.read_csv(testPath)

In [3]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


# TF-IDF Vectorizer

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = feature_extraction.text.TfidfVectorizer()

train_vectors = vectorizer.fit_transform(train_df["text"])
test_vectors = vectorizer.transform(test_df["text"])

Let's see how is the score, using the fact that the metric for this competition is F1.

# Model

In [5]:
clf = linear_model.RidgeClassifier()

scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.63366337, 0.6122449 , 0.68442211])

The score is around 0.65, which is not bad. Let's fit the data.

In [6]:
clf.fit(train_vectors, train_df["target"])

RidgeClassifier()

# Submission

In [7]:
sample_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")

In [8]:
sample_submission["target"] = clf.predict(test_vectors)

In [9]:
sample_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,1


In [10]:
sample_submission.to_csv("submission.csv", index=False)