# Real or Not
This notebook deals deals with the problem sugested in https://www.kaggle.com/c/nlp-getting-started/ .

This is a introduction work into NLP and ML projects


In [15]:
import numpy as np
import pandas as pd


In [20]:
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [33]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission = pd.read_csv('sample_submission.csv')

First, an example of a tweet that is NOT about a disaster

In [18]:
train_df[train_df['target'] == 0]['text'].values[1]

'I love fruits'

Now, one that is about a disaster

In [19]:
train_df[train_df['target'] == 1]['text'].values[1]

'Forest fire near La Ronge Sask. Canada'

#### Building Vectors
Using scikit-learn's CountVectorizer to count the words in each tweet and turn them into data our ML model can process

In [21]:
count_vectorizer = feature_extraction.text.CountVectorizer()

#let's get the count from the first 5 tweets in the dataset
example_train_vectors = count_vectorizer.fit_transform(train_df['text'][0:5])

In [29]:
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above tells us that:

    1) There are 54 unique words (or "tokens") in the first five tweets.
    2) The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that DO exist in the first tweet.

Now let's create vectors for all of our tweets.

In [30]:
train_vectors = count_vectorizer.fit_transform(train_df['text'])
test_vectors = count_vectorizer.transform(train_df['text'])

#### Our Model
As we mentioned above, we think the words contained in each tweet are a good indicator of whether they're about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we're assuming here is a linear connection. So let's build a linear model and see!



In [31]:
clf = linear_model.RidgeClassifier()

In [32]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df['target'], cv=3, scoring = 'f1')
scores

array([0.59453669, 0.56455572, 0.64051005])