## Chapter 1 NLP Tutorial


In our first NLP tutorial we'll look a competition's dataset from www.kaggle.com. We will use a simple technique to convert the text data into numeric format that can be processed by machine learning algorithms. We will then apply Random Forests algorithm to build a machine learning model. Finally, we make the predictions for the target variables and also evaluate our model's accuracy. This book expects you to have some prior knowledge in Python, and coding of machine learning, and deep learning (neural networks) models. 

We will straightway jump into writing our code and try to understand it as we write it. 


In [4]:
import numpy as np # Linear algebra.
import pandas as pd # Data processing, text file I/O. 

# The below code is for working with machine learning model.
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [35]:
# Read training data file, kept in the same directory as the code file.
train_df = pd.read_csv("train.csv")
# Read testing data file, kept in the same directory as the code file.
test_df = pd.read_csv("test.csv") 

In [92]:
train_df.columns

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

In [106]:
# Display the random three rows of the train dataset.
train_df[["text","target"]].sample(n=3, random_state=0) 

Unnamed: 0,text,target
311,@KatieKatCubs you already know how this shit g...,0
4970,@LeMaireLee @danharmon People Near Meltdown Co...,0
527,1-6 TIX Calgary Flames vs COL Avalanche Presea...,0


- The above train dataset display shows it as the dataset of tweets. 

In [108]:
# Display the random three rows of the test dataset.
test_df[["text"]].sample(n=3, random_state=0) 

Unnamed: 0,text
2464,Matt Baume Digs Into the the Controversial Û÷...
1515,Dutch crane collapses demolishes houses: Drama...
2756,More details: Bomber kills at least 13 at #Sau...


### A quick look at our data

- Let's look at our data... 

In [38]:
# A quick look at our data. First displaying a non desauster tweet.
train_df[train_df["target"] == 0]["text"].values[3] 

'My car is so fast'

In [40]:
# And one pf that is representing desauster a tweet.
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

### Building vectors

The words contained in every tweet are a good pointer of whether they are about a real disaster or not. In theory this is not totally correct. We will still use it as our starting point in our first NLP tutorial. Below we will use scikit-learn's CountVectorizer to count the words in each tweet and then turn them into a data format that our machine learning model can undestand. A vector is, in this context is a set of numbers that a machine learning algorithm can understand. The related code, its usage, and output is given below.

•	Working of CountVectorzer: https://www.educative.io/answers/countvectorizer-in-python


### An example.

In [7]:
demo_text = ["Stella is a good girl. She loves to swim"] # Demo sentence

In [8]:
count_vectorizer = feature_extraction.text.CountVectorizer() # Instrantiate CountVectorizer() 
count_vectorizer.fit(demo_text) # Fit the demo text
print(count_vectorizer.vocabulary_) # Print results

{'stella': 5, 'is': 2, 'good': 1, 'girl': 0, 'she': 4, 'loves': 3, 'to': 7, 'swim': 6}


In [9]:
# encode document
demo_vector = count_vectorizer.transform(demo_text)
print(demo_vector.shape)
print(demo_vector.toarray())

(1, 8)
[[1 1 1 1 1 1 1 1]]


- The number elements in the vector representing demo_text [[1 1 1 1 1 1 1 1]] are eitht (0 to 7) as the the number of distinct words is also eight. 
- All the entries in the vector are 1 as no word in the demo_text is repeating (frequency of all words is 1). 

### Let's try abother example

In [53]:
demo_text2 = ["The sky is blue. I wish to fly in the blue sky"] # Demo sentence 2
count_vectorizer = feature_extraction.text.CountVectorizer() # Instrantiate CountVectorizer() 
count_vectorizer.fit(demo_text2) # Fit the demo text
print(count_vectorizer.vocabulary_) # Print results

{'the': 5, 'sky': 4, 'is': 3, 'blue': 0, 'wish': 7, 'to': 6, 'fly': 1, 'in': 2}


In [54]:
# encode document
demo_vector = count_vectorizer.transform(demo_text2)
print(demo_vector.shape)
print(demo_vector.toarray())

(1, 8)
[[2 1 1 1 2 2 1 1]]


- We are simply counting the repetition of words and putting it in the vector. 
- The words "the", "sky", and "blue" have a frequency of 2 each in the demo_text2.
- So, these three words are represented by 2 each in demo vecctor [[2 1 1 1 2 2 1 1]].

### Let's get counts for the first 5 tweets in the data

In [10]:
## let's get counts for the first 5 tweets in the data.
demo_vectors = count_vectorizer.fit_transform(comb_df["text"][0:5])

NameError: name 'train_df' is not defined

In [69]:
type(demo_train_vectors) # Checking data type.

scipy.sparse._csr.csr_matrix

In [70]:
demo_train_vectors.shape # Checking shape.

(5, 54)

In [73]:
# we use .todense() here because these vectors are "sparse" 
# (only non-zero elements are kept to save space)
print(demo_train_vectors[0].todense().shape)
print(demo_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


The above code and result show us the following. 
- There are 54 distinct words (called "tokens") in the selected first five tweets.
- Obviously, the first tweet (and every other tweet) contains only some of those 54 distinct tokens. - - The vector contains 54 elements as there are 54 distinct tokens.
- All of the non-zero counts above in above vector are the tokens that definitely exist in the first tweet.

Now let's create representative number vectors for all of our five tweets. We will do it for both train and test tweets.

In [77]:
# We will use count_vectorizer.fit_transform() for the train train_df.
# And use only count_vectorizer.transform() for test_df.
# It is sstandard practice in machine learnining and its for a reason.
# For more details please refer to any standard machine learning text or Google it out.
train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])

### Let's build a simple machine learning model to predict the "target" variable.

In [80]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [83]:
# Initialize the Random Forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)

- Below we will train the model and evaluate it.
- We will be using the evaluation metric for this purpose is F1 score. 

In [85]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.52986023, 0.49075216, 0.62749446])

- The above F1 scores aren't horrible! You can get better results with other NLP techniques like TFIDF, LSA, LSTM / RNNs, and many others. We will play with them in the upcoming chapters. 
- As of now, let's make the much required precictions on the test data. 

In [86]:
clf.fit(train_vectors, train_df["target"])

In [87]:
# Make predictions
y_pred = clf.predict(test_vectors)

In [109]:
y_pred.shape

(3263,)

In [88]:
test_df.head(2)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."


In [89]:
test_df["target_predict"] = y_pred

In [104]:
# Display the random three rows of the train dataset.
test_df[["text","target_predict"]].sample(n=3, random_state=0) 

Unnamed: 0,text,target_predict
2464,Matt Baume Digs Into the the Controversial Û÷...,0
1515,Dutch crane collapses demolishes houses: Drama...,1
2756,More details: Bomber kills at least 13 at #Sau...,1


In [110]:
test_df["target_predict"].shape

(3263,)

In [111]:
# Read test target variable given in a seperate file.
# Load the CSV file into a DataFrame
target_df = pd.read_csv('test_target_values.csv')

In [112]:
# Add target cariable to test_df
test_df["target"] = target_df["target"]

In [113]:
test_df.columns

Index(['id', 'keyword', 'location', 'text', 'target_predict', 'target'], dtype='object')

In [114]:
# Compare original target and predicted target calues in test data.
print(test_df[['target','target_predict']].head())

   target  target_predict
0       0               0
1       0               1
2       0               1
3       0               0
4       0               1


- We are finishing the solution here, though the predictions does not seem to be that accurete. 
- For real-world analysis F1 score should be calculated on the test data. We are skipping this step.

- It comples our job for now. Remember, it was a over simplified example, created only for demo. 