<div align='center'><font size="6">Natural Language Processing</font></div>

<hr>


## Dataset
The datasets contains a set of tweets which have been divided into a training and a test set. The training set contains a target column identifying whether the tweet pertains to Violence or not.

The job is to create a ML model to predict whether the test set tweets belong to a violence or not, in the form of 1 or 0.This is a classic case of a Binary Classification problem. 



# Table of Contents
* [1. Importing the necessary libraries](#imports)
- [2. Reading the datasets](#reading)
- [3. Basic EDA](#eda)
- [4. Text data processing](#processing)
- [5. Transforming tokens to vectors](#vectorization)
- [6. Buiding a Text Classification model](#model)

# <a name="imports"></a>1. Importing the necessary libraries

In [1]:
import numpy as np 
import pandas as pd 

# text processing libraries
import re
import string
import nltk
from nltk.corpus import stopwords

# sklearn 
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import preprocessing, decomposition, metrics
from sklearn.metrics import f1_score,classification_report,confusion_matrix
from sklearn.model_selection import train_test_split



# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# File system manangement
import os
# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# <a name="reading"></a> 2. Reading the datasets

In [18]:
#Training data
train = pd.read_csv('train_prj.csv')
print('Training data shape: ', train.shape)
train.head()

Training data shape:  (31962, 3)


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [19]:
# Testing data 
test = pd.read_csv('test_prj.csv')
print('Testing data shape: ', test.shape)
test.head()

Testing data shape:  (17197, 2)


Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


# <a name="eda"></a> 3. Basic Exploratory Data Analysis (EDA)

## Missing values

In [4]:
#Missing values in training set
train.isnull().sum()

id       0
label    0
tweet    0
dtype: int64

In [5]:
#Missing values in test set
test.isnull().sum()

id       0
tweet    0
dtype: int64

## Exploring the Label Column

* ** Distribution of the label Column**

We have to predict whether a given tweet is about a violence one or not. - If so, predict a 1. If not, predict a 0.

In [6]:
train['label'].value_counts()

label
0    29720
1     2242
Name: count, dtype: int64

In [None]:
sns.barplot(train['label'].value_counts().index,train['label'].value_counts(),palette='rocket')

***Exploring the Target Column**
Let's look how the violence and the non violence tweets look like

In [7]:
# A Sexist/Violence tweet
violence_tweets = train[train['label']==1]['tweet']
violence_tweets.values[1]

'no comment!  in #australia   #opkillingbay #seashepherd #helpcovedolphins #thecove  #helpcovedolphins'

In [8]:
#not a Sexist/Violence tweet
non_violence_tweets = train[train['label']==0]['tweet']
non_violence_tweets.values[1]

"@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked"

Let's see how often the word 'kill' come in the dataset and whether this help us in determining whether a tweet belongs to a violoence category or not.

In [9]:
train.loc[train['tweet'].str.contains('kill', na=False, case=False)].label.value_counts()


label
0    219
1     27
Name: count, dtype: int64

> # <a name="processing"></a> 4. Text Data Preprocessing

## 1. Data Cleaning

Before we start with any NLP project we need to pre-process the data to get it all in a consistent format.We need to clean, tokenize and convert our data into a matrix. Some of the  basic text pre-processing techniques includes:

* Make text all **lower case** or **uppercase** 
* **Removing Noise** 
* **Tokenization**
* **Stopword Removal**

### More data cleaning steps after tokenization:

* **Stemming**
* **Lemmatization**
* Parts of speech tagging
* Create bi-grams or tri-grams
And more...


In [10]:
# A quick glance over the existing data
train['tweet'][:5]

0     @user when a father is dysfunctional and is s...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model   i love u take with u all the time in ...
4               factsguide: society now    #motivation
Name: tweet, dtype: object

In [20]:
# Applying a first round of text cleaning techniques

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

# Applying the cleaning function to both test and training datasets
train['tweet'] = train['tweet'].apply(lambda x: clean_text(x))
test['tweet'] = test['tweet'].apply(lambda x: clean_text(x))

# Let's take a look at the updated text
train['tweet'].head()

0     user when a father is dysfunctional and is so...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model   i love u take with u all the time in u...
4                 factsguide society now    motivation
Name: tweet, dtype: object

## 2. Tokenization

Tokenization is a process that splits an input sequence into so-called tokens where the tokens can be a word, sentence, paragraph etc. Base upon the type of tokens we want, tokenization can be of various types.

In [21]:
# Tokenizing the training and the test set
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
train['tweet'] = train['tweet'].apply(lambda x: tokenizer.tokenize(x))
test['tweet'] = test['tweet'].apply(lambda x: tokenizer.tokenize(x))
train['tweet'].head()

0    [user, when, a, father, is, dysfunctional, and...
1    [user, user, thanks, for, lyft, credit, i, can...
2                              [bihday, your, majesty]
3    [model, i, love, u, take, with, u, all, the, t...
4               [factsguide, society, now, motivation]
Name: tweet, dtype: object

## 3. Stopwords Removal

Now, let's get rid of the stopwords (i.e.,) words which occur very frequently but have no possible value. For example,(a, an, the, are, etc).

In [55]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...


True

In [22]:
def remove_stopwords(text):
    """
    Removing stopwords belonging to english language
    
    """
    words = [w for w in text if w not in stopwords.words('english')]
    return words


train['tweet'] = train['tweet'].apply(lambda x : remove_stopwords(x))
test['tweet'] = test['tweet'].apply(lambda x : remove_stopwords(x))
train.head()

Unnamed: 0,id,label,tweet
0,1,0,"[user, father, dysfunctional, selfish, drags, ..."
1,2,0,"[user, user, thanks, lyft, credit, cant, use, ..."
2,3,0,"[bihday, majesty]"
3,4,0,"[model, love, u, take, u, time, urð, ð, ð, ð, ..."
4,5,0,"[factsguide, society, motivation]"


## 4. Token normalization

Token normalisation means converting different tokens to their base forms. This can be done either by:

- **Stemming** :  removing and replacing suffixes to get to the root form of the word, which is called the **stem** for instance cats - cat, wolves - wolv 
- **Lemmatization** : Returns the base or dictionary form of a word, which is known as the **lemma** 


In [23]:
#tokenizer = nltk.tokenize.TreebankWordTokenizer()
lemmatizer=nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in text]

train['tweet']=train.tweet.apply(lemmatize_text)
train.head()

Unnamed: 0,id,label,tweet
0,1,0,"[user, father, dysfunctional, selfish, drag, k..."
1,2,0,"[user, user, thanks, lyft, credit, cant, use, ..."
2,3,0,"[bihday, majesty]"
3,4,0,"[model, love, u, take, u, time, urð, ð, ð, ð, ..."
4,5,0,"[factsguide, society, motivation]"


In [24]:
# After preprocessing, the text format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

train['tweet'] = train['tweet'].apply(lambda x : combine_text(x))
test['tweet'] = test['tweet'].apply(lambda x : combine_text(x))
train.head()

Unnamed: 0,id,label,tweet
0,1,0,user father dysfunctional selfish drag kid dys...
1,2,0,user user thanks lyft credit cant use cause do...
2,3,0,bihday majesty
3,4,0,model love u take u time urð ð ð ð ð ð ð ð
4,5,0,factsguide society motivation


# <a name="vectorization"></a>  5. Transforming tokens to a vector
After the initial preprocessing phase, we need to transform text into a meaningful vector (or array) of numbers. This can be done by a number of tecniques:

## Bag of Words


### Bag of Words - Countvectorizer Features


In [25]:
count_vectorizer = CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train['tweet'])
test_vectors = count_vectorizer.transform(test["tweet"])

## Keeping only non-zero elements to preserve space 
print(train_vectors[0].todense())

[[0 0 0 ... 0 0 0]]


### Bag of Words - TFIDF Features

**Term Frequency: is a scoring of the frequency of the word in the current document.**

```
TF = (Number of times term t appears in a document)/(Number of terms in the document)
```

**Inverse Document Frequency: is a scoring of how rare the word is across documents.**

```
IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
```

In [26]:
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
train_tfidf = tfidf.fit_transform(train['tweet'])
test_tfidf = tfidf.transform(test["tweet"])


# <a name="model"></a> 6. Building a Text Classification model
Now the data is ready to be fed into a classification model. Let's create a basic classification model using commonly used classification algorithms and see how our model performs.

## Logistic Regression Classifier

In [27]:
# Fitting a simple Logistic Regression on Counts/Here you can also use Random Forest
clf = LogisticRegression(C=1.5)
scores = model_selection.cross_val_score(clf, train_vectors, train["label"], cv=5, scoring="f1")#evaluating the model
scores

array([0.6759388 , 0.65294925, 0.66298343, 0.63923182, 0.63247863])

In [28]:
clf.fit(train_vectors, train["label"])#Building the model

In [29]:
# Fitting a simple Logistic Regression on TFIDF/Here you can also use Random Forest
clf_tfidf = LogisticRegression(C=1.5)
scores = model_selection.cross_val_score(clf_tfidf, train_tfidf, train["label"], cv=5, scoring="f1")
scores

array([0.47412354, 0.48852459, 0.48264463, 0.43686007, 0.44444444])

It appears the countvectorizer gives a better performance than TFIDF in this case.

## Naives Bayes Classifier
Well, this is a decent score. Let's try with another model that is said to work well with text data : Naive Bayes.

In [30]:
# Fitting a simple Naive Bayes on Counts
clf_NB = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB, train_vectors, train["label"], cv=5, scoring="f1")#Evaluating the model
scores

array([0.58190709, 0.58768873, 0.59069767, 0.55841121, 0.56972586])

In [31]:
clf_NB.fit(train_vectors, train["label"])#Building the model

In [33]:
# Fitting a simple Naive Bayes on TFIDF
clf_NB_TFIDF = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB_TFIDF, train_tfidf, train["label"], cv=5, scoring="f1")
scores

array([0.41622575, 0.43402778, 0.47098976, 0.41958042, 0.41901408])

#well the naive bayes scores is not better than logistic regression model.# 

In [None]:
clf_NB_TFIDF.fit(train_tfidf, train["target"])

Conclusion:*Count Vectorizer is best for this Data-Set in comparision with Tf_Idf Vectorizer.
           *So we are going to use the Count Vectorized Data for the other algorithms.
           

## Random Forest Classifier & SVC
Well, both the scores are not as expected. Let's try with another models like Random Forest Classifier,Decision Tree and SVC.

##Splitting the dataset

In [34]:
X_train, X_test, y_train, y_test = train_test_split(train_vectors, train["label"], test_size=0.3, random_state=42)

##Models

In [37]:
#Since we have already used Logistic regression and Naives Bayes in the above methodology.
#We are using Random Forest and SVC here in this methodology.
models = { 'Random Forest': RandomForestClassifier(), 'Support Vector Machine': SVC(), 'Decision Tree Classifier': DecisionTreeClassifier() }

##Train and Evaluate Models

In [38]:
for model_name, model in models.items(): 
    model.fit(X_train, y_train) 
    y_pred = model.predict(X_test) 
    print(f"Model: {model_name}") 
    print(classification_report(y_test, y_pred))#Evaluating the model using Classification Report 
    print(confusion_matrix(y_test, y_pred))#Evaluating the model using Confusion Matrix
    print("-" * 50) 

Model: Random Forest
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      8905
           1       0.87      0.45      0.60       684

    accuracy                           0.96      9589
   macro avg       0.92      0.72      0.79      9589
weighted avg       0.95      0.96      0.95      9589

[[8860   45]
 [ 375  309]]
--------------------------------------------------
Model: Support Vector Machine
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      8905
           1       0.93      0.36      0.52       684

    accuracy                           0.95      9589
   macro avg       0.94      0.68      0.75      9589
weighted avg       0.95      0.95      0.94      9589

[[8887   18]
 [ 438  246]]
--------------------------------------------------
Model: Decision Tree Classifier
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      890

In [39]:
#Instead of giving all the models in dictionary and looping it one by one to evaluate it, we can give the model here individually.
model=MultinomialNB()
model_name="Naives Bayes"
model.fit(X_train, y_train) 
y_pred = model.predict(X_test) 
print(f"Model: {model_name}") 
print(classification_report(y_test, y_pred)) 
print(confusion_matrix(y_test, y_pred)) 
print("-" * 50)

Model: Naives Bayes
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      8905
           1       0.63      0.49      0.55       684

    accuracy                           0.94      9589
   macro avg       0.79      0.73      0.76      9589
weighted avg       0.94      0.94      0.94      9589

[[8705  200]
 [ 348  336]]
--------------------------------------------------


## Function to make Prediction

* In this case I have used tweet that have already been pre-processed to verify whether the model is working fine or not.
* We can also give any tweet manually and predict by defining the below function with preprocessing functions(with some changes) that I have commented for now.

In [40]:
train.loc[train['label']==1,:][-5:]#To get Violent tweet from training set, for evaluating the prediction is made correctly or not.
#train.loc[train['label']==0,:][:5]#To get Non-Violent tweet from training set, for evaluating the prediction is made correctly or not.

Unnamed: 0,id,label,tweet
31934,31935,1,lady banned kentucky mall user jcpenny kentucky
31946,31947,1,user omfg im offended im mailbox im proud mail...
31947,31948,1,user user dont ball hashtag say weasel away lu...
31948,31949,1,make ask anybody god oh thank god
31960,31961,1,user sikh temple vandalised calgary wso condem...


In [41]:
def predict_tweet(model, tweet_vect):#Function to get any tweet and predict it using any of the built model.
#tweet1 = preprocess_tweet(tweet_vect) 
#tweet_vect = vectorizer.transform([tweet1]) 
    pred = model.predict(tweet_vect) 
    return 'Violent' if pred[0] == 1 else 'Not Violent'

In [42]:
#manual_tweet = input("Enter a tweet to classify: ")
manual_tweet=train_vectors[31934]
chosen_model = models['Random Forest']#You can also give the model name directly in here.
result = predict_tweet(chosen_model, manual_tweet)#Calling the above function to make prediction. 
print(f'The tweet is classified as: {result}')

The tweet is classified as: Violent


## Result

* In this case,everything is working perfectly.Inspite of choosing the best,RandomForestClassifier and Logistic Regression is giving highest accuracy.