# *This particular challenge is perfect for data scientists looking to get started with Natural Language Processing*.

<img src='https://i.morioh.com/94c2283427.png' width='600'>

<div class='alert alert-info'>
    <h2><center>AIM: To Predict whether a given tweet is about a real disaster or not. If so, predict as 1, else as 0.</center></h2>
    </div>


<div class='alert alert-info'>
    <h2><center>What is Natural Language Processing?</center></h2>

<h3>Natural Language Processing is the technology used to aid computers to understand the human’s natural language.</h3><br>
    <h3>It’s not an easy task teaching machines to understand how we communicate.</h3>

<h3>Natural Language Processing, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.
<h3>Most NLP techniques rely on machine learning to derive meaning from human languages</h3>
</div>

<div class='alert alert-warning'>
    <h3><center>Each sample in the train and test set has the following information:</center></h3>

    
1. The text of a tweet
2. A keyword from that tweet (although this may be blank!)
3. The location the tweet was sent from (may also be blank)
    </div>

<div class='alert alert-warning'>
    
<h3><center>I am going to try to keep it simple as much as possible! We will be following 4 steps to obtain the desired predictions</center></h3>

STEP 1: INPUT (Obtain the input files - train and test)<br>
STEP 2: EDA (To visualize the class distributions of the target variable)<br>
STEP 3: FEATURE ENGINEERING (Extarcting the features from the given text using TfidfVectorizer)<br>
STEP 4: MODEL BUILDING & MAKING PREDICTIONS (Using logistic regression to start with)<br>
STEP 5: OUTPUT<br>

<div class='alert alert-info'>
    <h3><center>STEP 1</center></h3>
So let's start with STEP 1 - Obtaining the inputs
    </div>

Let's import the necessary packages!!

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import missingno as mno
from sklearn.feature_extraction.text import TfidfVectorizer # For extracting the features from the tweet text
from sklearn.linear_model import LogisticRegression #To build a logistic model
from sklearn.metrics import accuracy_score  #To obtain the evaluation metrics
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

**Now since we have downloaded the necessary packages, lets import the data** 

In [None]:
train=pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test=pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

<div class='alert alert-info'>
    <h2><center>STEP 2 : EDA ON THE CLASS DISTRIBUTION OF THE TARGET VARIABLE</center></h2>
    </div>

In [None]:
train['location'].isna().sum(),train.shape

In [None]:

occurences=train['target'].value_counts().reset_index().rename(columns={'index':'Class','target':'Number of Occurences'})
sns.barplot(x=occurences['Class'],y=occurences['Number of Occurences'])


In [None]:
occurences['Percentage(%)']=(occurences['Number of Occurences']/occurences['Number of Occurences'].sum())*100
occurences.set_index('Class')

**The number of class 0 records are more in number compared to the class 1 records! But the differences is quite less!**


<div class='alert alert-info'>
    <h2><center>STEP 3: EXTRACTING THE FEATURES FROM THE TWEET TEXT</center></h2>
    </div>

In [None]:
traindata = list(np.array(train.iloc[:,3])) #Extracting the text feature alone from the train data
testdata = list(np.array(test.iloc[:,3]))#Extracting the text feature alone from the test data
y = np.array(train.iloc[:,4]).astype(int)#Extracting the target varaible from the train data

X_all = traindata + testdata #combining both the test and train data
lentrain = len(traindata)

In [None]:
# Implementing TFIDF to extract the features from the text
tfidf = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)

print("Implementing TFIDF to both the test and train data")
tfidf.fit(X_all)
print("Transforming the data")
X_all = tfidf.transform(X_all)

**Voila! The TFIDF features are now ready and we can proceed with step 4** 

<div class='alert alert-info'>
    <h2><center>STEP 4: MODEL BUILDING AND PREDICTION</center></h2>
    </div> 


In [None]:
X = X_all[:lentrain] # Seperating the train data from the entire data
X_test = X_all[lentrain:] # Seperating the test data from the entire data

log = LogisticRegression(penalty='l2',dual=False, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #initialising the logistic regression function with the respective parameters

print("Training on the train data")
log.fit(X,y)

#Evaluating with the train data's target variable to obatin the training accuracy!
y_pred_X=log.predict(X)
print('Training accuracy is {}'.format(accuracy_score(y, y_pred_X)))

predictions = log.predict(X_test) #Prediciting the target for the test data

predictions

<div class='alert alert-info'>
    <h2><center>STEP 5: GENERATING THE OUTPUT FILE </center></h2>
    </div> 



In [None]:
test

In [None]:
test_ids=test['id']
submission = pd.DataFrame(predictions,index=test_ids,columns=['target'])
submission.to_csv('submission_nlprnot.csv')
print("submission file created..")

<div class='alert alert-warning'>
    <h3><center>In order to improve the score, you could use different classifier with different sets of parameter tuning(Grid search or CVsearch)</center></h3>
    </div>

## Here are some other notebooks and datasets for you to explore more on twitter sentiment analysis👍

### Notebooks:
- [pfizer tweets EDA abd text analysis](https://www.kaggle.com/kaushiksuresh147/pfizer-tweets-eda-and-text-analysis)
- [IPL 20-2021 Twitter analysis & EDA](https://www.kaggle.com/kaushiksuresh147/ipl-20-2021-twitter-analysis-eda)
- [How to extract tweets from twitter using twitter API in python](https://www.kaggle.com/kaushiksuresh147/twitter-data-extraction-for-ipl2020)
- [Covid Vaccine EDA](https://www.kaggle.com/kaushiksuresh147/covid-vaccine-eda)</p>


### Datasets
- [Bitcoin Tweets](https://www.kaggle.com/kaushiksuresh147/bitcoin-tweets)
- [IPL 2020 & 2021 Tweets](https://www.kaggle.com/kaushiksuresh147/ipl2020-tweets)
- [Covid Vaccine Tweets](https://www.kaggle.com/kaushiksuresh147/covidvaccine-tweets)
- [The Social Dilemma Tweets - Text Classification](https://www.kaggle.com/kaushiksuresh147/the-social-dilemma-tweets)
