![What most of tweets looks like](https://web.stanford.edu/class/cs224n/reports/final_summaries/images/image000.png)

# Introduction

 Does most of the tweets on twitter social network are positive, negative, or neutral.
 
Well, you know what they say ... if you want, you  can find a correlation anywhere you look ... if you're really, deeply paying attention.

> 🟢 <b>Goal</b>: This notebook has the purpose of analysing and predicting twitter tweets if  have positive, negative, or neutral language .

>

## 📚 Libraries & Functions

In [None]:
import numpy as np 
import pandas as pd

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
from wordcloud import STOPWORDS as stopwords_wc

import re
import os

# nltk

import nltk
from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer

# ML & preprocessing tools 

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline 
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve 
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

import xgboost as xgb

# Color palette

my_colors = ["#ce8f5a", "#efd199", "#80c8bc", "#5ec0ca", "#6287a2"]
sns.palplot(sns.color_palette(my_colors))

# Set Style

sns.set_style("white")
mpl.rcParams['xtick.labelsize'] = 16
mpl.rcParams['ytick.labelsize'] = 16
mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False

class color:
    BOLD = '\033[1m' + '\033[93m'
    END = '\033[0m'



for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        


### 📥 Read in Data

In [None]:
col = ["target", "ids", "date", "flag", "user", "text"]
df = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', header=None, encoding='ISO-8859-1', names = col, skiprows=795000, nrows = 10000)
df.head()


**Map target label to String**

   * 0 -> NEGATIVE
   * 2 -> NEUTRAL
   * 4 -> POSITIVE



In [None]:
df.shape

In [None]:
df.isnull().sum() # seems like we don't have any null data points 

## **Data Preprocessing**

In [None]:
df2 = df[['text', 'target']]
df2.head()

In [None]:
df2['target'] = df2['target'].replace(4,1)

#### **Cleaning text column** 

>The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words. 

***What are Stop words?***

**Stop Words**: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

![](https://media.geeksforgeeks.org/wp-content/cdn-uploads/Stop-word-removal-using-NLTK.png)

In [None]:
def emoji_extractor(string, remove=False):
    '''Removes Emoji from a text.'''
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    if remove == False:
        # Extract emoji
        return emoji_pattern.findall(string)
    else:
        # Remove emoji from text
        return emoji_pattern.sub(r'', string)



In [None]:
def clean_emoji(x):
    if len(x) == 0:
        return ''
    else:
        return x[0] 
    


In [None]:
def clean_tweets(df,col):
    '''Returns the dataframe with the tweet column cleaned.'''
    
    # ----- Remove \n, \t, \xa0 -----
    df[col] = df[col].apply(lambda x: x.replace('\n', ''))
    df[col] = df[col].apply(lambda x: x.replace('\xa0', ''))
    df[col] = df[col].apply(lambda x: x.replace('\t', ''))
    
    # ----- Remove pic.twitter and http:// + https:// links -----
    df[col] = df[col].apply(lambda x: re.sub(r'http\S+', '', x))
    df[col] = df[col].apply(lambda x: re.sub(r'https\S+', '', x))
    df[col] = df[col].apply(lambda x: re.sub(r'pic.twitter\S+', '', x))
    
    # ----- Remove mentions and hashtags -----
    df[col] = df[col].apply(lambda x: re.sub(r'#\S+', '', x))
    df[col] = df[col].apply(lambda x: re.sub(r'@\S+', '', x))
    
    # ----- Extract Emojis and Remove from Tweet -----
    df['tweet_emojis'] = df[col].apply(lambda x: emoji_extractor(x, remove=False))
    df['tweet_emojis'].replace('', np.nan, inplace=True)
#     df["tweet_emojis"] = df["tweet_emojis"].apply(lambda x: clean_emoji(x))
    
    df[col] = df[col].apply(lambda x: emoji_extractor(x, remove=True))
    
    # ----- Strip of whitespaces -----
    df[col] = df['tweet'].apply(lambda x: x.strip())
    df[col] = df[col].apply(lambda x: ' '.join(x.split()))
    
    # ----- Remove punctuation & Make lowercase -----
    df[col] = df[col].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
    df[col] = df[col].apply(lambda x: x.lower())
    
    return df

In [None]:
# TEXT CLENAING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

In [None]:
def text_process(text, stem = False):
    
    text = re.sub(TEXT_CLEANING_RE, " ", str(text).lower()).strip()
    
    tokens = []
    
    for token in text.split():
        
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
                
    return " ".join(tokens)

>

In [None]:
%%timeit

df2['text'] = df2.text.apply(lambda x: text_process(x))

In [None]:
df2.head()

### Most Frequent Words 

In [None]:

#Make wordcloud

all_tweets = ' '.join(token for token in df['text'])
stopwords_wc = set(stopwords_wc)

FONT_PATH = "../input/ace-font/acetone_font.otf"

wordcloud = WordCloud(stopwords = stopwords_wc, font_path= FONT_PATH,
                      max_words =1500, 
                      max_font_size = 350, random_state=42,
                      width = 2000, height=1000,
                      colormap = 'twilight')

wordcloud.generate(all_tweets)

plt.figure(figsize = (16, 8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis("off")
plt.show();

**Representing text numerically**

*how we take the **tokenization** we have created and turn them into an array that we can feed into a machine learning algorithm ?*

**Bag of words** is the choice as it is:
* Simple way to represent text in machine learning.
* Discards information about grammer and word order.
* Computes frequency of occurrence.
assuming that the number of times a word occurs is enough information. 

**CountVectorizer()** works by taking an array of strings and doing three things:
1. Tokenizes all the strings 
2. Builds a "Vocabulary" --> makes note of all the words that appear.
3. Counts the occurrences of each token in the vocabulary.

![](https://s3.ap-south-1.amazonaws.com/s3.studytonight.com/curious/uploads/pictures/1590391511-1.jpg)

### **Splitting the dataset into Training and test sets**

In [None]:
X = df2['text']

y = df2['target']

X_train,X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [None]:
y.values

### **Creating a pipeline**


#### **Pipeline workflow**:
* Repeatable way to go from raw data to trained model

* Pipeline object takes sequential list of steps, where the output of one step is input to the next.

* Each step is a tuple with two elements: ("Name of the step", "Transform object")

#### **One-Vs-The-Rest** is a multiclass strategy.

*copying from the official documentation of scikit-learn :*

> Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy for multiclass classification and is a fair default choice.

In [None]:
pl = Pipeline([('vec', CountVectorizer()), 
               ('clf', OneVsRestClassifier(LogisticRegression()))])

pl.fit(X_train, y_train)

y_pred = pl.predict(X_test)

In [None]:
print("Pipeline with accuaracy score: ",accuracy_score(y_test, y_pred))

### This is not the best model accuracy score we can get.. Lets try some different preprocessing techniqes to reach higher accuracy for our model.

In [None]:
y_pred_proba = pl.predict_proba(X_test)[:,1]

fpr, tpr,thresholds = roc_curve(y_test, y_pred_proba)

plt.plot(fpr, tpr, label = "Logisitic Regression")
plt.plot([0,1], [0,1], "k--")

> I used the predicted probabilities of the model assigning a value of (1) to the observation in question. This is because to compute the (ROC) we don't merely want the prdictions on the test set, but we want the probability that our logistic regression model ouputs before using a threshold to predict the label.

#### **Now the question is:** given the ROC curve, can we extract a metric of interset?

* Larger the area under the curve  ==== Better model

**The way to think about this is the following:** if we had a model which produced an ROC curve that had a single point at (1,0) the upper left corner, representing a "True positive" rate of one and a "False positive" of zero, this will be a great model !

* For this reason the Area under the ROC, commonly denoted as "AUC", is another popular metric for classificatioon models. 

In [None]:
roc_auc_score(y_test, y_pred_proba)

### Now, when fitting different values of hyperparameters, it is essential to use "Cross-validation" as using "train_test_split" alone would risk overfitting. 

In [None]:
cv_scores = cross_val_score(pl, X, y, cv = 10, scoring = 'roc_auc')
cv_scores

### We could see the difference between using "train_test_split" alone and using 'cross_val_score' as we can see that the 9th fold returns with the best model with 89% roc_auc score.

### **We Could Try differnet way to make sure that we get the most accuarate model.**

>By trying the most known in competitions for this specific problems "Multiclass classification"  **XGBOOST**

In [None]:

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer

# specifing the parameters 

params = {
    'max_depth': 6,
    
    'objective': 'multi:softmax', ## Error Evaluation For Multivclass Classification
    
    'num_class': 3,
    
}

param_grid = {
    'clf__max_depth': [2, 3, 5, 7, 10],
    'clf__n_estimators': [10, 100, 500],
}

clf = xgb.XGBClassifier(**params)

pl_2 = Pipeline([('vec', CountVectorizer()), 
               ('clf', clf)])

grid = GridSearchCV(pl_2, param_grid, cv=5, n_jobs = -1, 
                    scoring = 'accuracy')


In [None]:
%%time
grid.fit(X_train, y_train)


In [None]:
pred = grid.predict(X_test)
pred[:10]

In [None]:
print(classification_report(y_test, pred))

In [None]:
cm = confusion_matrix(y_test, pred)
cm

In [None]:
def plot_confusion_matrix(cm, classes, normalized=True, cmap='bone'):
    
    plt.figure(figsize=[7,6])
    
    norm_cm = cm
    if normalized:
        norm_cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        sns.heatmap(norm_cm, annot=cm, fmt='g', xticklabels = classes,
                    yticklabels = classes, cmap=cmap)

In [None]:
plot_confusion_matrix(cm, ['pro 1', 'pro 2'])

[](https://bs-uploads.toptal.io/blackfish-uploads/uploaded_file/file/191043/image-1582222692844-bfd251400319962c71d58f464e086281.png)

In [None]:
## ![](https://bs-uploads.toptal.io/blackfish-uploads/uploaded_file/file/191043/image-1582222692844-bfd251400319962c71d58f464e086281.png)

https://pgirish.github.io/spark-project/index.html

https://www.toptal.com/apache/apache-spark-streaming-twitter

****