### Problem Statement

Monoclonal antibodies (mAbs) are highly effective in treating mild to moderate COVID-19 among nonhospitalized patients. We are asked to perform the following tasks:
1. Scrape data from any of the social media platform like facebook, twitter or reddid
2. Perform sentiment analysis and label positive and negative comments
3. Fetch the key words that are most used across platform
4. Fetch the influencers and their geography
5. Provide insights obtained using visualization

I choose to perform web scraping in twitter

### Summary:


1. __<a href='#1' target='_self'>Import Libraries</a>__
1. __<a href='#2' target='_self'>Fetch Tweets & Sentiments</a>__
    1. __<a href='#2A' target='_self'>Fetch Tweets</a>__
    1. __<a href='#2B' target='_self'>Fetch sentiments</a>__
1. __<a href='#3' target='_self'>Text Pre-processing</a>__
    1. __<a href='#3A' target='_self'>Pre-processing 'Key Words'</a>__
        1. <a href='#3Aa' target='_self'>Removing '@names'</a>
        1. <a href='#3Ab' target='_self'>Removing links (http | https)</a>
        1. <a href='#3Ac' target='_self'>Removing spaces in tweets</a>
        1. <a href='#3Ae' target='_self'>Removing Punctuations, Numbers and Special characters</a>
        1. <a href='#3Af' target='_self'>Removing Stop words</a>
        1. <a href='#3Ag' target='_self'>Tokenizing</a>
        1. <a href='#3Ah' target='_self'>Lemmatization </a>
        1. <a href='#3i' target='_self'>Joining all tokens into sentences</a>
        1. <a href='#3Ad' target='_self'>Dropping redundant rows</a>
        1. <a href='#3Ad' target='_self'>Resetting index</a>
    1. __<a href='#3B' target='_self'>Pre-processing 'Key Phrases'</a>
        1. <a href='#3Ba' target='_self'>Helper class, will help in preprocessing phrase terms</a>
        1. <a href='#3Bb' target='_self'>Defining the grammar of the phrases</a>
        1. <a href='#3Bc' target='_self'>New feature called 'key_phrases', will contain phrases for corresponding tweet</a>
1. __<a href='#4' target='_self'>Story Generation and Visualization</a>__
    1. __<a href='#4A' target='_self'>Most common words in positive tweets</a>__
    1. __<a href='#4B' target='_self'>Most common words in negative tweets</a>__
    1. __<a href='#4C' target='_self'>Most commonly used Hashtags</a>__
    1. __<a href='#4D' target='_self'>Most common influencers</a>__
1. __<a href='#5' target='_self'>Feature Extraction</a>__
    1. __<a href='#5A' target='_self'>Feature Extraction for 'Key Words'</a>__
    1. __<a href='#5B' target='_self'>Feature Extraction for 'Key Phrases'</a>__
1. __<a href='#6' target='_self'>Model Building: Sentiment Analysis</a>__
    1. __<a href='#6A' target='_self'>Predictions on 'key words' based features</a>__
        1. <a href='#6Aa' target='_self'> BOW word features</a>
        1. <a href='#6Ab' target='_self'>TF-IDF word features</a>
    1. __<a href='#6B' target='_self'>Predictions on 'key phrases' based features</a>__
        1. <a href='#6Ba' target='_self'>BOW phrase features</a>
        1. <a href='#6Bb' target='_self'>TF-IDF phrase features</a>
       

## <a id='1'>1. Import Libraries</a>

In [1]:
# !pip install tweepy
# pip install wordcloud

In [2]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
import re
import warnings

# Web Scraping
import tweepy
from tweepy import OAuthHandler 

# Text mining
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
from nltk.classify import NaiveBayesClassifier
from wordcloud import WordCloud

# Model Building
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB

# Sentiment Analysis
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
from textblob.np_extractors import ConllExtractor

# Ignoring all the warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Downloading stopwords corpus
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')
nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('conll2000')
nltk.download('brown')
stopwords = set(stopwords.words("english"))

# For showing all the plots inline
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nivedharakigmail.com/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nivedharakigmail.com/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/nivedharakigmail.com/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nivedharakigmail.com/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/nivedharakigmail.com/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nivedharakigmail.com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[

## <a id='2'>2. Fetch Tweets & Sentiments</a>
### <a id='2A'>A. Fetch Tweets</a>

In [3]:
# Enter the Keys and secrets generated from the Twitter Dev platform
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

In [4]:
# # Keys and secrets generated from the Twitter Dev platform
# consumer_key = input("Enter consumer key: ")
# consumer_secret = input("Enter consumer secret: ")
# access_token = input("Enter access token: ")
# access_token_secret = input("Enter access token secret: ")

In [5]:
#Creating a class to fetch twitter API using tweepy library
class Fetch_Tweet(object): 
    #Initialization method 
    def __init__(self): 
        try: 
            # creating OAuth1UserHandler object
            auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret,
                                            access_token, access_token_secret)
            self.api = tweepy.API(auth, wait_on_rate_limit=True)
            
        except tweepy.Unauthorized as e:
            print(f"Error: Authentication Failed - \n{str(e)}")

    #Defining the function to fetch tweets        
    def get_tweets(self, text, maxTweets = 1000): 
        # empty list to store parsed tweets 
        tweets = [] 
        sinceId = None
        max_id = -1
        tweetCnt = 0
        tweetsPerTxt = 100

        while tweetCnt < maxTweets:
            try:
                if (max_id <= 0):
                    if (not sinceId):
                        new_tweets = self.api.search_tweets(q=text, count=tweetsPerTxt)
                    else:
                        new_tweets = self.api.search_tweets(q=text, count=tweetsPerTxt,
                                                since_id=sinceId)
                else:
                    if (not sinceId):
                        new_tweets = self.api.search_tweets(q=text, count=tweetsPerTxt,
                                                max_id=str(max_id - 1))
                    else:
                        new_tweets = self.api.search_tweets(q=text, count=tweetsPerTxt,
                                                max_id=str(max_id - 1),
                                                since_id=sinceId)
                if not new_tweets:
                    print("End of search")
                    break

                for tweet in new_tweets:
                    parsed_tweet = {} 
                    parsed_tweet['tweets'] = tweet.text 

                    # appending parsed tweet to tweets list 
                    if tweet.retweet_count > 0: 
                        # if tweet has retweets, ensure that it is appended only once 
                        if parsed_tweet not in tweets: 
                            tweets.append(parsed_tweet) 
                    else: 
                        tweets.append(parsed_tweet) 
                        
                tweetCnt += len(new_tweets)
                print("Downloaded {0} tweets".format(tweetCnt))
                max_id = new_tweets[-1].id

            except tweepy.TweepyException as e:
                # Terminate program if error occurs
                print("Tweepy error : " + str(e))
                break
        
        return pd.DataFrame(tweets)

In [6]:
twitter_client = Fetch_Tweet()

# calling function to get tweets
tweets_df = twitter_client.get_tweets('monoclonal antibody treatment', maxTweets=9000)
print(f'tweets_df Shape - {tweets_df.shape}')
tweets_df.to_csv('tweets.csv')
tweets_df.head(10)

Tweepy error : 400 Bad Request
215 - Bad Authentication data.
tweets_df Shape - (0, 0)


### <a id='2B'>B. Fetch sentiments</a>

In [7]:
#Fetching sentiments using Textblob
def textblob_fetch_sentiment(text):
    analysis = TextBlob(text)
    return 'pos' if analysis.sentiment.polarity >= 0 else 'neg'

In [8]:
textblob_sentiments = tweets_df.tweets.apply(lambda tweet: textblob_fetch_sentiment(tweet))
pd.DataFrame(textblob_sentiments.value_counts())

AttributeError: 'DataFrame' object has no attribute 'tweets'

We have got 225 positive sentiments and 34 negative sentiments

In [None]:
tweets_df['sentiment'] = textblob_sentiments
tweets_df.head()

## <a id='3'>3. Text Pre-processing</a> 
### <a id='3A'>A. Pre-processing Key Words</a>

In [None]:
def add_pattern(text, pattern_regex):
    r = re.findall(pattern_regex, text)
    for i in r:
        text = i
    return text 

In [None]:
def remove_pattern(text, pattern_regex):
    r = re.findall(pattern_regex, text)
    for i in r:
        text = re.sub(i, '', text)
    return text 

In [None]:
# Adding a column named Influencers with all the names
tweets_df['Influencers'] = np.vectorize(add_pattern)(tweets_df['tweets'], "@[\w]+")
tweets_df.head(10)

#### <a id='3Aa'>a. Removing '@names'</a>
Names are not required for analysis hence we remove them

In [None]:
# Adding a column named Clean tweets with processed tweets
tweets_df['clean_tweets'] = np.vectorize(remove_pattern)(tweets_df['tweets'], "@[\w]*: | *RT*")
tweets_df.head(10)

#### <a id='3Ab'>b. Removing links from tweets </a>

In [None]:
cleaned_tweets = []

for index, row in tweets_df.iterrows():
    words_without_links = [word for word in row.clean_tweets.split() if 'http' or 'https' not in word]
    cleaned_tweets.append(' '.join(words_without_links))

tweets_df['clean_tweets'] = cleaned_tweets
tweets_df.head(10)

#### <a id='3Ac'>c. Removing spaces from tweets</a>

In [None]:
tweets_df = tweets_df[tweets_df['clean_tweets']!='']
tweets_df.head()

#### <a id='3Ad'>d. Removing Punctuations, Numbers and Special characters</a>
Since the semantics are required during the sentiment analysis of Key phrases we create a new column called 'clean_tweets_final'

In [None]:
tweets_df['clean_tweets_final'] = tweets_df['clean_tweets'].str.replace("[^a-zA-Z# ]", "")


#### <a id='3Ae'>e. Removing Stop words</a>
Stop words don't contribute to the analysis and hence are removed.

In [None]:
stopwords_set = set(stopwords)
cleaned_tweets = []

for index, row in tweets_df.iterrows():
    
    # filerting out all the stopwords 
    words_without_stopwords = [word for word in row.clean_tweets_final.split() if not word in stopwords_set and '#' not in word.lower()]
    
    # finally creating tweets list of tuples containing stopwords(list) and sentimentType 
    cleaned_tweets.append(' '.join(words_without_stopwords))
    
tweets_df['clean_tweets_final'] = cleaned_tweets
tweets_df.head(10)

#### <a id='3Af'>f. Tokenize *'clean_tweets_final'*</a> 
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens.

In [None]:
tokenized_tweet = tweets_df['clean_tweets_final'].apply(lambda x: x.split())
tokenized_tweet.head()

#### <a id='3Ag'>g. Lemmatization </a>
Lemmatization is a common normalization technique in text pre-processing. In lemmatization, words are replaced by their root form or words with similar context.

In [None]:
word_lemmatizer = WordNetLemmatizer()

tokenized_tweet = tokenized_tweet.apply(lambda x: [word_lemmatizer.lemmatize(i) for i in x])
tokenized_tweet.head()

#### <a id='3Ah'>h. Joining all tokens into sentences</a>

In [None]:
for i, tokens in enumerate(tokenized_tweet):
    tokenized_tweet[i] = ' '.join(tokens)

tweets_df['clean_tweets_final'] = tokenized_tweet
tweets_df.head(10)

#### <a id='3Ai'>i. Dropping redundant rows</a>

In [None]:
tweets_df.drop_duplicates(subset=['clean_tweets'], keep=False)
tweets_df.head()

#### <a id='3Aj'>j. Resetting index</a>
It seems that our index needs to be reset, since after removal of some rows, some index values are missing, which may cause problem in future operations.

In [None]:
tweets_df = tweets_df.reset_index(drop=True)
tweets_df.head()

### <a id='3B'>B. Pre-processing 'Key Phrases'</a> 

#### <a id='3Ba'>a. Helper class, will help in preprocessing phrase terms</a>
The pre-processing techniques used are :
1. Lemmatization
2. Stemming. Stemming is a natural language processing technique that lowers inflection in words to their root forms
3. Case Normalization
4. Optimize length of phrase


In [None]:
class PhraseExtractHelper(object):
    def __init__(self):
        self.lemmatizer = nltk.WordNetLemmatizer()
        self.stemmer = nltk.stem.porter.PorterStemmer()
    # Finds NP (nounphrase) leaf nodes of a chunk tree
    def leaves(self, tree):
        for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
            yield subtree.leaves()
            
    #Normalises words to lowercase and stems and lemmatizes it
    def normalise(self, word): 
        word = word.lower()
        word = self.lemmatizer.lemmatize(word)
        return word
    #Checks conditions for acceptable word: length, stopword. We can increase the length if we want to consider large phrase
    def acceptable_word(self, word):
        accepted = bool(3 <= len(word) <= 40
            and word.lower() not in stopwords
            and 'https' not in word.lower()
            and 'http' not in word.lower()
            and '#' not in word.lower()
            )
        return accepted

    def get_terms(self, tree):
        for leaf in self.leaves(tree):
            term = [ self.normalise(w) for w,t in leaf if self.acceptable_word(w) ]
            yield term

#### <a id='3Bb'>b. Defining the grammar of the phrases</a>

In [None]:
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:\$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"\'?():-_`])'
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)

#### <a id='3Bc'>c. New feature called 'key_phrases', will contain phrases for corresponding tweet</a>

In [None]:
key_phrases = []
phrase_extract_helper = PhraseExtractHelper()

for index, row in tweets_df.iterrows(): 
    toks = nltk.regexp_tokenize(row.clean_tweets, sentence_re)
    postoks = nltk.tag.pos_tag(toks)
    tree = chunker.parse(postoks)

    terms = phrase_extract_helper.get_terms(tree)
    tweet_phrases = []

    for term in terms:
        if len(term):
            tweet_phrases.append(' '.join(term))
    
    key_phrases.append(tweet_phrases)
    
key_phrases[:10]

In [None]:
textblob_key_phrases = []
extractor = ConllExtractor()

for index, row in tweets_df.iterrows():
    # filerting out all the hashtags
    words_without_hash = [word for word in row.clean_tweets.split() if '#' not in word.lower()]
    
    hash_removed_sentence = ' '.join(words_without_hash)
    
    blob = TextBlob(hash_removed_sentence, np_extractor=extractor)
    textblob_key_phrases.append(list(blob.noun_phrases))

textblob_key_phrases[:10]

In [None]:
tweets_df['key_phrases'] = textblob_key_phrases
tweets_df.head(10)

## <a id='4'>4. Story Generation and Visualization</a>

In [None]:
# function to generate wordcloud
def generate_wordcloud(all_words):
    wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=100, relative_scaling=0.5, colormap='Dark2').generate(all_words)

    plt.figure(figsize=(14, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()

In [None]:
# function to collect hashtags
def hashtag_extract(text_list):
    hashtags = []
    # Loop over the words in the tweet
    for text in text_list:
        ht = re.findall(r"#(\w+)", text)
        hashtags.append(ht)

    return hashtags

def generate_hashtag_freqdist(hashtags):
    a = nltk.FreqDist(hashtags)
    d = pd.DataFrame({'Hashtag': list(a.keys()),
                      'Count': list(a.values())})
    # selecting top 15 most frequent hashtags     
    d = d.nlargest(columns="Count", n = 20)
    plt.figure(figsize=(16,7))
    ax = sns.barplot(data=d, x= "Hashtag", y = "Count")
    plt.xticks(rotation=80)
    ax.set(ylabel = 'Count')
    plt.show()

#### <a id='4A'>A. Most common words in positive tweets</a>

In [None]:
all_words = ' '.join([text for text in tweets_df['clean_tweets_final'][tweets_df.sentiment == 'pos']])
generate_wordcloud(all_words)

The wordcloud shows that the positive key words are monoclonal antibody,treatment,Prion protein,Covid,drug,antiviral theraphy etc.

#### <a id='4B'>B. Most common words in negative tweets</a>

In [None]:
all_words = ' '.join([text for text in tweets_df['clean_tweets_final'][tweets_df.sentiment == 'neg']])
generate_wordcloud(all_words)

The wordcloud shows that the negative key words are monoclonal antibody,covid,vacine hesitant,skeptical,prevent,immuno compromised,letters etc.

#### <a id='4C'>C. Most commonly used Hashtags</a>

In [None]:
hashtags = hashtag_extract(tweets_df['clean_tweets'])
hashtags = sum(hashtags, [])

In [None]:
generate_hashtag_freqdist(hashtags)

The plot shows the histogram of the hastags that are commonly used. We can see the presence of drugs like Evusheld, Paxlovid, Medvix. We can also see the variants of Covid19 like Omicron. The mention of passive genocide reflects the disappointment of the third world contries. Florida comes under the geographical factor in the key topic of discussion 

#### <a id='4D'>D. Most common influencers </a>

In [None]:
generate_hashtag_freqdist(tweets_df['Influencers'])

The top influencers are tagged by the name B52malmet, Sophos_Veritate, theLancetNeuro, CDRMaguire and so on 

In [None]:
# For sake of consistency, we are going to discard the records which contains no phrases i.e where tweets_df['key_phrases'] contains []
tweets_df2 = tweets_df[tweets_df['key_phrases'].str.len()>0]

## <a id='5'>5. Feature Extraction</a>

Feature extraction can be done by two methods. They are as follows:

1. __Bag of words (Simple vectorization)__
2. __TF-IDF (Term Frequency - Inverse Document Frequency)__


### <a id='5A'>A. Feature Extraction for 'Key Words'</a>

In [None]:
# BOW features
bow_word_vectorizer = CountVectorizer(max_df=0.90, min_df=2, stop_words='english')
# bag-of-words feature matrix
bow_word_feature = bow_word_vectorizer.fit_transform(tweets_df2['clean_tweets_final'])

# TF-IDF features
tfidf_word_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, stop_words='english')
# TF-IDF feature matrix
tfidf_word_feature = tfidf_word_vectorizer.fit_transform(tweets_df2['clean_tweets_final'])


### <a id='5B'>B. Feature Extraction for 'Key Phrases'</a>

In [None]:
phrase_sents = tweets_df2['key_phrases'].apply(lambda x: ' '.join(x))

# BOW phrase features
bow_phrase_vectorizer = CountVectorizer(max_df=0.90, min_df=2)
bow_phrase_feature = bow_phrase_vectorizer.fit_transform(phrase_sents)

# TF-IDF phrase feature
tfidf_phrase_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2)
tfidf_phrase_feature = tfidf_phrase_vectorizer.fit_transform(phrase_sents)

## <a id='6'>6. Model Building: Sentiment Analysis</a>

In [None]:
# Mapping target variables to  {0, 1}
target_variable = tweets_df2['sentiment'].apply(lambda x: 0 if x=='neg' else 1)

In [None]:
def plot_confusion_matrix(matrix):
    plt.clf()
    plt.imshow(matrix, interpolation='nearest', cmap=plt.cm.Set2_r)
    classNames = ['Positive', 'Negative']
    plt.title('Confusion Matrix')
    plt.ylabel('Predicted')
    plt.xlabel('Actual')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames)
    plt.yticks(tick_marks, classNames)
    s = [['TP','FP'], ['FN', 'TN']]

    for i in range(2):
        for j in range(2):
            plt.text(j,i, str(s[i][j])+" = "+str(matrix[i][j]))
    plt.show()

In [None]:
def naive_model(X_train, X_test, y_train, y_test):
    naive_classifier = GaussianNB()
    naive_classifier.fit(X_train.toarray(), y_train)

    # predictions over test set
    predictions = naive_classifier.predict(X_test.toarray())

    # calculating Accuracy Score
    print(f'Accuracy Score - {accuracy_score(y_test, predictions)}')
    conf_matrix = confusion_matrix(y_test, predictions, labels=[True, False])
    plot_confusion_matrix(conf_matrix)

### <a id='6A'>A. Predictions on 'key words' based features</a>

#### <a id='6Aa'>a. BOW word features</a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(bow_word_feature, target_variable, test_size=0.3, random_state=272)
naive_model(X_train, X_test, y_train, y_test)

#### <a id='6Ab'>b. TF-IDF word features</a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_word_feature, target_variable, test_size=0.3, random_state=272)
naive_model(X_train, X_test, y_train, y_test)

### <a id='6B'>B. Predictions on 'key phrases' based features</a>

#### <a id='6Ba'>a. BOW Phrase features</a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(bow_phrase_feature, target_variable, test_size=0.3, random_state=272)
naive_model(X_train, X_test, y_train, y_test)

#### <a id='6Bb'>b. TF-IDF Phrase features</a>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_phrase_feature, target_variable, test_size=0.3, random_state=272)
naive_model(X_train, X_test, y_train, y_test)

Based on the accuracy score features extracted from 'key words' helps model perform better. They give better positive predictions than the features extracted from 'key phrases'. Between BOW and TF-IDF, BOW gives better result. So we choose BOW and key words based model using NaiveBayes classifier.
