# Study of Adverse Reactions from Tweets

We are going to identify adverse reactions from tweets. These tweets were collected with an R script using Twiiter API.

# ---------------------------------------------------------------------------------------------------


## Importing Libraries

In [84]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.feature_extraction.text import TfidfVectorizer 

"twitter_data_All_Final.csv" is the file that contains all information extracted from Twitter API.

In [85]:
rawdata = pd.read_csv("twitter_data_All_Final.csv",header=0,quoting=0)
data = pd.DataFrame(rawdata.iloc[:,[3,4,5,12,13,27,42,43]])

``data`` is the dataframe containing tweets with its all attributes.

This is how `data` looks

In [86]:
data.head(10)

Unnamed: 0,screen_name,text,source,retweet_count,hashtags,mentions_screen_name,searchTerm,disease
0,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer
1,rachillionaire,@femmebostonian @Cherrell_Brown This has been ...,Twitter for iPhone,0,,femmebostonian Cherrell_Brown,Levothyroxine,Cancer
2,renae_dePerio,For a decade I was on Levothyroxine 137 mcg. I...,Twitter for iPhone,0,,,Levothyroxine,Cancer
3,junglismastiff,@yourAAH Hi could you tell me which of your UK...,Twitter for Android,0,,yourAAH,Levothyroxine,Cancer
4,notsydvicious,And my levothyroxine,Twitter for iPhone,0,,,Levothyroxine,Cancer
5,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer
6,Juniper_publish,Telephone Survey to Assess Substitution of Ele...,Twitter Web Client,0,,,Levothyroxine,Cancer
7,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer
8,Princeaurora200,"Dear pfiser, Thyroxine.fatalities occur, Ny p...",Mobile Web (M2),0,,,Levothyroxine,Cancer
9,Princeaurora200,"Dear pfiser, Thyroxine. Ny puberty. Growth hor...",Mobile Web (M2),0,,,Levothyroxine,Cancer


Some Information of what is inside the `data` Dataframe

In [87]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3941 entries, 0 to 3940
Data columns (total 8 columns):
screen_name             3941 non-null object
text                    3941 non-null object
source                  3941 non-null object
retweet_count           3941 non-null int64
hashtags                647 non-null object
mentions_screen_name    2764 non-null object
searchTerm              3941 non-null object
disease                 3941 non-null object
dtypes: int64(1), object(7)
memory usage: 246.4+ KB


# ---------------------------------------------------------------------------------------------------


## Preprocessing the data

1) Removing `mentions`, `urls` and `hashtags` from all the tweets

2) Removing `puncutations`

3) Converting data to `lowercase`

In [88]:
import re
import string
import nltk
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords
nltk.download("stopwords") 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#removing mentions, URL and hashtag symbols from all the tweets followed by removing punctuations, lower-casing the data


mention = '(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z-_\.]+[A-Za-z0-9])|(?<=^|(?<=[^a-zA-Z0-9-_\.])) @([A-Za-z-_\.]+[A-Za-z0-9])|(?<=^|(?<=[^a-zA-Z0-9-_\.])).@([A-Za-z-_\.]+[A-Za-z0-9])'
url = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
hashtag = '(?:^|\s)[＃#]{1}(\w+)'


for i in range(len(data['text'])):
    
    x = data.loc[i,'text']
    x = re.sub(mention,'',x)
    x = re.sub(url,'',x)
    x = re.sub('#','',x)
    x = re.sub(r'[^\x00-\x7f]',r' ',x)
    x = re.sub("[^a-zA-Z]", " ", x) 
    x = re.sub('[ \t\n]+', ' ',x)
    x = x.lower().strip().rstrip(string.punctuation).lstrip(string.punctuation).strip()
    
    #text is split into words and converted into lowercase
    words = x.lower().split()
    
    # remove stop words (false by default)
    stops = set(stopwords.words("english"))  ##Although we have not removed stop words, we have stored stop words in "stops" in case it is required for further analysis
    #remove retweet 'abbreviation' from tweets
    words = [w for w in words if not w in ('rt','RT')]

    
    cleaned_word_list = " ".join([w for w in words if len(w)>1])
    #print('cleaned:  ',cleaned_word_list)
##A column 'Preprocess text' was added to the dataset
    data.loc[i,'Preprocesstext'] = cleaned_word_list

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chirag\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let's have a look at the data

In [97]:
data.head(10)


Unnamed: 0,screen_name,text,source,retweet_count,hashtags,mentions_screen_name,searchTerm,disease,Preprocesstext
0,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer,buy top levothyroxine online on our free compa...
1,rachillionaire,@femmebostonian @Cherrell_Brown This has been ...,Twitter for iPhone,0,,femmebostonian Cherrell_Brown,Levothyroxine,Cancer,this has been super helpful switched from levo...
2,renae_dePerio,For a decade I was on Levothyroxine 137 mcg. I...,Twitter for iPhone,0,,,Levothyroxine,Cancer,for decade was on levothyroxine mcg was not se...
3,junglismastiff,@yourAAH Hi could you tell me which of your UK...,Twitter for Android,0,,yourAAH,Levothyroxine,Cancer,hi could you tell me which of your uk pharmaci...
4,notsydvicious,And my levothyroxine,Twitter for iPhone,0,,,Levothyroxine,Cancer,and my levothyroxine
5,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer,buy top levothyroxine online on our free compa...
6,Juniper_publish,Telephone Survey to Assess Substitution of Ele...,Twitter Web Client,0,,,Levothyroxine,Cancer,telephone survey to assess substitution of ele...
7,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer,buy top levothyroxine online on our free compa...
8,Princeaurora200,"Dear pfiser, Thyroxine.fatalities occur, Ny p...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine fatalities occur ny pube...
9,Princeaurora200,"Dear pfiser, Thyroxine. Ny puberty. Growth hor...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine ny puberty growth hormon...


In [98]:
print("We have a total of {} tweets".format(data.shape[0]))

We have a total of 3941 tweets


Removing Duplicate items to avoid Redundancy and then reseting the index of the dataframe

In [91]:
# Remove Duplicates
data.duplicated('Preprocesstext')
dataUnique = data.drop_duplicates(['Preprocesstext'])
dataUnique = dataUnique.reset_index(drop = True)
dataUnique=pd.DataFrame(dataUnique)

In [92]:
dataUnique

Unnamed: 0,screen_name,text,source,retweet_count,hashtags,mentions_screen_name,searchTerm,disease,Preprocesstext
0,buysextoys71,Buy top levothyroxine online on our free compa...,twittbot.net,0,,,Levothyroxine,Cancer,buy top levothyroxine online on our free compa...
1,rachillionaire,@femmebostonian @Cherrell_Brown This has been ...,Twitter for iPhone,0,,femmebostonian Cherrell_Brown,Levothyroxine,Cancer,this has been super helpful switched from levo...
2,renae_dePerio,For a decade I was on Levothyroxine 137 mcg. I...,Twitter for iPhone,0,,,Levothyroxine,Cancer,for decade was on levothyroxine mcg was not se...
3,junglismastiff,@yourAAH Hi could you tell me which of your UK...,Twitter for Android,0,,yourAAH,Levothyroxine,Cancer,hi could you tell me which of your uk pharmaci...
4,notsydvicious,And my levothyroxine,Twitter for iPhone,0,,,Levothyroxine,Cancer,and my levothyroxine
5,Juniper_publish,Telephone Survey to Assess Substitution of Ele...,Twitter Web Client,0,,,Levothyroxine,Cancer,telephone survey to assess substitution of ele...
6,Princeaurora200,"Dear pfiser, Thyroxine.fatalities occur, Ny p...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine fatalities occur ny pube...
7,Princeaurora200,"Dear pfiser, Thyroxine. Ny puberty. Growth hor...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine ny puberty growth hormon...
8,microRNA_papers,Serum microRNA profiles in athyroid patients o...,dlvr.it,0,,,Levothyroxine,Cancer,serum microrna profiles in athyroid patients o...
9,ImagineThatBaby,RT @handmade_dorset: And there I was home agai...,Twitter Web Client,1,thyroidcancer thyroid hashimotos livingwell gr...,handmade_dorset,Levothyroxine,Cancer,and there was home again what an adventure thy...


An important step is to remove the tweets done for marketing purposes as it may not represent correct information we need

In [101]:
# removing potential market tweets
market = ['buy','purchase','purchased','insurance','insured','sell','advertise','resell','retail','endorse','endorsed','discount','consumer','pharmacy']
dataUnique = dataUnique[~dataUnique.Preprocesstext.str.contains('|'.join(market))]

dataUnique = pd.DataFrame(dataUnique).reset_index(drop=True)
dataUnique.to_csv('Preprocessed_without_markets.csv',sep=',')
print("We are left with {} tweets after removing potential market tweets".format(dataUnique.shape[0]))

We are left with 1457 tweets after removing potential market tweets


# ---------------------------------------------------------------------------------------------------


## Sentiment Analysis Using TextBlob(Easiest Way to do sentiment analysis)

Let's identify the `polarity` and `subjectivity` of the normalized text of tweets and add it as a column in our dataframe

`Polarity` : `Polarity` takes into account the positive and negative terms in a sentence

`Subjectivity` : `Subjective` sentence expresses some personal feelings, views, or beliefs.

In [43]:
from textblob import TextBlob

dataUnique['polarity'] = dataUnique.apply(lambda x: TextBlob(x['Preprocesstext']).sentiment.polarity, axis=1)
dataUnique['subjectivity'] = dataUnique.apply(lambda x: TextBlob(x['Preprocesstext']).sentiment.subjectivity, axis=1)

In [104]:
dataUnique.head(10)

Unnamed: 0,screen_name,text,source,retweet_count,hashtags,mentions_screen_name,searchTerm,disease,Preprocesstext
0,rachillionaire,@femmebostonian @Cherrell_Brown This has been ...,Twitter for iPhone,0,,femmebostonian Cherrell_Brown,Levothyroxine,Cancer,this has been super helpful switched from levo...
1,renae_dePerio,For a decade I was on Levothyroxine 137 mcg. I...,Twitter for iPhone,0,,,Levothyroxine,Cancer,for decade was on levothyroxine mcg was not se...
2,junglismastiff,@yourAAH Hi could you tell me which of your UK...,Twitter for Android,0,,yourAAH,Levothyroxine,Cancer,hi could you tell me which of your uk pharmaci...
3,notsydvicious,And my levothyroxine,Twitter for iPhone,0,,,Levothyroxine,Cancer,and my levothyroxine
4,Juniper_publish,Telephone Survey to Assess Substitution of Ele...,Twitter Web Client,0,,,Levothyroxine,Cancer,telephone survey to assess substitution of ele...
5,Princeaurora200,"Dear pfiser, Thyroxine.fatalities occur, Ny p...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine fatalities occur ny pube...
6,Princeaurora200,"Dear pfiser, Thyroxine. Ny puberty. Growth hor...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine ny puberty growth hormon...
7,microRNA_papers,Serum microRNA profiles in athyroid patients o...,dlvr.it,0,,,Levothyroxine,Cancer,serum microrna profiles in athyroid patients o...
8,ImagineThatBaby,RT @handmade_dorset: And there I was home agai...,Twitter Web Client,1,thyroidcancer thyroid hashimotos livingwell gr...,handmade_dorset,Levothyroxine,Cancer,and there was home again what an adventure thy...
9,medschat,Levothyroxine And Ringing In The Ears https://...,MedsChat.com,0,,,Levothyroxine,Cancer,levothyroxine and ringing in the ears


## Here above I tried to use an inbuild function to calculate sentiment, which failed very badly.
### I am using the Handlabelled sentiment file in next section. I have done this manually.
### After so much manual labelling, lets build a model that does this on its own

Files/ Dataframes:
    1) `sentiment` : Handlablled Sentiment file.
    2) `sentiment_test` : Another set of tweets which model will label as per its training

a) Here, we used `LabelEncoder` over the `Sentiment` column which would assign the values that is `positive` and `negative` an integer value (say `0` or `1`). Now, convert it to type `string`.

b) Then we extract a random sample out of it. `Line 23`

c) Converting the `Sentiment` in `sentiment_test` to `categorical variable`, then `label encoding` it, converting it to `string`

d) Vectorizing the tweets using `Term Frequency-Inverse Document Frequency` i.e. `TF-IDF`. For further understanding, visit: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html


In [49]:
##Training model for sentiments
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
##Training dataset built on every iteration of testing
sentiment= pd.read_csv("DatasetsWithSentiments.csv", header=0, delimiter=",", quoting=0)
sentiment = sentiment.reset_index(drop=True)
##Testing data for Sentiment Classification
sentiment_test= pd.read_csv("Sentiment_test50.csv", header=0, delimiter=",", quoting=0,encoding='latin-1')



sentiment['Sentiment'] = sentiment['Sentiment'].astype('category')
sentiment['Sentiment'] = le.fit_transform(sentiment['Sentiment'])
sentiment['Sentiment'] = str(sentiment['Sentiment'])
# print(type(sentiment['Sentiment']))


#sentiment_test = sentiment_test.sample(frac=1, random_state= 100)
sentiment['Sentiment'] = sentiment.Sentiment.str.lower()
sentiment['text'] = sentiment.text.str.lower()
sentiment = sentiment.sample(frac=1, random_state= 100)

sentiment_test['Sentiment'] = sentiment_test['Sentiment'].astype('category')
sentiment_test['Sentiment'] = le.fit_transform(sentiment_test['Sentiment'])
sentiment_test['Sentiment'] = str(sentiment_test['Sentiment'])
sentiment_test['Sentiment'] = sentiment_test.Sentiment.str.lower()

#print(sentiment['Sentiment'])

X_train_text = sentiment['text']
y_train = sentiment['Sentiment']


X_valid_text = sentiment_test['text']
y_valid = sentiment_test['Sentiment']


vect = TfidfVectorizer(min_df = 1)

X_train = vect.fit_transform(X_train_text.values.astype('U'))


X_valid = vect.transform(X_valid_text.values.astype('U'))


<class 'pandas.core.series.Series'>


Using `NaiveBayes` for sentiment classification

Wondering why use `NaiveBayes`. 

Since, now `sentiments` features are now in vector format. They can be represented as : 



In [109]:
%%html
<img src="https://sebastianraschka.com/images/blog/2014/naive_bayes_1/linear_vs_nonlinear_problems.png",width=200,height=200>


Checkout: https://sebastianraschka.com/Articles/2014_naive_bayes_1.html

# ---------------------------------------------------------------------------------------------------


## Scoring and Metric Evaluation

Types of Metrics:

    1) Accuracy : Accuracy is ratio of correctly predicted observation to the total observations.
    
    2) Confusion Matrix : confusion matrix is a table that is used to describe the performance of model on a set of test data for which the true values are known.
    
    3) Classification Report : Builds the text report showing the main classification metrics.
    
    4) Precision : precision expresses the proportion of the data points our model says was relevant actually were relevant
    
    5) Recall : recall is the number of correct results divided by the number of results that should have been returned.

In [115]:
from sklearn.metrics  import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)


p_valid = clf.predict(X_valid)
#print(accuracy_score(y_valid, p_valid))

sentiment_test['Sentiment_NB'] =np.array(p_valid) 
sentiment_test.to_csv('sentiment_test50_addedNBScore.csv', sep=',')


  self.class_log_prior_ = (np.log(self.class_count_) -


`Sentiment` is the file containing Hand-labelled sentiments

Here, we extract all the `negative` tweets and remove the tweets that someone would have posted which is still under research or study

In [117]:
import pandas as pd
Sentiment = pd.read_csv("DatasetsWithSentiments.csv",header=0)

#Negative tweets
dataUniqueNeg = Sentiment[Sentiment.Sentiment=='negative']
# print(dataUniqueNeg.shape)

# removing the tweets with research or study
study = ['research','study','survey','clinical trial']
dataUniqueNeg = dataUniqueNeg[~dataUniqueNeg.Preprocesstext.str.contains('|'.join(study))]

dataUniqueNeg = dataUniqueNeg.dropna(subset=['Preprocesstext']).reset_index(drop=True)
print("We are left with {} tweets.".format(dataUniqueNeg.shape[0]))

We are left with 139 tweets.


# ---------------------------------------------------------------------------------------------------


## Stemming and Lemmatization

`Stemming`: `Stemming` is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. When we stem a mushroom, we chop off its stem and keep the cap that most people think of as the edible portion. Similarly, when we stem a word, we chop off its inflections and keep what hopefully represents the main essence of the word. Technically, it depends on the type of mushroom, and we’re throwing away the mushroom stems while keeping the word stems. Nonetheless, I hope the metaphor is useful.

`Lemmatization` : `Lemmatization` is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.Going back to our mushrooms, even an amateur chef knows that you shouldn’t just chop off the stems with a knife. Instead, you should carefully remove the stems, cutting around them with a paring knife or gently twisting them off.`Lemmatization` tries to take a similarly careful approach to removing inflections. Lemmatization does not simply chop off inflections, but instead relies on a lexical knowledge base like WordNet to obtain the correct base forms of words.

Great thanks to Mr.Daniel Tunkelang for this article: https://queryunderstanding.com/stemming-and-lemmatization-6c086742fe45

There are several `stemming` techniques available out of which we are using `PorterStemmer` because : 

a) Most Gentle towards data.

b) Significant Margin.

More Info at: https://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg



`General_ADR` is a csv file that contains list of `Adverse Reactions`. Steps followed:

1) LowerCasing

2) Lemmatizing the words, appending them to a list

3) Creating its Dataframe and a set.

In [68]:
import nltk
from nltk.stem.lancaster import LancasterStemmer
from nltk.corpus import stopwords, wordnet
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import sent_tokenize

nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")

ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# reading the file with common adverse effects

ADR = pd.read_csv("General_ADR.csv")
ADR['Adr'] = ADR.adr.str.lower()
ADR_stem = []
for adr in ADR['adr']:
    adr = lemmatizer.lemmatize(adr)
    ADR_stem.append(adr)
    
ADR_stem = pd.DataFrame(ADR_stem)
ADRset = ADR_stem.apply(set)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chirag\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chirag\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chirag\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# ---------------------------------------------------------------------------------------------------


Several Functions are created in this block. They work as follows:

1) featureExtraction: Looping is done over each word in the review and if it is in the model's vocabulary then its feature vector is added to the total.

2) search: This function searched for text and retrieves 'n' words eother side of the text, which are returned seperately.

3) phraseBuilder: It helps in buildin the phrases for Adverse reactions and uses the `search` function declared above.

4) Drug_text_wostem : It is used to treat text with lemmatizer and stemmer to extract Adverse Reactions.

In [65]:
def featureExtraction(vocab):
    nwords = []

    for word in vocab:
        
        if any(word in s for s in ADRset):
            nwords.append(word)
            #featureVec = np.add(featureVec,model[word])
                
    return nwords


def search(text,search_word,n):
    
    word = r"\W*([\w]+)"
    phrase = re.search(r'{}\W*{}{}'.format(word*n,search_word, word*n), text)
    if phrase is not None:
        groups = phrase.groups()
        phrase1 = " ".join(groups[:n])
        phrase2 = search_word
        phrase3 = " ".join(groups[n:])
        finalphrase1 = phrase1 + ' '+ phrase2
        finalphrase2 = phrase2+' ' +phrase3
        return finalphrase1, finalphrase2

def phraseBuilder(text_drug):
    
    Symptoms = []
    if not text_drug.empty:
        for i in range(len(text_drug)):
            ADR = text_drug['ADRE'][i]
            text = text_drug['Preprocesstext'][i]
            finalphrase = []
            if len(ADR) and len(text)>0:
                for adr in ADR:
                    phrase1 = search(text,adr,2)
                    finalphrase.append(phrase1)
            Symptoms.append(finalphrase)
        Symptoms = np.array(Symptoms)

        text_drug['Symptoms'] = Symptoms
        return text_drug
##treating text with lemmatizer and extracting ADR
def Drug_text_wostem(text_drug):
    if not text_drug.empty:
        TwitterText_Drug = []
        for i in range(len(text_drug['Preprocesstext'])):
            tweet = text_drug.loc[i,'Preprocesstext']

            token =nltk.word_tokenize(tweet)
           
            #Stem the words.
            text = [lemmatizer.lemmatize(t) for t in token]

            TwitterText_Drug.append(text)
        TwitterText_Drug = np.array(TwitterText_Drug)   
        text_drug['words'] = TwitterText_Drug

        ####calling the featureExtraction function to extraxt ADR from each review###
        ADRE = []
        ADRTweet=[]
        for i in range(len(text_drug['words'])):
            ADR = featureExtraction(text_drug['words'][i])
            ADRE.append(ADR)

        #ADRE 
        text_drug['ADRE'] = ADRE
        text_drug = phraseBuilder(text_drug)
        text_drug = pd.DataFrame(text_drug)
        text_drug = text_drug[text_drug['ADRE'].map(lambda d: len(d)) > 0]
        return text_drug


def Drug_text(text_drug):
    if not text_drug.empty:
        TwitterText_Drug = []
        for i in range(len(text_drug['Preprocesstext'])):
            tweet = text_drug.loc[i,'Preprocesstext']

            token =nltk.word_tokenize(tweet)
            
            #lemmatize and Stem the words.
            text = [lemmatizer.lemmatize(t) for t in token]
            text = [ps.stem(t) for t in text]
            
            TwitterText_Drug.append(text)
        TwitterText_Drug = np.array(TwitterText_Drug)   
        #TwitterText_Drug
        text_drug['words'] = TwitterText_Drug

        ####calling the featureExtraction function to extraxt ADR from each review###
        ADRE = []
        ADRTweet=[]
        for i in range(len(text_drug['words'])):
            ADR = featureExtraction(text_drug['words'][i])

            ADRE.append(ADR)

        #ADRE 
        text_drug['ADRE'] = ADRE
        text_drug = phraseBuilder(text_drug)
        text_drug = pd.DataFrame(text_drug)
        ##keeping the rows with the tweets with ADRE
        text_drug = text_drug[text_drug['ADRE'].map(lambda d: len(d)) > 0]
        return text_drug

#### Lemmatized Text Only: 

Fetching the dataset with Adverse Reaction for text treated with Lemmatizer and saving it in a CSV.

In [None]:
a1 = Drug_text_wostem(dataUniqueNeg)
a1.to_csv('Tweets with ADR_with Lem.csv')

### Lemmatized and Stemmed Data :

Fetching the dataset with Adverse Reaction for text treated with lemmatizer and stemming

In [None]:
a2 = Drug_text(dataUniqueNeg)
a2.to_csv('Tweets with ADR_with Lem and Stem.csv')

`FDA_Drug_Effects.csv` contains some of the Adverse Side Effects. 

We stem and lemmatize each word for comparision with the words extracted in previous step.

In [73]:
specific = pd.read_csv('FDA_Drug_effects.csv')
specific_stem = []
for i in range(len(specific)):
    words = specific.SideEffects[i]
    text = []
    text = words.lower().split()
    text = [lemmatizer.lemmatize(t) for t in text]
    text = [ps.stem(t) for t in text]
    specific_stem.append(text)
    
specific_stem = np.array(specific_stem)
specific_stem
specific['SideEffectToken'] = specific_stem

Here, we create a new dataset which is join between the specific dataset and dataset with adverse reactions from tweets

In [75]:
result = pd.merge(a2, specific, on='searchTerm')

In [123]:
result.head(10)

Unnamed: 0,screen_name,text,source,retweet_count,hashtags,mentions_screen_name,searchTerm,disease,Preprocesstext,Sentiment,ADRE,Symptoms,SideEffects,Unspecified_ADR
0,renae_dePerio,For a decade I was on Levothyroxine 137 mcg. I...,Twitter for iPhone,0,,,Levothyroxine,Cancer,for decade was on levothyroxine mcg was not se...,negative,"[up, bad, heart]","[(even bumped up, up to mcg), (it caused bad, ...","tummy trouble,boner kryptonite,chest infection...","[up, bad]"
1,Princeaurora200,"Dear pfiser, Thyroxine.fatalities occur, Ny p...",Mobile Web (M2),0,,,Levothyroxine,Cancer,dear pfiser thyroxine fatalities occur ny pube...,negative,[fatal],"[(pfiser thyroxine fatal, fatal ities occur)]","tummy trouble,boner kryptonite,chest infection...",[fatal]
2,scottisaacsmd,Treating borderline low thyroid levels can in...,LinkedIn,0,,,Levothyroxine,Cancer,treating borderline low thyroid levels can inc...,negative,"[low, thyroid, can, risk, death]","[(treating borderline low, low thyroid levels)...","tummy trouble,boner kryptonite,chest infection...","[low, thyroid, can, risk, death]"
3,MINXY_ROCHELLE,@TaylorHelga Ive always been prescribed levoth...,Twitter for Android,0,,TaylorHelga,Levothyroxine,Cancer,ive always been prescribed levothyroxine for u...,negative,"[thyroid, weight]","[(for underactive thyroid, thyroid still strug...","tummy trouble,boner kryptonite,chest infection...","[thyroid, weight]"
4,scchapterofacp,Levothyroxine therapy may increase risk for de...,Twitter Web Client,0,,,Levothyroxine,Cancer,levothyroxine therapy may increase risk for de...,negative,"[may, risk, death]","[(levothyroxine therapy may, may increase risk...","tummy trouble,boner kryptonite,chest infection...","[may, risk, death]"
5,sarahoelker,"The implicit ""OMG you're addicted"" message her...",TweetDeck,0,,,Levothyroxine,Cancer,the implicit omg you re addicted message here ...,negative,"[addict, addict, up]","[(you re addict, addict ed message), (you re a...","tummy trouble,boner kryptonite,chest infection...","[addict, addict, up]"
6,bringinsexybach,@C_GraceT I also couldn't stop taking my levot...,Twitter Web Client,0,,C_GraceT,Levothyroxine,Cancer,also couldn stop taking my levothyroxine witho...,negative,[thyroid],"[(on synthetic thyroid, thyroid hormone huh)]","tummy trouble,boner kryptonite,chest infection...",[thyroid]
7,priellan,would you get call someone with hypothyroidism...,TweetDeck,0,,,Levothyroxine,Cancer,would you get call someone with hypothyroidism...,negative,[thyroid],"[(with hypo thyroid, thyroid ism dependent)]","tummy trouble,boner kryptonite,chest infection...",[thyroid]
8,KijonaiaArt,"@kittykaya Nonono, he's sometimes up UNTIL 2-3...",Twitter Web Client,0,,kittykaya,Levothyroxine,Cancer,nonono he sometimes up until am actually just ...,negative,"[up, insomnia, back, blood]","[(he sometimes up, up until am), (might be ins...","tummy trouble,boner kryptonite,chest infection...","[up, insomnia, back, blood]"
9,KijonaiaArt,"@bunnieandbea See, Scott sleeps in so late no ...",Twitter Web Client,0,,bunnieandbea,Levothyroxine,Cancer,see scott sleeps in so late no matter when he ...,negative,"[sleep, up]","[(see scott sleep, sleep s in), (to wake up, u...","tummy trouble,boner kryptonite,chest infection...",[up]


This block of code compares the words extracted per tweet to the bag of words of the corresponding `searchTerm` i.e. drug. 

The words that do not match are appended in a column called `Unspecified_ADR`

In [78]:
Unidentified_ADR = []
for i in range(len(result)):
 
    adr = result['ADRE'][i]
    UnADR = []
    for word in adr:
        if not any(word in s for s in result['SideEffectToken'][i]):

            UnADR.append(word)
    Unidentified_ADR.append(UnADR)
Unidentified_ADR = np.array(Unidentified_ADR)
Unidentified_ADR
result['Unspecified_ADR'] = Unidentified_ADR

Creating a csv for the final result after dropping the `SideEffectToken` and words since they are redundant to the other columns


In [79]:
result = result.drop(['SideEffectToken','words'],axis = 1).reset_index(drop=True)
#result.to_csv('TweetsWithUnidentifiedSideEffects.csv', sep=',')
result.to_csv('TweetsWithUnidentifiedSideEffectsStem.csv', sep=',')
print('CSV with ADRs done.')

CSV with ADRs done.
