### Problem Statement - tweets classification
company X identifies ADRs (adverse drug reactions) after a drug’s release. Comprehensive knowledge of ADRs can reduce the detrimental impact on patient’s health system. Practically, clinical trials cannot investigate all settings in which a drug will be used, making it impossible to fully characterize the drug’s adverse effect profile before its approval. company X methods continuously analyse frequently updated data sources, Twitter in particularly because of its large user base, demographic variability, and publicly available data.
    ADR detection in social media requires automated methods to process the high data volume. It would greatly help her if she is able to automate the segmentation of Tweet into either ADR or NON-ADR, on the basis of Drug, Symptom and Effect mentioned in future.

In [53]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score,recall_score, roc_auc_score
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
# models
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
# vectorisation objects
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

import nltk 
import string
import re
%matplotlib inline

pd.set_option('display.max_colwidth', 100)
import warnings
warnings.filterwarnings('ignore')
stopword = nltk.corpus.stopwords.words('english')
# from sklearn.metrics import SCORERS 
# SCORERS.keys()

In [None]:
# word (irrelevant) + 
# stop words
#hyper parameters tunning


In [2]:
# uncomment below code to remove irrelevant words from the data
# stopword + ['therapy', 'associated','year','old','use','effect','following','man','woman','induced','CONCLUSION','RESULTS','METHODS','BACKGROUND','patient']

In [3]:
# Load dataset
def load_data():
    data = pd.read_csv('data_text.csv')
    return data
tweet_df = load_data()
print('Dataset size:',tweet_df.shape)
print('Columns are:',tweet_df.columns)

df = tweet_df.copy()

Dataset size: (23516, 3)
Columns are: Index(['ID', 'tweets', 'label'], dtype='object')


### pre-processing text data
1. Remove punctuations
2. Tokenization
3. Remove stopwords
4. Lammetization/stemming


In [4]:
# stopword
def clean_text(text):
#     1. # remove punctuations
    text_p = "".join([word for word in text if word not in string.punctuation]).lower()
#     2. # remove stopwords
    text_s = " ".join([word for word in text_p.split() if word not in stopword])
#     3. # remove numbers
    text_n = re.sub(pattern = "[0-9]+", repl='', string=text_s)
#     4. # lammetization
    text_l = " ".join([nltk.PorterStemmer().stem(word) for word in text_n.split()])
#     5. # tokenization
    tokens = re.split('\W+', text_l)
    return tokens   

In [5]:
df.head()

Unnamed: 0,ID,tweets,label
0,413205,Intravenous azithromycin-induced ototoxicity.,1
1,528244,"Immobilization, while Paget's bone disease was present, and perhaps enhanced activation of dihyd...",1
2,361834,Unaccountable severe hypercalcemia in a patient treated for hypoparathyroidism with dihydrotachy...,1
3,292240,METHODS: We report two cases of pseudoporphyria caused by naproxen and oxaprozin.,1
4,467101,METHODS: We report two cases of pseudoporphyria caused by naproxen and oxaprozin.,1


In [6]:
# df.tweets.apply(clean_text)

In [7]:
# ### vectorisation
# rank = ['This is blue sky','This is not blue sky','third']
# 6 distinct words

# ### vectorisation concept
# # sn this is blue sky not third
# # 1. 1 1 1 1 0 0
# # 2. 1 1 1 1 1 0
# # 3. 0 0 0 0 0 1

In [8]:
vectoriser = CountVectorizer()
X = vectoriser.fit_transform(df['tweets'])
print(X.toarray())
print(vectoriser.get_feature_names())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [9]:
vectoriser = CountVectorizer(analyzer=clean_text)
X = vectoriser.fit_transform(df['tweets'])
print(X.toarray())
print(vectoriser.get_feature_names())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
['a', 'aa', 'aaa', 'aav', 'ab', 'aba', 'abab', 'ababc', 'abacavir', 'abacavirlamivudin', 'abandon', 'abat', 'abdomen', 'abdomin', 'abdominocentesi', 'abdominopelv', 'abdominu', 'abemayor', 'aberr', 'abil', 'abl', 'ablat', 'ablc', 'ablmut', 'abmt', 'abnorm', 'abo', 'aboincompat', 'abolish', 'abomismatch', 'abort', 'abovement', 'abpa', 'abpct', 'abroad', 'abrog', 'abrupt', 'abruptli', 'abscess', 'abscessu', 'absenc', 'absent', 'absolut', 'absorb', 'absorpt', 'absorptiometri', 'abstain', 'abstin', 'abstract', 'abund', 'abus', 'abvd', 'ac', 'acad', 'academ', 'acanthameb', 'acanthameoba', 'acanthamoeba', 'acantholysi', 'acanthosi', 'acarbos', 'acc', 'acceler', 'accelerometri', 'accentu', 'accept', 'access', 'accessori', 'accid', 'accident', 'accol', 'accommod', 'accompani', 'accomplic', 'accomplish', 'accord', 'accordingli', 'account', 'accredit', 'accumul', 'accur', 'accu

In [10]:
# print(len(vectoriser.get_feature_names()))

In [11]:
# ### 
# ' This is severe case'
# ' This is not severe case'
# 'This is good'.'This is not good'
# ['this','is','not','good'] # n_grams = 1
# ['this is','is not','is good','not good']  # bi-grams
# 'United States of America'

In [12]:
### n_grams in nlp (read)

In [13]:
vectoriser = CountVectorizer(analyzer=clean_text, ngram_range=(2,2))
X = vectoriser.fit_transform(df['tweets'])
print(X.toarray())
# print(vectoriser.get_feature_names())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [14]:
### tf -idf => term frequency - inverse document frequency  (read)
vectoriser = TfidfVectorizer(analyzer=clean_text, ngram_range=(2,2))
X = vectoriser.fit_transform(df['tweets'])
print(X.toarray())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [15]:
df.shape[0] # no. of rows

23516

In [16]:
len(vectoriser.get_feature_names()) # no. of columns

14330

In [17]:
# pd.DataFrame(X.toarray()[:100]).to_csv('data.csv')

In [18]:
list(nltk.ngrams(['This','is','blue','sky','not'], n = 1))

[('This',), ('is',), ('blue',), ('sky',), ('not',)]

In [19]:
list(nltk.ngrams(['This','is','blue','sky','not'], n = 2))

[('This', 'is'), ('is', 'blue'), ('blue', 'sky'), ('sky', 'not')]

In [20]:
list(nltk.ngrams(['This','is','blue','sky','not'], n = 3))

[('This', 'is', 'blue'), ('is', 'blue', 'sky'), ('blue', 'sky', 'not')]

In [22]:
# ### Modeling
# logistic regression # 
# SVC # 
# Naive based classification 
# KNN # 
# RandomForestClassifier 

In [None]:
### which model works better on text data?

### Model Piplines

In [23]:
## spliting dataset into train and test set
X_train, X_test, y_train, y_test = train_test_split(df.tweets, df.label, test_size = 0.2, stratify = df.label, random_state = 42)

In [24]:
for est in [RandomForestClassifier(random_state=42), MultinomialNB(), GradientBoostingClassifier()]:
    rf_pip = Pipeline(steps=[('vectorisation',CountVectorizer(analyzer=clean_text)),
                            ('estimator',est)])
    scores = cross_val_score(rf_pip,X_train,y_train, cv = 3, scoring= 'roc_auc')
    print(scores)
    print(np.average(scores))
    print(np.std(scores))

[0.90429804 0.91093464 0.90490249]
0.9067117243856068
0.002996234061521033
[0.89502133 0.88590367 0.89795895]
0.8929613179034078
0.005132585854676006
[0.80249736 0.81243242 0.81809753]
0.8110091055518168
0.006447775342958134


In [None]:
# overfitting 
# Ex1: trainset - 0.9
# Ex2: trainset - 0.85  
# #     bagging -> bootstrap aggregation
# # boosting 
## Train Accuracy - Test Accuracy ~ delta (high) 
# what should be optimal value of delta which draws line for overfitting?

In [25]:
for est in [RandomForestClassifier(random_state=42), MultinomialNB(), GradientBoostingClassifier()]:
    rf_pip = Pipeline(steps=[('vectorisation',CountVectorizer(analyzer=clean_text)),
                             ('tfid', TfidfTransformer()),
                            ('estimator',est)])
    scores = cross_val_score(rf_pip,X_train,y_train, cv = 3, scoring= 'roc_auc')
    print(scores)
    print(np.average(scores))
    print(np.std(scores))

[0.90579584 0.91128052 0.91004406]
0.9090401404257159
0.0023489468307391114
[0.89234783 0.88287219 0.89083205]
0.8886840250971507
0.004155914722731575
[0.80712117 0.81532332 0.82052856]
0.8143243501489205
0.005518934562165035


In [49]:
# df.shape

In [37]:
## Base Mode
rf= Pipeline(steps=[('vectorisation',CountVectorizer(analyzer=clean_text)),
                         ('tfidfTransformation',TfidfTransformer()),
                        ('estimator',RandomForestClassifier(random_state=42, n_jobs= -1))])
rf_pip = rf.fit(X_train,y_train) ## training
print(accuracy_score(y_train, rf_pip.predict(X_train))) ## auc score
print(accuracy_score(y_test, rf_pip.predict(X_test))) ## auc score

0.9999468424409951
0.8756377551020408


In [67]:
## Base Mode
rf= Pipeline(steps=[('vectorisation',CountVectorizer(analyzer=clean_text)),
                         ('tfidfTransformation',TfidfTransformer()),
                        ('estimator',RandomForestClassifier(random_state=42, n_jobs= -1))])
rf_pip = rf.fit(X_train,y_train) ## training
print(accuracy_score(y_train, rf_pip.predict(X_train))) ## auc score
print(accuracy_score(y_test, rf_pip.predict(X_test))) ## auc score

0.9999468424409951
0.8756377551020408


## 1.  Hyper-parameters tunning usig gridsearchCV

Hyper parameters of RandomForestClassifier() 
 - estimators: no. of trees
 - max_features: no. of features to a model
 - max_depth: CART depth
 - min_samples_leaf: 

In [None]:
# for est in [1,50,100,500,1000,2000]:
#     {'1':0.2, '50':0.5,'100':0.8,'500':0.79,'1000':0.70,'2000':0.69}

In [None]:
# for est in [100,150,200,250,300,350,400,450,500]:
# #     {'1':0.2, '50':0.5,'100':0.8,'500':0.79,'1000':0.70,'2000':0.69}

In [None]:
# for est in [250,260,270,280,290,300]:
# #     {'1':0.2, '50':0.5,'100':0.8,'500':0.79,'1000':0.70,'2000':0.69}

In [None]:
# for est in [260,261,262,263,264,265,266,267,268,269,270]:
# #     {'1':0.2, '50':0.5,'100':0.8,'500':0.79,'1000':0.70,'2000':0.69}

In [71]:
## sample tunning code using GridSearchCV
params = {'estimator__n_estimators':range(1,2),
         'estimator__max_depth':range(1,2)}

gs_clf = GridSearchCV(estimator = rf, param_grid=params, cv = 3, scoring='roc_auc', n_jobs = -1)
gs_clf.fit(X_train,y_train)
print(roc_auc_score(y_test, gs_clf.predict(X_test))) ## auc score
print(accuracy_score(y_test, gs_clf.predict(X_test))) ## accuracy

0.5003663003663004
0.7100340136054422


In [72]:
# pd.DataFrame(gs_clf.cv_results_).sort_values(by = 'mean_test_score', ascending = False).to_csv('result.csv')

In [73]:
gs_clf.best_params_

{'estimator__max_depth': 1, 'estimator__n_estimators': 1}

## 2. Tunning CountVectorizer() change ngram and record the scores 

## 3. Feature engineering

## 4. Feature selection

## 5. PCA