## 2. Sentiment Analysis - Modeling
---

In [2]:
#importing libraries 

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier,RandomForestClassifier

import joblib
import xgboost as xgb


In [9]:
# !pip install xgboost

Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/c9/73/884e2d50ba8fc95bcb92c47910a8fbb8c80627f53a76250080c780c91185/xgboost-1.0.1-py3-none-win_amd64.whl (24.6MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.0.1


In [2]:
# !pip install joblib

## Modeling
---

In this segment, we will use the following models to classify the tweets. 

##### Models Used: 
1. Logistic Regression Model (??)
2. Multinomial Naive Bayes Model 
3. Decision Tree
3. Use of Grid Search to optimise the number of features in the count vectoriser in an attempt to improve model accuracy

##### Metric to Validate Model: 

Accuracy is likely the best metric to use here as improperly classifying a subreddit post is equally bad in this instance.

##### Outcome:

The Multinomial Naive Bayes Model was able to accurately classify 91.8% of the posts. 

In [3]:
filepath = '../datasets/tweets_clean_1.csv'

In [4]:
df_1 = pd.read_csv(filepath)

In [5]:
df_1.head()

Unnamed: 0,text,airline_sentiment
0,said,1
1,plu ad commerci experi tacki,2
2,today must mean need take anoth trip,1
3,realli aggress blast obnoxi entertain guest fa...,0
4,realli big bad thing,0


### Modeling 

General Approach
- Split data into X and y 
- Train test split for model validation
- Hyperparameter optimisation for count vectoriser, Tfidf transformer and model hyperparameters


##### Split data into `X` and `y`.

In [6]:
X = df_1['text']
y = df_1['airline_sentiment']

In [7]:
X.head()

0                                                 said
1                         plu ad commerci experi tacki
2                 today must mean need take anoth trip
3    realli aggress blast obnoxi entertain guest fa...
4                                 realli big bad thing
Name: text, dtype: object

#### Baseline Accuracy 
Baseline accuracy for model is 62.7%

In [8]:
y.value_counts(normalize=True)

0    0.627149
1    0.211698
2    0.161153
Name: airline_sentiment, dtype: float64

##### Function for model validation

In [10]:
def run_model(model_type, X,y):
    
    #train test split data for model validation
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify = y)

    #specifying models
    models = {'lr': LogisticRegression(), 
              'nb': MultinomialNB(),
              'dt': DecisionTreeClassifier(),
              'rf': RandomForestClassifier(),
              'ada': AdaBoostClassifier(base_estimator=DecisionTreeClassifier()),
              'gb': GradientBoostingClassifier(),
              'xgb': xgb.XGBClassifier(objective='multi:softmax',num_class=3) 
             }
    
    #creating a pipeline
    pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        (model_type, models[model_type])
    ])
    
    #pipeline parameters
    pipe_params = {
    'cvec__max_df': (.9, .8),
    'cvec__max_features':[2500,3000,3500,5000],
    'cvec__max_df':[0.9,0.95],
    'cvec__ngram_range':[(1,1),(1,2)],
    'tfidf__use_idf':[True,False],

    }
        
    #additional parameters for each model 
    if model_type == 'nb':
        pipe_params.update({'nb__alpha':[1,2]
                           })
        
    elif model_type == 'lr':
        pipe_params.update({'lr__penalty':['none','l2'],
                            'lr__max_iter':[1000]
                           })
        
    elif model_type == 'dt':
        pipe_params.update({'dt__max_depth':[3,5,10],
                           'dt__min_samples_split':[5,10,20],
                           'dt__min_samples_leaf':[10,50,70]         #changed
                           })
    
    elif model_type == 'rf':
        pipe_params.update({'rf__n_estimators': [100,150,200],
                            'rf__max_depth':[3,5,7]                  #changed
                           })
    elif model_type == 'ada':
        pipe_params.update({'ada__n_estimators':[50,100],
                            'ada__learning_rate':[0.9,1],             #changed param
                            'ada__base_estimator__max_depth':[1,2,3]
                           })
    elif model_type == 'gb':
        pipe_params.update({'gb__n_estimators': [50,100,150],
                            'gb__learning_rate':[0.08, 0.1, 0.12],       #changed param
                            'gb__max_depth':[1,2,3]
        })
        
    elif model_type == 'xgb':
        pipe_params.update({'xgb__max_depth':[1,2,3]
        })


    #grid search 
    gs = GridSearchCV(estimator = pipe, 
                        param_grid = pipe_params,
                        verbose = 1,
                        n_jobs = -1, 
                        cv = 5)
    
    print("Grid Search for " + model_type)
    print('===========================================================================')
    
    gs.fit(X_train,y_train)

    print(f"Best Score: {gs.best_score_}")
    print(f"Best Parameters: {gs.best_params_}")
    
    #train model on entire train data set using  best parameters
    pipeline_final= Pipeline([
    ('cvec', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    (model_type, models[model_type])
    ])
    
    pipeline_final.set_params(**gs.best_params_)
    final_model = pipeline_final.fit(X_train,y_train)
    
    print("Final Model Score")
    print()
    print(f'Train Score:{final_model.score(X_train,y_train)}')
    print(f'Test Score:{final_model.score(X_test,y_test)}')
    
    #save model with best parameters for use later
    filename = 'model_' + model_type +'.pkl'
    joblib.dump(final_model,filename)


### Logistic Regression Model


In [12]:
run_model('lr',X,y)

Grid Search for lr
Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   36.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:  7.4min finished


Best Score: 0.7782648401826484
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 2), 'lr__max_iter': 1000, 'lr__penalty': 'l2', 'tfidf__use_idf': False}
Final Model Score

Train Score:0.8485844748858448
Test Score:0.7814297452752671


### Naive Bayes Model

In [13]:
run_model('nb',X,y)

Grid Search for nb
Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   18.0s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:   31.3s finished


Best Score: 0.7365296803652968
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 2500, 'cvec__ngram_range': (1, 2), 'nb__alpha': 1, 'tfidf__use_idf': False}
Final Model Score

Train Score:0.7754337899543379
Test Score:0.7376061353053958


### Decision Tree Model

In [14]:
run_model('dt',X,y)

Grid Search for dt
Fitting 5 folds for each of 864 candidates, totalling 4320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   11.9s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   41.2s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  5.7min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  7.1min
[Parallel(n_jobs=-1)]: Done 4320 out of 4320 | elapsed:  7.8min finished


Best Score: 0.6889497716894978
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 2), 'dt__max_depth': 10, 'dt__min_samples_leaf': 10, 'dt__min_samples_split': 5, 'tfidf__use_idf': False}
Final Model Score

Train Score:0.6962557077625571
Test Score:0.6926869350862778


### Random Forest Model

In [15]:
run_model('rf',X,y)

Grid Search for rf
Fitting 5 folds for each of 288 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   47.8s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed:  6.2min finished


Best Score: 0.628310502283105
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 2500, 'cvec__ngram_range': (1, 2), 'rf__max_depth': 7, 'rf__n_estimators': 100, 'tfidf__use_idf': True}
Final Model Score

Train Score:0.628675799086758
Test Score:0.6280471103807176


### Ada Boost Model

In [16]:
run_model('ada',X,y)

Grid Search for ada
Fitting 5 folds for each of 384 candidates, totalling 1920 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done 1920 out of 1920 | elapsed: 17.6min finished


Best Score: 0.7252968036529681
Best Parameters: {'ada__base_estimator__max_depth': 2, 'ada__learning_rate': 1, 'ada__n_estimators': 100, 'cvec__max_df': 0.95, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 1), 'tfidf__use_idf': False}
Final Model Score

Train Score:0.7754337899543379
Test Score:0.7252807450013695


### Gradient Boost

In [17]:
run_model('gb',X,y)

Grid Search for gb
Fitting 5 folds for each of 864 candidates, totalling 4320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   29.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 18.0min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 29.2min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 43.2min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 59.9min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 78.0min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 99.5min
[Parallel(n_jobs=-1)]: Done 4320 out of 4320 | elapsed: 108.6min finished


Best Score: 0.7421004566210045
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 2), 'gb__learning_rate': 0.12, 'gb__max_depth': 3, 'gb__n_estimators': 150, 'tfidf__use_idf': False}
Final Model Score

Train Score:0.7842922374429224
Test Score:0.7299370035606683


### XGBoost

In [11]:
run_model('xgb',X,y)

Grid Search for xgb
Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   54.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed: 12.4min finished


Best Score: 0.7435616438356165
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 2500, 'cvec__ngram_range': (1, 1), 'tfidf__use_idf': False, 'xgb__max_depth': 3}
Final Model Score

Train Score:0.7750684931506849
Test Score:0.7359627499315257


### Model Performance

| Model                    | Accuracy Score |
|--------------------------|----------------|
| Logistic Regression      | 0\.781         |
| Naive Bayes              | 0\.737         |
| Decision Tree            | 0\.692         |
| Random Forest Classifier | 0\.634         |
| Ada Boost                | 0\.726         |
| Gradient Boost           | 0\.717         |



In [None]:
# loaded_model = joblib.load(filename)
# result = loaded_model.score(X_test, Y_test)
# print(result)