## 2. Sentiment Analysis - Modeling
---

In [23]:
#importing libraries 

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier,RandomForestClassifier

import joblib


In [22]:
# !pip install joblib



## Modeling
---

In this segment, we will use the following models to classify the tweets. 

##### Models Used: 
1. Logistic Regression Model (??)
2. Multinomial Naive Bayes Model 
3. Decision Tree
3. Use of Grid Search to optimise the number of features in the count vectoriser in an attempt to improve model accuracy

##### Metric to Validate Model: 

Accuracy is likely the best metric to use here as improperly classifying a subreddit post is equally bad in this instance.

##### Outcome:

The Multinomial Naive Bayes Model was able to accurately classify 91.8% of the posts. 

In [2]:
filepath = '../datasets/tweets_clean_1.csv'

In [3]:
df_1 = pd.read_csv(filepath)

In [4]:
df_1.head()

Unnamed: 0,text,airline_sentiment
0,said,1
1,plu ad commerci experi tacki,2
2,today must mean need take anoth trip,1
3,realli aggress blast obnoxi entertain guest fa...,0
4,realli big bad thing,0


### Modeling 

General Approach
- Split data into X and y 
- Train test split for model validation
- Hyperparameter optimisation for count vectoriser, Tfidf transformer and model hyperparameters


##### Split data into `X` and `y`.

In [5]:
X = df_1['text']
y = df_1['airline_sentiment']

In [6]:
X.head()

0                                                 said
1                         plu ad commerci experi tacki
2                 today must mean need take anoth trip
3    realli aggress blast obnoxi entertain guest fa...
4                                 realli big bad thing
Name: text, dtype: object

In [43]:
y.value_counts(normalize=True)

0    0.627149
1    0.211698
2    0.161153
Name: airline_sentiment, dtype: float64

##### Function for model validation

In [41]:
def run_model(model_type, X,y):
    
    #train test split data for model validation
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=42,
                                                    stratify = y)

    #specifying models
    models = {'lr': LogisticRegression(), 
              'nb': MultinomialNB(),
              'dt': DecisionTreeClassifier(),
              'rf': RandomForestClassifier(),
              'ada': AdaBoostClassifier(base_estimator=DecisionTreeClassifier()),
              'gb': GradientBoostingClassifier()
             }
    
    #creating a pipeline
    pipe = Pipeline([
        ('cvec', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        (model_type, models[model_type])
    ])
    
    #pipeline parameters
    pipe_params = {
    'cvec__max_df': (.9, .8),
    'cvec__max_features':[2500,5000,7000],
    'cvec__max_df':[0.9,0.95],
    'cvec__ngram_range':[(1,1),(1,2)],
    'tfidf__use_idf':[True,False],

    }
        
    #additional parameters for each model 
    if model_type == 'nb':
        pipe_params.update({'nb__alpha':[1,2]
                           })
        
    elif model_type == 'lr':
        pipe_params.update({'lr__penalty':['none','l2'],
                            'lr__max_iter':[1000]
                           })
        
    elif model_type == 'dt':
        pipe_params.update({'dt__max_depth':[3,5,10],
                           'dt__min_samples_split':[5,10,20],
                           'dt__min_samples_leaf':[2,5,7]
                           })
    
    elif model_type == 'rf':
        pipe_params.update({'rf__n_estimators': [100,150,200],
                            'rf__max_depth':[None,1,2,3,4,5]
                           })
    elif model_type == 'ada':
        pipe_params.update({'ada__n_estimators':[50,100],
                            'ada__learning_rate':[0.9,1],
                            'ada__base_estimator__max_depth':[1,2,3]
                           })
    elif model_type == 'gb':
        pipe_params.update({'gb__n_estimators': [50,100,150],
                            'gb__learning_rate':[0.08, 0.1, 0.12],
                            'gb__max_depth':[1,2,3]
        })


    #grid search 
    gs = GridSearchCV(estimator = pipe, 
                        param_grid = pipe_params,
                        verbose = 1,
                        n_jobs = -1, 
                        cv = 5)
    
    print("Grid Search for " + model_type)
    print('===========================================================================')
    
    gs.fit(X_train,y_train)

    print(f"Best Score: {gs.best_score_}")
    print(f"Best Parameters: {gs.best_params_}")
    
    #train model on entire train data set using  best parameters
    pipeline_final= Pipeline([
    ('cvec', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    (model_type, models[model_type])
    ])
    
    pipeline_final.set_params(**gs.best_params_)
    final_model = pipeline_final.fit(X_train,y_train)
    
    print("Final Model Score")
    print()
    print(f'Train Score:{final_model.score(X_train,y_train)}')
    print(f'Test Score:{final_model.score(X_test,y_test)}')
    
    #save model with best parameters for use later
    filename = 'model_' + model_type +'.pkl'
    joblib.dump(final_model,filename)


### Logistic Regression Model


In [37]:
run_model('lr',X,y)

Grid Search for lr
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   21.9s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  5.4min finished


Best Score: 0.7764383561643836
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 2), 'lr__max_iter': 1000, 'lr__penalty': 'l2', 'tfidf__use_idf': False}
Final Model Score

Train Score:0.848310502283105
Test Score:0.7825253355245139


### Naive Bayes Model

In [38]:
run_model('nb',X,y)

Grid Search for nb
Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   12.5s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:   16.4s finished


Best Score: 0.7345205479452054
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 2500, 'cvec__ngram_range': (1, 2), 'nb__alpha': 1, 'tfidf__use_idf': True}
Final Model Score

Train Score:0.78
Test Score:0.7376061353053958


### Decision Tree Model

In [17]:
run_model('dt',X,y)

Grid Search for dt
Fitting 5 folds for each of 648 candidates, totalling 3240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    6.1s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   47.4s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed:  7.1min finished


Best Score: 0.6925114155251142
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 2), 'dt__max_depth': 10, 'dt__min_samples_leaf': 2, 'dt__min_samples_split': 5, 'tfidf__use_idf': False}
Final Model Score

Train Score:0.7052054794520548
Test Score:0.6894001643385373


### Random Forest Model

In [26]:
run_model('rf',X,y)

Grid Search for rf
Fitting 5 folds for each of 432 candidates, totalling 2160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  7.1min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 12.8min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 19.1min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 24.4min
[Parallel(n_jobs=-1)]: Done 2160 out of 2160 | elapsed: 28.9min finished


Best Score: 0.76337899543379
Best Parameters: {'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 1), 'rf__max_depth': None, 'rf__n_estimators': 200, 'tfidf__use_idf': True}
Final Model Score

Train Score:0.9926027397260274
Test Score:0.7636264037250069


### Ada Boost Model

In [36]:
run_model('ada',X,y)

Grid Search for ada
Fitting 5 folds for each of 288 candidates, totalling 1440 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 1440 out of 1440 | elapsed: 12.6min finished


Best Score: 0.7262100456621006
Best Parameters: {'ada__base_estimator__max_depth': 2, 'ada__learning_rate': 0.9, 'ada__n_estimators': 100, 'cvec__max_df': 0.9, 'cvec__max_features': 7000, 'cvec__ngram_range': (1, 2), 'tfidf__use_idf': False}
Final Model Score

Train Score:0.7658447488584474
Test Score:0.7192549986305122


### Gradient Boost

In [42]:
run_model('gb',X,y)

Grid Search for gb
Fitting 5 folds for each of 648 candidates, totalling 3240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   23.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 15.0min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 27.3min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 43.4min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 60.2min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 83.0min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 85.3min finished


Best Score: 0.7415525114155251
Best Parameters: {'cvec__max_df': 0.95, 'cvec__max_features': 5000, 'cvec__ngram_range': (1, 1), 'gb__learning_rate': 0.12, 'gb__max_depth': 3, 'gb__n_estimators': 150, 'tfidf__use_idf': False}
Final Model Score

Train Score:0.7770776255707763
Test Score:0.7348671596822788


### Model Performance

| Model                    | Accuracy Score |
|--------------------------|----------------|
| Logistic Regression      | 0\.782         |
| Decision Tree            | 0\.737         |
| Random Forest Classifier | 0\.689         |
| Ada Boost                | 0\.763         |
| Gradient Boost           | 0\.719         |
| Support Vector Machine   | 0\.734         |


In [8]:
models = {'lr': 1}

In [9]:
models['lr']

1

In [None]:
# loaded_model = joblib.load(filename)
# result = loaded_model.score(X_test, Y_test)
# print(result)