## Checklist

- [ ] app

     - [ ] template

        - [ ] master.html  # main page of web app

        - [ ] go.html  # classification result page of web app

    - [ ] run.py  # Flask file that runs app

- [ ] data

    - [x] process_data.py

    - [ ] InsertDatabaseName.db   # database to save clean data to

- [ ] models

    - [ ] train_classifier.py

    - [ ] classifier.pkl  # saved model 

- [ ] README.md

## File structure

The coding for this project can be completed using the Project Workspace IDE provided. Here's the file structure of the project:

- app

     - template

        - master.html  # main page of web app

        - go.html  # classification result page of web app

    - run.py  # Flask file that runs app

- data

    - disaster_categories.csv  # data to process 

    - disaster_messages.csv  # data to process

    - process_data.py

    - InsertDatabaseName.db   # database to save clean data to

- models

    - train_classifier.py

    - classifier.pkl  # saved model 

- README.md

# Project Components
There are three components you'll need to complete for this project.

## 1. ETL Pipeline

In a Python script, process_data.py, write a data cleaning pipeline that:

Loads the messages and categories datasets
Merges the two datasets
Cleans the data
Stores it in a SQLite database

#### Project Workspace - ETL
The first part of your data pipeline is the Extract, Transform, and Load process. Here, you will read the dataset, clean the data, and then store it in a SQLite database. We expect you to do the data cleaning with pandas. To load the data into an SQLite database, you can use the pandas dataframe .to_sql() method, which you can use with an SQLAlchemy engine.

Feel free to do some exploratory data analysis in order to figure out how you want to clean the data set. Though you do not need to submit this exploratory data analysis as part of your project, you'll need to include your cleaning code in the final ETL script, process_data.py.

#### Project Workspace - Machine Learning Pipeline

For the machine learning portion, you will split the data into a training set and a test set. Then, you will create a machine learning pipeline that uses NLTK, as well as scikit-learn's Pipeline and GridSearchCV to output a final model that uses the message column to predict classifications for 36 categories (multi-output classification). Finally, you will export your model to a pickle file. After completing the notebook, you'll need to include your final machine learning code in train_classifier.py.

### Data Pipelines: Python Scripts

After you complete the notebooks for the ETL and machine learning pipeline, you'll need to transfer your work into Python scripts, process_data.py and train_classifier.py. If someone in the future comes with a revised or new dataset of messages, they should be able to easily create a new model just by running your code. These Python scripts should be able to run with additional arguments specifying the files used for the data and model.

Example:

    python process_data.py disaster_messages.csv disaster_categories.csv DisasterResponse.db

    python train_classifier.py ../data/DisasterResponse.db classifier.pkl

Templates for these scripts are provided in the Resources section, as well as the Project Workspace IDE. The code for handling these arguments on the command line is given to you in the templates.

## 2. ML Pipeline

In a Python script, train_classifier.py, write a machine learning pipeline that:

Loads data from the SQLite database
Splits the dataset into training and test sets
Builds a text processing and machine learning pipeline
Trains and tunes a model using GridSearchCV
Outputs results on the test set
Exports the final model as a pickle file



## 3. Flask Web App

We are providing much of the flask web app for you, but feel free to add extra features depending on your knowledge of flask, html, css and javascript. For this part, you'll need to:

Modify file paths for database and model as needed
Add data visualizations using Plotly in the web app. One example is provided for you



In [None]:
#might be helpful - duplicates
#let's take a look to the duplicates in the dataframe
ids = messages ['message']
display (messages [ids.isin (ids [ids.duplicated ()])].sort_values (by = ['message']).head (n=10))


#let's take a look to the rows with identical IDs
ids = messages ['id']
display (messages [ids.isin (ids [ids.duplicated ()])].sort_values (by = ['id']).head (n=10))


# 5. Test your model / metrics
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

- [ ] I want to combine model_metrics and create_report_summary

In [None]:
def model_metrics(actual, predicted, Y):
    '''
    Return  f1 score, precision and recall for each output category of the dataset
    
    
    Parameters:
    actual (np.array): Array of actual Y values
    predicted (np.array): Array of predicted Y values  
    col_names (list): List containing names for each of the predicted fields.
    
    Returns:
    df (df): Dataframe f1 score, precision 
    
    '''
    metrics = []
    actual = np.array(actual)
    col_names = list(Y.columns.values)
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i])
        recall = recall_score(actual[:, i], predicted[:, i])
        f1 = f1_score(actual[:, i], predicted[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return df

In [None]:
def model_metrics_report(y_test, y_preds, Y):
    metrics_dict = {}
    for pred, label, col in zip(y_preds.transpose(), Y_test.values.tranpose())

In [44]:
def model_metrics(actual, predicted, Y):
    '''
    Return  f1 score, precision and recall for each output category of the dataset
    
    
    Parameters:
    actual (np.array): Array of actual Y values
    predicted (np.array): Array of predicted Y values  
    col_names (list): List containing names for each of the predicted fields.
    
    Returns:
    df (df): Dataframe f1 score, precision 
    
    '''
    metrics = []
    actual = np.array(actual)
    col_names = list(Y.columns.values)
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i])
        recall = recall_score(actual[:, i], predicted[:, i])
        f1 = f1_score(actual[:, i], predicted[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return df

In [150]:
def get_accuracy(Y_test, y_preds, Y):
    '''
    returns accuracy of each label as dict
    '''
    results_dict = {}
    actual = np.array(Y_test)
    col_names = list(Y.columns.values)
    for i in range(len(Y.columns)):
        key = col_names[i]
        results_dict[key] = accuracy_score(actual[:,i], y_preds[:,i])
    #for pred, actual, col in zip(y_preds.transpose(), Y_test.values.transpose(), Y_test.columns):
        #results_dict[col] = accuracy_score(np.(Y_test[col]), y_preds[col])
    return results_dict

In [151]:
get_accuracy(Y_test, y_preds_3, Y)

{'related': 0.8011372368218841,
 'request': 0.8964192408175811,
 'offer': 0.9956969417550331,
 'aid_related': 0.7590287382818504,
 'medical_help': 0.9268480098355617,
 'medical_products': 0.9565083755955125,
 'search_and_rescue': 0.9734132472721684,
 'security': 0.9786383894267712,
 'military': 0.9688028277239895,
 'water': 0.9635776855693868,
 'food': 0.9472875364991548,
 'shelter': 0.9443676041186415,
 'clothing': 0.9878592285231289,
 'money': 0.9772552635623175,
 'missing_people': 0.9898570769940065,
 'refugees': 0.9681881051175657,
 'death': 0.9688028277239895,
 'other_aid': 0.8725987398186569,
 'infrastructure_related': 0.9314584293837406,
 'transport': 0.958967266021208,
 'buildings': 0.9571230982019364,
 'electricity': 0.9826340863685262,
 'tools': 0.9927770093745197,
 'hospitals': 0.9872445059167051,
 'shops': 0.9944674965421854,
 'aid_centers': 0.9869371446134931,
 'other_infrastructure': 0.9535884432149992,
 'weather_related': 0.8818195789150146,
 'floods': 0.9592746273244198

In [156]:
def get_accuracy(Y_test, y_preds, Y):
    '''
    returns accuracy of each label as df
    '''
    accuracy_df = pd.DataFrame(columns=['Accuracy score'])
    actual = np.array(Y_test)
    col_names = list(Y.columns.values)
    for i in range(len(Y.columns)):
        key = col_names[i]
        accuracy_df.loc[key] = accuracy_score(actual[:,i], y_preds[:,i])
    #for pred, actual, col in zip(y_preds.transpose(), Y_test.values.transpose(), Y_test.columns):
        #results_dict[col] = accuracy_score(np.(Y_test[col]), y_preds[col])
    return accuracy_df

In [157]:
get_accuracy(Y_test, y_preds_3, Y)

Unnamed: 0,Accuracy score
related,0.801137
request,0.896419
offer,0.995697
aid_related,0.759029
medical_help,0.926848
medical_products,0.956508
search_and_rescue,0.973413
security,0.978638
military,0.968803
water,0.963578


In [167]:
def test_report(Y_test, y_preds):
    '''
    Create a weighted averages summary dataframe so that every feature has 
    OUT df columns: accuracy, precision,recall,f1-score,support
    '''
    results_dict = {}

    for pred, label, col in zip(y_preds.transpose(), Y_test.values.transpose(), Y_test.columns):
        results_dict[col] = classification_report(label, pred, output_dict=True)
        
    weighted_avg = {}
    for key in results_dict.keys():
        weighted_avg[key] = results_dict[key]['weighted avg']

    df_wavg = pd.DataFrame(weighted_avg).transpose()
    accuracy_df = get_accuracy(Y_test, y_preds)
    joined_df = accuracy_df.join(df_wavg)
    return joined_df

In [172]:
def metrics_dict(Y_test, y_preds):
    '''
    Create a weighted averages summary dataframe so that every feature has 
    OUT df columns: accuracy, precision,recall,f1-score,support
    '''
    results_dict = {}

    for pred, label, col in zip(y_preds.transpose(), Y_test.values.transpose(), Y_test.columns):
        results_dict[col] = classification_report(label, pred, output_dict=True)
        
    weighted_avg = {}
    for key in results_dict.keys():
        weighted_avg[key] = results_dict[key]['weighted avg']
    return results_dict, weighted_avg

In [173]:
metrics_dict(Y_test, y_preds_3)

({'related': {'0': {'precision': 0.6670716889428918,
    'recall': 0.34990439770554493,
    'f1-score': 0.45903010033444813,
    'support': 1569},
   '1': {'precision': 0.8205489092188599,
    'recall': 0.9445119481571487,
    'f1-score': 0.8781773677273584,
    'support': 4938},
   'accuracy': 0.8011372368218841,
   'macro avg': {'precision': 0.7438102990808759,
    'recall': 0.6472081729313468,
    'f1-score': 0.6686037340309032,
    'support': 6507},
   'weighted avg': {'precision': 0.7835417233247468,
    'recall': 0.8011372368218841,
    'f1-score': 0.7771105070328024,
    'support': 6507}},
  'request': {'0': {'precision': 0.9147517979301877,
    'recall': 0.9652045160096243,
    'f1-score': 0.9393011527377522,
    'support': 5403},
   '1': {'precision': 0.7667493796526055,
    'recall': 0.5597826086956522,
    'f1-score': 0.6471204188481676,
    'support': 1104},
   'accuracy': 0.8964192408175811,
   'macro avg': {'precision': 0.8407505887913966,
    'recall': 0.7624935623526383

In [169]:
test_report(Y_test, y_preds_3)

Unnamed: 0,accuracy,precision,recall,f1-score,support
related,0.801137,0.783542,0.801137,0.777111,6507.0
request,0.896419,0.889641,0.896419,0.889729,6507.0
offer,0.995697,0.992023,0.995697,0.993857,6507.0
aid_related,0.759029,0.758949,0.759029,0.754306,6507.0
medical_help,0.926848,0.912669,0.926848,0.912353,6507.0
medical_products,0.956508,0.949186,0.956508,0.950013,6507.0
search_and_rescue,0.973413,0.967028,0.973413,0.967423,6507.0
security,0.978638,0.967202,0.978638,0.970599,6507.0
military,0.968803,0.962603,0.968803,0.964185,6507.0
water,0.963578,0.961374,0.963578,0.962092,6507.0


In [None]:
def create_report_summary(y_preds, Y_test):
    '''
    Create a weighted averages summary dataframe so that every feature has 
    OUT df columns: precision,recall,f1-score,support
    '''
    results_dict = {}

    for pred, actual, col in zip(y_preds.transpose(), Y_test.values.transpose(), Y_test.columns):
        results_dict[col] = classification_report(actual, pred, output_dict=True)
        
    weighted_avg = {}
    for key in results_dict.keys():
        weighted_avg[key] = results_dict[key]['weighted avg']

    df_wavg = pd.DataFrame(weighted_avg).transpose()
    return df_wavg

In [None]:
def weighted_avg_report(df):
    '''
    OUT:
        descriptive statistics for the created weighted averages summary df
        upper and lower quantile df slices
    '''
    display(df['f1-score'].describe())
    display('lowest quantile of f scores',df[df['f1-score'] <= df['f1-score'].quantile(0.25)]) # lowest quantile of f scores
    #print(df.sort_values('f1-score').head(n = 10))
    display('highest quantile of f scores', df[df['f1-score'] >= df['f1-score'].quantile(0.75)]) # highest quantile of f scores

In [24]:
%%time
Y_pred = pipeline.predict (X_test)

In [54]:
%%time
# Train model Prediction
col_names = list(Y.columns.values)
Y_train_pred = pipeline.predict(X_train)
metrics_df = model_metrics(np.array(Y_train), Y_train_pred, col_names)

Wall time: 48.1 s


In [46]:
metrics_df

Unnamed: 0,Accuracy,Precision,Recall,F1
related,0.999129,0.999265,0.999599,0.999432
request,0.999641,1.0,0.997923,0.99896
offer,0.999898,1.0,0.978261,0.989011
aid_related,0.999385,0.999631,0.998895,0.999263
medical_help,0.999539,1.0,0.994163,0.997073
medical_products,0.999744,1.0,0.994824,0.997405
search_and_rescue,0.999795,1.0,0.992481,0.996226
security,0.999846,1.0,0.991124,0.995542
military,0.999846,0.998423,0.99685,0.997636
water,0.999949,1.0,0.999195,0.999597


In [37]:
def multioutput_fscore(y_true,y_pred,beta=1):
    score_list = []
    if isinstance(y_pred, pd.DataFrame) == True:
        y_pred = y_pred.values
    if isinstance(y_true, pd.DataFrame) == True:
        y_true = y_true.values
    for column in range(0,y_true.shape[1]):
        score = fbeta_score(y_true[:,column],y_pred[:,column],beta,average='weighted')
        score_list.append(score)
    f1score_numpy = np.asarray(score_list)
    f1score_numpy = f1score_numpy[f1score_numpy<1]
    f1score = gmean(f1score_numpy)
    return  f1score

In [42]:
multi_f1 = multioutput_fscore(Y_test,Y_pred, beta = 1)
overall_accuracy = (Y_pred == Y_test).mean().mean()

print('Average overall accuracy {0:.2f}% \n'.format(overall_accuracy*100))
print('F1 score (custom definition) {0:.2f}%\n'.format(multi_f1*100))

ValueError: Can only compare identically-labeled DataFrame objects

In [27]:
#converting to dataframe
Y_pred = pd.DataFrame(Y_pred, columns = Y_test.columns)

In [21]:
# Calculate the accuracy for each of them.
for i in range(len(Y.columns)):
    print('Category: {} '.format(Y.columns[i]))
    print(classification_report(Y_test.iloc[:, i].values, Y_pred[:, i]))
    print('Accuracy {}\n\n'.format(accuracy_score(Y_test.iloc[:, i].values, Y_pred[:, i])))
    print('F1 {}\n\n'.format(f1_score(Y_test.iloc[:, i].values, Y_pred[:, i],average='weighted')))

Category: related 


NameError: name 'Y_pred' is not defined

In [17]:
print(classification_report(Y_test.iloc[:, 1:].values, np.array([x[1:] for x in Y_pred]), target_names = Y.columns))

NameError: name 'categories' is not defined

---


# ??

In [2]:
def f1_pre_acc_evaluation (y_true, y_pred): 
    """A function that measures mean of f1, precision, recall for each class within multi-class prediction 
       Returns a dataframe with columns: 
       f1-score (average for all possible values of specific class)
       precision (average for all possible values of specific class)
       recall (average for all possible values of specific class)
       kindly keep in mind that some classes might be imbalanced and average values may mislead. 
    """
    #instantiating a dataframe
    report = pd.DataFrame ()
    
    for col in y_true.columns:
        #returning dictionary from classification report
        class_dict = classification_report (output_dict = True, y_true = y_true.loc [:,col], y_pred = y_pred.loc [:,col])
    
        #converting from dictionary to dataframe
        eval_df = pd.DataFrame (pd.DataFrame.from_dict (class_dict))
        
       # print (eval_df)
        
        #dropping unnecessary columns
        eval_df.drop(['micro avg', 'macro avg', 'weighted avg'], axis =1, inplace = True)
        
        #dropping unnecessary row "support"
        eval_df.drop(index = 'support', inplace = True)
        
        #calculating mean values
        av_eval_df = pd.DataFrame (eval_df.transpose ().mean ())
        
        #transposing columns to rows and vice versa 
        av_eval_df = av_eval_df.transpose ()
    
        #appending result to report df
        report = report.append (av_eval_df, ignore_index = True)    
    
    #renaming indexes for convinience
    report.index = y_true.columns
    
    return report

def f1_scorer_eval (y_true, y_pred): 
    """A function that measures mean of F1 for all classes 
       Returns an average value of F1 for sake of evaluation whether model predicts better or worse in GridSearchCV 
    """
    #converting y_pred from np.array to pd.dataframe
    #keep in mind that y_pred should a pd.dataframe rather than np.array
    y_pred = pd.DataFrame (y_pred, columns = y_true.columns)
    
    
    #instantiating a dataframe
    report = pd.DataFrame ()
    
    for col in y_true.columns:
        #returning dictionary from classification report
        class_dict = classification_report (output_dict = True, y_true = y_true.loc [:,col], y_pred = y_pred.loc [:,col])
    
        #converting from dictionary to dataframe
        eval_df = pd.DataFrame (pd.DataFrame.from_dict (class_dict))
        
        #dropping unnecessary columns
        eval_df.drop(['micro avg', 'macro avg', 'weighted avg'], axis =1, inplace = True)
        
        #dropping unnecessary row "support"
        eval_df.drop(index = 'support', inplace = True)
        
        #calculating mean values
        av_eval_df = pd.DataFrame (eval_df.transpose ().mean ())
        
        #transposing columns to rows and vice versa 
        av_eval_df = av_eval_df.transpose ()
    
        #appending result to report df
        report = report.append (av_eval_df, ignore_index = True)    
    
    #returining mean value for all classes. since it's used for GridSearch we may use mean
    #as the overall value of F1 should grow. 
    return report ['f1-score'].mean () 


In [None]:
y_pred = pipeline.predict (X_test)
#converting to dataframe
y_pred = pd.DataFrame (y_pred, columns = y_test.columns)

In [None]:
report = f1_pre_acc_evaluation (y_test, y_pred)

---


In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
print(classification_report(y_test.iloc[:,1:].values, np.array([x[1:] for x in y_pred]), target_names=categories))

---


In [None]:
def display_results(y_test, y_pred):
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)

In [None]:
def main():
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    model = model_pipeline()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    display_results(y_test, y_pred)

---

model parameters:

model 1 gridCV parameters:

In [None]:
#grid search
parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10, 25, 50, 100], 
              'clf__estimator__min_samples_split':[2, 5, 10]}

model_cv = GridSearchCV(estimator=model, param_grid=parameters, verbose=3)

model 1 best parameters:

to import: StartingVerbExtractor()),
            ("word_count", WordCount()),
            ("character_count", CharacterCount()),
            ("noun_count", NounCount()),
            ("verb_count", VerbCount())

In [None]:
   {'clf__estimator__min_samples_split': 2,
 'clf__estimator__n_estimators': 100,
 'tfidf__use_idf': True,
 'vect__min_df': 5}

In [None]:
{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x0000022E4B4A5CA8>,
                   vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          ccp_alpha=0.0,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=None,
                                                          max_features='auto',
                                                          max_leaf_nodes=None,
                                                          max_samples=None,
                                                          min_impurity_decrease=0.0,
                                                          min_impurity_split=None,
                                                          min_samples_leaf=1,
                                                          min_samples_split=2,
                                                          min_weight_fraction_leaf=0.0,
                                                          n_estimators=100,
                                                          n_jobs=None,
                                                          oob_score=False,
                                                          random_state=None,
                                                          verbose=0,
                                                          warm_start=False),
                         n_jobs=None))],
 'verbose': False,
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=1.0, max_features=None, min_df=1,
                 ngram_range=(1, 1), preprocessor=None, stop_words=None,
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=<function tokenize at 0x0000022E4B4A5CA8>,
                 vocabulary=None),
 'tfidf': TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                        ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features='auto',
                                                        max_leaf_nodes=None,
                                                        max_samples=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        n_estimators=100,
                                                        n_jobs=None,
                                                        oob_score=False,
                                                        random_state=None,
                                                        verbose=0,
                                                        warm_start=False),
                       n_jobs=None),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,
 'tfidf__use_idf': True,
 'clf__estimator__bootstrap': True,
 'clf__estimator__ccp_alpha': 0.0,
 'clf__estimator__class_weight': None,
 'clf__estimator__criterion': 'gini',
 'clf__estimator__max_depth': None,
 'clf__estimator__max_features': 'auto',
 'clf__estimator__max_leaf_nodes': None,
 'clf__estimator__max_samples': None,
 'clf__estimator__min_impurity_decrease': 0.0,
 'clf__estimator__min_impurity_split': None,
 'clf__estimator__min_samples_leaf': 1,
 'clf__estimator__min_samples_split': 2,
 'clf__estimator__min_weight_fraction_leaf': 0.0,
 'clf__estimator__n_estimators': 100,
 'clf__estimator__n_jobs': None,
 'clf__estimator__oob_score': False,
 'clf__estimator__random_state': None,
 'clf__estimator__verbose': 0,
 'clf__estimator__warm_start': False,
 'clf__estimator': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=None,
                        verbose=0, warm_start=False),
 'clf__n_jobs': None}

---


In [None]:
create_report_summary(y_preds, Y_test)
weighted_avg_report(df)

F-scores

Model 1:
count    35.000000
mean      0.931140
std       0.057876
min       0.771124
25%       0.916096
50%       0.944198
75%       0.971679
max       0.994010


Model 2:
count    35.000000
mean      0.930772
std       0.058746
min       0.767192
25%       0.921509
50%       0.940264
75%       0.970957
max       0.994010

In [None]:
#pipeline model 3
def pipeline_model_3():
    pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            #('starting_verb', StartingVerbExtractor()),
            #("word_count", WordCount()),
            #("character_count", CharacterCount()),
            #("noun_count", NounCount()),
            #("verb_count", VerbCount())
        ])),

        ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    #pipeline = Pipeline([('vect', CountVectorizer(tokenizer=tokenize)),
    #                      ('tfidf', TfidfTransformer()),
    #                      ('clf', MultiOutputClassifier(RandomForestClassifier()))])
    return pipeline

In [None]:
model.class_to_idx = image_datasets['training'].class_to_idx

def save_checkpoint(model):
    checkpoint = {'arch': "vgg16",
                 'class_to_idx': model.class_to_idx,
                  'model_state_dict': model.state_dict(),
                  'classifier_save': model.classifier
                 }
    torch.save(checkpoint, 'checkpoint.pth')

save_checkpoint(model)

In [237]:
_serialize(model_cv.best_estimator_)

{'py/class': {'name': 'Pipeline',
  'mod': 'sklearn.pipeline',
  'attr': {'steps': [{'py/tuple': ['vect',
      {'py/class': {'name': 'CountVectorizer',
        'mod': 'sklearn.feature_extraction.text',
        'attr': {'input': 'content',
         'encoding': 'utf-8',
         'decode_error': 'strict',
         'strip_accents': None,
         'preprocessor': None,
         'tokenizer': {'py/class': {'name': 'function',
           'mod': '__main__',
           'attr': {}}},
         'analyzer': 'word',
         'lowercase': True,
         'token_pattern': '(?u)\\b\\w\\w+\\b',
         'stop_words': None,
         'max_df': 1.0,
         'min_df': 5,
         'max_features': None,
         'ngram_range': {'py/tuple': [1, 1]},
         'vocabulary': None,
         'binary': False,
         'dtype': {'py/numpy.type': 'int64'},
         'fixed_vocabulary_': False,
         '_stop_words_id': 140713296288992,
         'stop_words_': {'py/set': ['mizzima',
           '7jqqy8yb',
           'c

In [None]:
ET_pipeline_pos_tag = Pipeline([
   ('u1', FeatureUnion([
      ('tfdif_features', Pipeline([('cleaner', FeatureCleaner()),
                            ('tfidf', TfidfVectorizer(max_features=40000, ngram_range=(1, 3))),
                            ])),
      ('numerical_features', Pipeline([('numerical_feats', FeatureMultiplierCount()),
                               ('scaler', StandardScaler()), ])),

      ('pos_features', Pipeline([
         ('pos', PosTagMatrix(tokenizer=nltk.word_tokenize)),
      ])),
   ])),
   ('clf', ExtraTreesClassifier()),
])

In [9]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_type(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

NameError: name 'BaseEstimator' is not defined

In [9]:
def starting_verb(msg_col):
    for msg in msg_col:
        sentence_list = nltk.sent_tokenize(msg)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
    return False