# Email Marketing Campaigns Optimization Model with NLP


## Jupyter Notebook 10/10

### 7. Cleaning, preprocessing, transforming new data and making new predictions.

#### Script 01 - EDA & Feature Engineering (part I)

- Obtaining the dataset
- Dropping inactive accounts
- Getting number of campaigns info
- Applying log transformation
- Handling outliers
- Creating new variables `open_rate`, `ctr` and `ctor`
- Removing duplicated users
- Removing empty messages
- Removing nonwords
- Counting words in `subject` and `message`

In [1]:
cd ../03_Scripts

/Users/danielaperezduro/Desktop/TFM/emailmarketingwnlp/notebooks/03_Scripts


In [2]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/danielaperezduro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/danielaperezduro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
! python 01_mainmodel_before_filtering.py --input_root ../../../datasets/input --input_file campaigns_w_label_sample_02.csv --output_root ../../../datasets/output/sample_02/ --output_file output_01.csv

Dataset:  campaigns_w_label_sample_02.csv
Campaigns df shape: (54997, 10)
Total Sent Skweness: 0.41
Opens Skweness: -0.00
Clicks Skewness: 0.12

Outliers:
213

Outliers:
618

Outliers:
183
Mean value after outliers treatment: 3.26
Mean value after outliers treatment: 2.46
Mean value after outliers treatment: 1.34

Outliers:
0

Outliers:
0

Outliers:
0
Number of campaigns within the dataset after removing users with duplicated accounts:  54948
Number of campaigns within the dataset after removing empty messages:  54935
Total number of words in subject:  417163
Total number of words in message:  16313302


#### Script 02 - Texts Cleaning 

- Removing irrelevant phrases from `message`

In [4]:
! python 02_remove_irrelevant_phrases.py --input_root ../../../datasets/output/sample_02/ --input_file output_01.csv --output_root ../../../datasets/output/sample_02/ --output_file output_02.csv

output_01.csv
File 01/39: 
1. Number of seconds since it started running: 1.1920928955078125e-06
2. Campaigns shape before removing irrelevant phrases: (54935, 15)
3. Removing empty messages
4. Number of words before removing irrelevant phrases:  16313302
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  15579850
9. Number of words removed:  733452
10. Campaigns shape after having removed empty messages: (54513, 15)
11. Number of words after having removed empty messages:  15579428
12. Number of seconds since it started running: 827.8035199642181
File 02/39: 
1. Number of seconds since it started running: 2.1457672119140625e-06
2. Campaigns shape before removing irrelevant phrases: (54513, 15)
3. Removing empty messages
4. Number of words before removing irrelevant phrases:  15579428
5. Starting the process of removing irrelevant phr

4. Number of words before removing irrelevant phrases:  14783461
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  14768402
9. Number of words removed:  15059
10. Campaigns shape after having removed empty messages: (54488, 15)
11. Number of words after having removed empty messages:  14768402
12. Number of seconds since it started running: 343.2654318809509
File 14/39: 
1. Number of seconds since it started running: 9.5367431640625e-07
2. Campaigns shape before removing irrelevant phrases: (54488, 15)
3. Removing empty messages
4. Number of words before removing irrelevant phrases:  14768402
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  14753848
9. Number of words removed:  14

4. Number of words before removing irrelevant phrases:  14381227
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  14358324
9. Number of words removed:  22903
10. Campaigns shape after having removed empty messages: (54434, 15)
11. Number of words after having removed empty messages:  14358324
12. Number of seconds since it started running: 319.7901496887207
File 26/39: 
1. Number of seconds since it started running: 1.9073486328125e-06
2. Campaigns shape before removing irrelevant phrases: (54434, 15)
3. Removing empty messages
4. Number of words before removing irrelevant phrases:  14358324
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  14306504
9. Number of words removed:  51

4. Number of words before removing irrelevant phrases:  14073220
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  14069929
9. Number of words removed:  3291
10. Campaigns shape after having removed empty messages: (54336, 15)
11. Number of words after having removed empty messages:  14069929
12. Number of seconds since it started running: 362.31920194625854
File 38/39: 
1. Number of seconds since it started running: 2.1457672119140625e-06
2. Campaigns shape before removing irrelevant phrases: (54336, 15)
3. Removing empty messages
4. Number of words before removing irrelevant phrases:  14069929
5. Starting the process of removing irrelevant phrases from messages
6. File appended
7. Applying the remove_template method over the dataframe
8. Number of words after having removed irrelevant phrases:  14064029
9. Number of words removed: 

#### Script 03 - Creating one df per language

- Filtering campaigns df by language

In [5]:
! python 03_filtering_by_language.py --input_root ../../../datasets/output/sample_02/ --input_file output_02.csv --output_root ../../../datasets/output/sample_02/output_03/

output_02.csv
1 Number of seconds since it started running: 0.035861968994140625
2 Number of seconds since it started running: 781.8422179222107
3 Number of seconds since it started running: 781.8448619842529
['es' 'ca' 'en' 'fr' 'pt' 'it' 'sl' 'pl' 'id' 'de' 'no' 'et' 'nl' 'da'
 'sw' 'ro' 'hr' 'sk' 'af' 'cs' 'tr' 'fi' 'hu' 'vi' 'sv' 'tl' 'cy' 'lt']
Number of campaigns in Spanish: 
43593


#### Script 04 - Removing stopwords from language df

- Removing stopwords from the language file

In [6]:
!python 04_remove_stopwords.py --input_root ../../../datasets/output/sample_02/output_03 --input_file es.csv --output_root ../../../datasets/output/sample_02/output_04/ --output_file es.csv --sw_remover spanish

Language:  (43593, 15)


#### Script 05 - Stemming & selecting the best vectorization method to predict verticals

- Preprocessing texts: stemming
- Filtering the df dropping NaN values in `sector`
- Encoding Verticals
- SMOTE oversampling
- Transforming texts into features using CountVectorizer or TFIDF vectorizer methods
- Predicting verticals with multinomial nb classifier 
- Evaluating predictions

In [7]:
!python 05_verticalmodel_model_selection.py --input_root ../../../datasets/output/sample_02/output_04/ --input_file es.csv --vectorizer CountVectorizer --stemmer spanish

spanish
Labeled Messages: 5420
CountVectorizer()
The dimension of our feature vector is 41877.
Misclassified: 7.195571955719557


In [8]:
!python 05_verticalmodel_model_selection.py --input_root ../../../datasets/output/sample_02/output_04/ --input_file es.csv --vectorizer TfidfVectorizer --stemmer spanish

spanish
Labeled Messages: 5420
TfidfVectorizer()
The dimension of our feature vector is 42852.
Misclassified: 7.380073800738007


#### Script 06 - Preprocessing texts 

- Preprocessing texts from the whole dataset: stemming


In [9]:
!python 06_verticalmodel_language_preprocessing_texts.py --input_root ../../../datasets/output/sample_02/output_04/ --input_file es.csv --output_root ../../../datasets/output/sample_02/output_05/ --output_file es.csv --stemmer spanish

Language df shape before cleaning texts: (43593, 15)
Language df shape after cleaning texts: (43593, 15)


#### Script 07 -  Predicting verticals in the whole dataframe

- Encoding Vertical
- Splitting between labeled and unlabeled data
- Vectorizing texts
- SMOTE oversampling `target_train`
- Training a multinomial nb classifier
- Saving the trained classifier
- Predicting Verticals
- Merging predictions with the original messages

In [10]:
!python 07_verticalmodel_language_vectorization_and_ml.py --input_root ../../../datasets/output/sample_02/output_05/ --input_file es.csv --output_root ../../../datasets/output/sample_02/output_06/ --output_file es.csv --save_trained_vectorizer vertical_model_tfidf_vectorizer_spanish --save_trained_classifier vertical_model_multinomial_nb_classifier_spanish 

TfidfVectorizer(max_features=10000)
Features train shape:  (5420, 10000)
Labeled Messages: (14924, 10000)
Unabeled Messages: (38173, 10000)
The dimension of our feature vector is 10000.
MultinomialNB()
[[   0  935]
 [   1  314]
 [   2  476]
 [   3 3634]
 [   4 4180]
 [   5 2300]
 [   6 5556]
 [   7 2246]
 [   8 4914]
 [   9 2010]
 [  10 3966]
 [  11 2630]
 [  12  796]
 [  13 4216]]


#### Script 08 - Preparing Main Models Features and Target

- Creating new variables:`open_rate_benchmark`, `ctr_benchmark`, `ctor_benchmark`, `open_rate_result`, `ctr_result`, `ctor_result

- Preprocessing texts: stemming


In [11]:
!python 08_mainmodel_spanish_preparing_features_and_target.py --input_root ../../../datasets/output/sample_02/output_06/ --input_file es.csv --output_root ../../../datasets/output/sample_02/output_07/ --output_file es.csv

Campaigns df columns:  Index(['sender', 'subject', 'date_sent', 'total_sent', 'customer_cat', 'opens',
       'clicks', 'message', 'campaign_id', 'open_rate', 'ctr', 'ctor',
       'clean_subject', 'clean_message', 'vertical', 'open_rate_benchmark',
       'ctr_benchmark', 'ctor_benchmark', 'open_rate_result', 'ctr_result',
       'ctor_result'],
      dtype='object')


#### Script 09 - Predicting Main Models Targets

- Creating a Scikit-learn Pipeline
- Predicting targets
- Evaluating performance

In [12]:
!python 09_second_model_pipeline.py --input_root ../../../datasets/output/sample_02/output_07/ --input_file es.csv --features clean_subject vertical --target open_rate_result --text_column clean_subject --pickle_file ../../../datasets/pipelines/main_model_sample_02_target_01_pipeline 

Predictions mean:  0.3934388621243404
Target Test:  0.44023858683184214
Predictions mean is not greater than target test mean
accuracy 71.36958017894013

Confusion Matrix

 [[1918  522]
 [ 726 1193]]
True Negatives:  1918
True Positives:  1193
False Negatives  726
False Positives:  522
Recall Score:  62.16779572694111
Precision Score:  69.56268221574345


In [13]:
!python 09_second_model_pipeline.py --input_root ../../../datasets/output/sample_02/output_07/ --input_file es.csv --features clean_message vertical --target ctr_result --text_column clean_message --pickle_file ../../../datasets/pipelines/main_model_sample_02_target_02_pipeline 

Predictions mean:  0.4278504244092682
Target Test:  0.4475797201192934
Predictions mean is not greater than target test mean
accuracy 76.55425556320256

Confusion Matrix

 [[1940  468]
 [ 554 1397]]
True Negatives:  1940
True Positives:  1397
False Negatives  554
False Positives:  468
Recall Score:  71.60430548436699
Precision Score:  74.90616621983915


In [14]:
!python 09_second_model_pipeline.py --input_root ../../../datasets/output/sample_02/output_07/ --input_file es.csv --features clean_message vertical --target ctor_result --text_column clean_message --pickle_file ../../../datasets/pipelines/main_model_sample_02_target_03_pipeline 

Predictions mean:  0.3718742830924524
Target Test:  0.40192704748795594
Predictions mean is not greater than target test mean
accuracy 78.0454232622161

Confusion Matrix

 [[2194  413]
 [ 544 1208]]
True Negatives:  2194
True Positives:  1208
False Negatives  544
False Positives:  413
Recall Score:  68.94977168949772
Precision Score:  74.52190006169032
