##**Save and Load model**

[Save and load models](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb#scrollTo=pZJ3uY9O17VN)

#------------------------------------------------

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
import re  # For preprocessing
import pandas as pd  # For data handling
from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing

import logging  # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [10]:
url_df_train = "/content/gdrive/MyDrive/Colab Notebooks/Machine Learning/31062021_sarcasm-detection/dataset/Sarcasm_Detection_Train.csv"
url_df_test = "/content/gdrive/MyDrive/Colab Notebooks/Machine Learning/31062021_sarcasm-detection/dataset/Sarcasm_Detection_Test.csv"

df_train = pd.read_csv(url_df_train, usecols=['is_sarcastic', 'title'])

df_test = pd.read_csv(url_df_test, usecols=['is_sarcastic', 'title'])

print (df_train.shape, df_test.shape)

(337758, 2) (28207, 2)


In [13]:
df_train.head()

Unnamed: 0,is_sarcastic,title
0,0,8 New Books We Recommend This Week
1,1,West Toronto getting close to 2025 goal of 1 c...
2,0,New Footage Emerges of Police Pepper-Spraying ...
3,0,"Decision to scrap exams a mistake, warn school..."
4,1,5 effortless pottery overalls worn by calmer w...


In [15]:
df_train.isnull().sum()

is_sarcastic    0
title           0
dtype: int64

#**Tiền xử lý**

In [16]:
nlp = spacy.load('en', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return ' '.join(txt)

Xóa các ký tự không phải chữ cái:

In [17]:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df_train['title'])

Tận dụng thuộc tính spaCy .pipe () để tăng tốc quá trình làm sạch:

In [19]:
t = time()

txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]

print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))

Time to clean up everything: 1.71 mins


In [20]:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

INFO - 12:57:26: NumExpr defaulting to 2 threads.


(142185, 1)

In [24]:
df_clean.head()

Unnamed: 0,clean
0,honour d day soldier fight freedom
1,giannis antetokounmpo call amazing sign language
2,nation s dad remind mom s birthday come d forget
3,ronald melzack cartographer pain dead
4,' miserable ' don jr wait donald trump presidency
