In [120]:
# Use "Base" Kernel for this notebook

#### Load a Dataset

In [121]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('(A) Data/PreProcessed_News Content Title_800 Data.csv', usecols=['Detokenized', 'Labelling'], engine='python')
df.head()

Unnamed: 0,Labelling,Detokenized
0,1.0,olahraga pilates populer deret manfaat
1,1.0,janice tjen lolos nomor final chennai open
2,-1.0,pasu rsf duga bantai ribu warga sipil sudan
3,1.0,pertamina peduli salur bantu korban bencana su...
4,-1.0,cuaca panas hujan banjir rob bayang wilayah


In [122]:
df.isnull().sum()

Labelling      2
Detokenized    2
dtype: int64

In [123]:
df = df.dropna()

In [124]:
df.isnull().sum()

Labelling      0
Detokenized    0
dtype: int64

In [125]:
df.shape

(832, 2)

#### Take a Quick Look at the Label

In [126]:
df['Labelling'].value_counts()

Labelling
 0.0    557
-1.0    180
 1.0     95
Name: count, dtype: int64

#### Split the Data into Training and Testing sets

In [127]:
from sklearn.model_selection import train_test_split

X = df['Detokenized']  

# Map labels to 0, 1, 2
label_mapping = {-1: 0, 0: 1, 1: 2}
y = df['Labelling'].map(label_mapping)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [128]:
print(y.value_counts()) 

Labelling
1    557
0    180
2     95
Name: count, dtype: int64


#### Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [129]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

# Only the X_train not the whole X
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(665, 3783)

#### Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

==  
Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "red" and "dogs". 
    
An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.  
== ( See "NLP With Python Notebook") 

In [130]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(665, 3783)

Note: the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

In [131]:
X_train_tfidf

<665x3783 sparse matrix of type '<class 'numpy.float64'>'
	with 13940 stored elements in Compressed Sparse Row format>

#### Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(665, 3783)

In [133]:
X_test_tfidf = vectorizer.transform(X_test) # remember to use the original X_train set
X_test_tfidf.shape

(167, 3783)

#### CNN Classifier

In [134]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [135]:
# Convert the sparse matrix to a dense numpy array
X_train_tfidf_dense = X_train_tfidf.toarray()
X_test_tfidf_dense = X_test_tfidf.toarray()

In [136]:
model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(3, activation = 'softmax')    # < softmax activation here
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

model.fit(
    X_train_tfidf_dense,
    y_train,
    epochs=10
)

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x21ee57b03d0>

In [137]:
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_18 (Dense)            (None, 25)                94600     
                                                                 
 dense_19 (Dense)            (None, 15)                390       
                                                                 
 dense_20 (Dense)            (None, 3)                 48        
                                                                 
Total params: 95,038
Trainable params: 95,038
Non-trainable params: 0
_________________________________________________________________


In [138]:
predictions = model.predict(X_test_tfidf_dense)



In [139]:
# 2. Convert probabilities to class labels (0, 1, or 2)
predicted_classes = np.argmax(predictions, axis=1)
print(predicted_classes)

[1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1
 1 1 1 1 1 2 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 0 0
 1 1 1 1 1 2 1 1 1 1 0 1 1 0 1 1 1 1 1]


#### Evaluating Model Performance

In [140]:
from sklearn.metrics import confusion_matrix,classification_report

In [143]:
y_pred_classes = predictions.argmax(axis=1)
confusion_matrix(y_test, y_pred_classes)

array([[ 16,  19,   0],
       [  8, 106,   0],
       [  2,  14,   2]], dtype=int64)

In [144]:
print(classification_report(y_test, y_pred_classes))

              precision    recall  f1-score   support

           0       0.62      0.46      0.52        35
           1       0.76      0.93      0.84       114
           2       1.00      0.11      0.20        18

    accuracy                           0.74       167
   macro avg       0.79      0.50      0.52       167
weighted avg       0.76      0.74      0.70       167

