## Text Processing
This cell code is a summarization of the previous text preprocessing activity

In [78]:
import pandas as pd
import re
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt') 
#File path
df = pd.read_csv('data_supervised.csv', error_bad_lines= False)
#drop null values
df.dropna(inplace=True)
filename = 'english_words.txt'
with open(filename, 'r') as file:
    stop_words = file.read().splitlines()
filename = 'tagalog_stop_words.txt'
with open(filename, 'r') as file:
    tagalog_words = file.read().splitlines()
stop_words.extend(tagalog_words)
def preprocess_data(article):
    stopwords=stop_words
    article = str(article).lower()
    article = re.sub("[^a-zA-Z0-9\s]",'',article)
    temp_final =[]
    for word in article.split():
        if word =='' or '\r\n' in word or word in stop_words:
            None
        else:
            temp_final.append(word)
    return word_tokenize(' '.join(temp_final))
df['Article_processed'] = df['Article'].apply(preprocess_data)

[nltk_data] Downloading package punkt to /Users/jagoodkid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
%%capture
#Check first 5 instances
df.head()

# Supervised Machine Learning
In this activity, you will step into the shoes of a data scientist and work on building a Filipino news classifier using the power of Support Vector Machines (SVM). The goal is to create a program that can automatically categorize Filipino news articles into different topics, such as politics, entertainment, sports, and more. This will help you understand how SVM works in text classification tasks and how it can be applied to real-world scenarios.

In [7]:
#Importing countvectorizer
from sklearn.feature_extraction.text import CountVectorizer

#### We just imported CountVectorizer
Count Vectorizer is a fundamental technique used in natural language processing (NLP) to convert text data into numerical form that machine learning algorithms, like Support Vector Machines (SVM), can understand and work with. It's a way to represent text as a collection of word counts. Let's break it down:

In [19]:
#Creating a CountVectorizer model
df['Article'] = df['Article_processed']
bow_transformer= CountVectorizer(analyzer=preprocess_data).fit(df['Article'])

In [20]:
#Print number of vocabulary/words
print(len(bow_transformer.vocabulary_))

66783


In [27]:
%%capture
#trying the countvectorizer for a single instance
artc3=df['Article'][2]
bow3=bow_transformer.transform([artc3])
print(bow3)


### 1.  Now try it with the whole Dataset

In [22]:
#Enter code here, name it as article_bow
article_bow=bow_transformer.transform(df['Article'])

In [29]:
#Import TFIDF
from sklearn.feature_extraction.text import TfidfTransformer
#Making an instance of this transformer
tfidf_transformer=TfidfTransformer().fit(article_bow)

### TF-IDF
stands for Term Frequency-Inverse Document Frequency. It's a technique used in natural language processing (NLP) to help understand the importance of words in a collection of documents. Let's break it down:

#### Term Frequency (TF):
Imagine you have a document (like an article or a book). Term Frequency is the number of times a specific word appears in that document. It helps us know which words are important in that particular document.

#### Inverse Document Frequency (IDF):
Now, think about all the documents you have in your collection. Inverse Document Frequency measures how unique or rare a word is across all those documents. It tells us which words are special and not common in the entire collection.

#### Combining TF and IDF - TF-IDF:
TF-IDF combines Term Frequency and Inverse Document Frequency. It helps us understand how significant a word is in a particular document compared to its significance in the entire collection. If a word appears a lot in a document but is rare in the whole collection, its TF-IDF value will be high for that document.

#### Why Use TF-IDF?
Imagine you're analyzing a bunch of articles about cats. The word "cat" might appear a lot in all of them, but words like "purr" or "kitten" might appear less frequently. TF-IDF helps us identify these less common, more interesting words that give a document its unique character.

In [53]:
#Just to get what it looks like for a single article
tfidf3=tfidf_transformer.transform(bow3)

In [54]:
%%capture
#transforming a simple word count into a tfidf 
print(tfidf3)

In [55]:
#Checking the tfidf of a particular word('Bansa')
tfidf_transformer.idf_[bow_transformer.vocabulary_['bansa']]

2.931136870164136

### Now try to check the tfidf of word "manila"

In [56]:
#Enter code here
tfidf_transformer.idf_[bow_transformer.vocabulary_['manila']]

In [57]:
#Convert the entire bag of words corpus into a tfidf corpus at once
article_tfidf=tfidf_transformer.transform(article_bow)

## Training dataset
Training a dataset is a crucial step in building machine learning models like the Filipino news classifier using Support Vector Machines (SVM). Training helps the model learn and understand the patterns in the data. 

In [58]:
from sklearn.model_selection import train_test_split

In [59]:
article_train,article_test,category_train,category_test=train_test_split(df['Article'],df['Category'],test_size=0.25)

### 3. Now, try to make an instance where the test size is 30%

In [60]:
#Enter your answer here
article_train,article_test,category_train,category_test=train_test_split(df['Article'],df['Category'],test_size=0.3)

In [46]:
#lets import SVM
from sklearn.linear_model import SGDClassifier

### Support Vector Machine
SVM figures out the best way to draw separator lines based on the numerical fingerprints of the articles. It's like finding the patterns that separate politics from entertainment and sports from technology.

In [65]:
from sklearn.pipeline import Pipeline

In [67]:
#think of the pipeline as the steps or methods on a task
text_clf_svm = Pipeline([('vect', CountVectorizer(analyzer=preprocess_data)),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, random_state=42)),
 ])

### loss: [‘hinge’, ‘log_loss’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’, ‘squared_error’, ‘huber’, ‘epsilon_insensitive’, ‘squared_epsilon_insensitive’]
### penalty: l1,l2,  elasticnet
### alpha: float below 0
### random state: float

In [68]:
_=text_clf_svm.fit(df['Article'],df['Category'])

## Evaluation

In [69]:
from sklearn.metrics import classification_report
#predict the test article
prediction=text_clf_svm.predict(article_test)

In [72]:
# Generate a classification report
report = classification_report(category_test, prediction)

In [74]:
print(report)

              precision    recall  f1-score   support

       Bansa       0.92      0.87      0.89       213
       Metro       0.90      0.97      0.94       187
     Opinyon       0.97      0.97      0.97       181
      Palaro       0.97      1.00      0.99       205
  Probinsiya       0.98      0.94      0.96       194
     Showbiz       1.00      0.98      0.99       216

    accuracy                           0.96      1196
   macro avg       0.96      0.96      0.96      1196
weighted avg       0.96      0.96      0.96      1196



In [75]:
text_clf_svm.predict(["James yap, 54 points sa latest na laro ng gilas pilipinas"])[0]

'Palaro'

# Answers

### 1. article_bow=bow_transformer.transform(df['Article'])
### 2. tfidf_transformer.idf_[bow_transformer.vocabulary_['manila']]
### 3. article_train,article_test,category_train,category_test=train_test_split(df['Article'],df['Category'],test_size=0.3)