# News Category Analysis

## Objective 
Our main goal is to create a model that predict the category of news from their title and save the model and preprocessing functions to be used 
in a web application. Here are some definitions:

* **Dataset used**: The dataset used is the <a href="https://www.kaggle.com/datasets/rmisra/news-category-dataset">News Category Dataset</a> which contains around 210k news headlines from 2012 to 2022 from HuffPost.

* **Data preprocessing**: To use text for ML algorithms we need to convert our text into vectors. But, before this the data is prepared by following this steps: first, we clean the text by removing punctuations, repetitions and contractions, after this, our text is convert to tokens (and every token is stemming by SnowballStemmer), then, we apply the tf-idf vectorizer,finally, to reduce the dimensionality of our vectors we use PCA

* **Model definition and training**: Our dataset have for about 20k news, divide by 5 categories: business, art, education, science and sports. The training data contains 80% of all data, the rest is test data. As we have a good amount of data and small categories a simple logistic regression gives good results. So our model estimates what category is more likely to occur based on probabilities.

## Results

For training data or best model has 85% of accuracy, 84% of f1-score macro avg and 84% of f1-score weighted avg. At test data, it achieved 80% of accuracy, 78% of f1-score macro avg and 80% of f1-score weighted avg.

## Table of Contents
1. [Load the data](#load_data)
2. [Data preprocessing](#data_preprocessing)
3. [Create Training and Test data](#create_train_test_data)
4. [Defining the model](#defining_model)
5. [Model Hyperparameters tuning](#model_hyperparameters)
6. [Results](#results)
7. [Testing](#test)


<p id='load_data'><a href='#load_data'>#</a></p>

# 1 - Load the data

In [1]:
#imports
##imports for visualization and data preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 
import re
import string
import contractions
import pickle 

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

#imports for text preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV

#ML libraries

from sklearn.metrics import classification_report, accuracy_score,roc_auc_score,mean_squared_error

#Web Scrapping Libraries
import requests
from bs4 import BeautifulSoup

In [2]:
#load data
directory = r"C:\Users\Talissa\Documents\env\flask_projects\newsApp\news\model_data\News_Category_Dataset_v2.json"

df = pd.read_json(directory,lines=True)

In [3]:
df.head(5)

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


In [4]:
df.shape

(200853, 6)

<p id='data_preprocessing'><a href='#data_preprocessing'>#</a></p>

# 2 - Data preprocessing

In [5]:
(df.groupby(["category"]).size()/df.shape[0])*100

category
ARTS               0.751296
ARTS & CULTURE     0.666657
BLACK VOICES       2.254385
BUSINESS           2.955893
COLLEGE            0.569571
COMEDY             2.576511
CRIME              1.695270
CULTURE & ARTS     0.512813
DIVORCE            1.705725
EDUCATION          0.499868
ENTERTAINMENT      7.994902
ENVIRONMENT        0.658691
FIFTY              0.697525
FOOD & DRINK       3.099779
GOOD NEWS          0.696031
GREEN              1.305432
HEALTHY LIVING     3.332786
HOME & LIVING      2.088592
IMPACT             1.722155
LATINO VOICES      0.562103
MEDIA              1.401523
MONEY              0.849875
PARENTING          4.320075
PARENTS            1.969102
POLITICS          16.299981
QUEER VOICES       3.143593
RELIGION           1.272572
SCIENCE            1.084375
SPORTS             2.431629
STYLE              1.122214
STYLE & BEAUTY     4.804011
TASTE              1.043549
TECH               1.036579
THE WORLDPOST      1.824220
TRAVEL             4.922506
WEDDINGS   

In [6]:
#number of categories
len(df["category"].unique())

41

In [7]:
#Change category is a function to change the categories in DF
def change_category(df,old_categories,new_category):
    for c in old_categories:
        data = df.loc[df.loc[:,"category"]==c]
        df.loc[data.index,"category"]=new_category
    return df

In [8]:
# Removing categories Culture & Arts, wellness, parenting, style, envioronment, taste, queer voices, latino voices,
#black voices, women, the worldpost, wedding, divorce, college, green para uma única categoria 

#arts & culture, culture & arts --> arts
#black voices, queer voices, latino voices, women --> social issues 
#style & beauty --> style 
#divorce e wedding --> social
#taste --> food & drink
#college --> education 
#green --> enviorement 
#parenting --> Parents
#the worldpost --> worldpost 
#wellness --> healthy living

new_categories = {"STYLE & BEAUTY":"STYLE", "ARTS & CULTURE":"ARTS","CULTURE & ARTS":"ARTS",
                  "PARENTING":"PARENTS","WELLNESS":"HEALTHY LIVING",
                 "DIVORCE":"SOCIAL","WEDDINGS":"SOCIAL","TASTE":"FOOD & DRINK",
                 "COLLEGE":"EDUCATION","THE WORLDPOST":"WORLDPOST","BLACK VOICES":"SOCIAL ISSUES",
                 "QUEER VOICES":"SOCIAL ISSUES","LATINO VOICES":"SOCIAL ISSUES","WOMEN":"SOCIAL ISSUES",
                 "GREEN":"ENVIRONMENT"}


for k,v in new_categories.items():
    df = change_category(df,[k],v)

In [9]:
len(df["category"].unique())

28

In [10]:
df["category"].unique()

array(['CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS',
       'WEIRD NEWS', 'SOCIAL ISSUES', 'COMEDY', 'SPORTS', 'BUSINESS',
       'TRAVEL', 'MEDIA', 'TECH', 'RELIGION', 'SCIENCE', 'EDUCATION',
       'PARENTS', 'ARTS', 'STYLE', 'ENVIRONMENT', 'FOOD & DRINK',
       'HEALTHY LIVING', 'WORLDPOST', 'GOOD NEWS', 'FIFTY',
       'HOME & LIVING', 'SOCIAL', 'MONEY'], dtype=object)

In [11]:
df.groupby(["category"]).size()

category
ARTS               3878
BUSINESS           5937
COMEDY             5175
CRIME              3405
EDUCATION          2148
ENTERTAINMENT     16058
ENVIRONMENT        3945
FIFTY              1401
FOOD & DRINK       8322
GOOD NEWS          1398
HEALTHY LIVING    24521
HOME & LIVING      4195
IMPACT             3459
MEDIA              2815
MONEY              1707
PARENTS           12632
POLITICS          32739
RELIGION           2556
SCIENCE            2178
SOCIAL             7077
SOCIAL ISSUES     15461
SPORTS             4884
STYLE             11903
TECH               2082
TRAVEL             9887
WEIRD NEWS         2670
WORLD NEWS         2177
WORLDPOST          6243
dtype: int64

For our problem, we're going to use just 5 categories: Arts, Education, Science, Business and Sports

In [12]:
selected_categories = ["ARTS","EDUCATION","SCIENCE","BUSINESS","SPORTS"]
new_df = []
for category in selected_categories:
    data = df[df.category == category]
    new_df.append(data)

df = pd.concat(new_df,ignore_index=True)

In [13]:
#create a csv only with the selected categories to store at our database
df.to_csv("selected_news.csv",encoding="utf-8",index=False)

Removing columns that are not important for the model and keep only the headline and category

In [14]:
df.drop(columns=["authors","link","date","short_description"],inplace=True)
df.head(5)

Unnamed: 0,category,headline
0,ARTS,Modeling Agencies Enabled Sexual Predators For...
1,ARTS,Actor Jeff Hiller Talks “Bright Colors And Bol...
2,ARTS,New Yorker Cover Puts Trump 'In The Hole' Afte...
3,ARTS,J. K. Rowling Trolls Trump For Canceled UK Vis...
4,ARTS,Man Surprises Girlfriend By Drawing Them In Di...


In [15]:
df.groupby(["category"]).size()

category
ARTS         3878
BUSINESS     5937
EDUCATION    2148
SCIENCE      2178
SPORTS       4884
dtype: int64

In [16]:
#defining type of category 
df["category"] = df["category"].astype("category")
df.dtypes

category    category
headline      object
dtype: object

In [17]:
#changing categories to ordinal 
df["category"]=df["category"].cat.rename_categories([i for i in range(len(df["category"].unique()))])
df["category"]

0        0
1        0
2        0
3        0
4        0
        ..
19020    4
19021    4
19022    4
19023    4
19024    4
Name: category, Length: 19025, dtype: category
Categories (5, int64): [0, 1, 2, 3, 4]

Begining text preprocessing

In [18]:
def to_lowercase(text):
    text = text.lower()
    return text

In [19]:
test_text = df.loc[0].headline
test_text

'Modeling Agencies Enabled Sexual Predators For Years, Former Agent Says'

In [20]:
to_lowercase(test_text)

'modeling agencies enabled sexual predators for years, former agent says'

In [21]:
def fix_contractions(text):
    for k,v in contractions.contractions_dict.items():
        text = text.replace(k,v)
    return text

In [22]:
def punct_repetition(text,default_replace=""):
    text = re.sub(r"[\.\,\?\!]+(?=[\.\,\?\!])",default_replace,text)
    return text

In [23]:
def word_repetition(text):
    text = re.sub(r'(.)\1+',r'\1\1',text)
    return text 

Applying text preprocessing at an exemple

In [24]:
test = "I'm going to mall!!!!"
fix_contractions(test)

'I am going to mall!!!!'

In [25]:
punct_repetition(test)

"I'm going to mall!"

In [26]:
teste = "woooord!"
word_repetition(test)

'woord!'

In [27]:
#download packages
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Talissa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Talissa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Create function `process_text` that apply all the text preprocessing steps, create tokens and remove stop works, punctuations and alphanumeric characters, finally, our tokens are stemming by applying Snowball Stemmer

In [29]:
def custom_tokenize(text,keep_punct=False,keep_alnum=False,keep_stop=False):
    
    token_list = word_tokenize(text)
    if not keep_punct:
        token_list = [ token for token in token_list if token not in string.punctuation]
    
    if not keep_alnum:
        token_list = [ token for token in token_list if token.isalpha()]
    
    if not keep_stop:
        stop_words = set(stopwords.words("english"))
        #remove no from stop words
        stop_words.discard("not")
        stop_words.discard("no")
        token_list = [token for token in token_list if not token in stop_words]
    
    return token_list

In [30]:
def stem_tokens(tokens,stemmer):
    token_list = []
    for token in tokens:
        token_list.append(stemmer.stem(token))
    return token_list

In [31]:
def process_text(text):
    #clean the text
    text = to_lowercase(text)
    text = fix_contractions(text)
    text = punct_repetition(text)
    text = word_repetition(text)
    
    #tokenize the text and returns stems 
    tokens = custom_tokenize(text,keep_stop=True,keep_alnum=True)
    snowball_stemmer = SnowballStemmer("english")
    stem = stem_tokens(tokens,snowball_stemmer)
    return stem

In [32]:
df["tokens"] = df["headline"].apply(process_text)
df["tokens"]

0        [model, agenc, enabl, sexual, predat, for, yea...
1        [actor, jeff, hiller, talk, “, bright, color, ...
2        [new, yorker, cover, put, trump, in, the, hole...
3        [j., k., rowl, troll, trump, for, cancel, uk, ...
4        [man, surpris, girlfriend, by, draw, ththem, i...
                               ...                        
19020         [thank, you, jame, dolan, and, time, warner]
19021    [maria, sharapova, stun, by, victoria, azarenk...
19022    [giant, over, patriot, jet, over, colt, among,...
19023    [aldon, smith, arrest, 49er, lineback, bust, f...
19024    [dwight, howard, rip, teammat, after, magic, l...
Name: tokens, Length: 19025, dtype: object

<p id='create_train_test_data'><a href='#create_train_test_data'>#</a></p>

# 3 - Create Training and Test data

In [33]:
df.head()

Unnamed: 0,category,headline,tokens
0,0,Modeling Agencies Enabled Sexual Predators For...,"[model, agenc, enabl, sexual, predat, for, yea..."
1,0,Actor Jeff Hiller Talks “Bright Colors And Bol...,"[actor, jeff, hiller, talk, “, bright, color, ..."
2,0,New Yorker Cover Puts Trump 'In The Hole' Afte...,"[new, yorker, cover, put, trump, in, the, hole..."
3,0,J. K. Rowling Trolls Trump For Canceled UK Vis...,"[j., k., rowl, troll, trump, for, cancel, uk, ..."
4,0,Man Surprises Girlfriend By Drawing Them In Di...,"[man, surpris, girlfriend, by, draw, ththem, i..."


Transform our tokens into vectors by applying the `tf-idf Vectorizer`.

In [34]:
def fit_tfidf(corpus):
    tf_vect = TfidfVectorizer(tokenizer=lambda x:x,
                            preprocessor= lambda x:x)
    tf_vect.fit(corpus)
    return tf_vect

In [35]:
#defining X and Y
X = df["tokens"]
y = df["category"]

In [36]:
tf_vect = fit_tfidf(X)

#transform for a matrix 
X_tf= tf_vect.transform(X)



In [37]:
len(tf_vect.vocabulary_)

16496

In [38]:
X = X_tf.todense()
X

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.31559294, 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [39]:
tfidf_docs = pd.DataFrame(X)
tfidf_docs

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16486,16487,16488,16489,16490,16491,16492,16493,16494,16495
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.310224,0.315593,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19020,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
19021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
19022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0
19023,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0


In [40]:
#Centering the values around the mean
tfidf_docs = tfidf_docs - tfidf_docs.mean()
tfidf_docs

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16486,16487,16488,16489,16490,16491,16492,16493,16494,16495
0,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
1,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,0.310128,0.315514,-0.000041,-0.000031
2,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
3,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
4,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19020,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
19021,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
19022,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031
19023,-0.002198,-0.000084,-0.000439,-0.000023,-0.017905,-0.002931,-0.000027,-0.0001,-0.000024,-0.000024,...,-0.000026,-0.000097,-0.000026,-0.000206,-0.001193,-0.003573,-0.000096,-0.000079,-0.000041,-0.000031


In [41]:
tfidf_docs[:10]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16486,16487,16488,16489,16490,16491,16492,16493,16494,16495
0,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
1,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,0.310128,0.315514,-4.1e-05,-3.1e-05
2,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
3,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
4,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
5,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
6,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
7,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
8,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05
9,-0.002198,-8.4e-05,-0.000439,-2.3e-05,-0.017905,-0.002931,-2.7e-05,-0.0001,-2.4e-05,-2.4e-05,...,-2.6e-05,-9.7e-05,-2.6e-05,-0.000206,-0.001193,-0.003573,-9.6e-05,-7.9e-05,-4.1e-05,-3.1e-05


Reduce the dimensionality by applying PCA

In [42]:
pca = PCA()

In [43]:
pca = PCA(n_components=1000)
X_new = pca.fit_transform(tfidf_docs)

In [44]:
#Checking for similarity
pd.DataFrame(X_new[:10].dot(X_new[:10].T)).round(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.4,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.1,-0.0,-0.0
1,-0.0,0.3,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0
2,-0.0,-0.0,0.7,0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0
3,0.0,-0.0,0.0,0.3,0.0,0.0,-0.0,-0.0,-0.0,-0.0
4,-0.0,-0.0,0.0,0.0,0.7,-0.0,-0.0,0.0,-0.0,-0.0
5,-0.0,-0.0,-0.0,0.0,-0.0,0.4,0.0,0.0,0.1,-0.0
6,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.2,0.0,0.0,-0.0
7,0.1,0.0,0.0,-0.0,0.0,0.0,0.0,0.5,0.0,-0.0
8,-0.0,0.0,-0.0,-0.0,-0.0,0.1,0.0,0.0,0.6,-0.0
9,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.5


In [45]:
new_df = pd.DataFrame(X_new)
new_df["category"] = y
new_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,991,992,993,994,995,996,997,998,999,category
0,-0.048235,-0.053161,0.016758,-0.04655,-0.018398,0.003234,0.122535,0.027764,-0.052936,-0.026089,...,-0.012579,0.001445,-0.002215,-0.003683,-0.005629,-0.023917,-0.008892,0.006718,0.004505,0
1,-0.019644,-0.045558,0.007477,-0.015181,-0.038157,0.04793,0.009867,-0.048251,0.078919,0.181145,...,0.002263,0.029282,0.03072,0.014208,0.020531,-0.000456,0.029026,0.007716,-0.006406,0
2,0.059524,-0.05063,0.014088,0.017499,0.034848,-0.002484,-0.010142,-0.091695,-0.015024,-0.065567,...,-0.017096,0.007416,-0.027577,-0.012833,-0.003222,-0.008765,-0.003832,-0.033195,-0.019726,0
3,-0.038662,-0.052088,0.010021,-0.038116,0.001186,0.001947,0.089429,0.032982,-0.010793,0.000645,...,0.025383,0.028813,0.028529,-0.016804,-0.001207,-0.015744,-0.0264,0.01791,0.026685,0
4,-0.019468,-0.072099,0.020274,0.015738,0.031711,0.031034,-0.030594,-0.085184,-0.026411,-0.033109,...,0.013856,0.005316,0.011261,0.00418,0.036561,0.00375,-0.01088,0.006684,0.006572,0


In [46]:
#Defining X and Y again
X = new_df.iloc[:,0:1000]
y = new_df.category

In [47]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [48]:
X_train.shape

(15220, 1000)

<p id='defining_model'><a href='#defining_model'>#</a></p>

# 4 - Defining the model

In [49]:
model_lr = LogisticRegression(solver="saga",multi_class="multinomial",random_state=84)
model_lr.fit(X_train,y_train)

LogisticRegression(multi_class='multinomial', random_state=84, solver='saga')

In [50]:
yproba = model_lr.predict_proba(X_test)
ypred_test = model_lr.predict(X_test)
ypred_train = model_lr.predict(X_train)


auc = roc_auc_score(y_test,yproba,multi_class="ovr")

print("------------------------------------------")

print(f"o modelo {model_lr.__class__.__name__} teve AUC de {auc*100:.2f}")
print(f"o modelo {model_lr.__class__.__name__} teve acuracia nos dados de teste de {accuracy_score(y_test,ypred_test)*100:.2f}")
print(f"o modelo {model_lr.__class__.__name__} teve acuracia nos dados de treino de {accuracy_score(y_train,ypred_train)*100:.2f}")

print("-------------------------------------------------")

------------------------------------------
o modelo LogisticRegression teve AUC de 95.06
o modelo LogisticRegression teve acuracia nos dados de teste de 79.11
o modelo LogisticRegression teve acuracia nos dados de treino de 82.69
-------------------------------------------------


In [51]:
y_pred_test = model_lr.predict(X_test)
print(classification_report(y_test,y_pred_test))

              precision    recall  f1-score   support

           0       0.75      0.75      0.75       751
           1       0.75      0.87      0.81      1201
           2       0.83      0.63      0.72       447
           3       0.88      0.63      0.73       456
           4       0.84      0.88      0.86       950

    accuracy                           0.79      3805
   macro avg       0.81      0.75      0.77      3805
weighted avg       0.80      0.79      0.79      3805



<p id='model_hyperparameters'><a href='#model_hyperparameters'>#</a></p>

# 5 - Model Hyperparameters tuning

In [52]:
params = {"C": np.linspace(1,10,10),"penalty":["l2","elasticnet"],"l1_ratio":[0.2,0.3,0.7]}
clf = RandomizedSearchCV(model_lr,params,cv=5,n_jobs=-1,n_iter=10)
clf.fit(X_train,y_train)



RandomizedSearchCV(cv=5,
                   estimator=LogisticRegression(multi_class='multinomial',
                                                random_state=84,
                                                solver='saga'),
                   n_jobs=-1,
                   param_distributions={'C': array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]),
                                        'l1_ratio': [0.2, 0.3, 0.7],
                                        'penalty': ['l2', 'elasticnet']})

In [53]:
clf.best_estimator_

LogisticRegression(C=4.0, l1_ratio=0.3, multi_class='multinomial',
                   random_state=84, solver='saga')

In [54]:
best_model = clf.best_estimator_

In [55]:
#seeing the params
best_model.get_params()

{'C': 4.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': 0.3,
 'max_iter': 100,
 'multi_class': 'multinomial',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 84,
 'solver': 'saga',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

<p id='results'><a href='#results'>#</a></p>

# 6 - Results

In [56]:
y_pred_train_bm = best_model.predict(X_train)
print(classification_report(y_train,y_pred_train_bm))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83      3127
           1       0.82      0.90      0.86      4736
           2       0.88      0.71      0.79      1701
           3       0.88      0.76      0.82      1722
           4       0.89      0.91      0.90      3934

    accuracy                           0.85     15220
   macro avg       0.86      0.82      0.84     15220
weighted avg       0.86      0.85      0.85     15220



In [57]:
y_pred_test_bm = best_model.predict(X_test)
print(classification_report(y_test,y_pred_test_bm))

              precision    recall  f1-score   support

           0       0.75      0.76      0.75       751
           1       0.78      0.86      0.82      1201
           2       0.81      0.68      0.74       447
           3       0.84      0.68      0.75       456
           4       0.85      0.87      0.86       950

    accuracy                           0.80      3805
   macro avg       0.80      0.77      0.78      3805
weighted avg       0.80      0.80      0.80      3805



<p id='test'><a href='#test'>#</a></p>

# 7 - Testing 
Performing a test with a news article seen on HuffPost itself.

In [60]:
response = requests.get("https://www.huffpost.com/entry/apple-epic-games-ruling_n_613b856fe4b09519c5044f8d")

In [61]:
raw_data = BeautifulSoup(response.text,"html.parser")
title_tag = str(raw_data.title)
title_tag 

"<title>Federal Judge Loosens Apple's Grip On App Store In Epic Games Ruling | HuffPost Latest News</title>"

In [62]:
strg= [txt for txt in  re.split(r"<title>|</title>",title_tag) if txt !='']
strg

["Federal Judge Loosens Apple's Grip On App Store In Epic Games Ruling | HuffPost Latest News"]

In [63]:
text_news = [txt for txt in re.split(r"\s\|\s\w+" ,strg[0]) if txt !=''][0]
text_news

"Federal Judge Loosens Apple's Grip On App Store In Epic Games Ruling"

In [64]:
def text_pipeline(text):
    processed_text = process_text(text)
    tfidf_text = tf_vect.transform([processed_text])
    tfidf_text = np.array(tfidf_text.todense())
    new_vec = pca.transform(tfidf_text)
    return new_vec

In [65]:
new_vec = text_pipeline(text_news)
new_vec

array([[ 1.56274528e-02, -4.55689468e-02,  1.73520746e-02,
         7.13842325e-05,  5.24754819e-02, -2.03750362e-02,
        -3.14255428e-02, -5.63772775e-02,  3.01206385e-02,
        -6.72532551e-02,  1.05787098e-01,  4.85780674e-02,
         1.27405923e-01, -1.64231538e-02,  6.41296674e-04,
        -1.72112676e-03,  9.54464423e-03,  1.83846464e-02,
         1.43156495e-02,  1.35659789e-02,  2.61644393e-02,
        -8.88551397e-03, -4.88552808e-04, -2.11745521e-02,
         1.16673980e-02, -4.45292615e-03, -2.39280116e-02,
        -3.71297178e-02,  1.13518946e-02,  3.01289167e-02,
         4.56676795e-02,  3.90809892e-02, -5.74575279e-03,
        -2.96437793e-03, -2.21860584e-02,  3.52616579e-03,
        -1.63369176e-02, -1.52356839e-02,  8.26469272e-03,
        -2.18573179e-02, -7.50575894e-03, -2.17629652e-02,
         4.67132906e-03,  2.25068266e-02,  8.48329582e-03,
         2.18284532e-02,  2.97758994e-02,  1.43310892e-02,
        -2.24731967e-02, -1.38525809e-02, -1.76515571e-0

In [66]:
best_model.predict(new_vec)

array([1], dtype=int64)

The news is about the Apple company and the model we created categorized it as "Business" which is correct! Yay!

Saving the model and text preprocessing functions (tf_vect e pca)

In [67]:
filename_model = "./save/my_model.sav"
pickle.dump(best_model,open(filename_model,"wb"))

In [68]:
import cloudpickle

filename_tf_vec = "./save/tf_vec.sav"
cloudpickle.dump(tf_vect,open(filename_tf_vec,"wb"))

In [69]:
filename_pca = "./save/pca.sav"
pickle.dump(pca,open(filename_pca,"wb"))