### Cas Kaggke: Topic Labeled News Dataset

En aquest document veurem, analitzarem i compararem un recommenador d'articles.

La base de dades de kaggle utilitzada és la de "Topic Labeled News Dataset". A continucació començarem veient i analitzant el dataset.

In [17]:
# Importem llibreries

import ipywidgets as widgets
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
%matplotlib notebook
from matplotlib import pyplot as plt
import scipy.stats
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np #importem la llibreria
import random

In [18]:
# Visualitzarem només 3 decimals per mostra en les taules
#pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Funcio per a llegir dades en format csv
def load_dataset(path):
    dataset = pd.read_csv(path,sep = None,engine='python')
    return dataset

# Carreguem dataset asignat
dataset = load_dataset('labelled_newscatcher_dataset.csv')
data = dataset.values

In [19]:
print(dataset.dtypes)

topic             object
link              object
domain            object
published_date    object
title             object
lang              object
dtype: object


In [20]:
print(dataset.isnull().sum())

topic             0
link              0
domain            0
published_date    0
title             0
lang              0
dtype: int64


In [21]:
dataset.head()

Unnamed: 0,topic,link,domain,published_date,title,lang
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en


Com es pot observar, la BD està formada per 6 atributs:
- topic: 
- link:
- domain: 
- published_date:
- title: 
- lang:

Ningún d'ells té valors null, així que no hem d'eliminar ninguna entrada de la BD

In [22]:
print (dataset['published_date'].min())
print (dataset['published_date'].max())

2012-09-16 04:44:50
2020-08-18 05:49:00


In [23]:
contTopic = dataset['topic'].value_counts()
contTopic

NATION           15000
BUSINESS         15000
TECHNOLOGY       15000
ENTERTAINMENT    15000
SPORTS           15000
WORLD            15000
HEALTH           15000
SCIENCE           3774
Name: topic, dtype: int64

In [24]:
contLang = dataset['lang'].value_counts()
contLang

en    108774
Name: lang, dtype: int64

In [25]:
dataset.describe()

Unnamed: 0,topic,link,domain,published_date,title,lang
count,108774,108774,108774,108774,108774,108774
unique,8,106130,5164,68743,103180,1
top,NATION,https://www.google.com/,dailymail.co.uk,2020-08-04 01:00:00,"US tops 5 million confirmed virus cases, to Eu...",en
freq,15000,19,1855,41,21,108774


Podem observar com hi han 8 topics diferents, 7 d'ells consten de 15000 entrades en la BD i 1 d'ells (el de SCIENCE) 3774, també podem observar que tots els articles estàn en Anglés, així que utilitzar el idioma en el recommenador no serà una opció.

He decidit que els millors atributs de la BD per fer una recommenació seran el de topic, ja que hi ha varietat i normalment si estàs llegint un article d'esports voldràs que et recommanin un d'esports i no un de tecnologia o de ciéncies. I l'altre atribut que utilitzaré serà el titol. Descarto el de links, el domini, el de l'idioma i la data de publicació.

In [39]:
from sklearn.model_selection import train_test_split
# Dividim dades d'entrenament
x_train, x_test, y_train, y_test = train_test_split(dataset['title'], dataset['topic'], test_size=0.2, random_state=0)

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(x_train)
X_train_counts.shape

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

import numpy as np
predicted = clf.predict(x_test)
np.mean(predicted == y_test)

ValueError: dimension mismatch

In [46]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
])
text_clf = text_clf.fit(x_train, y_train)

predicted = text_clf.predict(x_test)
np.mean(predicted == y_test)

0.7875890599862101

In [45]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

twenty_train.target_names #prints all the categories
#print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first line of the first data file

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(dataset['title'])
X_train_counts.shape

(108774, 55042)

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(108774, 55042)

In [28]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, dataset['topic'])