## Introdução

O objetivo deste notebook é executar o K-Means clustering para ver se o algoritmo consegue agrupar com sucesso as notícias em 'Reais' e 'Falsas' usando apenas as palavras dos artigos.

## Imports

In [1]:
import numpy as np # álgebra Linear
import pandas as pd # processamento de dados, E/S de arquivo CSV (por exemplo, pd.read_csv)

import matplotlib.pyplot as plt # plotagem e visualização de dados
import seaborn as sns # melhorar o visual
sns.set() # Definir como estilo padrão

import string #biblioteca python
import re #biblioteca regex

from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short # Preprocesssing
from gensim.models import Word2Vec # Word2vec

from sklearn import cluster # Kmeans clustering
from sklearn import metrics # Métricas para avaliação
from sklearn.decomposition import PCA #PCA
from sklearn.manifold import TSNE #TSNE

## Análise e limpeza de dados

In [2]:
fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")

In [3]:
fake.head(10)

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017"
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017"
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017"
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017"
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017"


In [4]:
true.head(10)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
5,"White House, Congress prepare for talks on spe...","WEST PALM BEACH, Fla./WASHINGTON (Reuters) - T...",politicsNews,"December 29, 2017"
6,"Trump says Russia probe will be fair, but time...","WEST PALM BEACH, Fla (Reuters) - President Don...",politicsNews,"December 29, 2017"
7,Factbox: Trump on Twitter (Dec 29) - Approval ...,The following statements were posted to the ve...,politicsNews,"December 29, 2017"
8,Trump on Twitter (Dec 28) - Global Warming,The following statements were posted to the ve...,politicsNews,"December 29, 2017"
9,Alabama official to certify Senator-elect Jone...,WASHINGTON (Reuters) - Alabama Secretary of St...,politicsNews,"December 28, 2017"


O primeiro problema visto acima é que os dados True contêm:

1. Uma inserção de responsabilidade da Reuters de que o artigo é um tweet
> "The following statements were posted to the verified Twitter accounts of U.S. President Donald Trump, @realDonaldTrump and @POTUS.  The opinions expressed are his own. Reuters has not edited the statements or confirmed their accuracy.  @realDonaldTrump"


2. Nome da cidade e editora no início
> WASHINGTON (Reuters)

Então no próximo bloco de código eu removo isso dos dados

In [5]:
# A seguir está uma maneira simples de remover a inserção da fonte do tweet @realDonaldTrump e Estado/Editor no início do texto

cleansed_data = []
for data in true.text:
    if "@realDonaldTrump : - " in data:
        cleansed_data.append(data.split("@realDonaldTrump : - ")[1])
    elif "(Reuters) -" in data:
        cleansed_data.append(data.split("(Reuters) - ")[1])
    else:
        cleansed_data.append(data)

true["text"] = cleansed_data
true.head(10)

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",The head of a conservative Republican faction ...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,Transgender people will be allowed for the fir...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,The special counsel investigation of links bet...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,Trump campaign adviser George Papadopoulos tol...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,President Donald Trump called on the U.S. Post...,politicsNews,"December 29, 2017"
5,"White House, Congress prepare for talks on spe...",The White House said on Friday it was set to k...,politicsNews,"December 29, 2017"
6,"Trump says Russia probe will be fair, but time...",President Donald Trump said on Thursday he bel...,politicsNews,"December 29, 2017"
7,Factbox: Trump on Twitter (Dec 29) - Approval ...,While the Fake News loves to talk about my so-...,politicsNews,"December 29, 2017"
8,Trump on Twitter (Dec 28) - Global Warming,"Together, we are MAKING AMERICA GREAT AGAIN! b...",politicsNews,"December 29, 2017"
9,Alabama official to certify Senator-elect Jone...,Alabama Secretary of State John Merrill said h...,politicsNews,"December 28, 2017"


In [6]:
true.text[7]

'While the Fake News loves to talk about my so-called low approval rating, @foxandfriends just showed that my rating on Dec. 28, 2017, was approximately the same as President Obama on Dec. 28, 2009, which was 47%...and this despite massive negative Trump coverage & Russia hoax! [0746 EST] - Why is the United States Post Office, which is losing many billions of dollars a year, while charging Amazon and others so little to deliver their packages, making Amazon richer and the Post Office dumber and poorer? Should be charging MUCH MORE! [0804 EST] -- Source link: (bit.ly/2jBh4LU) (bit.ly/2jpEXYR) '

Parte do texto ainda contém vários caracteres/palavras, como:

1. Links
2. Marcação de data e hora
3. Colchetes
4. Números

Portanto, removeremos todos esses caracteres dos dados reais e falsos usando o pré-processamento genlib e um regex personalizado para os links em preparação para o Word2Vec

Antes disso, porém, o título e o texto serão mesclados em um só para que todos possam ser pré-processados juntos. Também adicionarei um rótulo para verdadeiro e falso que será usado posteriormente para avaliar nosso agrupamento

In [7]:
# Mesclando título e texto
fake['Sentences'] = fake['title'] + ' ' + fake['text']
true['Sentences'] = true['title'] + ' ' + true['text']

# Adicionando rótulo falso e verdadeiro
fake['Label'] = 0
true['Label'] = 1

# Podemos mesclar os dois, pois agora temos os rótulos
final_data = pd.concat([fake, true])

# Randomize as linhas para que tudo fique misturado
final_data = final_data.sample(frac=1).reset_index(drop=True)

# Eliminar colunas não é necessário
final_data = final_data.drop(['title', 'text', 'subject', 'date'], axis = 1)

final_data.head(10)

Unnamed: 0,Sentences,Label
0,President Obama Humiliates GOP As Iran Releas...,0
1,Is Gabby Giffords Being Sued By Her Shooter? ...,0
2,China state media warn Trump against renouncin...,1
3,Trump’s White House Team Is So Dumb They Fell...,0
4,Racists Explode As Death Of Bundy Militant Bl...,0
5,Watch This GOP Delegate CLEARLY Show He Doesn...,0
6,LIBERAL TEACHER’S Social Media Message Goes VI...,0
7,Belarus KGB says Ukrainian journalist set up s...,1
8,Leftists on cusp of power as weary Icelanders ...,1
9,MUSLIM DEMOCRAT WOMAN Is Asked How She Feels A...,0


In [8]:
# Aqui nós pré-processamos as sentenças
def remove_URL(s):
    regex = re.compile(r'https?://\S+|www\.\S+|bit\.ly\S+')
    return regex.sub(r'',s)

# Funções de pré-processamento para remover letras minúsculas, links, espaços em branco, tags, números, pontuação, faixas de palavras
CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, remove_URL, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short]

# Aqui armazenamos as frases processadas e seus rótulos
processed_data = []
processed_labels = []

for index, row in final_data.iterrows():
    words_broken_up = preprocess_string(row['Sentences'], CUSTOM_FILTERS)
    # Isso elimina quaisquer campos que possam ficar em branco após o pré-processamento
    if len(words_broken_up) > 0:
        processed_data.append(words_broken_up)
        processed_labels.append(row['Label'])

## Word2Vec

In [9]:
# Modelo Word2Vec treinado em dados processados
model = Word2Vec(processed_data, min_count=1)

In [10]:
model.wv.most_similar("country")

[('nation', 0.8103289604187012),
 ('america', 0.664252758026123),
 ('countries', 0.5652427077293396),
 ('europe', 0.5639137029647827),
 ('world', 0.5258144736289978),
 ('especially', 0.502290666103363),
 ('prosperous', 0.49783098697662354),
 ('american', 0.4973124563694),
 ('americans', 0.4865381419658661),
 ('realize', 0.4762621819972992)]

## Vetores de frases

In [11]:
# Obtendo o vetor de uma frase com base na média de todos os vetores de palavras na frase
# Obtemos a média, pois isso leva em conta diferentes comprimentos de frase

def ReturnVector(x):
    try:
        return model[x]
    except:
        return np.zeros(100)
    
def Sentence_Vector(sentence):
    word_vectors = list(map(lambda x: ReturnVector(x), sentence))
    return np.average(word_vectors, axis=0).tolist()

X = []
for data_x in processed_data:
    X.append(Sentence_Vector(data_x))

In [12]:
X_np = np.array(X)
X_np.shape

(44889, 100)

## Agrupamento/Clustering

In [13]:
# Treinamento para 2 clusters (Falso e Real)
kmeans = cluster.KMeans(n_clusters=2, verbose=1)

# Fit Predict retornará rótulos
clustered = kmeans.fit_predict(X_np)

  super()._check_params_vs_input(X, default_n_init=10)


Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: center shift 0.0 within tolerance 0.0.
Initialization complete
Iteration 0, inertia 0.0.
Converged at iteration 0: cent

  return fit_method(estimator, *args, **kwargs)


In [14]:
testing_df = {'Sentence': processed_data, 'Labels': processed_labels, 'Prediction': clustered}
testing_df = pd.DataFrame(data=testing_df)

testing_df.head(10)

Unnamed: 0,Sentence,Labels,Prediction
0,"[president, obama, humiliates, gop, iran, rele...",0,0
1,"[gabby, giffords, sued, shooter, images, repre...",0,0
2,"[china, state, media, warn, trump, renouncing,...",1,0
3,"[trump’s, white, house, team, dumb, fell, fake...",0,0
4,"[racists, explode, death, bundy, militant, bla...",0,0
5,"[watch, gop, delegate, clearly, doesn’t, like,...",0,0
6,"[liberal, teacher’s, social, media, message, g...",0,0
7,"[belarus, kgb, says, ukrainian, journalist, se...",1,0
8,"[leftists, cusp, power, weary, icelanders, pol...",1,0
9,"[muslim, democrat, woman, asked, feels, trump’...",0,0


Os resultados acima mostram que eles foram agrupados corretamente em alguns casos onde 0 é notícia falsa e 1 é notícia real

In [15]:
correct = 0
incorrect = 0
for index, row in testing_df.iterrows():
    if row['Labels'] == row['Prediction']:
        correct += 1
    else:
        incorrect += 1
        
print("Correctly clustered news: " + str((correct*100)/(correct+incorrect)) + "%")

Correctly clustered news: 52.28897948272405%


## Visualização

In [16]:
# PCA of sentence vectors
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_np)

PCA_df = pd.DataFrame(pca_result)
PCA_df['cluster'] = clustered
PCA_df.columns = ['x1','x2','cluster']

  self.explained_variance_ratio_ = self.explained_variance_ / total_var


## Testes de notícias personalizados

In [17]:
# Teste com notícias falsas geradas em https://www.thefakenewsgenerator.com/
onion_data = "Flint Residents Learn To Harness Superpowers, But Trump Gets Away Again They developed superpowers after years of drinking from a lead-poisoned water supply. But just having incredible abilities doesn't make them superheroes. Not yet. Donald Trump faced off against the superpowered civilians but he got away before they could catch him"

# Artigo de pré-processamento
onion_data = preprocess_string(onion_data, CUSTOM_FILTERS)

# Obtém vetor de frase
onion_data = Sentence_Vector(onion_data)

# Obtém a previsão
kmeans.predict(np.array([onion_data]))

array([0], dtype=int32)

In [18]:
# Notícias da BBC

bbc_data = "Nasa Mars 2020 Mission's MiMi Aung on women in space Next year, Nasa will send a mission to Mars. The woman in charge of making the helicopter that will be sent there – which is set to become the first aircraft to fly on another planet – is MiMi Aung. At 16, MiMi travelled alone from Myanmar to the US for access to education. She is now one of the lead engineers at Nasa. We find out what it's like being a woman in space exploration, and why her mum is her biggest inspiration."

# pré-processamento do artigo
bbc_data = preprocess_string(bbc_data, CUSTOM_FILTERS)

# Obter vetor de sentença
bbc_data = Sentence_Vector(bbc_data)

# Obtém a previsão
kmeans.predict(np.array([bbc_data]))

array([0], dtype=int32)