<p style="font-size:14pt">This notebook is only about visualizing and creating word clouds. Contrary to what you might think, word clouds are not just a beautiful design element. In fact, they can also be used for analysis, as the size of the words can also represents the frequency. For human perception, they may be even more pleasant than bar charts, which show the frequency of the words appearing in the text. Sometimes it is just a matter of getting a quick impression of the content of the documents.</p>

In [None]:
import os
import numpy as np
import pandas as pd
import nltk
!pip install contractions
import contractions
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
import matplotlib.pyplot as plt

<p style="font-size:14pt">The Dataset contains fake and real news data. It can be found <a href="https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset">here</a>.</p>

In [None]:
Fake = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv")
Real = pd.read_csv("../input/fake-and-real-news-dataset/True.csv")

Real.head(n=5)

In [None]:
emojis = ["\U0001F600", "\U0001F64F", "\U0001F300", "\U0001F5FF", "\U0001F680", "\U0001F6FF", "\U0001F1E0", "\U0001F1FF", "\U00002702", "\U000027B0", "\U000024C2", "\U0001F251"]
emojis

In [None]:
stopWords = nltk.corpus.stopwords.words("english")

In [None]:
token = nltk.tokenize.RegexpTokenizer(r"\w+")

In [None]:
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
def cleaner(DF,c: str):

    data=[]
    length=DF.shape[0]
    i=10
    
    col=np.where(DF.columns == c)[0][0]
    
    for i, row in enumerate(DF.iterrows()):

        sentence=row[1][col]
        sentence=sentence.lower().split(" ")
        sentence=[word for word in sentence if word not in emojis]
        sentence=[word for word in sentence if "http" not in word and "https" not in word and "@" not in word]
        sentence=[contractions.fix(word) for word in sentence]
        sentence=" ".join(sentence).lower()
        sentence=token.tokenize(sentence)
        sentence=[word for word in sentence if word not in stopWords]
        sentence=[lemmatizer.lemmatize(word) for word in sentence]
        sentence=[word.strip() for word in sentence]
        sentence=[word for word in sentence if not word.isdigit()]
        sentence=[word for word in sentence if 
                  word != "rt" and 
                  word != "û_" and 
                  word != "amp" and 
                  word != " " and
                  word != "" and
                  word != "ûª" and
                  word != "ûò" and
                  word != "åè" and
                  word != "ìñ1"] 

        sentence=[word for word in sentence if len(word) > 1]

        data.extend(sentence)
        
        if i % 5000 == 0:
            print(str(int(round(i/length*100, 0))) + " %")
        
    return data

In [None]:
Fake=cleaner(Fake, c="text")
Real=cleaner(Real, c="text")

In [None]:
All_Data = Real + Fake 
sAll=set(All_Data)
sFke=set(Fake)
sRel=set(Real)
justReal = [i for i in sAll if i not in sFke]
justFake = [i for i in sAll if i not in sRel]

just = justReal + justFake 

In [None]:
def custom_colours(word, font_size, position, orientation, random_state=None, **kwargs):
    if word in justReal: 
        return "#70E7A4"
    elif word in justFake:
        return "#EA5852"
    return "#CCCCCC"


<p style="font-size:14pt">Color <span style="color:#70E7A4;">green</span> indicates that the word is just in the real Data</p>

<p style="font-size:14pt">Color <span style="color:#EA5852;">red</span> indicates that the word is just in the fake Data</p>


In [None]:
wc=WordCloud(
    max_words=10000, 
    relative_scaling=1,
    max_font_size=8,
    min_font_size=1,
    min_word_length=1,
    width=400, height=200,
    background_color="black"
    ).generate(" ".join(just))
wc.recolor(color_func=custom_colours)
plt.figure(figsize=(20, 40), dpi=90)
plt.axis('off')
plt.imshow(wc, interpolation="bilinear")
plt.show()

<p style="font-size:14pt">It is noticeable that the words that appear only in fake news and not in real news are repeated less often overall. Lies therefore differ more from one another than the truth does 🤥</p>

<p style="font-size:14pt">Now, we load some images to color the wordcloud</p>

In [None]:
colors=np.array(Image.open("../input/imagesforkernels/nlpcover.jpg"))
colors=colors[::5,::5]
mask = colors.copy()
mask[mask.sum(axis=2) == 0] = 255
width=colors.shape[1]
height=colors.shape[0]
plt.figure(figsize=(20, 40))
plt.axis('off')
plt.imshow(colors)
plt.show()

In [None]:
wc=WordCloud(
    max_words=3000, 
    relative_scaling=1,
    max_font_size=20,
    min_font_size=2,
    width=width, height=height,
    background_color="white",
    mask=mask
    ).generate(" ".join(All_Data))
image_colors = ImageColorGenerator(colors)
wc.recolor(color_func=image_colors)
plt.figure(figsize=(20, 40))
plt.axis('off')
plt.imshow(wc, interpolation="bilinear")
plt.show()

<p style="font-size:14pt">Besides the color, you can also define the shape of the Wordcloud. To do this, you can create a mask array.</p>

In [None]:
colors=np.array(Image.open("../input/imagesforkernels/palmCart.jpg"))
mask = colors.copy()
mask[mask.sum(axis=2) == 0] = 255
width=colors.shape[1]
height=colors.shape[0]
plt.figure(figsize=(10, 10))
plt.axis('off')
plt.imshow(colors)
plt.show()

In [None]:
wc=WordCloud(
    max_words=2000, 
    relative_scaling=1,
    max_font_size=20,
    min_font_size=5,
    width=width, height=height,
    background_color="white",
    mask=mask
    ).generate(" ".join(Fake))
image_colors = ImageColorGenerator(colors)
wc.recolor(color_func=image_colors)
plt.figure(figsize=(20, 20))
plt.axis('off')
plt.imshow(wc, interpolation="bilinear")
plt.show()

In [None]:
colors=np.array(Image.open("../input/imagesforkernels/kaggle-logo-transparent-300.jpg"))
mask = colors.copy()
mask[mask.sum(axis=2) == 0] = 255
width=colors.shape[1]
height=colors.shape[0]
plt.figure(figsize=(10, 10))
plt.axis('off')
plt.imshow(colors)
plt.show()

In [None]:
wc=WordCloud(
    max_words=1000, 
    relative_scaling=0.5,
    max_font_size=10,
    min_font_size=2,
    width=width, height=height,
    background_color="white",
    mask=mask
    ).generate(" ".join(Fake))
image_colors = ImageColorGenerator(colors)
wc.recolor(color_func=image_colors)
plt.figure(figsize=(20, 20))
plt.axis('off')
plt.imshow(wc, interpolation="bilinear")
plt.show()