<a href="https://colab.research.google.com/github/BoosterGold98/Word-cloud-for-Whatsapp-/blob/master/Text_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Creating a word cloud for whatsapp chats**

A word cloud is a data visual representation of text data, typically used to depict keywords in a free form text. This data visualization technique is useful for quickly perceiving the most prominent terms in a document. An example is shown below:

![word cloud](https://altoona.psu.edu/sites/altoona/files/styles/photo_gallery_large/public/success-word-cloud.jpg?itok=ywo0g6Pf)

A word cloud is a collection of words which vary in font size and colour according to significance or frequency of occurence. 

An alternative is creating a table for the occurences of words and while the table gives more accurate information, a word cloud makes that data more visually appealing and accentuates the most occuring words in a document.

There is already a library in python called **wordcloud** which achieves the purpose of converting text into a wordcloud. Using the library, one can create wordclouds of various texts such as books or scripts. Here, implementation of wordcloud for a whatsapp text is achieved. The code is divided into two parts:

1.   Pre processing of the text
2.   Word cloud creation





In [0]:
import re
import nltk
from nltk.corpus import stopwords

Importing all the dependenies for the code. The re library will be used for text clean up while the nltk library is used to import stopwords which will be used later.

First lets take a look at the whatsapp chat when imported via the app.



```
01/01/2030, 12:00 AM - Sender:   sample
01/01/2030, 12:00 AM - Receiver: sample
01/01/2030, 12:00 AM - Sender:   sample
01/01/2030, 12:00 AM - Sender:   sample
01/01/2030, 12:00 AM - Sender:   sample
01/01/2030, 12:00 AM - Receiver: <Media ommitted>
01/01/2030, 12:00 AM - Sender:   This message was deleted
01/01/2030, 12:00 AM - Receiver: sample

```
The above is a sample whatsapp chat. We see that the sample is the actual message while the date and time stamp as well as the sender and receiver names are metadata. The word cloud requires only the text and nothing else. Listing the features needed to be discarded:



1.   Emojis
2.   Date and time stamps
3.   Meta messages such as 'media omitted' and 'This message has been deleted'
4.   Punctuations
5.   Stray blank lines
6.   Any character other than upper and lower case alphabets

In [0]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [0]:
def remove_string(string):
    string = re.sub('<.*?>', '', string)                                # punctuations
    expr = re.compile('\d{2}/\d{2}/\d{4},')                             # Date
    string = re.sub(expr, '', string)
    expr1 = re.compile('\d{1,2}\:\d{2} - ')                             # Time
    string = re.sub(expr1, '', string)
    string = re.sub('<Media omitted>', '', string)
    string = re.sub('This message was deleted', '', string)
    string = re.sub(r'.*:', '', string)                                 # Removing sender and receiver names 
    string = re.sub(r'[^\w\s]','',string)                               # Removing expressions other than upper and lowercase alphabets 
    string = re.sub('\n\s*\n','\n', string)                             # Removing stray blank lines
    return string.lower()                                               # Converting all to lowercase for consistency of words

In [0]:
file_name = "data.txt"                                   # Whatsapp chat export renamed to data

with open(file_name, encoding="utf8") as chat:           # Encoding is utf8 since chats conatin emojis and cannot be read otherwise 
    chat_text = chat.read()

text = remove_emoji(chat_text)
text1 = remove_string(text)
#print(text1)                                            # Uncomment this to check output

Lets look at stopwords. Stop words are words which are filtered out before or after processing of natural language data (text). If one where to find the most occured word in a text, it would always will be 'the' followed by words like 'if', 'that', 'from', etc.  This info is useless and thus such words are removed.

In [0]:
stop_words = set(stopwords.words('english'))
words = text1.split() 
open('filteredtext.txt', 'w').close() # A seperate text file to store the cleaned text. This command always clears up contents of the text file every time this code is run
for w in words: 
    if not w in stop_words: 
        appendFile = open('filteredtext.txt','a') 
        appendFile.write(" "+w) 
        appendFile.close()