<hr style="border-width:1px;color:DarkSlateBlue; background-color:DarkSlateBlue;border:none; height:4px">


<h5 style="font-size:20; color:#191970;font-family:monospace"><u>Brief Description:</u></h5>
<p style="font-size:18; color:#191970;text-align:justify;text-justify: initial;font-family:monospace">
Analyzing WhatsApp chat data can provide valuable insights beyond sentiment analysis. Here are some additional ways in which WhatsApp chat analysis can be important:
 <ul style="color: #191970;font-family:monospace"><li>Content understanding,</li>
    <li>User activity,</li>
    <li>Participant engagement,</li>
    <li>Social Network,</li>
    <li>Content and Link sharing,</li> 
    <li>User Behaviour pattern etc.</li></ul></p>
<p style="font-size:18; font-weight:600;color:#191970;text-align:justify;text-justify: initial;font-family:monospace">
WhatsApp chat analysis involves studying chat data to gain insights into topics, user behavior, content sharing, engagement, and communication patterns. It offers valuable information for understanding conversations, optimizing communication strategies, monitoring user activity, and more, without focusing on sentiment analysis.
</p>

<hr style="border-width:1px;color:DarkSlateBlue; background-color:DarkSlateBlue;border:none; height:4px">

<h5 style="font-size:20; color:#191970;font-family:monospace"><u>Methodology (in swift):</u></h5>

<ol style="font-size:18; color:#191970; text-align:justify;text-justify: initial;font-family:monospace">
    <li><b>Importing Libraries</b>: Various libraries like re, nltk, pandas, numpy, emoji, collections, and matplotlib.pyplot` are imported.</li>

<li><b>Loading Chat Data</b>: The WhatsApp Chat data is loaded from a text file.</li>

<li><b>Dataprocessing</b>: Text data is preprocessed and cleaned for further analysis. This involves date & time extraction, emoji extraction, word frequency count, and so on.</li>

<li><b>Creating DataFrame</b>: A pandas DataFrame is created store the processed chat data.</li>

<li><b>Information Extraction</b>: Important features are extracted from the data such as top users, daily timeline, monthly, busiest day and month, and so on.</li>

<li><b>Data Description</b>: Descriptive statistics is computed on the data.</li>

<li><b>Plotting Graphs</b>: Graphs and charts like heatmaps and word clouds are generated to visualize the chat data and draw meaningful insights.</li>
</ol>

<hr style="border-width:1px;color:DarkSlateBlue; background-color:DarkSlateBlue;border:none; height:4px">

<p style="font-size:28;font-weight:600; color:#191970;text-align:justify;text-justify: initial;font-family:monospace"><font size="4">
    As a student embarking on this project, I've just completed the initial project setup, including importing necessary libraries and loading the WhatsApp chat data. The next steps will involve processing the data, creating a DataFrame, extracting key information, and ultimately visualizing the output through code to gain valuable insights from the chat data.
</font></p>

<hr style="border-width:1px;color:DarkSlateBlue; background-color:DarkSlateBlue;border:none; height:4px">

In [None]:
#importing required libraries
import re
import nltk
import pandas as pd
import numpy as np
import emoji
import collections
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import hvplot.pandas
import simple_colors

In [None]:
file_path_ = r"WhatsAppGroupChat.txt" #put your file name here, process of exporting is stated

In [None]:
#File Exist or not checker
import os

def check_file_existence(file_path_):
    if os.path.exists(file_path_):
        return True
    else:
        return False

def main():
    if check_file_existence(file_path_):
        print(f"The file '{file_path_}' exists.")
    else:
        print(f"The file '{file_path_}' does not exist.")

if __name__ == "__main__":
    main()

In [None]:
#Extracting Date and Time
def date_time(s):
    pattern = r'^([0-9]+)(\/)([0-9]+)(\/)([0-9]+), ([0-9]+):([0-9]+)([ ]|.)?(AM|PM|am|pm)? -'
    result = re.match(pattern, s)
    if result:
        return True
    return False

# Extract Messenger ================================================
def messenger(s):
    s = s.split(":")
    if len(s) == 2:
        return True
    else:
        return False

# Extract Messages ================================================
def message_data(line):
    splitline = line.split(' - ')
    dateTime = splitline[0]
    date, time = dateTime.split(", ")
    message = " ".join(splitline[1:])
    if messenger(message):
        splitmessage = message.split(": ")
        author = splitmessage[0]
        message = " ".join(splitmessage[1:])
    else:
        author = None
    return date, time, author, message

# Dummy tuple =================================================
data = []

# Main block =================================================
try:
    with open(file_path_, encoding="utf-8") as fp:
        fp.readline()  # Skip the header line if any
        messageBuffer = []
        date, time, author = None, None, None
        while True:
            line = fp.readline()
            if not line:
                break
            line = line.strip()
            if date_time(line):
                if len(messageBuffer) > 0:
                    data.append([date, time, author, ' '.join(messageBuffer)])
                messageBuffer.clear()
                date, time, author, message = message_data(line)
                messageBuffer.append(message)
            else:
                messageBuffer.append(line)

except Exception as e:  # Improved error handling
    print(f"An error occurred: {e}")
    # Handle the error accordingly


<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The <u>preprocessing Python script</u> <i>defines functions to extract date and time, detect messenger format, and parse message data</i> from a text file. It then processes the file, handling errors with improved exception handling. The <i>extracted data is stored in a list of lists, and a Pandas DataFrame is created to organize it with columns for Date, Time, Author, and Message</i>. This DataFrame facilitates further analysis and manipulation of the message data. Overall, <u>the code now includes error handling, employs a clearer structure, and enhances data organization</u> for better utilization in data analysis tasks using Pandas.</font></p>

In [None]:
u='\033[4m'
r='\033[0m'

In [None]:
#creating dataframe
df = pd.DataFrame(data, columns=['Date', 'Time', 'Author', 'Message'])

<p style="font-size:18px; font-weight:400; color:Brown;text-decoration:overline underline;font-family:monospace">Pre-processing steps: Date, Time, Author_Name(sender) and Messsage extracted, Dataframe created.</p>
<h4 style="font-size:18px; font-weight:400; color:#00008B;font-family:monospace"><u>Current dataframe structure</u>:</h4><p style="font-size:16px; font-weight:300; color:#00008B;font-family:monospace">(Column_Name Non_Null_Count Data_Type)</p>

In [None]:
df.info()

<h4 style="font-size:24px; font-weight:500; color:#00008B;font-family:monospace"><u>Data_Cleaning Processes</u>:</h4>

<ul style="font-size:18px; font-weight:500; color:#00008B;font-family:monospace">
        <li><u>Checking and removing NaN/NULL values.</u></li>

In [None]:
#Checking no. of null values in dataset
print(f'{u}Checking no. of null values in dataset:{r}')
print(df.isnull().sum())
#removing null--------------------------------------------------------
df=df.dropna()
print( f'{u}Checking no. of null values in dataset:{r}')
print(df.isnull().sum())
print( simple_colors.red('Null values deleted succesfully.', ['bold', 'underlined','italic']))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">We started by checking the <u>null values in the dataset</u> using the code snippet. First, we printed the number of null values for each column. Then, we removed the rows with null values from the DataFrame. After that, we checked again to confirm that there were no more null values. Finally, we added a print statement in red to signify the successful deletion of null values. The <i>entire process ensures a clean dataset without any missing values, ready for further analysis.</i></i>.</font></p>

<ul style="font-size:18px; font-weight:500; color:#00008B;font-family:monospace">
        <li>Changing sender names for privacy:</li>

In [None]:
#Hiding the Names/Numbers of sender (Privacy)=======
Temp_auth_list=df['Author'].unique().tolist()
num_elements =1000
sender_list = [f'Sender {i}' for i in range(1, num_elements + 1)]
author_dict= dict(zip(Temp_auth_list,sender_list ))

print(f'{u}New title for Sender/Author:{r}')
df['Author'] = df['Author'].replace(author_dict)
print(df['Author'].unique())

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">A <u>privacy-focused</u> approach has been implemented in the code to <i>hide sender names or numbers</i>. The unique authors from the DataFrame are <i>assigned temporary labels, such as 'Sender 1', 'Sender 2', and so on</i>. The original sender names or numbers are replaced with these anonymized labels. This <i>ensures confidentiality by masking the actual identities. The code successfully transforms the 'Author' column, providing a layer of privacy protection while maintaining data integrity</i>.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Removing stopwords, punctuations and group notifications.</li>   

In [None]:
def remove_stop_words(message):
  f = open('stop_hinglish.txt', 'r', encoding='utf-8')
  stop_words = f.read()
  y = []
  for word in message.lower().split():
      if word not in stop_words:
          y.append(word)
  return " ".join(y)

df['Message'] = df['Message'].apply(remove_stop_words) #remove stopwords
print( simple_colors.red('Stopwords removed.', ['bold', 'underlined','italic']))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code defines a <i>function remove_stop_words to eliminate stopwords from a DataFrame column ('Message') using a predefined list from a file. The function converts text to lowercase, removes stopwords, and updates the DataFrame. The print statement uses simple_colors to indicate that stopwords have been successfully removed, applying formatting like bold, underlined, and italic for emphasis.</font></p>

In [None]:
import string
def remove_punctuation(message):
  x = re.sub('[%s]'% re.escape(string.punctuation), '', message)
  return x

df['Message'] = df['Message'].apply(remove_punctuation) #remove punctuations
print( simple_colors.red('Punctuations removed.', ['bold', 'underlined','italic']))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>defines a function remove_punctuation to remove punctuation from messages in a DataFrame using regular expressions</i>. It applies this function to the 'Message' column, effectively <i>eliminating punctuation marks</i>. A <u>confirmation message is printed</u> using the simple_colors library, indicating the successful removal of punctuations from the messages in the DataFrame.</font></p>

In [None]:
#cleaning data
df = df[df['Message'] != 'This message was deleted']

df = df[df['Message'] != 'null']

df = df[df['Message'] != 'message deleted']

df = df[df['Message'] != 'deleted message']

df = df[df['Message'] != 'missed voice call']

df = df[df['Message'] != 'missed video call']

print( simple_colors.red('Printing dataframe:', ['bold', 'underlined','italic']))

df = df[df['Message'].str.strip() != '']
# Print the cleaned DataFrame
df

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">In the data cleaning process, the code <i>removes rows with specific message content such as deleted messages, null values, missed voice calls, and missed video calls</i>. Additionally, <i>it eliminates rows with whitespace-only messages</i>. The <u>resulting cleaned DataFrame</u>, excluding the specified messages, is then printed with formatting, emphasizing the data cleaning steps.</font></p>

<h4 style="font-size:18px; font-weight:600; color:red;font-family:monospace"><u>Data preliminary cleaning process completed.</u></h4>

<h4 style="font-size:22px; font-weight:400; color:#00008B;font-family:monospace"><u>Dataframe Overhauling Processes:</u></h4>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Modifying date and time column.</li>  </ul>

In [None]:
#restructuring date and time column
print( simple_colors.red('Primary columns of the Dataframe:', ['bold', 'underlined','italic']))
print(df.columns)
df["Date&Time"]=df['Date']+' '+df['Time']
df= df.drop('Time', axis=1)
df= df.drop('Date', axis=1)
df=df[['Date&Time','Author','Message']]
print( simple_colors.red('Revised columns of the Dataframe:', ['bold', 'underlined','italic']))
print(df.columns)

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">
The code<u> restructures a DataFrame by combining 'Date' and 'Time' columns into a new 'Date&Time' colum</u>n. The<i> original 'Time' and 'Date' columns are then dropped, and the DataFrame is rearranged to have 'Date&Time,' 'Author,' and 'Message' as the primary column</i>s. The revised columns are displayed for confirmation.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Reshaping Date&Time column datatype:</li>  </ul>

In [None]:
# convert the DateTime column to datetime format
print(f"{u}Before Formatting:{r}\n")
print(df['Date&Time'].head(5),"{r}\n")
df['Date&Time'] = pd.to_datetime(df['Date&Time'])
print(f"{u}After Formatting:{r}\n")
print(df['Date&Time'].head(5))
print(simple_colors.red('As the Date&Time was in raw text format we now changed it to Date_Time format as YYYY-MM-DD HH:MM:SS.',['bold', 'underlined','italic']))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">
The code converts the<u> 'Date&Time' column in a DataFrame to datetime format using Pandas' pd.to_datetime functio</u>n. It then displays the original and formatted date and time values, emphasizing the conversion from raw text to the standard<i> YYYY-MM-DD HH:MM:S</i>S format in a bold, underlined, and italicized red message</font></p>.

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Adding additional columns to get helpful insights during extraction.</li>  </ul>

In [None]:
#Splitting date&time column ====================================
df['Only_date'] = df['Date&Time'].dt.date
df['Year'] = df['Date&Time'].dt.year
df['Month_No'] = df['Date&Time'].dt.month
df['Month'] = df['Date&Time'].dt.month_name()
df['Day'] = df['Date&Time'].dt.day
df['Day_name'] = df['Date&Time'].dt.day_name()
df['Hour'] = df['Date&Time'].dt.hour
df['Minute'] = df['Date&Time'].dt.minute
print(simple_colors.red("After splitting 'Date&Time' column, revised columns are as follows:", ['bold', 'underlined','italic']))
print(df.columns)

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>splits the 'Date&Time' column in a DataFrame into separate columns for date, year, month number, month name, day, day name, hour, and minute</i>i>. The <u>resulting columns provide a detailed breakdown of temporal information</u>. The revised columns are displayed using a formatted print statement.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Adding Hour_to_Hour period to get help in further steps during extraction.</li>  </ul>

In [None]:
#Hour to Hour collection =======================================
period = []
for hour in df[['Day_name', 'Hour']]['Hour']:
    if hour == 23:
        period.append(str(hour) + "-" + str('00'))
    elif hour == 0:
        period.append(str('00') + "-" + str(hour + 1))
    else:
        period.append(str(hour) + "-" + str(hour + 1))

df['Hour_Period'] = period
print(simple_colors.red("After adding 'Hour_Period' column, revised columns are as follows:", ['bold', 'underlined','italic']))
print(df.columns)

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">A new column <i>'Hour_Period' is created in a DataFrame</i>, representing time periods based on the 'Hour' column. The periods range from one hour to the next, handling the transition from 23 to 00. The <u>revised DataFrame includes this additional column for improved time-based analysis</u>.</font></p>

<h4 style="font-size:22px; font-weight:400; color:#00008B;font-family:monospace"><u>Extraction of useful insights:</u></h4>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Checking valuable insights of the dataframe.</li>  </ul>

In [None]:
df.describe()


<h5 style="font-size:20; color:#191970;font-family:monospace"><u>Overview of the above description:</u></h5>
<p style="font-size:18; color:#191970;text-align:justify;text-justify: initial;font-family:monospace">
The df.describe() function in pandas is like a summary report for your data. It gives you a quick overview of the main statistics for each column (or feature) in your DataFrame. Here's what it tells you:
 <ol style="color: #191970;font-family:monospace">
     <li><b>Count</b>: The number of non-missing (non-null) values in each column. This tells you how many data points you have for each feature.</li>
     <li><b>Mean</b>: The average value for each feature. It's the sum of all values divided by the count. This gives you an idea of the central tendency of your data.</li>
     <li><b>Standard Deviation (std)</b>: This measures the amount of variation or dispersion in your data. A higher standard deviation means that the data points are more spread out from the mean.</li>
     <li><b>Minimum (min)</b>: The smallest value in each column, showing the minimum value observed.</li> 
     <li><b>25th Percentile (25%)</b>: This is the value below which 25% of your data falls. It gives you an idea of the lower end of your data distribution.</li>
     <li><b>50th Percentile (50%)</b>: Also known as the median, this is the value below which 50% of your data falls. It's a good measure of the central point of your data distribution.</li>
     <li><b>75th Percentile (75%)</b>: This is the value below which 75% of your data falls. It gives you an idea of the upper end of your data distribution.</li>
     <li><b>Maximum (max)</b>: The largest value in each column, showing the maximum value observed.</li>
</ol></p>
<p style="font-size:18; font-weight:600;color:#191970;text-align:justify;text-justify: initial;font-family:monospace">
By running df.describe(), we can quickly get a sense of the data's distribution, identify potential outliers, and understand the basic statistics of each feature in your DataFrame. It's a helpful starting point for data exploration and analysis in pandas.
</p>

<h5 style="font-size:20; color:#191970;font-family:monospace"><u>Insights from the above description:</u></h5>
<ol style="color: #191970;font-family:monospace">
    <li>The <b>first message</b> sent was on: 2020-04-17 19:20:00</li>
    <li>The  <b>last message</b> sent was on:  2023-09-06 03:48:00</li>
    <li><b>25% of the chats</b> was on/before:2020-12-27 23:33:00</li>
    <li><b>50% of the chats</b> was on/before:2021-03-05 23:40:00</li>
    <li><b>75% of the chats</b> was on/before:2021-11-25 12:19:00</li>
</ol>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Top sender(s) of the coversation:</li>  </ul>

In [None]:
temp = df['Author'].value_counts().head(5)
print(f"{u}Top sender(s) and their message count:{r}\n")
temp

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>calculates the top 5 senders and their respective message counts from the 'Author' column in a DataFrame</i>. The <u>results are printed in a formatted message, indicating the most active contributors and their corresponding message counts</u>.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Sender and their messege percentage in the conversation:</li>  </ul>

In [None]:
#### Percentage users & their messages =============================
new_df = round(((df['Author'].value_counts() / df.shape[0]) * 100), 2).reset_index()
new_df=new_df.rename(columns={'Author': 'Code_Name', 'count': 'Percent'})
print(f"{u}Involvement of each sender by percentage of message sent:{r}\n")
new_df

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <u>calculates the percentage of messages sent by each user in a DataFrame</u>. It uses the 'Author' column to determine the message count for each user, rounds the percentages to two decimal places, and prints the results. The <i>'Code_Name' and 'Percent' columns are renamed for clarity.</i></u></font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Total number of words in the conversation:</li>  </ul>

In [None]:
#Total Number of words ========================================
words = []
for message in df['Message']:
  words.extend(message.split())
print(f"{u}Total number of words present in message:{r}",len(words))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">
The code counts the<i> total number of words in the 'Message' column of a DataFram</i>e. It iterates through each message, splits them into words, and appends them to a list. The<u> total word count is then printed.</u></font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Total number of media sent in the conversation:</li>  </ul>

In [None]:
#Counting Media sent in the group ===============================
#dup_df=df[['Author','Message']].copy()
df['Word_Count']=df['Message'].str.count('media omitted')
WordCount=(df['Word_Count']==1).sum()
print(f"{u}Number of Media  present in coversation{r}:",WordCount)

In [None]:
print(f"{u}Dropped all the rows with message 'Media Omitted'{r}:",WordCount)
#Dropping Rows with Media Omitted Values
df=df[df.Word_Count != 1]

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>counts the occurrences of the phrase 'media omitted' in the 'Message' column of a DataFrame and prints the number of times it appears<i>. It then <u>drops the rows containing this phrase, updating the DataFrame accordingly.</u></font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Total number of links sent in the conversation:</li>  </ul>

In [None]:
#Number of Links Shared ========================================
from urlextract import URLExtract
extract = URLExtract()

links = []
for message in df['Message']:
    links.extend(extract.find_urls(message))
print(f"{u}Total number of links present in message:{r}\n")
print(len(links))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code utilizes the <u>'urlextract' library to extract URLs from messages in a DataFrame</u>. It iterates through the 'Message' column, finds and stores URLs using the URLExtract class. The <i>total number of links is then printed. The result provides insights into the quantity of shared links.</i></font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Emoji analysis in the conversation:</li>  </ul>

In [None]:
#Emojis in the chat =============================================
emojis = []
for message in df['Message']:
    emojis.extend([c for c in message if c in emoji.EMOJI_DATA])
    pd.DataFrame(Counter(emojis).most_common(len(Counter(emojis))))
#print(emojis)


#grouping emojis ================================================
emoji_count={}
for e in emojis:
    if e in emoji_count:
        emoji_count[e]+=1
    else:
        emoji_count[e]=1

In [None]:
#Sorting dict by values (desc) ==================================
emoji_count=dict(sorted(emoji_count.items(), key=lambda item: item[1], reverse=True))
print(f"{u}Top 10 emojis used in message:{r}\n")
print(dict(list(emoji_count.items())[:10]))

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>extracts emojis from chat messages, counts their occurrences, and creates a sorted dictionary based on frequency</i>. The <u>result displays the top emojis used in messages</u>. Emojis are collected and counted, providing insights into the most frequently used symbols in the chat data.</font></p>

In [None]:
df['Emoji_Count'] = df['Message'].str.count(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FAB0-\U0001FABF\U0001FAC0-\U0001FAFF\U0001FAD0-\U0001FAFF\U0001FAE0-\U0001FAFF\U0001FAF0-\U0001FAFF\U0001F4AA]')

# Calculate the total number of emojis sent by each user
emoji_counts = df.groupby('Author')['Emoji_Count'].sum().sort_values(ascending=False)

# Find the user with the highest emoji count
user_with_most_emojis = emoji_counts[:10]

print(f"{u}The user who sent the most emojis are:\n{r}{user_with_most_emojis}")

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>calculates the total number of emojis sent by each user in a DataFrame</i>. It creates a new column 'Emoji_Count', groups the data by the 'Author,' and computes the sum of emoji counts. The <u>result is a list of users ranked by the highest emoji count</u>, displayed for the top users.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Percentage of (generic) abusive words used in converstaion:</li>  

In [None]:
# Read target words from a text file
with open('abusive.txt', 'r') as file:
    target_words = [line.strip() for line in file]

# Use str.contains() with logical OR to check if any of the words are present in each row
# Then use sum() to count the occurrences
word_count = df['Message'].str.contains('|'.join(target_words)).sum()

# Print the result
print(f"{u}The abusive (target) words appear {word_count} times in the 'Message' column.{r}")
total_words = df['Message'].apply(lambda x: len(x.split())).sum()
percentage_abusive_words = (word_count / total_words) * 100
print(f"{u}The percentage of abusive words compared to total words is: {percentage_abusive_words:.2f}%.{r}")

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code <i>reads a list of <u>abusive words</u> from a file, checks their occurrences in the 'Message' column of a DataFrame using logical OR, and counts them</i>. The result is printed, indicating the number of times abusive words appear in the messages.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Word frequency of the conversation:</li>  
</ul>
<p style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">(Removed stopwords, other target words, emoji(s) and group notifications.)</p>

In [None]:
emoji_pattern = r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U0001FAB0-\U0001FABF\U0001FAC0-\U0001FAFF\U0001FAD0-\U0001FAFF\U0001FAE0-\U0001FAFF\U0001FAF0-\U0001FAFF\U0001F4AA]'

# Remove emojis from the 'Text' column
df['Message'] = df['Message'].str.replace(emoji_pattern, '', regex=True)

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Word frequency (most frequent wors):</li>  

In [None]:
word_freq=collections.Counter(df["Message"])
print(f"{u}Word_Frequency (Most Common 20):{r}\n")
word_freq.most_common(20)

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code uses the <i>'collections.Counter'</i> to <u>calculate the word frequency in the 'Message' column of a DataFrame</u>. It then prints the most common 20 words and their counts. This provides insights into the frequently used words in the dataset.</font></p>

<h4 style="font-size:22px; font-weight:400; color:#00008B;font-family:monospace"><u>Plotting insights:</u></h4>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Word_Cloud:</li></ul>

In [None]:
words=df['Message']
text = ' '.join(words)
# Create a WordCloud object with increased max_words
wordcloud = WordCloud(width=1400, height=800, background_color='white', max_words=20).generate(text)

# Plot the WordCloud
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code creates a <u>WordCloud from the 'Message' column in a DataFrame</u>, <i>visualizing the most frequent words</i>. The WordCloud has a width of 1400, height of 800, and displays a maximum of 20 words. It provides a concise and visually appealing representation of the most prevalent words in the text data.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Daily_Timeline:</li></ul>

In [None]:
import seaborn as sns
daily_timeline = df.groupby('Only_date').count()['Message'].reset_index()
plt.figure(figsize=(12, 6))
sns.lineplot(x='Only_date', y='Message', data=daily_timeline, color='maroon')
plt.show()

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code groups a DataFrame by date, <i>counts the daily message occurrences, and creates a line plot using Seaborn</i>. The x-axis represents the dates, the y-axis shows message counts, and the maroon line depicts the daily timeline. The figure, sized 12x6, provides a clear visualization of messaging patterns over time.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Monthly_Timeline::</li></ul>

In [None]:
#Monthly Data =====================================
timeline = df.groupby(['Year', 'Month_No', 'Month']).count()['Message'].reset_index()
month_timeline = []
for i in range(timeline.shape[0]):
  month_timeline.append(timeline['Month'][i] + "-" + str(timeline['Year'][i]))
timeline['Time'] = month_timeline

plt.figure(figsize=(12,6))
plt.plot(timeline['Time'], timeline['Message'])
plt.xticks(rotation='vertical')
plt.show()

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code groups a DataFrame by year and month, <i>counting the number of messages per month</i>. It then creates a <u>timeline by combining the month and year</u>, plotting the message count over time using a line plot. The figure size is set to 12x6 for better visibility, offering a monthly message trend overview.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Most crowded day for the conversation:</li></ul>

In [None]:
Busy_day = df['Day_name'].value_counts()
colors = plt.cm.viridis(np.linspace(0, 1, len(Busy_day)))
plt.figure(figsize=(12, 6))
plt.bar(Busy_day.index, Busy_day.values, color=colors)
plt.title("Busy Day")
plt.xticks(rotation='vertical')
plt.show()

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code generates a bar plot using Matplotlib to visualize the message count for each day of the week in a DataFrame. It employs the 'viridis' colormap for varied colors, enhancing the representation of busy days. The figure size is set to 12x6 for improved visibility.</font></p>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li>Most active months in the conversation:</li></ul>

In [None]:
#Busy Month ===========================================
busy_month = df['Month'].value_counts()
colors = plt.cm.viridis(np.linspace(0, 1, len(busy_month)))
plt.figure(figsize=(12, 6))
plt.bar(busy_month.index, busy_month.values, color=colors)
plt.title("Busy Month")
plt.xticks(rotation='vertical')
plt.show()

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code generates a <i>bar plot using Matplotlib to display the message count for each month in a DataFrame</i>. The 'viridis' colormap provides a range of colors, creating a visual representation of busy months. The figure has a size of 12x6 for improved readability.</font></p><br>

<ul style="font-size:18px; font-weight:300; color:#00008B;font-family:monospace">
    <li><u>Demonstration of most active hours of each days:</u></li></ul>

In [None]:
#Plot heat_map using seaborn to show time_period each day message count
import seaborn as sns
plt.figure(figsize=(18, 9))
sns.heatmap(df.pivot_table(index='Day_name', columns='Hour_Period', values='Message', 
            aggfunc='count').fillna(0))
plt.yticks(rotation='vertical')
plt.show()

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The <u>seaborn heatmap</u> visualizes the count of messages for each day and time period in a DataFrame. The <i>x-axis represents different time periods, the y-axis shows days, and the color intensity corresponds to the message count</i>. The figure, with a size of 18x9, provides a clear overview of messaging patterns over days and hours.</font></p>

In [None]:
#finding and creating df with dupicate wprds used in message column sorted desc
duplicates_df = df[df.duplicated(subset='Message', keep=False)]
duplicates_count = duplicates_df.groupby('Message').size().reset_index(name='occurrence_count')
duplicates_count = duplicates_count.sort_values(by='occurrence_count', ascending=False)
duplicates_count.head(50)
#---------------------------------------------------------

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">The code identifies and creates a DataFrame, `duplicates_count`, <i>containing duplicate occurrences in the 'Message' column</i> from the original DataFrame, sorted in descending order based on occurrence count. The resulting DataFrame <u>shows the top 50 duplicate messages</u> and their respective occurrence counts.</font></p>

In [None]:
#Creating csv for formatted df
try:
    df.to_csv(r"Chat_Formatted.csv")
except:
    print("Failed to export")
finally:
    print("Successfully Exported to CSV format")

<p style="font-size:18;font-weight:600; color:#1E90FF;text-align:justify;text-justify: initial;font-family:monospace">
    <font size="4">This code <i>exports the DataFrame 'df' to a CSV file named 'Chat_Formatted.csv</i>' in the 'Mini_Project' directory. The `to_csv` method is used for this, <u><i>allowing easy storage and sharing of the formatted data in a common CSV format for further analysis or reference.</i></u></font></p>