#### **WhatsApp Chat Analysis in Streamlit**

This notebook will guide you through building a **Streamlit app** to analyze WhatsApp chat data. The app allows users to upload their WhatsApp chat exports and provides various insights, such as message frequency, sentiment analysis, and the most active users.

#### **WhatsApp Chat Data Processing**

In this notebook, we process a WhatsApp chat data file using regular expressions and `pandas` to clean and structure the data for further analysis.

#### **Import Libraries** 

We will use `re` for regular expressions and `pandas` for data manipulation.

In [3]:
import re
import pandas as pd

f = open("WhatsApp Chat with Group discussion(AI).txt", "r", encoding="utf-8")
data=f.read()

# Regex pattern to match the chat log format
pattern = r"(\d{2}/\d{2}/\d{4}),\s(\d{1,2}:\d{2}\s?[ap]m)\s-\s(\+\d{2}\s\d{3}\s\d{7}):\s(.*)"

# Assuming `data` is the raw chat log string
matches = re.findall(pattern, data)

# Create a DataFrame from the extracted matches
df = pd.DataFrame(matches, columns=['Date', 'Time', 'User', 'Message'])

# Split the 'Date' column into 'Day', 'Month', 'Year'
df[['Day', 'Month', 'Year']] = df['Date'].str.split('/', expand=True)

# Convert 'Date' to datetime format to extract the day name and month name
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

# Extract the Day name (e.g., Monday, Tuesday)
df['Day'] = df['Date'].dt.strftime('%A')  # Get the day name (e.g., "Monday")

# Extract the Month name (e.g., January, February)
df['Month'] = df['Date'].dt.strftime('%B')  # Get the full month name (e.g., "July")

# Clean up and split the 'Time' column into 'Hour', 'Minute', and 'AMPM'
df['Time'] = df['Time'].str.replace('\u202f', ' ')  # Remove non-breaking spaces

# Split the cleaned 'Time' column into 'Hour:Minute' and 'AMPM'
df[['Hour_Minute', 'AMPM']] = df['Time'].str.extract(r'(\d{1,2}:\d{2})\s?(am|pm)', expand=True)

# Split the 'Hour_Minute' into 'Hour' and 'Minute'
df[['Hour', 'Minute']] = df['Hour_Minute'].str.split(':', expand=True)

# Drop the original 'Date' and 'Time' columns if you don't need them anymore
df = df.drop(columns=['Date', 'Time', 'Hour_Minute'])

# Display the DataFrame
df.head()


Unnamed: 0,User,Message,Day,Month,Year,AMPM,Hour,Minute
0,+92 303 6123098,Assalamualaikum wa rehmatullahi wa baraktuh,Thursday,July,2024,pm,6,39
1,+92 303 6123098,Agr koe kr skta to mjy bta dyn Mai unko cv per...,Thursday,July,2024,pm,6,40
2,+92 328 9460713,<Media omitted>,Thursday,July,2024,pm,9,36
3,+92 328 9460713,"Remember Brothers, Don't miss it !",Thursday,July,2024,pm,9,38
4,+92 310 8668380,<Media omitted>,Friday,July,2024,am,3,18


### **2. Basic Statistics** ###

In [4]:
df.describe()

Unnamed: 0,User,Message,Day,Month,Year,AMPM,Hour,Minute
count,960,960,960,960,960,960,960,960
unique,103,738,7,5,1,2,12,60
top,+92 307 7789508,<Media omitted>,Thursday,August,2024,pm,11,2
freq,88,145,247,324,960,708,136,39


In [5]:
# Basic stats and general information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 960 entries, 0 to 959
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   User     960 non-null    object
 1   Message  960 non-null    object
 2   Day      960 non-null    object
 3   Month    960 non-null    object
 4   Year     960 non-null    object
 5   AMPM     960 non-null    object
 6   Hour     960 non-null    object
 7   Minute   960 non-null    object
dtypes: object(8)
memory usage: 60.1+ KB


In [6]:
df.shape

(960, 8)

In [8]:
df[df['User']=="+92 307 7789508"].shape

(88, 8)

In [9]:
# Filter rows where the 'User' column is equal to '+92 307 7789508'
result = df[df['User'] == "+92 307 7789508"].shape

print(result)  # Output will be (3, 2) indicating 3 rows and 2 columns

(88, 8)
