## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../scripts')))

In [3]:
from preprocessing import *

## Read the data

In [4]:
df = pd.read_csv('../data/telegram_messages.csv')

In [5]:
df.head()

Unnamed: 0,date,message
0,2024-12-06 02:01:20,Hey Unihorse 🦄\nHide my Identity\nI need to ve...
1,2024-12-06 02:01:20,Hey Unihorse 🦄\nHide my Identity\nI need to ve...
2,2024-12-06 02:01:18,Hey Unihorse 🦄\nHide my Identity\nI need to ve...
3,2024-12-06 02:01:15,Hey Unihorse 🦄\nHide my Identity\nI need to ve...
4,2024-12-06 02:01:14,Hey Unihorse 🦄\nHide my Identity\nI need to ve...


In [6]:
df.tail()

Unnamed: 0,date,message
21262,2017-05-28 02:44:35,Hey unihorse 🐴.\nPlease hide my identity.\nHer...
21263,2017-05-27 15:48:49,Hey unihorse 🐴\nHide my identity.\nI got a con...
21264,2017-05-27 15:37:51,Hey unihorse 🐴.\nHide my identity.\nI have a c...
21265,2017-05-27 14:36:54,"A place to ask for advice, look for comfort or..."
21266,2017-05-27 14:17:47,[Media or Non-Text Message]


In [7]:
df.shape

(21267, 2)

## The target Column(message) is not clean, we will clean it.

### Let's understand it first

In [8]:
df.iloc[1]['message']

"Hey Unihorse 🦄\nHide my Identity\nI need to vent\nGenuineee question here. I'm 19F and...all my life I have never been in any relationship( ik it's not the appropriate age to be in one either... for the most part)  And looking at my peers around me, going on multiple dates, having too many exes and stuff, I wonder...HOW TF DO YALL GET BF OR GF? Like seriously! Does it...just...happen? Idk, maybe it's because I've been obsessed with books(specifically fiction), and I'm more of an idealistic person than realistic... But I really don't get it. I always think like...there is a special moment where the man sees the woman and he falls for her and he will just try his best to win her or the woman is unrelenting and she will not stop pestering him until he eventually finds himself in love with her (oh...I forgot to mention I'm delusional too☺️) I believed it is how it always happens. But now...I'm in uni and...I see all these people dating and mnamn and I feel like... is that how it always ha

### Now we remove emojis from the text column

In [9]:
remove_emojis(df, 'message')
df.head()

Unnamed: 0,date,message
0,2024-12-06 02:01:20,Hey Unihorse \nHide my Identity\nI need to ven...
1,2024-12-06 02:01:20,Hey Unihorse \nHide my Identity\nI need to ven...
2,2024-12-06 02:01:18,Hey Unihorse \nHide my Identity\nI need to ven...
3,2024-12-06 02:01:15,Hey Unihorse \nHide my Identity\nI need to ven...
4,2024-12-06 02:01:14,Hey Unihorse \nHide my Identity\nI need to ven...


In [10]:
df.tail()

Unnamed: 0,date,message
21262,2017-05-28 02:44:35,Hey unihorse .\nPlease hide my identity.\nHere...
21263,2017-05-27 15:48:49,Hey unihorse \nHide my identity.\nI got a conf...
21264,2017-05-27 15:37:51,Hey unihorse .\nHide my identity.\nI have a co...
21265,2017-05-27 14:36:54,"A place to ask for advice, look for comfort or..."
21266,2017-05-27 14:17:47,[Media or Non-Text Message]


### Now let's try to remove the first two sentences from the text column as they all are the same

In [12]:
df['message'] = df['message'].apply(process_text)

In [15]:
df.head(10)

Unnamed: 0,date,message
0,2024-12-06 02:01:20,I need to vent\nHey unicorn \nI just wanna ven...
1,2024-12-06 02:01:20,I need to vent\nGenuineee question here. I'm 1...
2,2024-12-06 02:01:18,I need to vent\nOhhh I know am gone get judged...
3,2024-12-06 02:01:15,"I need to vent\nHey there endet nachu , \n\nAm..."
4,2024-12-06 02:01:14,I need to vent\nHello\nI'm 26 M\nThere’s a dee...
5,2024-12-06 02:01:13,I need to vent\nIam here to vent \nfemale\nEna...
6,2024-12-06 02:01:12,I need to vent\nfirst time venting \nHello guy...
7,2024-12-06 02:01:11,I need to vent\nWhy did you come back to my li...
8,2024-12-06 02:01:09,I need to vent\nHey yall so straight to the po...
9,2024-12-06 02:01:07,"I need to vent\nHello, hope everyone is fine i..."


### shape of the dataframe before the non-english rows removal

In [16]:
df.shape

(21267, 2)

## Let's Remove non-english content from the text column

In [17]:
english_df = remove_non_english_rows(df, 'message')

### shape of the dataframe after removing non-english rows

In [18]:
english_df.shape

(20571, 2)

In [20]:
english_df.to_csv('../data/cleaned_messages.csv', index=False)