In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

I imported the data from the tsv file and converted it to a dataframe. 
Here is the explanation of the code I used to import the data:

This line of code is using the `read_csv` function from the `pandas` library to read a tab-separated values (TSV) file into a DataFrame.

Here's a breakdown of the parameters:

- `file_path`: This is the path to the file you want to read. The variable `file_path` should contain a string representing this path.

- `sep='\t'`: This specifies that the separator between values in the file is a tab character. This is typical for TSV files.

- `on_bad_lines='skip'`: This tells pandas what to do when it encounters a bad line (a line with too many or too few fields). In this case, it's set to 'skip', which means that pandas will skip over bad lines and not include them in the DataFrame.

- `engine="python"`: This specifies which engine to use for reading the file. The options are 'c' and 'python'. The 'c' engine is faster, but the 'python' engine is more feature-complete and is required for certain options, like `on_bad_lines`.

In [2]:
# import an tsv file
import pandas as pd
import csv 

# Specify the file path
file_path = 'truths.tsv'

# Use pandas to read the TSV file
df = pd.read_csv(file_path, sep='\t', on_bad_lines='skip',engine="python")

# subset text column
df_text = df[['text']]





In [3]:
df_text.head()


Unnamed: 0,text
0,Q+ BE READY ANONS - PUBLIC AWAKENING COMING - ...
1,Enough is enough! RETRUTH
2,https://justthenews.com/politics-policy/all-th...
3,https://t.me/realx22report/6729
4,@CeceBloomwood


To clean the data, I used the following code:
I used the re library to remove punctuation, numbers, and other characters from the text column.

In [4]:
import re

In [5]:
# Convert the 'text' column to string type
df_text['text'] = df_text['text'].astype(str)

# Now apply the 'lower' method and remove punctuation
df_text['text'] = df_text['text'].apply(lambda x: re.sub('[!@#$:).;,?&]', '', x.lower()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('[!@#$:).;,?&]', '', x.lower()))


Then I controlled if the tidying was successful by inspecting the head of the data.

In [6]:
df_text.head()

Unnamed: 0,text
0,q+ be ready anons - public awakening coming - ...
1,enough is enough retruth
2,https//justthenewscom/politics-policy/all-thin...
3,https//tme/realx22report/6729
4,cecebloomwood


I realized that there where still characters related to websites, so I removed them.

In [7]:
# remove  'http' and 'https' from the text
df_text['text'] = df_text['text'].apply(lambda x: re.sub('http', '', x.lower()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('http', '', x.lower()))


In [8]:
# remove 'http' and 'https' from the text
df_text['text'] = df_text['text'].apply(lambda x: re.sub('https', '', x.lower()))

# reve all numbers
df_text['text'] = df_text['text'].apply(lambda x: re.sub('[0-9]', '', x.lower()))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('https', '', x.lower()))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('[0-9]', '', x.lower()))


I removed the pattern "jw" from the text. It was a pattern that was repeated in many posts from thruth social without any obvious meaning.

In [9]:
# remove jw
df_text['text'] = df_text['text'].apply(lambda x: re.sub('jw', '', x.lower()))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('jw', '', x.lower()))


In [10]:
df_text.to_csv('unlabeled_data_cleaned.csv', index=False)

In [11]:
# remove /
df_text['text'] = df_text['text'].apply(lambda x: re.sub('/', '', x.lower()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('/', '', x.lower()))


I removed the < and ≥ characters from the text.

In [12]:
df_text['text'] = df_text['text'].apply(lambda x: re.sub('<', '', x))
df_text['text'] = df_text['text'].apply(lambda x: re.sub('≥', '', x))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('<', '', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('≥', '', x))


I finaly removed the emoji from the text, to get a clean text for a better analysis.

In [13]:
# remove emoji 
df_text['text'] = df_text['text'].apply(lambda x: re.sub('[^\w\s#@/:%.,_-]', '', x.lower()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_text['text'] = df_text['text'].apply(lambda x: re.sub('[^\w\s#@/:%.,_-]', '', x.lower()))


I saved the cleaned data as a csv file.
I called the file unlabeled_data_cleaned.csv

In [15]:
# save as csv called unlabeled_data_cleaned.csv
df_text.to_csv('unlabeled_data_cleaned.csv', index=False)