## <b>Cleaning the dataset for analysis</b>
<i>This code is written for cleaning a dataset of tweets. Tweets often contain noise, which makes it important to remove that noise to be able to analyse the tweets. </i>

First, the <u>geoso</u> library is used for transferring the text to lowercase, removing repeated characters, removing usernames, stop words, URLs, special characters, numbers, and punctuation signs, and replacing hashtags with their text.

In [15]:
#Import necessary libraries
import pandas as pd
from geoso import twitter_clean_text_in_dataframe

#Read the data file with pandas
outputgermany = pd.read_csv('Data\\output_germany.csv')

#Create dataframe
tweets_germany = pd.DataFrame(outputgermany)

#Clean the data and add new column with the cleaned text
tweets_germany['text_clean'] = twitter_clean_text_in_dataframe(tweets_germany, text_column='text', lang_code_column='lang')

The only thing that is not included in the geoso library is dealing with emoticons. As emoticons are an important part of the meaning of a text, it is useful to keep them while analysing the tweets. However, the emoticons have to be converted to their textual meaning. This can be done using the <u>emoji</u> library.

In [16]:
#Import the necessary library
import emoji

#Replace the emoticons with text for each row, while adding a new column with the new cleaned text
tweets_germany = (
    tweets_germany.assign(emoji_clean_text = lambda x: x['text_clean'].astype(str).apply(lambda s: emoji.demojize(s))))

In [None]:
datum = tweets_germany['t_datetime'].str[:3] + tweets_germany['t_datetime'].str[4:]
input_str = pd.DataFrame(datum)
tweets_germany['t_datetime_goed'] = input_str
tweets_germany

In [41]:
#Convert DateTime column to two columns containing the date and the time
tweets_germany[['Date', 'Time']] = tweets_germany['t_datetime_goed'].str.split(expand=True)
tweets_germany['Date'] = pd.to_datetime(tweets_germany.Date)

After the cleaning, the data can be exported to a csv file again.

In [43]:
#Export as csv
    #index=False means that the index of the python dataframe will not be exported
    #header=True means that the headers in the python dataframe will be used in the csv as well
tweets_germany.to_csv('Data\\clean_tweets_germany.csv', index = False, header='true')