# Preprocessing and Data Encoding

In this document, we preprocess the dataset to make sure we only have clean data. 

Towards the end, we are going to perform categorical data encoding.

## 1 Preprocessing for Clean Data

### 1.1 Loading the data

First, we load the data from our [dataset](https://docs.google.com/spreadsheets/d/1jzNEJWQMvfK6aJeG28JSXecrLZfjeFOCYp-gtW0Q7LQ/edit#gid=1456408654) which we have made available online. 

We display the last three records as shown in the output to ensure that we have obtained all data properly.

In [84]:
# Load datasets
import pandas as pd
import gspread
import numpy as np

sa = gspread.service_account(filename="group-28-dataset-7233caedfe09.json")
sheet = sa.open("Dataset - Group 28")
work_sheet = sheet.worksheet("Fixed Data")

df = pd.DataFrame(work_sheet.get_all_values())

new_header = df.iloc[0] 
df = df[1:152] 
df.columns = new_header 
df = df.replace(r'^\s*$', np.nan, regex=True)

df.tail(3)


Shown here are all the current features of the dataset. 

This may include features with missing values.

In [None]:
print("List all current features:")
df.columns

List all current features:


Index(['ID', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
       'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
       'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
       'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
       'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
       'Rating', 'Reasoning', 'Remarks', 'Thread/Tweet Language',
       'Add columns here', 'Add columns here', 'Reviewer', 'Review'],
      dtype='object', name=0)

In [None]:
print("Display the number of records and features:")
df.shape

Display the number of records and features:


(151, 35)

### 1.2 Checking for Missing Values

Now, we check which features has no missing values.

From, here we can see that there are missing value on Account Bio, Location, Screenshot, etc.

In [None]:
print("Check for missing values:")
df.isnull().sum()

Check for missing values:


0
ID                         0
Timestamp                  0
Tweet URL                  0
Group                      0
Collector                  0
Category                   0
Topic                      0
Keywords                   0
Account handle             0
Account name               0
Account bio               25
Account type               1
Joined                     0
Following                  0
Followers                  0
Location                  52
Tweet                      0
Tweet Translated           0
Tweet Type                 0
Date posted                0
Screenshot               101
Content type               0
Likes                      0
Replies                    0
Retweets                   0
Quote Tweets              49
Views                    151
Rating                   150
Reasoning                  0
Remarks                   32
Thread/Tweet Language      0
Add columns here         151
Add columns here         151
Reviewer                 151
Review      

Upon checking, the collectors deemed that the only necessary features for this project are as follows:

> `['ID', 'Timestamp', 'Tweet URL', 'Group', 'Keywords',
    'Account handle', 'Account name', 'Account bio', 'Account type', 
    'Tweet', 'Tweet Translated', 'Tweet Type', 'Date posted',
    'Content type', 'Reasoning', 'Thread/Tweet Language',]`

From the previous output, we can verify that all of these features have no missing value.

In [None]:
print("Dataset summary:")
df_clean = df[['ID', 'Timestamp', 'Tweet URL', 'Group', 'Keywords', 
         'Account handle', 'Account name', 'Account bio', 'Account type', 
         'Tweet', 'Tweet Translated', 'Tweet Type', 'Date posted',
         'Content type', 'Reasoning', 'Thread/Tweet Language',]]
df_clean.describe().style

Dataset summary:


0,ID,Timestamp,Tweet URL,Group,Keywords,Account handle,Account name,Account bio,Account type,Tweet,Tweet Translated,Tweet Type,Date posted,Content type,Reasoning,Thread/Tweet Language
count,151,151,151,151,151,151,151,126,150,151,151,151,151,151,151,151
unique,151,132,151,1,21,140,138,113,3,151,151,10,149,4,119,3
top,28-1,1/03/2023 02:28:22,https://twitter.com/ggnelhsa/status/1430131446719549442,28,"""covid"" ""vaccine"" ""magnet"" ""implant"" ""kutsara"" ""spoon"" ""barya"" ""coin"" ""microchip"" ""metal"" ""track"" ""robot""",@ven_cuenca,Ven cuenca,"Registered Psychiatric Nurse UK, loves poetry, music and women, free spirit",Anonymous,Gagi astig ng vaccine ko astra yung akin eh dinikit ko yung kutsara sa pinasukan ng karayom aba dumikit HAHAHA! May magnet ata 😂,"My vaccine was so cool, mine was astra, I placed the spoon where the needle went in, but it stuck HAHAHA! It has a magnet 😂","Text, Reply",16/04/21 01:41,Emotional,Claims that SinoVac only contains water. This is false as shown in the product information. https://www.moh.gov.sg/docs/librariesprovider5/vaccination-matter/annex-2---sinovac-vaccination-information-sheet-170621.pdf,Foreign
freq,1,7,1,151,41,4,4,4,110,1,1,67,2,82,13,78


The table shows a simple summary of our clean dataset. Some notable observations are as follows: 
- Looking at the highest frequency of each feature, we can infer that ID, Tweet URL, Tweet, and Tweet Translated have unique values.
- 82 out of 151 tweets are of Emotional content type.
- There are 78 Foreign Tweets. This also implies that the remaining 73 tweets are Filipino Tweets.

### 1.3 Ensuring Formatting Cosistency

Words

## 2 Categorical Data Encoding

### 2.1 One Hot Encoding

Words

### 2.2 Heat Map (Feature Correlation)

Words