# Data Analysis: SOGIE Same-Sex-Marriage

### Importing of necessary packages

In [566]:
import pandas as pd 
import openpyxl 

from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\juf\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Converting Dataset to Pandas Dataframe

In [567]:
raw_dataset = 'Dataset - Group 54 - Combined.xlsx' # excel file 
df = pd.read_excel(raw_dataset) 

### Sample tweets from the dataset

In [568]:
df.head(5)

Unnamed: 0,ID,Timestamp,Tweet URL,Group,Collector,Category,Topic,Keywords,Account handle,Account name,...,Likes,Replies,Retweets,Quote Tweets,Views,Rating,Reasoning,Remarks,Reviewer,Review
0,54-1,2023-03-27 11:17:00,https://twitter.com/ReightWingAngel/status/160...,54.0,"Dycaico, Julian",GNDR,Sogie; same-sex marriage,#NoToSOGIE,@ReightWingAngel,ReiCon,...,0,0,0,0,163,False,"What is gender reassignment | Equality, Divers...",,,
1,54-2,2023-03-27 11:48:00,https://twitter.com/ReightWingAngel/status/160...,54.0,"Dycaico, Julian",GNDR,Sogie; same-sex marriage,#NoToSOGIE,@ReightWingAngel,ReiCon,...,0,0,0,0,undefined,,,,,
2,54-3,2023-03-27 12:08:00,https://twitter.com/JPAbecillaPH/status/133475...,54.0,"Dycaico, Julian",GNDR,Sogie; same-sex marriage,#NoToSOGIE,@JPAbecillaPH,JP Abecilla,...,4,0,2,2,undefined,False,Sexuality explained - Better Health Channel,,,
3,54-4,2023-03-27 12:17:00,https://twitter.com/Conservative_PH/status/133...,54.0,"Dycaico, Julian",GNDR,Sogie; same-sex marriage,#NoToSOGIE,@Conservative_PH,Conservative Philippines,...,10,3,5,5,undefined,MISLEADING,"Genes cannot predict a person's sexuality, hom...",,,
4,54-5,2023-03-27 21:22:00,https://twitter.com/JPAbecillaPH/status/133449...,54.0,"Dycaico, Julian",GNDR,Sogie; same-sex marriage,#noToSOGIE,@JPAbecillaPH,JP Abecilla,...,2,1,6,0,undefined,UNPROVEN,"Baseless claim that Christians are ""complacent...",,,


## General Overview of the Data

#### Size and columns

The dataset contains 131 tweets and 30 features (Excluding 2 columns for reviewing).

In [569]:
# Size/Shape of the dataset
print(f"Dataset size: {df.shape}\n")

# Columns in the dataset
print(f"Number of columns: {df.shape[1]}\n")

# Data type of each column
print(f"Columns and their type of data:\n\n{df.dtypes}\n")

Dataset size: (131, 32)

Number of columns: 32

Columns and their type of data:

ID                          object
Timestamp           datetime64[ns]
Tweet URL                   object
Group                      float64
Collector                   object
Category                    object
Topic                       object
Keywords                    object
Account handle              object
Account name                object
Account bio                 object
Account type                object
Joined                      object
Following                    int64
Followers                    int64
Location                    object
Tweet                       object
Tweet Translated            object
Tweet Type                  object
Date posted                 object
Screenshot                  object
Content type                object
Likes                        int64
Replies                      int64
Retweets                     int64
Quote Tweets                 int64
Views    

#### Missing Values

The features with the most missing values are: **Remarks, Screenshot, Views, Location, and Account Bio.**

In [570]:
null_data = df.isnull().sum().sort_values(ascending=False)
print(f"Missing Values:\n\n{null_data[null_data>0]}")

Missing Values:

Review          131
Reviewer        131
Remarks         130
Screenshot      128
Views           107
Location         50
Account bio      26
Rating           17
Reasoning        16
Content type      9
Tweet Type        2
Account type      1
Timestamp         1
Group             1
dtype: int64


#### Dropping of Empty Columns

Drop the "Review" and "Reviewer" Columns, assign it to df_clean.
df_clean will be used from here on.

In [571]:
df_clean = df.dropna(axis=1, how='all')

print(f"Shape of df_clean: {df_clean.shape}")
print(f"Features of df_clean: {df_clean.columns.tolist()}")

Shape of df_clean: (131, 30)
Features of df_clean: ['ID', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category', 'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio', 'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet', 'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot', 'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views', 'Rating', 'Reasoning', 'Remarks']


#### Distribution of Data

In [572]:
df_clean.describe()

Unnamed: 0,Timestamp,Group,Following,Followers,Likes,Replies,Retweets,Quote Tweets
count,130,130.0,131.0,131.0,131.0,131.0,131.0,131.0
mean,2023-04-08 18:13:37.242815232,54.0,242.076336,402.465649,2.312977,1.076336,0.70229,2.229008
min,2023-03-27 10:49:00,54.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2023-03-31 15:55:59.151000064,54.0,42.0,10.0,0.0,0.0,0.0,0.0
50%,2023-04-03 09:56:58.216999936,54.0,102.0,31.0,0.0,0.0,0.0,0.0
75%,2023-04-19 23:45:53.592999936,54.0,284.5,189.5,2.0,0.0,0.0,0.0
max,2023-05-09 23:07:53,54.0,3294.0,18267.0,61.0,93.0,25.0,233.0
std,,0.0,400.910879,1754.23361,6.83102,8.146371,2.656472,20.383842


#### Unique Values for Each Column

In [573]:
df_clean.nunique()

ID                  131
Timestamp            88
Tweet URL           124
Group                 1
Collector             4
Category              2
Topic                 1
Keywords             11
Account handle       81
Account name         76
Account bio          62
Account type          3
Joined               81
Following            70
Followers            64
Location             22
Tweet               125
Tweet Translated    125
Tweet Type           20
Date posted         129
Screenshot            3
Content type          3
Likes                14
Replies               8
Retweets             11
Quote Tweets          9
Views                 2
Rating               14
Reasoning            79
Remarks               1
dtype: int64

## Context-specific Questions

The main topic used for collection of data is **Sogie; same-sex marriage**

We now observe the content of the tweets. Focus on the **Tweet** column by assigning it to dfc_tweets.

In [574]:
dfc_tweets = df_clean['Tweet Translated']
dfc_tweets.head(5)

0    Do you remember when you joined Twitter? I do!...
1    youtube.com/watch?v=vgngesz7E30 #notosogiebill...
2    Sexuality is not based on feelings. It is unli...
3    Born "gay"? "We are born this way" and "God cr...
4    Many Christians are complacent about the #Sogi...
Name: Tweet Translated, dtype: object

Similarly to observe the **content of the bio** of the twitter accounts, we assign the **Account bio** column to dfc_bio.

In [575]:
dfc_bio = df_clean['Account bio'].astype(str).fillna(' ')


#### Function for counting common words

In [576]:
# Define stop words 
stop_words = list(stopwords.words('english'))

# Initialize Count Vectorizer with defined stop words
vectorizer = CountVectorizer(stop_words=stop_words) 

def count_words(dfc_type):
    # Prepare text data 
    text = ' '.join(dfc_type)
    
    word_counts = vectorizer.fit_transform([text]).toarray().flatten()
    features = vectorizer.get_feature_names_out()

    # Create a dictionary, word - count
    word_count_dict = dict(zip(features, word_counts))
    return sorted(word_count_dict.items(), key=lambda x: x[1], reverse=True)

def print_count(dfc_type):
    for word, count in dfc_type:
        print(f"{word}: {count}")

Common Words from Tweets

In [577]:
dfc_tweets_count = count_words(dfc_tweets)
print_count(dfc_tweets_count)

sogie: 67
notosogiebill: 63
bill: 56
notosogie: 56
https: 26
co: 25
god: 23
amp: 22
equality: 20
law: 16
people: 16
rights: 15
anti: 14
bills: 14
junksogiebill: 13
discrimination: 11
junkmhbill: 11
right: 11
satanic: 11
already: 10
love: 10
mhactnow: 10
sogi: 10
yes: 10
com: 9
lgbt: 9
lgbtq: 9
special: 9
act: 8
family: 8
future: 8
gay: 8
man: 8
philippines: 8
us: 8
depraved: 7
everyone: 7
homo: 7
like: 7
pinas: 7
still: 7
want: 7
would: 7
community: 6
constitution: 6
created: 6
facebook: 6
gender: 6
laws: 6
may: 6
protect: 6
real: 6
respect: 6
sinner: 6
superiority: 6
totally: 6
treatment: 6
based: 5
even: 5
filipino: 5
junkmentalhealthbill: 5
life: 5
need: 5
state: 5
truth: 5
woman: 5
women: 5
abscbnnews: 4
agree: 4
become: 4
bit: 4
born: 4
cannot: 4
care: 4
chose: 4
christians: 4
country: 4
culture: 4
enough: 4
feelings: 4
freedom: 4
lead: 4
let: 4
lgbtqia: 4
marriage: 4
mean: 4
members: 4
mental: 4
mentalhealth: 4
notolgbt: 4
one: 4
read: 4
reject: 4
say: 4
separation: 4
sex: 4
sin:

Common Words from Twitter Account Bio

In [578]:
dfc_bio_count = count_words(dfc_bio)
print_count(dfc_bio_count)

nan: 26
conservative: 15
ð_x009d_: 15
truth: 9
god: 8
strictly: 8
life: 7
christ: 6
sa: 6
love: 5
loves: 5
people: 5
ðÿ: 5
ally: 4
antifakenews: 4
antifixart: 4
antihyprocrite: 4
antistupidity: 4
antiwokes: 4
biblical: 4
blackwashingcharacterisracist: 4
calvinistic: 4
charismatic: 4
christian: 4
evangelical: 4
feminismisantiwoman: 4
friend: 4
king: 4
lady: 4
master: 4
prouddds: 4
stopasianhate: 4
supporter: 4
yet: 4
action: 3
aims: 3
ang: 3
bayan: 3
com: 3
conservatism: 3
conservativeph: 3
diyos: 3
father: 3
filipino: 3
glory: 3
jesus: 3
joy: 3
jpabecilla: 3
knowledge: 3
lab: 3
lang: 3
lord: 3
maga: 3
never: 3
notosogiebill: 3
para: 3
preserve: 3
simple: 3
testimony: 3
traditional: 3
values: 3
wife: 3
2020: 2
2nd: 2
_x008d_ðÿœˆnoah: 2
_x008f_: 2
_x008f_â: 2
advocate: 2
alone: 2
always: 2
apostle: 2
bible: 2
big: 2
blog: 2
boys: 2
catholic: 2
cell: 2
certified: 2
child: 2
choose: 2
doulos: 2
eye: 2
faithful: 2
fear: 2
filipinoinfluencer: 2
game: 2
great: 2
husband: 2
iskomoreno: 2
katot

Most Shared Tweets 

In [579]:
top_shared_tweets = df.nlargest(10, 'Retweets')[['Tweet', 'Retweets']].set_index('Tweet')['Retweets'].to_dict()

for tweet, retweet in top_shared_tweets.items():
    print(f"Retweets: {retweet}\nTweet: {tweet}\n\n")

Retweets: 25
Tweet: The truths and facts of SOGIE Bill and why it must not be passed.
 #NoToSogieBill


Retweets: 10
Tweet: #NOTOSOGIE YES TO THE PRESERVATION OF THE FILIPINO FAMILY! facebook.com/10000028902408â€¦


Retweets: 8
Tweet: The DANGERS OF SOGIE BILL #NoToSogieBill; Image with text: "The problem with FEELINGS-BASED SEXUAL ORIENTATION is that they are CHANGEABLE, VARIABLE, MODIFIABLE, INCONSTANT. A woman who feels she is a man and lives like a man may someday revert back to being a woman or a man who is living like a man now may decide to be a woman later, etc.


Retweets: 7
Tweet: Dito palang makikita mo nang MAY MALI sa pinaglalaban nila. Superiority ang gusto hindi equality. #NoToSOGIE #NoToSogieBill Shoutout mga hashtag users #PassADBNow #YESToSOGIEBill #SOGIEEqualityNow look how discriminatory sogie itself. (c) Stanley Clyde Flores


Retweets: 6
Tweet: Many Christians are complacent about the #SOGIEBill. Perhaps because...  They don't understand it.  They are not directly

Length of Tweets 

In [580]:
dfc_tweets_lengths = dfc_tweets.str.len() 

print(f"Tweet Lengths:\n\n{dfc_tweets_lengths}")

Tweet Lengths:

0      226
1      134
2      172
3      258
4      264
      ... 
126    249
127    228
128    105
129    269
130    179
Name: Tweet Translated, Length: 131, dtype: int64


Average Length of Tweets 

In [581]:
average_tweet_length = dfc_tweets_lengths.mean()

print(f"Average Tweet Length: {average_tweet_length}")


Average Tweet Length: 174.6412213740458
