### BDP Final Project: Twitter Credibility Analysis on Special Education

#### Objective: identify whether Twitter can be considered a credible source of information, which reflects the emergence of important trends or topics in education, with a focus on special education.
- One example of such topic would be “Florida math book ban”.  
- Do you see any spikes in Twitter activity or any shifts in geographical distribution of Twitterers?  (TwittererLinks to an external site. is the name given to those who use Twitter).
- Are higher Tweet volumes reflective of the emergence of new hot topic in education?  Or they are more related to other events, such as sports, viral social media posts, university application cycles, being admitted to the university, etc.? 
- Who are these Twitterers that are tweeting about K-12Links to an external site. or Higher Education?  
- Are these mostly government institutions, universities and credible non-profit organizations?  Or random users tweeting about their schools, teachers, application processes, or attitudes toward going (or not going) to schools.  
- Do you see most of these messages being original content or just copies of the original tweets / retweets?

#### Data Scource information
- volumne: ~ 100 million Tweets (~500GB).
- These tweets are collected on the topics of education, schools, universities, learning, knowledge sharing, etc., but only a fraction of them would be directly related to either primary, secondary or higher education.

### Part 1: Data Preprocessing

#### 1. Set up environment & import packages

In [1]:
spark.version

'3.1.3'

In [2]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None)
pd.reset_option('display.max_rows')

import warnings
warnings.filterwarnings(action='ignore')
warnings.simplefilter('ignore')

from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import SparkSession

In [3]:
# List all files in given COS directory
def list_blobs_pd(bucket_name, folder_name):
    gcs_client = storage.Client()
    bucket = gcs_client.bucket(bucket_name)
    blobs = list(bucket.list_blobs(prefix=folder_name))

    blob_name = []
    blob_size = []
    
    for blob in blobs:
        blob_name.append(blob.name)
        blob_size.append(blob.size)

    blobs_df = pd.DataFrame(list(zip(blob_name, blob_size)), columns=['Name','Size'])

    blobs_df.style.format({"Size": "{:,.0f}"}) 
    
    return blobs_df

In [4]:
# Reading data from open bucket, avaible to all students
bucket_read = 'msca-bdp-tweets'
folder_read = 'final_project'

In [5]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

#### 2. Read in data

In [6]:
path = 'gs://msca-bdp-tweets/final_project'

In [7]:
%%time

total_df = spark.read.json(path)

23/03/03 15:31:06 WARN org.apache.spark.sql.execution.datasources.SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
                                                                                

CPU times: user 1.29 s, sys: 361 ms, total: 1.65 s
Wall time: 7min 3s


23/03/03 15:36:08 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [8]:
total_df.count()

                                                                                

99994342

##### - There are 99994342 rows in total.

In [9]:
total_df.printSchema()

root
 |-- coordinates: struct (nullable = true)
 |    |-- coordinates: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |    |-- type: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- display_text_range: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- entities: struct (nullable = true)
 |    |-- hashtags: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- indices: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |-- media: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- additional_media_info: struct (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- embeddable: boolean (nullable = true)
 |    |    |    |    |-- monetizable: boolean (nullable = true)
 |    |    |    |   

In [10]:
total_df.columns

['coordinates',
 'created_at',
 'display_text_range',
 'entities',
 'extended_entities',
 'extended_tweet',
 'favorite_count',
 'favorited',
 'filter_level',
 'geo',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'is_quote_status',
 'lang',
 'place',
 'possibly_sensitive',
 'quote_count',
 'quoted_status',
 'quoted_status_id',
 'quoted_status_id_str',
 'quoted_status_permalink',
 'quoted_text',
 'reply_count',
 'retweet_count',
 'retweeted',
 'retweeted_from',
 'retweeted_status',
 'source',
 'text',
 'timestamp_ms',
 'truncated',
 'tweet_text',
 'user',
 'withheld_in_countries']

In [11]:
len(total_df.columns)

39

In [12]:
total_df.show(5)

[Stage 5:>                                                          (0 + 1) / 1]

+-----------+--------------------+------------------+--------------------+-----------------+--------------------+--------------+---------+------------+----+-------------------+-------------------+-----------------------+---------------------+-------------------------+-------------------+-----------------------+---------------+----+-----+------------------+-----------+--------------------+-------------------+--------------------+-----------------------+--------------------+-----------+-------------+---------+--------------+--------------------+--------------------+--------------------+-------------+---------+--------------------+--------------------+---------------------+
|coordinates|          created_at|display_text_range|            entities|extended_entities|      extended_tweet|favorite_count|favorited|filter_level| geo|                 id|             id_str|in_reply_to_screen_name|in_reply_to_status_id|in_reply_to_status_id_str|in_reply_to_user_id|in_reply_to_user_id_str|is_quote

                                                                                

In [13]:
total_df.limit(5)

                                                                                

coordinates,created_at,display_text_range,entities,extended_entities,extended_tweet,favorite_count,favorited,filter_level,geo,id,id_str,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,is_quote_status,lang,place,possibly_sensitive,quote_count,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_text,reply_count,retweet_count,retweeted,retweeted_from,retweeted_status,source,text,timestamp_ms,truncated,tweet_text,user,withheld_in_countries
,Thu Nov 17 22:32:...,"[5, 140]","{[], null, [], [{...",,"{[5, 190], {[], n...",0,False,low,,1593371572567425024,1593371572567425024,ucu,1.5933315268335206e+18,1.5933315268335206e+18,17724276.0,17724276.0,False,en,,,0,,,,,,0,0,,,,"<a href=""http://t...",@ucu At teesside ...,1668724360969,True,@ucu At teesside ...,"{false, Wed Aug 3...",
,Thu Nov 17 22:32:...,,"{[], null, [], []...",,,0,False,low,,1593371572932415492,1593371572932415492,,,,,,False,en,,,0,,,,,,0,0,RT,AbrhaEyerus,"{null, Thu Nov 17...","<a href=""http://t...",RT @AbrhaEyerus: ...,1668724361056,False,"""As students in m...","{false, Mon Nov 1...",
,Thu Nov 17 22:32:...,,"{[{[62, 66], CO2}...",,,0,False,low,,1593371572810694660,1593371572810694660,,,,,,True,en,,,0,"{null, Wed Nov 02...",1.5876987968126403e+18,1.5876987968126403e+18,{twitter.com/CPAC...,Australian Geolog...,0,0,RT,GONOW77,"{null, Thu Nov 17...","<a href=""http://t...","RT @GONOW77: ❄ ""N...",1668724361027,False,"❄ ""No one has eve...","{false, Thu Apr 2...",
,Thu Nov 17 22:32:...,,"{[{[33, 44], OneC...",,,0,False,low,,1593371575138414592,1593371575138414592,,,,,,False,en,,,0,,,,,,0,0,RT,seaforthhs,"{null, Thu Nov 17...","<a href=""https://...",RT @seaforthhs: I...,1668724361582,False,It's a beautiful ...,"{false, Sat Sep 1...",
,Thu Nov 17 22:32:...,,"{[], null, [], [{...",,,0,False,low,,1593371575637639168,1593371575637639168,,,,,,False,en,,False,0,,,,,,0,0,,theblaze,,"<a href=""http://t...",This is what libe...,1668724361701,False,This is what libe...,"{false, Wed May 2...",


#### 3. Variable Selection(Columns)

##### Description for each column & Whether I will use it for analysis (X: discard; V: will select it for now)

- coordinates: The longitude and latitude of the Tweet’s location. --> X

- created_at: UTC time when this Tweet was created. Since we will need to analyze and visualize tweets in time series matter, we will need this variable. --> V

- display_text_range: --> X

- entities: Entities provide metadata and additional contextual information about content posted on Twitter. We will need to use hashtags and maybe user_mentions under this object, we will keep this variable --> V

- extended_entities: If a Tweet contains native media (shared with the Tweet user-interface as opposed via a link to elsewhere), there will also be a extended_entities section. --> X

- extended_tweet: An extended_tweet is a feature that allows users to post tweets longer than the traditional 280-character limit. An extended_tweet is essentially a tweet that has been expanded to include up to 10,000 characters of text, images, and other media. --> V

- favorite_count: Indicates approximately how many times this Tweet has been liked by Twitter users. --> V

- favorited: Indicates whether this Tweet has been liked by the authenticating user. --> X

- filter_level: Indicates the maximum value of the filter_level parameter which may be used and still stream this Tweet.  --> X

- geo: Deprecated. Nullable. Use the coordinates field instead.  --> X 

- id: The integer representation of the unique identifier for this Tweet. --> V

- id_str: The string representation of the unique identifier for this Tweet. --> X

- in_reply_to_screen_name ---> X

- in_reply_to_status_id --> X

- in_reply_to_status_id_str: --> X

- in_reply_to_user_id --> X

- in_reply_to_user_id_str: --> X

- is_quote_status: Indicates whether this is a Quoted Tweet. ---> X

- lang: indicates a BCP 47 language identifier corresponding to the machine-detected language of the Tweet text, or und if no language could be detected. (ex: "lang": "en") --> V

- place: Places are specific, named locations with corresponding geo coordinates. ---> V

- possibly_sensitive: This field indicates content may be recognized as sensitive. --> X

- retweeted: Indicates whether this Tweet has been Retweeted by the authenticating user --> V

- quote_count: Indicates approximately how many times this Tweet has been quoted by Twitter users. --> X

- quoted_status: This field only surfaces when the Tweet is a quote Tweet. --> X

- quoted_status_id: This field only surfaces when the Tweet is a quote Tweet. --> X

- quoted_status_id_str: --> X

- quoted_status_permalink: --> X

- quoted_text: --> X

- reply_count: Number of times this Tweet has been replied to. --> V

- retweet_count: Number of times this Tweet has been retweeted. --> V

- retweeted_from: --> V

- retweeted_status: Users can amplify the broadcast of Tweets authored by other users by retweeting. Retweets can be distinguished from typical Tweets by the existence of a retweeted_status attribute. --> V

- source: Utility used to post the Tweet(ex: Twitter Web Client) --> X

- text: Tweets content --> V

- timestamp_ms --> X

- truncated: Indicates whether the value of the text parameter was truncated --> X

- tweet_text: --> V

- user: The user who posted this Tweet--> V

- withheld_in_countries: When present, indicates a list of uppercase two-letter country codes this content is withheld from. ---> X


(reference: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)


#### Summary:
1. I will select below variables at this stage:
   - id
   - favorite_count
   - lang
   - entities
   - extended_tweet
   - place
   - user
   - text
   - tweet_text
   - retweeted
   - reply_count
   - retweet_count
   - retweeted_from
   - retweeted_status
   - created_at


2. Since I will need to identify the most influential Twitterers, I will need id, favorite_count, user, retweet_count, retweet_from, retweeted_status to help with finding out the twitterer profile and count thier tweets.
    - According to Twitter developer platform, the best practice is to retrieve the text, entities, original author and date from the original Tweet in retweeted_status whenever this exists. An exception is getting Twitter entities that are part of the additive Quote. Therefore, I choose to use retweeted_status and discard all quote relevant columns to conduct further analysis for retweet count, and text.
    

3. For time series relevant analysis, I use created_at to analyze and visualize tweets.

4. For location relevant analysis, I choose to use place object rather than geo/ coordiates since it includes name, full_name, country_code, country text information.

5. Since I will need to analyze the content of the tweet, I need tweet's text information (either in text, tweet_text, or extended_tweet)
   I will do EDA to decide using text, tweet_text or extended_tweet (or a hybrid approach)

6. Since there are child objects under User, Entities, Extended entities, Retweeted_status object, I will explore them to select the needed objects underneath them for further analysis.


##### - Use samples to see the actual information in User, Entities, Extended entities, Retweeted_status & conduct EDA

In [14]:
!hadoop fs -ls "gs://msca-bdp-tweets/final_project" | head - 5

==> standard input <==
Found 50696 items
-rwx------   3 root root          0 2023-02-08 13:58 gs://msca-bdp-tweets/final_project/_SUCCESS
-rwx------   3 root root    4500466 2023-02-08 13:44 gs://msca-bdp-tweets/final_project/part-00000-aa6d3cb4-7022-4df2-9921-218307589ce2-c000.json
-rwx------   3 root root    4107431 2023-02-08 13:44 gs://msca-bdp-tweets/final_project/part-00001-aa6d3cb4-7022-4df2-9921-218307589ce2-c000.json
-rwx------   3 root root    4672123 2023-02-08 13:44 gs://msca-bdp-tweets/final_project/part-00002-aa6d3cb4-7022-4df2-9921-218307589ce2-c000.json
-rwx------   3 root root    5186684 2023-02-08 13:44 gs://msca-bdp-tweets/final_project/part-00003-aa6d3cb4-7022-4df2-9921-218307589ce2-c000.json
-rwx------   3 root root    4729662 2023-02-08 13:44 gs://msca-bdp-tweets/final_project/part-00004-aa6d3cb4-7022-4df2-9921-218307589ce2-c000.json
-rwx------   3 root root    4605529 2023-02-08 13:44 gs://msca-bdp-tweets/final_project/part-00005-aa6d3cb4-7022-4df2-9921-218307589

In [15]:
# randomly select one of the json file from above
sample_path = 'gs://msca-bdp-tweets/final_project/part-00003-aa6d3cb4-7022-4df2-9921-218307589ce2-c000.json'
sample_df = spark.read.json(sample_path)

In [16]:
sample_df.count()

954

In [17]:
#sample_df.printSchema()

In [18]:
len(sample_df.columns)

37

In [19]:
for c in total_df.columns:
    if c not in sample_df.columns:
        print(c)

coordinates
geo


- I randomly chose a sample to conduct EDA and found that coordinates and geo are not in the columns, verifying that using place to do geographical analysis is relatively a good choice.

-  User

In [20]:
sample_df.select("user").take(1)

[Row(user=Row(contributors_enabled=False, created_at='Mon Mar 28 01:56:54 +0000 2011', default_profile=True, default_profile_image=False, description='Retired Principal. Proud Husband, Father, and Grandpa! Who loves working with teachers and students. Who likes out-of-the-box thinking and innovative ideas.', favourites_count=3084, followers_count=666, friends_count=1220, geo_enabled=True, id=273212858, id_str='273212858', is_translator=False, listed_count=26, location=None, name='Peter Embleton', profile_background_color='C0DEED', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=False, profile_banner_url='https://pbs.twimg.com/profile_banners/273212858/1661386030', profile_image_url='http://pbs.twimg.com/profile_images/1562592385225351168/N-10-cIG_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/1562592385225351168/N-1

In [63]:
sample_df.select('user.name').take(5)

[Row(name='Peter Embleton'),
 Row(name='Seattle 911'),
 Row(name='Forgive Damola'),
 Row(name='BL'),
 Row(name='TheNutKing')]

In [64]:
sample_df.select('user.screen_name').take(5)

[Row(screen_name='pcembleton'),
 Row(screen_name='seattle911'),
 Row(screen_name='Ewali_'),
 Row(screen_name='BL94055854'),
 Row(screen_name='TheNutKingsBack')]

In [23]:
sample_df.select('user.followers_count').take(1)

[Row(followers_count=666)]

In [24]:
sample_df.select('user.favourites_count').take(1)

[Row(favourites_count=3084)]

In [25]:
sample_df.select('user.description').take(1)

[Row(description='Retired Principal. Proud Husband, Father, and Grandpa! Who loves working with teachers and students. Who likes out-of-the-box thinking and innovative ideas.')]

In [26]:
sample_df.select('user.verified').take(1)

[Row(verified=False)]

##### In User object:
1. id --> unique identifier for each user, great for aggregation use (ex: user_count)
2. name --> after aggregation, can use to identify who the twitter is (would be useful if he's someone famous)
3. description --> can be used to classify twitter's profile / identify whether he's an advocate/supporter related to special education
4. followers_count --> can be used to measure whether this user is popular/ influential
5. favourites_count --> --> The number of Tweets this user has liked in the account’s lifetime. Can be used to measure this user's engagement to Twitter.
6. verified --> know whether this user's account is verified or not.
7. created_at --> the time when this account is created --> Can be used to observe the trend users create twitter account.

provide useful information so we should select them for further analysis.

- entities

In [27]:
sample_df.select("entities").take(5)

[Row(entities=Row(hashtags=[], media=None, symbols=[], urls=[Row(display_url='twitter.com/i/web/status/1…', expanded_url='https://twitter.com/i/web/status/1579271504612974594', indices=[117, 140], url='https://t.co/mNynWn9hb3')], user_mentions=[])),
 Row(entities=Row(hashtags=[], media=None, symbols=[], urls=[], user_mentions=[])),
 Row(entities=Row(hashtags=[], media=None, symbols=[], urls=[], user_mentions=[Row(id=1008832789708443649, id_str='1008832789708443649', indices=[3, 12], name='glenda 🎃', screen_name='glendyy8')])),
 Row(entities=Row(hashtags=[], media=None, symbols=[], urls=[], user_mentions=[Row(id=806344552718045184, id_str='806344552718045184', indices=[3, 16], name='Dr. Simone Gold', screen_name='drsimonegold')])),
 Row(entities=Row(hashtags=[], media=None, symbols=[], urls=[], user_mentions=[Row(id=1322604791168557056, id_str='1322604791168557056', indices=[3, 17], name='Tommy Alexander 68K (Top 0.2%)', screen_name='surferjock805')]))]

In [28]:
sample_df.select(sample_df.entities.user_mentions).take(5)

[Row(entities.user_mentions=[]),
 Row(entities.user_mentions=[]),
 Row(entities.user_mentions=[Row(id=1008832789708443649, id_str='1008832789708443649', indices=[3, 12], name='glenda 🎃', screen_name='glendyy8')]),
 Row(entities.user_mentions=[Row(id=806344552718045184, id_str='806344552718045184', indices=[3, 16], name='Dr. Simone Gold', screen_name='drsimonegold')]),
 Row(entities.user_mentions=[Row(id=1322604791168557056, id_str='1322604791168557056', indices=[3, 17], name='Tommy Alexander 68K (Top 0.2%)', screen_name='surferjock805')])]

##### In entities object:
There is no useful information that could help with analysis.

- extended_entities

In [29]:
sample_df.select("extended_entities").filter(isnull('extended_entities')==False).take(2)

[Row(extended_entities=Row(media=[Row(additional_media_info=None, description=None, display_url='pic.twitter.com/H9K7K7mk8N', expanded_url='https://twitter.com/co2nsequence/status/1579271512149749760/photo/1', id=1579271506261155842, id_str='1579271506261155842', indices=[120, 143], media_url='http://pbs.twimg.com/media/FeqzEU6XoAIHvaw.jpg', media_url_https='https://pbs.twimg.com/media/FeqzEU6XoAIHvaw.jpg', sizes=Row(large=Row(h=2048, resize='fit', w=1536), medium=Row(h=1200, resize='fit', w=900), small=Row(h=680, resize='fit', w=510), thumb=Row(h=150, resize='crop', w=150)), source_status_id=None, source_status_id_str=None, source_user_id=None, source_user_id_str=None, type='photo', url='https://t.co/H9K7K7mk8N', video_info=None)])),
 Row(extended_entities=Row(media=[Row(additional_media_info=Row(description=None, embeddable=None, monetizable=False, title=None), description=None, display_url='pic.twitter.com/qH2Z3exOun', expanded_url='https://twitter.com/OniriBoy/status/15791549823302

##### In extended_entities object:
There is no useful information that could help with analysis.

- retweeted_status

In [30]:
sample_df.select("retweeted_status").filter(isnull('retweeted_status')==False).take(2)

[Row(retweeted_status=Row(created_at='Fri Oct 07 18:04:54 +0000 2022', display_text_range=None, entities=Row(hashtags=[], media=None, symbols=[], urls=[], user_mentions=[]), extended_entities=None, extended_tweet=None, favorite_count=148, favorited=False, filter_level='low', id=1578446282078572544, id_str='1578446282078572544', in_reply_to_screen_name=None, in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, is_quote_status=False, lang='en', place=None, possibly_sensitive=None, quote_count=1, quoted_status=None, quoted_status_id=None, quoted_status_id_str=None, quoted_status_permalink=None, reply_count=12, retweet_count=15, retweeted=False, source='<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', text='i will get into nursing school, in Jesus name!', truncated=False, user=Row(contributors_enabled=False, created_at='Mon Jun 18 22:04:15 +0000 2018', default_profile=True, default_profile_im

In [53]:
sample_df.select("retweeted_status.retweeted").take(10)

[Row(retweeted=None),
 Row(retweeted=None),
 Row(retweeted=False),
 Row(retweeted=False),
 Row(retweeted=False),
 Row(retweeted=None),
 Row(retweeted=False),
 Row(retweeted=False),
 Row(retweeted=None),
 Row(retweeted=None)]

In [31]:
sample_df.select("retweeted_status.created_at").filter(isnull('retweeted_status')==False).take(2)

[Row(created_at='Fri Oct 07 18:04:54 +0000 2022'),
 Row(created_at='Mon Oct 10 00:23:45 +0000 2022')]

In [148]:
sample_df.select("retweeted_status.retweeted").filter(isnull('retweeted_status')==False).take(5)

[Row(retweeted=False),
 Row(retweeted=False),
 Row(retweeted=False),
 Row(retweeted=False),
 Row(retweeted=False)]

In [32]:
sample_df.select("retweeted_status.favorite_count").filter(isnull('retweeted_status')==False).take(2)

[Row(favorite_count=148), Row(favorite_count=2506)]

In [33]:
sample_df.select("retweeted_status.retweet_count").filter(isnull('retweeted_status')==False).take(2)

[Row(retweet_count=15), Row(retweet_count=925)]

In [34]:
sample_df.select("retweeted_status.reply_count").filter(isnull('retweeted_status')==False).take(2)

[Row(reply_count=12), Row(reply_count=120)]

In [36]:
sample_df.select("retweeted_status.entities.hashtags.text").filter(isnull('retweeted_status')==False).take(10)[4]

Row(text=['Mario', 'fanart', 'pixelart', 'ドット絵'])

In [60]:
sample_df.select("retweeted_status.entities.hashtags.text").filter(isnull('retweeted_status')==False).take(10)

                                                                                

[Row(text=[]),
 Row(text=[]),
 Row(text=[]),
 Row(text=[]),
 Row(text=['Mario', 'fanart', 'pixelart', 'ドット絵']),
 Row(text=[]),
 Row(text=[]),
 Row(text=[]),
 Row(text=[]),
 Row(text=[])]

In [57]:
sample_df.select("retweeted_status.id").filter(isnull('retweeted_status')==False).take(2)

[Row(id=1578446282078572544), Row(id=1579266397007319040)]

23/03/03 17:45:04 WARN org.apache.spark.deploy.yarn.YarnAllocator: Container from a bad node: container_1677857106541_0002_01_000025 on host: hub-msca-bdp-dphub-students-backup-chihhan-sw-wnsz.c.msca-bdp-students.internal. Exit status: 143. Diagnostics: [2023-03-03 17:45:04.359]Container killed on request. Exit code is 143
[2023-03-03 17:45:04.359]Container exited with a non-zero exit code 143. 
[2023-03-03 17:45:04.369]Killed by external signal
.
23/03/03 17:45:04 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 25 for reason Container from a bad node: container_1677857106541_0002_01_000025 on host: hub-msca-bdp-dphub-students-backup-chihhan-sw-wnsz.c.msca-bdp-students.internal. Exit status: 143. Diagnostics: [2023-03-03 17:45:04.359]Container killed on request. Exit code is 143
[2023-03-03 17:45:04.359]Container exited with a non-zero exit code 143. 
[2023-03-03 17:45:04.369]Killed by external signal
.
23/03/03 1

In [56]:
sample_df.select("retweeted_status.user").filter(isnull('retweeted_status')==False).take(2)

[Row(user=Row(contributors_enabled=False, created_at='Mon Jun 18 22:04:15 +0000 2018', default_profile=True, default_profile_image=False, description='txst | 1’21 🕊 | 🇨🇲 | @capturedbyglen on insta', favourites_count=33594, followers_count=1269, friends_count=985, geo_enabled=True, id=1008832789708443649, id_str='1008832789708443649', is_translator=False, listed_count=3, location='houston | san marcos', name='glenda 🎃', profile_background_color='F5F8FA', profile_background_image_url='', profile_background_image_url_https='', profile_background_tile=False, profile_banner_url='https://pbs.twimg.com/profile_banners/1008832789708443649/1663743981', profile_image_url='http://pbs.twimg.com/profile_images/1576850629569429505/lHEH5r7e_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/1576850629569429505/lHEH5r7e_normal.jpg', profile_link_color='1DA1F2', profile_sidebar_border_color='C0DEED', profile_sidebar_fill_color='DDEEF6', profile_text_color='333333', profile_use_b

In [55]:
sample_df.select("retweeted_status.user.id").filter(isnull('retweeted_status')==False).take(2)

[Row(id=1008832789708443649), Row(id=806344552718045184)]

##### In retweeted_status object:
1. retweeted_status.created_at: We can know when the retweet is created.
2. retweeted_status.favorite_count: We can know how many ppl like this retweet
3. retweeted_status.retweet_count: We can count the number of this retweet (Number of times this Tweet has been retweeted.)
4. retweeted_status.reply_count: We can know how many replies are in this retweet
5. retweeted_status.entities.hashtags.text: We can see the hashtags of this text.

provide useful information so we should select them for further analysis.

##### - Compare the text content in text, tweet_text, retweeted_status.text, and extended_text.

In [37]:
sample_df.select(sample_df["retweeted_status"].text).filter(isnull(sample_df["retweeted_status"].text)==False).take(2)

[Row(retweeted_status.text='i will get into nursing school, in Jesus name!'),
 Row(retweeted_status.text='California just became the first state in the nation to mandate that all children take mRNA vaccines in order to at… https://t.co/orFRQEWJIO')]

In [38]:
sample_df.select([sample_df["retweeted_status"].text,
                  sample_df["text"],
                  sample_df["tweet_text"],
                  sample_df['extended_tweet'].full_text]).filter((isnull(sample_df["retweeted_status"].text) == False)).take(1)

[Row(retweeted_status.text='i will get into nursing school, in Jesus name!', text='RT @glendyy8: i will get into nursing school, in Jesus name!', tweet_text='i will get into nursing school, in Jesus name!', extended_tweet.full_text=None)]

In [39]:
sample_df.select([sample_df["retweeted_status"].text,
                  sample_df["text"],
                  sample_df["tweet_text"],
                  sample_df['extended_tweet'].full_text]).filter((isnull(sample_df['extended_tweet'].full_text) == False)).take(1)

[Row(retweeted_status.text=None, text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am ha… https://t.co/mNynWn9hb3', tweet_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos', extended_tweet.full_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos')]

In [40]:
sample_df.select('extended_tweet').filter(isnull('extended_tweet')==False).take(1)

[Row(extended_tweet=Row(display_text_range=[0, 163], entities=Row(hashtags=[Row(indices=[154, 163], text='TeachPos')], media=None, symbols=[], urls=[], user_mentions=[]), extended_entities=None, full_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos'))]

In [41]:
sample_df.select('extended_tweet.full_text').filter(isnull('extended_tweet')==False).take(1)

[Row(full_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos')]

In [42]:
sample_df.select(['text', 'tweet_text', 'extended_tweet.full_text']).filter((isnull(sample_df['extended_tweet'].full_text) == False)).take(5)

[Row(text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am ha… https://t.co/mNynWn9hb3', tweet_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos', full_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos'),
 Row(text='@JMUSportsNews You have an elite experienced defense and a mobile experienced QB with experienced playmakers. Makin… https://t.co/b8voB88usX', tweet_text='@JMUSportsNews You have an elite experienced defense and a mobile experienced QB with experienced playmakers. Makings of great success in college football. Perfect storm recipe', full_text='@JMUSportsNews You have an elite experienced defense and a mobile experienced QB with experienced playmakers. Makings of great success

In [43]:
sample_df.select(['text', 'tweet_text', 'extended_tweet.full_text']).take(5)

[Row(text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am ha… https://t.co/mNynWn9hb3', tweet_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos', full_text='A2 Kids being late for various reasons, not having a good day, interruptions, things that happen in school, I am having a bad day and the list can go on. #TeachPos'),
 Row(text='Aid Response @ 1200 University St (A25)', tweet_text='Aid Response @ 1200 University St (A25)', full_text=None),
 Row(text='RT @glendyy8: i will get into nursing school, in Jesus name!', tweet_text='i will get into nursing school, in Jesus name!', full_text=None),
 Row(text='RT @drsimonegold: California just became the first state in the nation to mandate that all children take mRNA vaccines in order to attend s…', tweet_text='California just became the first state in the na

##### From above, we can see that tweet_text contains the full text of text without urls. Also, it appears more requently than extended_tweet.full_text Therefore, I will use tweet_text to conduct further analysis.

##### Text contains information about whether this tweet is retweeted or not, I can use column retweet & tweet_text to identify whether this tweet_text is a retweeted and get a more complete text.

#### create a new df to contain selected variables

In [58]:
tweet_df = total_df.select([total_df.id.alias("tweet_id"),
                            total_df.favorite_count.alias("tweet_likes"),
                            total_df.created_at.alias("time_tweet_created"),
                            total_df.reply_count.alias("tweet_reply_count"),
                            total_df.lang.alias("language"),
                            total_df.tweet_text,
                            
                            # User 
                            total_df.user['id'].alias("user_id"),
                            total_df.user['name'].alias("user_name"),
                            total_df.user['description'].alias("user_description"),
                            total_df.user['followers_count'].alias("user_followers_count"),
                            total_df.user['favourites_count'].alias("user_like_count"),
                            total_df.user['verified'].alias('verified_user'),
                            total_df.user['created_at'].alias("time_account_created"),
                            
                            # Location
                            total_df.place['full_name'].alias('location'),
                            total_df.place['country'].alias('country_name'),
                            total_df.place['country_code'].alias('country_code'),
                            
                            # Retweeted
                            total_df.retweeted.alias("is_retweeted"), # identify whether this tweet_text is a retweet or not
                            total_df.retweeted_from,
                            
                            ## Retweeted_status
                            total_df.retweeted_status['created_at'].alias('rt_created_time'),
                            total_df.retweeted_status['favorite_count'].alias('rt_likes'),
                            total_df.retweeted_status['retweet_count'].alias('rt_count'),
                            total_df.retweeted_status['reply_count'].alias('rt_reply_count'),
                            total_df.retweeted_status.entities.hashtags['text'].alias('rt_hashtags'),
                            total_df.retweeted_status.user['id'].alias('rt_usr_id'),
                            total_df.retweeted_status['id'].alias('rt_id'),
                               ])


In [59]:
tweet_df.printSchema()

root
 |-- tweet_id: long (nullable = true)
 |-- tweet_likes: long (nullable = true)
 |-- time_tweet_created: string (nullable = true)
 |-- tweet_reply_count: long (nullable = true)
 |-- language: string (nullable = true)
 |-- tweet_text: string (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_name: string (nullable = true)
 |-- user_description: string (nullable = true)
 |-- user_followers_count: long (nullable = true)
 |-- user_like_count: long (nullable = true)
 |-- verified_user: boolean (nullable = true)
 |-- time_account_created: string (nullable = true)
 |-- location: string (nullable = true)
 |-- country_name: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- is_retweeted: string (nullable = true)
 |-- retweeted_from: string (nullable = true)
 |-- rt_created_time: string (nullable = true)
 |-- rt_likes: long (nullable = true)
 |-- rt_count: long (nullable = true)
 |-- rt_reply_count: long (nullable = true)
 |-- rt_hashtags: array (nullable

In [61]:
tweet_df.count()

                                                                                

99994342

In [62]:
tweet_df.limit(5)

                                                                                

tweet_id,tweet_likes,time_tweet_created,tweet_reply_count,language,tweet_text,user_id,user_name,user_description,user_followers_count,user_like_count,verified_user,time_account_created,location,country_name,country_code,is_retweeted,retweeted_from,rt_created_time,rt_likes,rt_count,rt_reply_count,rt_hashtags,rt_usr_id,rt_id
1600923687976194048,0,Thu Dec 08 18:42:...,0,en,Explain to me why...,1587498361556537346,1811,Life Member Disab...,99,12978,False,Tue Nov 01 17:35:...,,,,RT,RachelMaryColl,Tue Dec 06 12:33:...,30.0,19.0,10.0,[],1.4327935877800468e+18,1.6001061618517071e+18
1600923690610610176,0,Thu Dec 08 18:42:...,0,en,College Football ...,776545370826543105,UCF Knights Natio...,@UCF_Football OLB...,2728,61953,False,Thu Sep 15 22:16:...,,,,RT,BigGameBoomer,Wed Dec 07 22:40:...,5603.0,347.0,312.0,[],561975460.0,1.600621332043014e+18
1600923690211811328,0,Thu Dec 08 18:42:...,0,en,The University of...,109574708,SoccerWire,Providing informa...,29611,6128,False,Fri Jan 29 14:35:...,,,,,,,,,,,,
1600923690970992640,0,Thu Dec 08 18:42:...,0,en,"Young men, you on...",1505028271158738944,Michael Peek,#GoCards,22,3789,False,Sat Mar 19 03:48:...,,,,RT,Zay_McCray,Thu Dec 08 11:06:...,204.0,8.0,2.0,[],1.0063561524085187e+18,1.6008089268908524e+18
1600923693072674816,0,Thu Dec 08 18:42:...,0,en,Holy shit. The sc...,1259700210734776320,Quiet 12,sentient being in...,476,65818,False,Mon May 11 04:22:...,,,,RT,againstgrmrs,Thu Dec 08 15:27:...,1433.0,344.0,152.0,[ExposeGroomers],1.5338577874863882e+18,1.6008747170404966e+18


In [68]:
from pyspark.sql.functions import col,isnan,when,count
tweet_df_Columns=["tweet_text", "is_retweeted","location","country_name","country_name"]
tweet_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in tweet_df_Columns]
   ).show()

                                                                                

+----------+------------+--------+------------+------------+
|tweet_text|is_retweeted|location|country_name|country_name|
+----------+------------+--------+------------+------------+
|         0|           0|99112826|    99112826|    99112826|
+----------+------------+--------+------------+------------+



In [None]:
tweet_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in tweet_df.columns[0:10]]
   ).show()

[Stage 64:>                                                         (0 + 1) / 1]

+--------+-----------+------------------+-----------------+--------+----------+-------+---------+----------------+--------------------+
|tweet_id|tweet_likes|time_tweet_created|tweet_reply_count|language|tweet_text|user_id|user_name|user_description|user_followers_count|
+--------+-----------+------------------+-----------------+--------+----------+-------+---------+----------------+--------------------+
|       0|          0|                 0|                0|       0|         0|      0|     1536|        17631689|                   0|
+--------+-----------+------------------+-----------------+--------+----------+-------+---------+----------------+--------------------+



                                                                                

In [80]:
tweet_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in tweet_df.columns[13:16]]).show()

[Stage 67:>                                                         (0 + 1) / 1]

+--------+------------+------------+
|location|country_name|country_code|
+--------+------------+------------+
|99112826|    99112826|    99112826|
+--------+------------+------------+



                                                                                

In [81]:
tweet_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in tweet_df.columns[17:20]]).show()

[Stage 70:>                                                         (0 + 1) / 1]

+--------------+---------------+--------+
|retweeted_from|rt_created_time|rt_likes|
+--------------+---------------+--------+
|      29511762|       37487791|37487792|
+--------------+---------------+--------+



                                                                                

In [69]:
tweet_df.select("tweet_text").limit(5).collect()

[Row(tweet_text="If you are more upset at Beto O'Rourke for interrupting a press conference than you are about the school shooting, you are the problem."),
 Row(tweet_text='Fun fact: In Germany, homeschooling is effectively prohibited. You must send your child to an accredited school. Exceptions are made for "the travelling people" and a few others, but even there, the state decides the curriculum.'),
 Row(tweet_text='26 school shootings this yr and they worried about abortions. Ridiculous'),
 Row(tweet_text='26 school shootings this yr and they worried about abortions. Ridiculous'),
 Row(tweet_text='FOR SALE\n\nMixed use land, (2½ plots, 1293.163sqm) with uncompleted structure fenced &amp; gated at Adebisi Layout, NNPC, Apata, Ibadan. \nPerfect for worship centre, hospital, school, warehouse, residential  other mixed uses.\nPRICE:  ₦25M https://t.co/9HEbhZdiQV')]

In [None]:
tweet_df.groupBy('language').count().orderBy('count', ascending=False).show()

[Stage 79:>                                                         (0 + 1) / 1]

+--------+--------+
|language|   count|
+--------+--------+
|      en|99994342|
+--------+--------+



                                                                                

In [149]:
tweet_df.groupBy('is_tweeted').count().orderBy('count', ascending=False).show()

23/03/04 00:05:04 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 191 for reason Container marked as failed: container_1677857106541_0002_01_000191 on host: hub-msca-bdp-dphub-students-backup-chihhan-sw-p4hj.c.msca-bdp-students.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
23/03/04 00:05:04 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 193 for reason Container marked as failed: container_1677857106541_0002_01_000193 on host: hub-msca-bdp-dphub-students-backup-chihhan-sw-p4hj.c.msca-bdp-students.internal. Exit status: -100. Diagnostics: Container released on a *lost* node.
23/03/04 00:05:04 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 191 on hub-msca-bdp-dphub-students-backup-chihhan-sw-p4hj.c.msca-bdp-students.internal: Container marked as failed: container_1677857106541_0002_01_0001

+------------+--------+
|is_retweeted|   count|
+------------+--------+
|          RT|62519196|
|            |37475146|
+------------+--------+



                                                                                

#### 4. Filtering Relevant Tweets(Rows)

- Since the tweets of total datasets are collected on the topics of education, schools, universities, learning, knowledge sharing, I won't include general words such as education, schools, universites, learning...
- Also, I want to focus on special education. Assume the topics are already education-relevant, I will include specific keywords only relevant to special education.

In [139]:
trend_text = ['machine learning', 'elearning', 'mlearning', 'flip class', 'learningapps']

general_text = ['preschool','kindergarten', 'primaryeducation', 'secondaryeducation', 'highereducation', 'k12', 'K-12', 'k12online','teacher', 'parenting',\
                'tuition', 'college', 'highschool', 'middleschool', 'professor', 'academics', 'academy', 'bullying', 'public school', 'private school',\
                'ACT', 'SAT', 'scholarship', 'tuition', 'student loan', 'Florida math book ban']

#content_areas_text = ['english', 'art', 'music', 'science', 'math', 'algebra', 'history', 'literature', 'STEM']

digital_citizenship = ['educationtech', 'cyberbullying']

literacy = ['literacy', 'multiliteracy', 'infolit', 'homeschooling', 'hiphomeschool'] 

other = ['dropouts', 'juvenile delinquency']

special_education = ['ece', 'specialneeds','dyslexia','tck', 'gifted', 'talented', 'autism', 'bilingual', 'aspergers', 'specialeducation', 'special-needs', 'aided',\
                    'exceptional', 'alternative provision', 'exceptional student', 'special ed', 'SDC', 'SPED', 'individual differences', 'disabilities',\
                    'self-sufficiency', 'learning disabilities', 'communication disorders', 'emotional disorders', 'behavioral disorders', ' physical disabilities', 'therapy',\
                    'osteogenesis imperfecta', 'cerebral palsy', 'lissencephaly', 'muscular dystrophy', 'developmental disabilities', 'autism spectrum disorder',\
                     'intellectual disabilities']

keywords = trend_text + general_text + digital_citizenship + literacy + other + special_education

In [140]:
keywords

['lrnchat',
 'edchat',
 'blendchat',
 'elearning',
 'mlearning',
 'ntchat',
 'edtech',
 'web20',
 'whatisschool',
 'flipclass',
 'blendedlearning',
 'flatclass',
 'machine learning',
 'elearning',
 'mlearning',
 'flip class',
 'learningapps',
 'preschool',
 'kindergarten',
 'primaryeducation',
 'secondaryeducation',
 'highereducation',
 'k12',
 'K-12',
 'k12online',
 'teacher',
 'parenting',
 'tuition',
 'college',
 'highschool',
 'middleschool',
 'professor',
 'academics',
 'academy',
 'bullying',
 'public school',
 'private school',
 'ACT',
 'SAT',
 'scholarship',
 'tuition',
 'student loan',
 'Florida math book ban',
 'educationtech',
 'cyberbullying',
 'literacy',
 'multiliteracy',
 'infolit',
 'homeschooling',
 'hiphomeschool',
 'dropouts',
 'juvenile delinquency',
 'ece',
 'specialneeds',
 'dyslexia',
 'tck',
 'gifted',
 'talented',
 'autism',
 'bilingual',
 'aspergers',
 'specialeducation',
 'special-needs',
 'aided',
 'exceptional',
 'alternative provision',
 'exceptional stude

In [141]:
%%time
# convert columns with strings (tweet text) to lowercase
tweet_df = tweet_df.withColumn("tweet_text",F.lower(F.col("tweet_text")))


# edu_df = tweet_df.filter((col('language') == 'en')) \
#                  .filter(col('tweet_text').rlike('|'.join(keywords)))

CPU times: user 3.69 ms, sys: 848 µs, total: 4.54 ms
Wall time: 13.5 ms


In [142]:
edu_df = tweet_df.filter((tweet_df.language == 'en'))
edu_df = edu_df.filter(tweet_df.tweet_text.rlike('|'.join(keywords)))

In [None]:
# edu_df = edu_df.filter(tweet_df.tweet_text.rlike('\\'.join(keywords)))

In [None]:
edu_df.count()

                                                                                

33490185

In [None]:
edu_df.printSchema()

root
 |-- tweet_id: long (nullable = true)
 |-- tweet_likes: long (nullable = true)
 |-- time_tweet_created: string (nullable = true)
 |-- tweet_reply_count: long (nullable = true)
 |-- language: string (nullable = true)
 |-- tweet_text: string (nullable = true)
 |-- user_id: long (nullable = true)
 |-- user_name: string (nullable = true)
 |-- user_description: string (nullable = true)
 |-- user_followers_count: long (nullable = true)
 |-- user_like_count: long (nullable = true)
 |-- verified_user: boolean (nullable = true)
 |-- time_account_created: string (nullable = true)
 |-- location: string (nullable = true)
 |-- country_name: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- is_retweeted: string (nullable = true)
 |-- retweeted_from: string (nullable = true)
 |-- rt_created_time: string (nullable = true)
 |-- rt_likes: long (nullable = true)
 |-- rt_count: long (nullable = true)
 |-- rt_reply_count: long (nullable = true)
 |-- rt_hashtags: array (nullable

In [None]:
# keywords = ["education", "learning", "teaching", "knowledge","training", "classroom", "curriculum", 
#             "academic", "degree", "diploma", "certificate", "exam", "assessment", "textbook", "lecture", 
#             "seminar", "workshop", "e-learning", "educational technology", "pedagogy", "andragogy",
#             "career development", "professional development", "mentoring", "coaching", "tutoring", 
#             "peer-to-peer learning", "open educational resources", "Florida math book ban", "K-12"]

# filtered_df = tweets.filter(col("full_text").rlike("|".join(keywords)))

In [None]:
edu_df.select('tweet_text').limit(10).collect()

[Row(tweet_text='fun fact: in germany, homeschooling is effectively prohibited. you must send your child to an accredited school. exceptions are made for "the travelling people" and a few others, but even there, the state decides the curriculum.'),
 Row(tweet_text='just in: cnn reporter alisyn camerota just confronted a republican texas state representative who said gun reform is unnecessary because we should just “convict” the shooters: “sir, you can’t convict him. he was killed along with 19 children in the school behind me.”'),
 Row(tweet_text='no parent should drop their kid off at school and then need a dna test to id which child is their child because they have been blown to pieces by an ar 15'),
 Row(tweet_text='no parent should drop their kid off at school and then need a dna test to id which child is their child because they have been blown to pieces by an ar 15'),
 Row(tweet_text='@bluscr3n @neat0queen @gearrl @vncerl @jg7tv putting guns in that school did nothing. why would 

In [None]:
edu_df.limit(5)

                                                                                

tweet_id,tweet_likes,time_tweet_created,tweet_reply_count,language,tweet_text,user_id,user_name,user_description,user_followers_count,user_like_count,verified_user,time_account_created,location,country_name,country_code,is_retweeted,retweeted_from,rt_created_time,rt_likes,rt_count,rt_reply_count,rt_hashtags,rt_usr_id,rt_id
1529262625166708739,0,Wed May 25 00:46:...,0,en,watch live: presi...,4027759953,Diana,Business owner. L...,1009,66604,False,Sat Oct 24 00:14:...,,,,,,,,,,,,
1529262628794601472,0,Wed May 25 00:46:...,0,en,rest in peace: fo...,225569013,marilynn,♎️ .Night Owl 🌙,694,61353,False,Sat Dec 11 22:30:...,,,,RT,ABC7,Wed May 25 00:17:...,1037.0,416.0,46.0,[],16374678.0,1.529255237218775e+18
1529262633094119426,0,Wed May 25 00:46:...,0,en,while talking to ...,736956996,Erin Sandoval 🌻?...,“I’ve decided to ...,472,40826,False,Sat Aug 04 15:53:...,,,,RT,kaitlancollins,Wed May 25 00:34:...,1326.0,195.0,144.0,[],180107694.0,1.5292596231857766e+18
1529262634150744064,0,Wed May 25 00:46:...,0,en,"26 years ago, a g...",796162913178238976,Dave is pretty fu...,I swear a lot. T...,584,190923,False,Wed Nov 09 01:30:...,,,,RT,peterframpton,Tue May 24 23:52:...,3343.0,1180.0,80.0,[],44164244.0,1.5292489111012393e+18
1529262637036449792,0,Wed May 25 00:46:...,0,en,looks inconsisten...,554848793,Angel Leyva,Studio arts/Art H...,154,121452,False,Mon Apr 16 04:05:...,,,,RT,dceiver,Wed May 25 00:42:...,116.0,29.0,0.0,[],14066024.0,1.5292616271614195e+18


- Reference: 
1. keyword for education:

    (1) The Complete Guide To Twitter Hashtags For Education: https://www.teachthought.com/twitter-hashtags-for-teacher/
    
2. keyword for special education:

    (1) https://en.wikipedia.org/wiki/Special_education
    
    (2) THE EDVOCATE’S LIST OF 123 TWITTER FEEDS FOR SPECIAL EDUCATORS: https://www.theedadvocate.org/edvocates-list-123-twitter-feeds-special-educators/


#### 5. Save the filtering data into individual bucket

In [150]:
bucket_write='msca-bdp-students-bucket'
folder_write = 'shared_data/chihhan/final_project_part1_V1'

In [152]:
edu_df.write.format("parquet").\
mode('overwrite').\
save('gs://' + bucket_write + '/'+folder_write)

                                                                                

In [153]:
!hadoop fs -ls 'gs://msca-bdp-students-bucket/shared_data/chihhan'

Found 1 items
drwx------   - root root          0 2023-03-04 01:06 gs://msca-bdp-students-bucket/shared_data/chihhan/final_project_part1_V1


In [154]:
filter_data = spark.read.parquet('gs://msca-bdp-students-bucket/shared_data/chihhan/final_project_part1_V1')

In [155]:
# check the filter data
filter_data.limit(5)

                                                                                

tweet_id,tweet_likes,time_tweet_created,tweet_reply_count,language,tweet_text,user_id,user_name,user_description,user_followers_count,user_like_count,verified_user,time_account_created,location,country_name,country_code,is_retweeted,retweeted_from,rt_created_time,rt_likes,rt_count,rt_reply_count,rt_hashtags,rt_usr_id,rt_id
1597115277602430981,0,Mon Nov 28 06:28:...,0,en,happy total secti...,1587123926022656000,Priscilla Harris,the same items fr...,110,531,False,Mon Oct 31 16:47:...,,,,,,,,,,,,
1597115281717227521,0,Mon Nov 28 06:28:...,0,en,#rc16 graduation ...,1512661462438752265,Suguna College of...,Suguna College of...,2,0,False,Sat Apr 09 05:53:...,,,,,,,,,,,,
1597115288541331456,0,Mon Nov 28 06:28:...,0,en,#卡塔尔世界杯 #世界杯下注 ht...,1530512461220257792,天博体育 AOA体育 开云体育 世...,AOA体育http://aoa85...,0,0,False,Sat May 28 11:33:...,,,,,,,,,,,,
1597115289325314049,0,Mon Nov 28 06:28:...,0,en,use of any relati...,1587149413054570497,Kimberly Olson,last day i found ...,110,592,False,Mon Oct 31 18:28:...,,,,,,,,,,,,
1597115289950638080,0,Mon Nov 28 06:28:...,0,en,i’m finally gradu...,389316573,c__c,Love you to the m...,60,26599,False,Wed Oct 12 06:30:...,,,,RT,yadarilya,Mon Nov 28 04:58:...,98.0,26.0,9.0,[],1.4054508421996544e+18,1.5970924249120973e+18


In [156]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Fri, 03 March 2023 19:26:46'