# Dataset Preparation

We have utilized a dataset obtained from the ArchiveTeam Twitter Stream, specifically focusing on the month of May in the year 2019. 

This dataset, available at https://archive.org/details/archiveteam-twitter-stream-2019-05, holds valuable information from the public Twitter stream, capturing a snapshot of global conversations during that time period.

Within this dataset, We have identified two significant incidents involving criminal activities that occurred during May 2019:

**STEM School Highlands Ranch Shooting (2019)**:
On May 7, 2019, a tragic shooting took place at STEM School Highlands Ranch in Colorado, USA. This unfortunate event resulted in the loss of one student's life and left eight others injured. The incident sent shockwaves throughout the community and the nation, sparking discussions on various platforms, including social media like Twitter. The dataset we've collected captures the sentiments, reactions, and discussions surrounding this incident, providing a glimpse into how people responded to and shared information about this devastating event.

**University of North Carolina at Charlotte Shooting (2019)**:
Another distressing incident occurred on April 30, 2019, at the University of North Carolina at Charlotte. In this incident, a gunman opened fire within a classroom, leading to the tragic deaths of two students and injuring four others. The dataset I've compiled encompasses the conversations, opinions, and reactions expressed on Twitter following this event. Analyzing this data could offer insights into public sentiment, the spread of information, and reactions to such events on social media platforms.

By delving into these specific incidents within the larger dataset, We aim to gain a deeper understanding of how Twitter users engage with and respond to real-world events of varying magnitudes. Through data analysis and processing techniques, We plan to extract valuable insights, patterns, and trends from these conversations, shedding light on the role of social media in disseminating information, expressing emotions, and fostering discussions during critical moments in our society. This endeavor not only showcases the potential of data science but also underscores the importance of understanding public discourse in the context of significant events.


We've collected information from Twitter during the entire month of May in 2019. This information is stored in special files called JSON files, which contain details about tweets people posted.

From these files, We've picked out specific details that are important for my research. These details include when the tweet was posted, what the tweet says, where it was posted from, the person's username, their location, and other things like how many followers they have.

But we're not just using any tweets. we're focusing on a particular issue. we're interested in tweets that talk about certain words like "shooting," "STEM," "killed," and "Charlotte." These words are linked to two events that happened in May 2019:

The STEM School Highlands Ranch Shooting on May 7, 2019, where one student died and others got hurt.

The University of North Carolina at Charlotte Shooting on April 30, 2019, which resulted in two students losing their lives and others getting injured.

By looking at tweets with these specific words, we want to understand how people reacted and talked about these incidents on Twitter. we want to see what they said and how they felt. This can help us learn how social media like Twitter can reflect people's thoughts during important times. Our goal is to find insights from this data that can help us better understand how people connect and respond to big events in our society using platforms like Twitter.

In [1]:
import json
import os
import bz2
import pandas as pd
# Set pandas display options to show full text
pd.set_option('display.max_colwidth', None)
import warnings
warnings.filterwarnings("ignore")

We used this script that processes the files for each day of the whole month individually. This approach was chosen due to resource and power limitations that we encountered.

While it may seem slower, it's a reliable way to ensure that we're able to process the data without causing any technical issues.

In [467]:
# Define the path to the root directory
root_directory = 'C:/Users/LAB\Downloads/DATA/31'

# Initialize an empty list to store extracted data
tweet_data = []

# Iterate through each hour folder
for hour_folder in os.listdir(root_directory):
    hour_path = os.path.join(root_directory, hour_folder)
    
    # Iterate through each JSON file in the hour folder
    for json_file in os.listdir(hour_path):
        if json_file.endswith('.bz2'):
            json_path = os.path.join(hour_path, json_file)
            
            # Extract relevant information from the compressed JSON file
            with bz2.open(json_path, 'rt', encoding='utf-8') as file:
                i=0
                for line in file:
                    tweet = json.loads(line)
                    
                    
                    # Check if the tweet is in English
                    if tweet.get('lang', '') == 'en':
                        #check if the tweet was a retweet to get original tweet values
                        if "retweeted_status" in tweet.keys():
                            org_tweet=tweet["retweeted_status"]
                            created_at = org_tweet.get('created_at', '')
                            text = org_tweet.get('text', '')
                        
                            user = org_tweet.get('user', {})
                            user_name = user.get('name', '')
                            location = user.get('location', '')
                            description = user.get('description', '')
                            followers_count = user.get('followers_count', 0)
                        
                            source = org_tweet.get('source', '')
                        
                            quote_count = org_tweet.get('quote_count',0)
                            reply_count = org_tweet.get('reply_count',0)
                            retweet_count = org_tweet.get('retweet_count',0)
                            favorite_count = org_tweet.get('favorite_count',0)
                            
                            tweet_data.append({
                            'Created At': created_at,
                            'Text': text,
                            'Source': source,
                            'User Name': user_name,
                            'Location': location,
                            'Description': description,
                            'Followers Count': followers_count,
                            'Quote Count': quote_count,
                            'Reply Count': reply_count,
                            'Retweet Count': retweet_count,
                            'Favorite Count': favorite_count
                            })
                            
                            
                            
                            
                            
                        else:
                            
                            created_at = tweet.get('created_at', '')
                            text = tweet.get('text', '')
                        
                            user = tweet.get('user', {})
                            user_name = user.get('name', '')
                            location = user.get('location', '')
                            description = user.get('description', '')
                            followers_count = user.get('followers_count', 0)
                        
                            source = tweet.get('source', '')
                         
                        
                            quote_count = tweet.get('quote_count',0)
                            reply_count = tweet.get('reply_count',0)
                            retweet_count = tweet.get('retweet_count',0)
                            favorite_count = tweet.get('favorite_count',0) 
                            
                            tweet_data.append({
                            'Created At': created_at,
                            'Text': text,
                            'Source': source,
                            'User Name': user_name,
                            'Location': location,
                            'Description': description,
                            'Followers Count': followers_count,
                            'Quote Count': quote_count,
                            'Reply Count': reply_count,
                            'Retweet Count': retweet_count,
                            'Favorite Count': favorite_count
                            })
                            #check if it was a quoted tweet to get the original
                            if "quoted_status" in tweet.keys():
                                org_tweet=tweet["quoted_status"]
                                created_at = org_tweet.get('created_at', '')
                                text = org_tweet.get('text', '')
                        
                                user = org_tweet.get('user', {})
                                user_name = user.get('name', '')
                                location = user.get('location', '')
                                description = user.get('description', '')
                                followers_count = user.get('followers_count', 0)
                        
                                source = org_tweet.get('source', '')
                        
                                quote_count = org_tweet.get('quote_count',0)
                                reply_count = org_tweet.get('reply_count',0)
                                retweet_count = org_tweet.get('retweet_count',0)
                                favorite_count = org_tweet.get('favorite_count',0)
                                
                                tweet_data.append({
                                'Created At': created_at,
                                'Text': text,
                                'Source': source,
                                'User Name': user_name,
                                'Location': location,
                                'Description': description,
                                'Followers Count': followers_count,
                                'Quote Count': quote_count,
                                'Reply Count': reply_count,
                                'Retweet Count': retweet_count,
                                'Favorite Count': favorite_count
                                })
                         
                        

In [468]:
# Create a pandas DataFrame from the combined list
tweet_df = pd.DataFrame(tweet_data)

In [469]:
tweet_df['Retweet Count'].value_counts()

0        444040
1         52872
2         22754
3         15208
4         11442
          ...  
45675         1
67671         1
29855         1
49240         1
22360         1
Name: Retweet Count, Length: 60782, dtype: int64

Each file (day) has about 5 million record, but we will filter them according to ertain words like "shooting," "STEM," and "Charlotte."

In [470]:
filtered_data = tweet_df[
    (tweet_df['Text'].str.contains(' school shooting', case=False)) |
    (tweet_df['Text'].str.contains(' mass shooting', case=False))
    #(tweet_df['Text'].str.contains(' STEM ', case=True)) |
    #(tweet_df['Text'].str.contains(' colorado', case=True)) |
    #(tweet_df['Text'].str.contains(' Highlands Ranch ', case=False)) |
    #(tweet_df['Text'].str.contains(' UNCC ', case=False)) |
    #(tweet_df['Text'].str.contains(' U.N.C.C ', case=False)) |
    #(tweet_df['Text'].str.contains(' University of North Carolina ', case=False)) |
    #(tweet_df['Text'].str.contains('#UNCC', case=False)) |
    #(tweet_df['Text'].str.contains('unc charlotte ', case=False)) |
    #(tweet_df['Text'].str.contains('#stem', case=False)) 
]

In [471]:
filtered_data.info

<bound method DataFrame.info of                              Created At  \
87707    Sat May 25 11:30:13 +0000 2019   
98759    Fri May 31 09:35:48 +0000 2019   
205488   Fri May 31 12:24:37 +0000 2019   
233739   Fri May 31 13:01:02 +0000 2019   
279803   Thu May 30 13:50:14 +0000 2019   
...                                 ...   
1061287  Fri May 31 22:55:02 +0000 2019   
1061468  Sat Jun 01 02:08:04 +0000 2019   
1065644  Sat Jun 01 01:22:56 +0000 2019   
1066406  Thu May 30 16:17:24 +0000 2019   
1066625  Sat Jun 01 03:31:57 +0000 2019   

                                                                                                                                                       Text  \
87707        In the US, 43% of mass shootings were committed by men with a known history of animal abuse.\n\nWant to protect human… https://t.co/6YhC7pmbbF   
98759                                                                                                                         @dakota

In [472]:
filtered_data['Text'].head(1)

87707    In the US, 43% of mass shootings were committed by men with a known history of animal abuse.\n\nWant to protect human… https://t.co/6YhC7pmbbF
Name: Text, dtype: object

In [473]:
filtered_data.to_csv('School_Shooting31.csv', index=False,encoding='utf-8')

##### Now we have successfully saved our files from 'School_Shooting1.csv' to 'School_Shooting31.csv'

Let's combine all these files into a single file

In [474]:
# Create an empty DataFrame to hold the combined data
combined_data = pd.DataFrame()

# Loop through each day's file and append data to the combined DataFrame
for day in range(1, 33):
    file_name = f'School_Shooting{day}.csv'
    day_data = pd.read_csv(file_name)
    combined_data = combined_data.append(day_data, ignore_index=True)

# Save the combined data to a new CSV file
combined_data.to_csv('School_Shooting_Data.csv', index=False)

In [476]:
School_Shooting_Data = pd.read_csv('School_Shooting_Data.csv')

In [477]:
School_Shooting_Data.info

<bound method DataFrame.info of                            Created At  \
0      Tue Apr 30 23:57:07 +0000 2019   
1      Tue Apr 30 23:29:19 +0000 2019   
2      Wed May 01 00:36:25 +0000 2019   
3      Wed May 01 06:48:12 +0000 2019   
4      Wed May 01 02:37:28 +0000 2019   
...                               ...   
12612  Wed May 01 05:06:02 +0000 2019   
12613  Tue Apr 30 23:00:13 +0000 2019   
12614  Tue Apr 30 23:25:16 +0000 2019   
12615  Wed May 01 02:33:51 +0000 2019   
12616  Wed May 01 00:16:31 +0000 2019   

                                                                                                                                                   Text  \
0      This week: \n• Baltimore: 1 dead\n• Birmingham: 4 injured\n• Nashville: 7 injured \n• West Chester: 4 dead \n• #UNCC: 2… https://t.co/ughzMyrpU7   
1          Two people dead and several injured at the University of North Carolina in Charlotte and NOT A SINGLE MAJOR TELEVIS… https://t.co/XjWBOvIL0R   
2         

In [478]:
School_Shooting_Data.head(5)

Unnamed: 0,Created At,Text,Source,User Name,Location,Description,Followers Count,Quote Count,Reply Count,Retweet Count,Favorite Count
0,Tue Apr 30 23:57:07 +0000 2019,This week: \n• Baltimore: 1 dead\n• Birmingham: 4 injured\n• Nashville: 7 injured \n• West Chester: 4 dead \n• #UNCC: 2… https://t.co/ughzMyrpU7,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Dante Vic,"Barcelona, Spain",rhythm & blues 🎶 #UNCC17,485,113,112.0,2441.0,4605.0
1,Tue Apr 30 23:29:19 +0000 2019,Two people dead and several injured at the University of North Carolina in Charlotte and NOT A SINGLE MAJOR TELEVIS… https://t.co/XjWBOvIL0R,"<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",Shannon Watts,,"Founder of @MomsDemand, grassroots army of @Everytown fighting for gun safety. Book “Fight Like A Mother” out May 28. IG: http://instagram.com/shannonrwatts",310019,351,507.0,6612.0,16246.0
2,Wed May 01 00:36:25 +0000 2019,Saddened to hear about the news at UNC Charlotte my thoughts and prayers go out the #UNCC community during this time 🙏🏾,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",President Parker 🇺🇸,"Charlotte, NC",Excellence is the Only Standard|#NCCU19 Student Body President 2018-2019 ✊🏾🇺🇸 Educator 🍎 ΚΑΨ♦️,7440,0,0.0,44.0,85.0
3,Wed May 01 06:48:12 +0000 2019,"It’s a sad reality when there’s been 106 school shootings in 2019.. we’re only 4 months in. Since January, the US h… https://t.co/30WqS2P8F0","<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Mike Kelleher,"Hazlet, New Jersey",•Part time owner of salernos pizzeria. •GQ Magazine. •Back to back 2nd place finishes in fantasy football,546,0,0.0,0.0,0.0
4,Wed May 01 02:37:28 +0000 2019,"I'm heartsick for the victims of the #UNCC shooting and their family, friends, and classmates. Schools should be a… https://t.co/YXLKIOQLtX","<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",Elizabeth Warren,Massachusetts,"US Senator, MA. Former teacher & law professor. Wife, mom (Amelia, Alex, Bailey, @CFPB), grandmother, & Okie. Official account: 2020 Presidential Campaign.",2452166,27,123.0,971.0,5327.0
