# Michael DeCero
# DSC 540
# Final Project: Predicting Tweets Deleted by X

X, formerly known as Twitter, has a set of rules they expect the users of the platform to abide by. Last year, I scraped over 64M tweets and wrote a set of programs to identify which of those tweets were deleted by (then) Twitter. I explored this project to verify that the platform was following their rules when determining what tweets to delete and not being biased in their approach. I documented my findings in a blog that can be found linked below.

I believe there is an opportunity for private and/or public communities to develop their own algorithms to essentially audit social media companies. The crux of my project is to train a model to identify text content on social media platforms that should be considered in violation of our shared vision of what we consider inappropriate. 
 
For more information about how I gathered this data, check out my blog and github:
 - Blog: https://inthegraey.com/
 - Github: https://github.com/madecero/thegraey
 
This notebook is used to select a sample of over 50k tweets from my local database and create a csv that will be used to transform the tweets to tfidf weights and train a series of ML models. Refer to DSC540_Final.py for the model results using the tfidf vectorized tweets.

### Query local database to obtain all tweets along with their delete reason code (if applicable)

In [2]:
import sqlite3
import pandas as pd
import numpy as np

In [3]:
# Establish a connection to the SQLite database
conn = sqlite3.connect('de0project.db')

# Define your SQL query
query = 'SELECT ID, Text, CreatedAt, deleteReason FROM deleteView'

# Execute the query and store the results in a Pandas DataFrame
df = pd.read_sql_query(query, conn)

# Close the database connection
conn.close()

In [4]:
#What does this dataframe look like?
df.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
0,1474541784210362368,RT @Fukkard: Top Two Belongs to Pra’BOSS’✊\n\n...,Sat Dec 25 00:45:31 2021,
1,1474541784181182468,RT @texan40: I swear... only in south Texas 😂 ...,Sat Dec 25 00:45:31 2021,
2,1474541784176861185,RT @kdramadaisy: choi woong is the standard.\n...,Sat Dec 25 00:45:31 2021,
3,1474541784156020741,RT @nft_ray: ANY #MAYC OWNERS INTERESTED IN TR...,Sat Dec 25 00:45:31 2021,
4,1474541784155963396,RT @methnpizza: Spare 11? 1k would be a wonder...,Sat Dec 25 00:45:31 2021,


In [5]:
#What is the shape?
df.shape

(64163912, 4)

In [6]:
#Let's make sure the delete reasons we care about came through
deletedf = df[df['deleteReason'] == 'Twitter API returned a 404 (Not Found), This Tweet is no longer available because it violated the Twitter Rules.']

In [7]:
deletedf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
2468695,1485894286365368328,That will really help stop the huge surge of m...,Tue Jan 25 08:36:19 2022,"Twitter API returned a 404 (Not Found), This T..."
2524858,1485996263380238341,"@ebonykayxxxx Scotland: fried mars bar, Gordon...",Tue Jan 25 15:21:32 2022,"Twitter API returned a 404 (Not Found), This T..."
2536038,1486011588780134400,Hey Sunghoon! Don't you dare to be closer with...,Tue Jan 25 16:22:26 2022,"Twitter API returned a 404 (Not Found), This T..."
2632183,1486939783540588544,@BevSutphin78 @MethyNurse @catherinenunya @Can...,Fri Jan 28 05:50:45 2022,"Twitter API returned a 404 (Not Found), This T..."
2710292,1487083580220186625,@Kasoulis1 @pskrill @Gala_heart @tariqnasheed ...,Fri Jan 28 15:22:08 2022,"Twitter API returned a 404 (Not Found), This T..."


In [8]:
deletedf.shape

(1153, 4)

### Transform our target variable to binary

In [9]:
# Convert the Target column based on substring presence
df['deleteReason'] = df['deleteReason'].apply(
    lambda x: 1 if x is not None and "Twitter API returned a 404 (Not Found), This Tweet is no longer available because it violated the Twitter Rules." in x else 0)

In [10]:
df.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
0,1474541784210362368,RT @Fukkard: Top Two Belongs to Pra’BOSS’✊\n\n...,Sat Dec 25 00:45:31 2021,0
1,1474541784181182468,RT @texan40: I swear... only in south Texas 😂 ...,Sat Dec 25 00:45:31 2021,0
2,1474541784176861185,RT @kdramadaisy: choi woong is the standard.\n...,Sat Dec 25 00:45:31 2021,0
3,1474541784156020741,RT @nft_ray: ANY #MAYC OWNERS INTERESTED IN TR...,Sat Dec 25 00:45:31 2021,0
4,1474541784155963396,RT @methnpizza: Spare 11? 1k would be a wonder...,Sat Dec 25 00:45:31 2021,0


In [11]:
deletedf = df[df['deleteReason'] == 1]

In [12]:
deletedf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
2468695,1485894286365368328,That will really help stop the huge surge of m...,Tue Jan 25 08:36:19 2022,1
2524858,1485996263380238341,"@ebonykayxxxx Scotland: fried mars bar, Gordon...",Tue Jan 25 15:21:32 2022,1
2536038,1486011588780134400,Hey Sunghoon! Don't you dare to be closer with...,Tue Jan 25 16:22:26 2022,1
2632183,1486939783540588544,@BevSutphin78 @MethyNurse @catherinenunya @Can...,Fri Jan 28 05:50:45 2022,1
2710292,1487083580220186625,@Kasoulis1 @pskrill @Gala_heart @tariqnasheed ...,Fri Jan 28 15:22:08 2022,1


In [13]:
df.shape

(64163912, 4)

In [14]:
deletedf.shape

(1153, 4)

### Let's create a df that is only records that have a target variable of 0 (not deleted by X)

In [15]:
sampledf = df[df['deleteReason'] == 0]

In [16]:
sampledf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
0,1474541784210362368,RT @Fukkard: Top Two Belongs to Pra’BOSS’✊\n\n...,Sat Dec 25 00:45:31 2021,0
1,1474541784181182468,RT @texan40: I swear... only in south Texas 😂 ...,Sat Dec 25 00:45:31 2021,0
2,1474541784176861185,RT @kdramadaisy: choi woong is the standard.\n...,Sat Dec 25 00:45:31 2021,0
3,1474541784156020741,RT @nft_ray: ANY #MAYC OWNERS INTERESTED IN TR...,Sat Dec 25 00:45:31 2021,0
4,1474541784155963396,RT @methnpizza: Spare 11? 1k would be a wonder...,Sat Dec 25 00:45:31 2021,0


In [17]:
sampledf.shape

(64162759, 4)

### Let's pull a sample of rows of the sampledf so that our algorithms can handle the smaller load. 65M rows takes too long to run, and we are not distributing this load for this project because we want to keep costs at $0.

In [18]:
sampledf = sampledf.sample(n=5000)

In [19]:
sampledf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
30843473,1575129780986388485,RT @milepluto: No matter the ending is perfect...,Wed Sep 28 14:26:18 2022,0
26321096,1546240962506129408,@Mmcintoshmerc89 the movie completely glides o...,Sun Jul 10 21:12:27 2022,0
34358526,1581465328856420352,@ImRusselsSlut @LoveRussel2799 @Russel2799 rus...,Sun Oct 16 02:01:31 2022,0
63753568,1648476562340323328,@cofeads goodbye,Wed Apr 19 00:00:14 2023,0
22636401,1538466330399088641,RT @Auto_Porn: blacked out… https://t.co/hcrNh...,Sun Jun 19 10:18:51 2022,0


In [20]:
#let's sort it by index

sampledf = sampledf.sort_index()

In [21]:
sampledf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
16863,1474572081517973504,RT @Mag_ho: Holy night⭐️ https://t.co/MD7lPwSqNz,Sat Dec 25 02:45:55 2021,0
31648,1474598575699021826,@TahrFantastico me too!,Sat Dec 25 04:31:11 2021,0
50353,1474640092933365762,@arseinall reimu will comfort me now from my r...,Sat Dec 25 07:16:10 2021,0
59865,1474658865333424128,aromantic rights !,Sat Dec 25 08:30:45 2021,0
73431,1474685352078372864,RT @TXT__News: TOMORROW X TOGETHER COMING SOON...,Sat Dec 25 10:16:00 2021,0


In [22]:
sampledf.shape

(5000, 4)

### We now have a dataframe of tweets that were not deleted. Let's concatenate it with the 1153 tweets that were deleted to make our dataframe we will use to run our models

In [23]:
tweetdf = pd.concat([sampledf, deletedf])

In [24]:
tweetdf = tweetdf.sort_index()

In [25]:
tweetdf.head()

Unnamed: 0,ID,Text,CreatedAt,deleteReason
16863,1474572081517973504,RT @Mag_ho: Holy night⭐️ https://t.co/MD7lPwSqNz,Sat Dec 25 02:45:55 2021,0
31648,1474598575699021826,@TahrFantastico me too!,Sat Dec 25 04:31:11 2021,0
50353,1474640092933365762,@arseinall reimu will comfort me now from my r...,Sat Dec 25 07:16:10 2021,0
59865,1474658865333424128,aromantic rights !,Sat Dec 25 08:30:45 2021,0
73431,1474685352078372864,RT @TXT__News: TOMORROW X TOGETHER COMING SOON...,Sat Dec 25 10:16:00 2021,0


In [26]:
tweetdf.shape

(6153, 4)

### Print to a csv that will be used for ML models

In [27]:
tweetdf.to_csv('projectdf.csv', index = False)

# Please now refer to DSC540_final.py for ML model deployment. This notebook was simply to create a csv to be used for the rest of the assignment. This way, the grader can replicate the steps using the produced CSV as opposed to the source being a local database that he or she will not have access to.