Spam Posts

Calculate the percentage of spam posts in all viewed posts by day. A post is considered a spam if a string "spam" is inside keywords of the post. Note that the facebook_posts table stores all posts posted by users. The facebook_post_views table is an action table denoting if a user has viewed a post.

In [1]:
import pandas as pd

In [3]:
facebook_posts = pd.read_csv("../CSV/facebook_posts.csv")
facebook_posts = facebook_posts.iloc[:, :5]
facebook_posts

Unnamed: 0,post_id,poster,post_text,post_keywords,post_date
0,0,2,The Lakers game from last night was great.,"[basketball,lakers,nba]",2019-01-01
1,1,1,Lebron James is top class.,"[basketball,lebron_james,nba]",2019-01-02
2,2,2,Asparagus tastes OK.,"[asparagus,food]",2019-01-01
3,3,1,Spaghetti is an Italian food.,"[spaghetti,food]",2019-01-02
4,4,3,User 3 is not sharing interests,[#spam#],2019-01-01
5,5,3,User 3 posts SPAM content a lot,[#spam#],2019-01-02


In [5]:
facebook_post_views = pd.read_csv("../CSV/facebook_post_views.csv")
facebook_post_views = facebook_post_views.iloc[:, :2]
facebook_post_views

Unnamed: 0,post_id,viewer_id
0,4,0
1,4,1
2,4,2
3,5,0
4,5,1
5,5,2
6,3,1
7,3,2
8,3,3


In [6]:
facebook_posts['is_spam'] = facebook_posts.post_keywords.str.contains('spam')
facebook_posts

Unnamed: 0,post_id,poster,post_text,post_keywords,post_date,is_spam
0,0,2,The Lakers game from last night was great.,"[basketball,lakers,nba]",2019-01-01,False
1,1,1,Lebron James is top class.,"[basketball,lebron_james,nba]",2019-01-02,False
2,2,2,Asparagus tastes OK.,"[asparagus,food]",2019-01-01,False
3,3,1,Spaghetti is an Italian food.,"[spaghetti,food]",2019-01-02,False
4,4,3,User 3 is not sharing interests,[#spam#],2019-01-01,True
5,5,3,User 3 posts SPAM content a lot,[#spam#],2019-01-02,True


In [7]:
facebook_posts['is_spam'] = facebook_posts['is_spam'].apply(lambda x: 1 if x == True else 0)
facebook_posts

Unnamed: 0,post_id,poster,post_text,post_keywords,post_date,is_spam
0,0,2,The Lakers game from last night was great.,"[basketball,lakers,nba]",2019-01-01,0
1,1,1,Lebron James is top class.,"[basketball,lebron_james,nba]",2019-01-02,0
2,2,2,Asparagus tastes OK.,"[asparagus,food]",2019-01-01,0
3,3,1,Spaghetti is an Italian food.,"[spaghetti,food]",2019-01-02,0
4,4,3,User 3 is not sharing interests,[#spam#],2019-01-01,1
5,5,3,User 3 posts SPAM content a lot,[#spam#],2019-01-02,1


In [8]:
result = facebook_post_views.merge(facebook_posts[['post_id', 'post_date', 'is_spam']], how='left', on='post_id')
result

Unnamed: 0,post_id,viewer_id,post_date,is_spam
0,4,0,2019-01-01,1
1,4,1,2019-01-01,1
2,4,2,2019-01-01,1
3,5,0,2019-01-02,1
4,5,1,2019-01-02,1
5,5,2,2019-01-02,1
6,3,1,2019-01-02,0
7,3,2,2019-01-02,0
8,3,3,2019-01-02,0


In [9]:
result = result.groupby('post_date').agg({'is_spam': ['sum', 'count']}).reset_index()
result

Unnamed: 0_level_0,post_date,is_spam,is_spam
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count
0,2019-01-01,3,3
1,2019-01-02,3,6


In [10]:
result.columns

MultiIndex([('post_date',      ''),
            (  'is_spam',   'sum'),
            (  'is_spam', 'count')],
           )

In [11]:
result.columns = ['post_date', 'spam_sum', 'post_count']
result

Unnamed: 0,post_date,spam_sum,post_count
0,2019-01-01,3,3
1,2019-01-02,3,6


In [12]:
result['spam_share'] = (result.spam_sum / result.post_count)*100
result

Unnamed: 0,post_date,spam_sum,post_count,spam_share
0,2019-01-01,3,3,100.0
1,2019-01-02,3,6,50.0


In [13]:
result.drop(['spam_sum', 'post_count'], axis=1, inplace=True)
result

Unnamed: 0,post_date,spam_share
0,2019-01-01,100.0
1,2019-01-02,50.0


Solution Walkthrough
In this walkthrough, we will go through a code snippet that calculates the percentage of spam posts in all viewed posts by day. The code uses the pandas library to manipulate and analyze data stored in two tables, facebook_posts and facebook_post_views.

Understanding The Data
The facebook_posts table stores all posts posted by users. It contains columns like post_id, post_keywords, and post_date. The post_keywords column contains strings that may include the word "spam" if the post is considered spam.

The facebook_post_views table is an action table that denotes if a user has viewed a post. It contains columns like post_id, user_id, and view_date.

The Problem Statement
The goal is to calculate the percentage of spam posts in all viewed posts by day. To achieve this, we need to perform the following steps:

Identify which posts are spam by checking if the keyword "spam" is present in the post_keywords column of the facebook_posts table.
Merge the facebook_posts and facebook_post_views tables on the common column post_id.
Group the merged table by post_date.
Calculate the sum of spam posts and the total count of posts for each post_date.
Calculate the percentage of spam posts in each post_date.
Breaking Down The Code
Let's break down the code snippet step by step:

Importing the pandas library:
import pandas as pd
This line of code imports the pandas library, which is used for data manipulation and analysis.

Identifying spam posts:
facebook_posts["is_spam"] = facebook_posts.post_keywords.str.contains(
    "spam"
)
This line of code creates a new column is_spam in the facebook_posts DataFrame. It checks if the string 'spam' is present in the post_keywords column for each row and assigns True or False accordingly.

Converting boolean values to integers:
facebook_posts["is_spam"] = facebook_posts["is_spam"].apply(
    lambda x: 1 if x == True else 0
)
This line of code converts the boolean values in the is_spam column to integers. It uses a lambda function to assign 1 if the value is True and 0 if the value is False.

Merging tables and selecting columns:
result = facebook_post_views.merge(
    facebook_posts[["post_id", "post_date", "is_spam"]],
    how="left",
    on="post_id",
)
This line of code merges the facebook_post_views and facebook_posts tables on the common column post_id. It selects the columns post_id, post_date, and is_spam from the facebook_posts table and adds them to the result DataFrame. The how='left' parameter specifies that we want to keep all rows from facebook_post_views and only merge matching rows from facebook_posts.

Grouping data and aggregating:
result = (
    result.groupby("post_date")
    .agg({"is_spam": ["sum", "count"]})
    .reset_index()
)
This line of code groups the result DataFrame by the post_date column. It then calculates the sum and count of the is_spam column for each group. The reset_index() function is used to reset the index of the DataFrame after grouping and aggregating.

Renaming columns:
result.columns = ["post_date", "spam_sum", "post_count"]
This line of code renames the columns in the result DataFrame to be more descriptive. It changes the names to post_date, spam_sum, and post_count.

Calculating the percentage of spam posts:
result["spam_share"] = (result.spam_sum / result.post_count) * 100
This line of code calculates the percentage of spam posts for each post_date. It divides the spam_sum column by the post_count column and multiplies by 100. The result is assigned to a new column called spam_share.

Removing unnecessary columns:
result.drop(["spam_sum", "post_count"], axis=1, inplace=True)
This line of code removes the spam_sum and post_count columns from the result DataFrame. The drop() function is used with the axis=1 parameter to specify that we want to drop columns, and the inplace=True parameter to apply the changes to the DataFrame.

Printing the final result:
result
This line of code prints the final result, which is the result DataFrame after all the manipulations and calculations.

Bringing It All Together
The code snippet first identifies spam posts by checking the presence of the keyword "spam" in the post_keywords column of the facebook_posts table. It then merges the facebook_post_views and facebook_posts tables and selects the relevant columns. The merged table is then grouped by post_date and the sum and count of the is_spam column are calculated. The percentage of spam posts is then calculated and unnecessary columns are removed. Finally, the result is printed.

Conclusion
The code snippet effectively calculates the percentage of spam posts in all viewed posts by day using the given facebook_posts and facebook_post_views tables.