In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# loading data
reddit = pd.read_csv('train-balanced-sarcasm.csv')

In [3]:
# seeing first 4 results
reddit.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


In [4]:
# seeing number of rows, columns
reddit.shape

(1010826, 10)

There are 1 million rows and 10 columns.

In [5]:
# checking dtypes, null values, column names
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   label           1010826 non-null  int64 
 1   comment         1010771 non-null  object
 2   author          1010826 non-null  object
 3   subreddit       1010826 non-null  object
 4   score           1010826 non-null  int64 
 5   ups             1010826 non-null  int64 
 6   downs           1010826 non-null  int64 
 7   date            1010826 non-null  object
 8   created_utc     1010826 non-null  object
 9   parent_comment  1010826 non-null  object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB


Observations of note:

- There are 10 total columns, 6 `object` datatype and 4 `int`. No floats, no datetime, though there is a `date` column as well as `created_utc`.

- Null values present in `comment` column.

- Before examining nulls, we should check for duplicates.

- `label` is likely our target variable, which holds binary values for `sarcastic` or `non-sarcastic`

In [6]:
reddit.duplicated().sum()

28

There are 28 duplicate rows. We can examine the rows to see if there is any pattern present in the result.

In [7]:
# using .loc to visually examine duplicate rows
reddit.loc[reddit.duplicated()]

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
23777,1,Said the under 2k post karma guy *expert*,iam4real,youdontsurf,1,-1,-1,2016-11,2016-11-02 10:09:10,This subreddit really does suck.
78313,1,"USE REAL WORDS, DAMN IT",FlameSpartan,TumblrInAction,1,1,0,2016-09,2016-09-07 00:31:39,"Yo, that's such a kawai'i level of desu, it ma..."
160906,1,Hey you're that one guy who racked in all that...,OG_Phx_Son,pcmasterrace,1,-1,-1,2016-10,2016-10-02 15:56:40,Practice mowing lawns pls
201633,1,"Because Sandy Hook, CT and Aurora, CO were so ...",Gogomelo,news,-9,-9,0,2016-07,2016-07-31 11:59:53,"Just stay out of ghetto neighborhoods, violent..."
223116,1,That's just a player who knows how to get maxi...,Ignitus1,heroesofthestorm,1,1,0,2016-08,2016-08-06 19:40:18,tell that to the murky I played with yesterday...
281522,1,Im sure players would LOVE to have to fly all ...,pewpewpew52,nfl,2,2,0,2016-07,2016-07-02 05:19:20,London.
300841,1,What if the velocity of the electricity is pro...,GoldenScarab569,AskReddit,1,1,0,2016-06,2016-06-30 15:10:20,Electricity takes 1/143 of a second to travel ...
306604,1,Hes got Diaz and GSP lined up Wonderboy will h...,Kgb725,MMA,1,1,0,2016-08,2016-08-01 09:03:45,I feel like if Woodley decides to carry on wit...
307543,1,Only right wing nut jobs worry about that sort...,qemist,Economics,0,0,0,2016-08,2016-08-22 04:46:33,"But if society collapses, how am I supposed to..."
313724,1,Unless your neighbour's face is hidden under a...,deadcat,australia,1,1,0,2016-07,2016-07-04 21:45:37,It's much easier to blindly hate a faceless id...


--------------------------------------------------------------------------------------------------------------------------------

It seems the only pattern to be found is that all rows have a `1` in the `label` column, our target variable. Before dropping duplicates, we should check what the distribution of the `label` column looks like.

In [8]:
reddit['label'].value_counts()

label
0    505413
1    505413
Name: count, dtype: int64

The distribution of `sarcastic` and `non-sarcastic` values is split evenly down the middle. Removing the duplicates will disturb that perfect distribution, but will ultimately make any model more accurate.

In [9]:
# double checking for duplicate columns before dropping
reddit.columns.duplicated().sum()

0

In [10]:
# dropping duplicates
reddit = reddit.drop_duplicates()

In [11]:
# confirmation of drop
reddit.shape

(1010798, 10)

--------------------------------------------------------------------------------------------------------------------------------

Next, we want to move on to examine the null values we found in the `comments` column.

In [13]:
# clear representation of nulls
reddit.isna().sum()

label              0
comment           55
author             0
subreddit          0
score              0
ups                0
downs              0
date               0
created_utc        0
parent_comment     0
dtype: int64