In [1]:
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# articleData.csv briefs about the articles
articles_df = pd.read_csv('GoodData/articleData.csv')
articles_df.sample(5)

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
707,Python String Tutorial,/community/tutorials/python-string-tutorial,Sejal Jaiswal,"In this tutorial, you'll learn all about Pytho...",44
119,Hacking Date Functions in SQLite,/community/tutorials/hacking-date-functions-sq...,Hillary Green-Lerman,"In this tutorial, learn how to use date functi...",6
130,Bootstrap in R,/community/tutorials/bootstrap-r,Łukasz Deryło,"In this tutorial, you will learn how to use th...",14
343,Views (Virtual Tables) in SQL,/community/tutorials/views-in-sql,Avinash Navlani,"In this tutorial, you will learn what views ar...",16
595,Joining DataFrames in Pandas,/community/tutorials/joining-dataframes-pandas,Manish Pathak,"In this tutorial, you’ll learn various ways in...",28


In [3]:
# Shape
articles_df.shape

(776, 5)

In [4]:
# Which article has the highest number of upvotes?
maximum_upvotes = articles_df.upvotes.max()
articles_df[articles_df.upvotes==maximum_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
62,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
212,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
362,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
512,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441


In [5]:
# Are there duplicates?
articles_df[articles_df.duplicated()].shape

(449, 5)

In [6]:
# Drop the duplicates
articles_df.drop_duplicates(inplace=True)
articles_df.shape

(327, 5)

In [7]:
# Run the maximum upvotes query again
articles_df[articles_df.upvotes==maximum_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
62,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441


In [8]:
# Lowest upvote?
lowest_upvotes = articles_df.upvotes.min()
articles_df[articles_df.upvotes==lowest_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
770,The RDocumentation Poster,/community/tutorials/package-rankings-task-vie...,Karlijn Willems,Rdocumentation.org is the only R documentation...,2


In [9]:
# How many unique authors contributed?
len(articles_df.Author.unique())

91

We now start with the comments posted on the community articles and tutorial from time to time. `commentsData` is our folder of interest. 

In [10]:
# How many articles/tutorials are there in this folder?
total_articles = !ls commentsData/
print(len(total_articles))

325


So, we have got the comments scrapped for 325 articles/tutorials of DataCamp Community. Consider the following comment and its reply given: 

![](images/Image1.png)


During our scrapping process, we captured only the first comments _not the entire conversation chain including the replies_. Let's now load the first article we have in the folder `commentsData` and glare through its comments (if any). 

In [15]:
comment_sample_1 = pd.read_csv('commentDataDetailed/10CommandlineUtilitiesinPostgreSQL.csv')
comment_sample_1.head()

Unnamed: 0,commentedBy,commentMessage,upvotes,commentDate
0,Jeremiah James,"Thank you for this tutorial, very helpful and ...",3,01/04/2019 08:29 PM


Note that `10CommandlineUtilitiesinPostgreSQL` is the name of the respective article. We are interested in knwoing if a comment is spam or non-spam which is essentially a **binary classification** problem. To approach the solution, we would first need to get all the comments from all the articles in one place. 

Binary classification is a _supervised learning_ task but we do not have labels for any of the comments. So, we will have to resort to **manual labelling**. We will label some of the comments manually and then we will proceed. But first, let's get the comments in one place (a pandas DataFrame!).

In [17]:
path = 'commentDataDetailed/'
all_files = glob.glob(path + "*.csv")

temp_list = []
count_no_data = 0

for filename in all_files:
    try:
        temp_df = pd.read_csv(filename, index_col=None)
        temp_list.append(temp_df)
    except:
        count_no_data += 1
        
final_df = pd.concat(temp_list, axis=0, ignore_index=True)
print('Total comments accumulated: {} and {} documents do not have any comments'.
                          format(len(final_df), count_no_data))

Total comments accumulated: 2247 and 51 documents do not have any comments


In [18]:
# Review
final_df.sample(10)

Unnamed: 0,commentedBy,commentMessage,upvotes,commentDate
776,Samsun Rock,"Guys, there mentioned information is so knowle...",1,31/01/2019 01:57 PM
855,Sam Shum,Very easy to understand. Well done. Thank you!,4,03/05/2018 06:38 AM
83,Mark Steven,Nice one...\n\n\n\n\n \n\nRead some of our bl...,1,15/04/2019 03:08 PM
2141,Alexander Baker,"Great tutorial, I've read through tons, and th...",1,03/01/2019 07:41 AM
2238,bhangad singh,This is a terrific tutorial and obviously a lo...,2,22/01/2019 08:24 AM
887,Laura Kirchner,"Great tutorial, but where is the link to the d...",2,02/10/2018 09:27 PM
537,Aman Sarviya,"Hello, @KarlijnWillems\n\nThank you for such a...",1,30/05/2019 11:42 AM
27,Stark Lord,This is a good tutorial. Meanwhile you can vis...,1,22/05/2019 05:57 PM
1219,Valentin Koffi,Great !,1,20/02/2018 11:42 PM
1083,Muqadder Iqbal,There should be a way to bookmark articles to ...,2,17/10/2018 01:39 PM


In [19]:
# Get the comments only
final_comments_df = pd.DataFrame()
final_comments_df['comment'] = final_df['commentMessage']
final_comments_df.sample(5)

Unnamed: 0,comment
1210,Thanks for this information. To get informatio...
564,I cannot see the images as described in the do...
2069,How can i make my own haar cascade file? any a...
2122,very good tutorial
80,We've got a team of highly skilled And ...


In [20]:
# Let's serialize this DataFrame to a .csv file
final_comments_df.to_csv('GoodData/final_data.csv', index=False)
print('File saved!')

File saved!


## Basic EDA

In [21]:
# Nulls?
final_comments_df.isna().sum()

comment    2
dtype: int64

In [22]:
# Where?
final_comments_df[final_comments_df.comment.isna()]

Unnamed: 0,comment
951,
1000,


In [23]:
# Goodbye
print('Before dropping the nulls shape of the DataFrame: {}'.format(final_comments_df.shape))
final_comments_df = final_comments_df.dropna().reset_index(drop=True)
print('Before dropping the nulls shape of the DataFrame: {}'.format(final_comments_df.shape))

Before dropping the nulls shape of the DataFrame: (2247, 1)
Before dropping the nulls shape of the DataFrame: (2245, 1)


In [24]:
# Duplicates?
final_comments_df[final_comments_df.duplicated()==True].head(20)

Unnamed: 0,comment
777,Great tutorial!
798,Very good article. This is a tutorial video ex...
985,Thanks
1013,
1087,Great article!
1431,good article
1553,Excellent tutorial!
1556,
1974,nice
2023,Great tutorial!


In [25]:
final_comments_df[final_comments_df.duplicated()==True].tail(10)

Unnamed: 0,comment
1087,Great article!
1431,good article
1553,Excellent tutorial!
1556,
1974,nice
2023,Great tutorial!
2047,"for i in range(201):\n\n print('EPOCH',..."
2132,Thank you very much!
2214,Very nice article!
2235,Part 3 published today: https://www.datacamp.c...


In [26]:
# Total duplicate rows
len(final_comments_df[final_comments_df.duplicated()==True])

14

In [27]:
# Drop the duplicates
final_comments_df.drop_duplicates(inplace=True)

In [28]:
# Final check
print(final_comments_df.isna().sum())

comment    0
dtype: int64


In [29]:
# Overwrite the a .csv file
final_comments_df.to_csv('GoodData/final_data.csv', index=False)