In [28]:
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
!ls

articleData.csv  DatacampArticleScraping  README.md
commentsData	 Data_prep.ipynb


In [3]:
# articleData.csv briefs about the articles
articles_df = pd.read_csv('GoodData/articleData.csv')
articles_df.sample(5)

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
265,Basic Programming Skills in R,/community/tutorials/basic-programming-skills-r,Ryan Sheehy,Practice basic programming skills in R by usin...,7
741,Exploratory Data Analysis of Craft Beers: Data...,/community/tutorials/python-data-profiling,Jean-Nicholas Hould,"In this tutorial, you'll learn about explorato...",25
300,Demystifying Mathematical Concepts for Deep Le...,/community/tutorials/demystifying-mathematics-...,Avinash Navlani,Explore basic math concepts for data science a...,7
187,How to Make a Histogram with ggplot2,/community/tutorials/make-histogram-ggplot2,Karlijn Willems,Learn how to make a histogram with ggplot2 in ...,15
538,Introduction to Python Metaclasses,/community/tutorials/python-metaclasses,Derrick Mwiti,"In this tutorial, learn what metaclasses are, ...",17


In [7]:
# Shape
articles_df.shape

(776, 5)

In [4]:
# Which article has the highest number of upvotes?
maximum_upvotes = articles_df.upvotes.max()
articles_df[articles_df.upvotes==maximum_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
62,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
212,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
362,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
512,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441


In [8]:
# Are there duplicates?
articles_df[articles_df.duplicated()].shape

(449, 5)

In [9]:
# Drop the duplicates
articles_df.drop_duplicates(inplace=True)
articles_df.shape

(327, 5)

In [10]:
# Run the maximum upvotes query again
articles_df[articles_df.upvotes==maximum_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
62,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441


In [11]:
# Lowest upvote?
lowest_upvotes = articles_df.upvotes.min()
articles_df[articles_df.upvotes==lowest_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
770,The RDocumentation Poster,/community/tutorials/package-rankings-task-vie...,Karlijn Willems,Rdocumentation.org is the only R documentation...,2


In [13]:
# How many unique authors contributed?
len(articles_df.Author.unique())

91

We now start with the comments posted on the community articles and tutorial from time to time. `commentsData` is our folder of interest. 

In [25]:
# How many articles/tutorials are there in this folder?
total_articles = !ls commentsData/
print(len(total_articles))

325


So, we have got the comments scrapped for 325 articles/tutorials of DataCamp Community. Consider the following comment and its reply given: 

![](images/Image1.png)


During our scrapping process, we captured only the first comments _not the entire conversation chain including the replies_. Let's now load the first article we have in the folder `commentsData` and glare through its comments (if any). 

In [26]:
comment_sample_1 = pd.read_csv('commentsData/10CommandlineUtilitiesinPostgreSQL.csv')
comment_sample_1.head()

Unnamed: 0,commentedBy,commentMessage,upvotes,commentDate
0,Jeremiah James,"Thank you for this tutorial, very helpful and ...",3,01/04/2019 08:29 PM


Note that `10CommandlineUtilitiesinPostgreSQL` is the name of the respective article. We are interested in knwoing if a comment is spam or non-spam which is essentially a **binary classification** problem. To approach the solution, we would first need to get all the comments from all the articles in one place. 

Binary classification is a _supervised learning_ task but we do not have labels for any of the comments. So, we will have to resort to **manual labelling**. We will label some of the comments manually and then we will proceed. But first, let's get the comments in one place (a pandas DataFrame!).

In [43]:
path = 'commentsData/'
all_files = glob.glob(path + "*.csv")

temp_list = []
count_no_data = 0

for filename in all_files:
    try:
        temp_df = pd.read_csv(filename, index_col=None)
        temp_list.append(temp_df)
    except:
        count_no_data += 1
        
final_df = pd.concat(temp_list, axis=0, ignore_index=True)
print('Total comments accumulated: {} and {} documents do not have any comments'.
                          format(len(final_df), count_no_data))

Total comments accumulated: 2233 and 57 documents do not have any comments


In [44]:
# Review
final_df.sample(10)

Unnamed: 0,commentedBy,commentMessage,upvotes,commentDate
114,Abigail Smith,Get solutions for all Epson Printer Support...,1,29/05/2019 08:45 PM
1813,Alex T,"Hi, great work, great tut.",2,21/07/2018 01:06 AM
330,Jeff Hendricks,source .bahsrc should be .bashrc,2,08/03/2018 06:49 AM
768,John Carrell,Wow! Where was this tutorial 3 years ago when ...,1,14/06/2018 07:50 AM
1783,Khachatur Karapetyan,You well done!\nIt's very useful and interesti...,2,08/09/2018 02:36 AM
1745,Ashu yadav,"hello, i am happy to share with you guys see h...",2,31/12/2018 12:59 PM
887,Akshay Sharma,Thanks for this great article in such a simpl...,2,18/04/2019 04:09 PM
1932,thomas,In the second bullet point of the first sectio...,1,03/04/2018 02:03 AM
214,vijayabhaskar96,Please change the colors of the code and it's ...,1,14/06/2018 06:55 PM
1669,Alister cook,Needed to compose you a very little word to th...,1,04/07/2018 03:07 PM


In [45]:
# Get the comments only
final_comments_df = pd.DataFrame()
final_comments_df['comment'] = final_df['commentMessage']
final_comments_df.sample(5)

Unnamed: 0,comment
1965,"Going short: ""or you sell your stock, expectin..."
2162,The .ix indexer is deprecated starting in Pa...
1385,why Do we retrain the model with all layers as...
991,It seems that since Yhat has been acquired by ...
1045,How to group samples if I have 930 samples in...


In [47]:
# Let's serialize this DataFrame to a .csv file
final_comments_df.to_csv('GoodData/final_data.csv', index=False)

## Basic EDA

In [52]:
# Nulls?
final_comments_df.isna().sum()

comment    7
dtype: int64

In [53]:
# Where?
final_comments_df[final_comments_df.comment.isna()]

Unnamed: 0,comment
91,
101,
631,
939,
988,
1531,
2023,


In [55]:
# Goodbye
print('Before dropping the nulls shape of the DataFrame: {}'.format(final_comments_df.shape))
final_comments_df = final_comments_df.dropna().reset_index(drop=True)
print('Before dropping the nulls shape of the DataFrame: {}'.format(final_comments_df.shape))

Before dropping the nulls shape of the DataFrame: (2233, 1)
Before dropping the nulls shape of the DataFrame: (2226, 1)


In [64]:
# Duplicates?
final_comments_df[final_comments_df.duplicated()==True].head(20)

Unnamed: 0,comment
117,
132,Hi
140,"Hi Sejal,"
143,Hi
151,"Hi Sejal,"
153,"Hi Sejal,"
155,"Hi Sejal,"
216,
260,"Hi,"
291,"Hi Thushan,"


In [66]:
# Yes, but how dire?
final_comments_df.loc[329]

comment    Hi,
Name: 329, dtype: object

In [67]:
final_comments_df.loc[291]

comment    Hi Thushan,
Name: 291, dtype: object

In [68]:
final_comments_df[final_comments_df.duplicated()==True].tail(10)

Unnamed: 0,comment
2114,Thank you very much!
2117,"Hi Karlijn,"
2120,Thanks for this tutorial.
2133,Great tutorial!
2148,"hi,"
2196,Very nice article!
2205,"Hi Debbie,"
2209,"Hi Debbie,"
2217,Part 3 published today: https://www.datacamp.c...
2223,


In [69]:
final_comments_df.loc[2217]

comment    Part 3 published today: https://www.datacamp.c...
Name: 2217, dtype: object

In [71]:
# Drop the duplicates
final_comments_df.drop_duplicates(inplace=True)

In [81]:
# Final check
print(final_comments_df.isna().sum())

comment    0
dtype: int64


In [82]:
# Overwrite the a .csv file
final_comments_df.to_csv('GoodData/final_data.csv', index=False)