In [1]:
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
import re
%matplotlib inline

In [6]:
# articleData.csv briefs about the articles
articles_df = pd.read_csv('GoodData/articleData.csv')
articles_df.sample(5)

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
273,Introduction to Cyclical Learning Rates,/community/tutorials/cyclical-learning-neural-...,Sayak Paul,Learn what cyclical learning rate policy is an...,14
222,Decision Tree Classification in Python,/community/tutorials/decision-tree-classificat...,Avinash Navlani,"In this tutorial, learn Decision Tree Classifi...",36
98,A Tutorial on Loops in R - Usage and Alternatives,/community/tutorials/tutorial-on-loops-in-r,Carlo Fanara,A tutorial on loops in R that looks at the con...,50
680,Logistic Regression in R Tutorial,/community/tutorials/logistic-regression-R,James Le,Discover all about logistic regression: how it...,74
543,Working With Zip Files In Python,/community/tutorials/zip-file,Hafeezul Kareem Shaik,"In this tutorial, you are going to learn how t...",7


In [7]:
# Shape
articles_df.shape

(776, 5)

In [8]:
# Which article has the highest number of upvotes?
maximum_upvotes = articles_df.upvotes.max()
articles_df[articles_df.upvotes==maximum_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
62,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
212,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
362,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441
512,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441


In [9]:
# Are there duplicates?
articles_df[articles_df.duplicated()].shape

(449, 5)

In [10]:
# Drop the duplicates
articles_df.drop_duplicates(inplace=True)
articles_df.shape

(327, 5)

In [11]:
# Run the maximum upvotes query again
articles_df[articles_df.upvotes==maximum_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
62,Python For Finance: Algorithmic Trading,/community/tutorials/finance-python-trading,Karlijn Willems,This Python for Finance tutorial introduces yo...,441


In [12]:
# Lowest upvote?
lowest_upvotes = articles_df.upvotes.min()
articles_df[articles_df.upvotes==lowest_upvotes]

Unnamed: 0,Article Name,Article URL extension,Author,Article Description,upvotes
770,The RDocumentation Poster,/community/tutorials/package-rankings-task-vie...,Karlijn Willems,Rdocumentation.org is the only R documentation...,2


In [13]:
# How many unique authors contributed?
len(articles_df.Author.unique())

91

We now start with the comments posted on the community articles and tutorial from time to time. `commentsData` is our folder of interest. 

In [14]:
# How many articles/tutorials are there in this folder?
total_articles = !ls commentsData/
print(len(total_articles))

1


So, we have got the comments scrapped for 325 articles/tutorials of DataCamp Community. Consider the following comment and its reply given: 

![](images/Image1.png)


During our scrapping process, we captured only the first comments _not the entire conversation chain including the replies_. Let's now load the first article we have in the folder `commentsData` and glare through its comments (if any). 

In [15]:
comment_sample_1 = pd.read_csv('commentDataDetailed/10CommandlineUtilitiesinPostgreSQL.csv')
comment_sample_1.head()

Unnamed: 0,commentedBy,commentMessage,upvotes,commentDate
0,Jeremiah James,"Thank you for this tutorial, very helpful and ...",3,01/04/2019 08:29 PM


Note that `10CommandlineUtilitiesinPostgreSQL` is the name of the respective article. We are interested in knwoing if a comment is spam or non-spam which is essentially a **binary classification** problem. To approach the solution, we would first need to get all the comments from all the articles in one place. 

Binary classification is a _supervised learning_ task but we do not have labels for any of the comments. So, we will have to resort to **manual labelling**. We will label some of the comments manually and then we will proceed. But first, let's get the comments in one place (a pandas DataFrame!).

### Dictionary to map article names with file names

In [16]:
articleDict = {}
# Regex for enlighs, chinese characters
regExp = re.compile(r'[^A-Za-z0-9_⺀-⺙⺛-⻳⼀-⿕々〇〡-〩〸-〺〻㐀-䶵一-鿃豈-鶴侮-頻並-龎]', re.UNICODE)
for articleName in articles_df['Article Name'].tolist():
    articleDict['commentDataDetailed\\'+regExp.sub('', articleName)+".csv"] = articleName

### Adding the article reference with the comments for better labelling

In [None]:
path = 'commentDataDetailed'
all_files = glob.glob(path + "*.csv")

temp_list = []
count_no_data = 0
columnNames = None

# for getting dynamic column names
for fileIterator in range(0, len(all_files)):
    try: 
        columnNames = pd.read_csv(all_files[fileIterator], index_col=None).columns
    except:
        continue
    
for filename in all_files:
    try:
        # mentioning names parameter seperately because read_csv() functions engine=python with encoding=utf-8 is unable to read column names properly
        temp_df = pd.read_csv(filename, names=columnNames,index_col=None, header=None, skiprows=[0], engine='python', encoding='utf-8')
        # adding article names also for the reference
        temp_df['articleName'] = articleDict[filename]
        temp_list.append(temp_df)
    except:
        count_no_data += 1
        
final_df = pd.concat(temp_list, axis=0, ignore_index=True)
print('Total comments accumulated: {} and {} documents do not have any comments'.
                          format(len(final_df), count_no_data))

In [14]:
# Review
final_df.sample(10)

Unnamed: 0,commentedBy,commentMessage,upvotes,commentDate,articleName
1080,Elias Oziolor,Another small mistake. When you're merging the...,2,28/07/2018 05:30 AM,Machine Learning in R for beginners
861,Jan Eite Bullema,"A little error in this line, should be save_m...",1,22/05/2018 02:25 PM,keras: Deep Learning in R
166,mrgb1 ch,"\n\nHolistic Bliss Keto ""We have a positive...",1,30/05/2019 08:51 AM,AWS EC2 Tutorial For Beginners
246,Aman Sarviya,"Hello @KarlijnWillems Thank you for the blog,\...",0,23/05/2019 12:57 PM,Choosing R or Python for Data Analysis? An Inf...
589,Erkan ŞİRİN,At Visualize Color section the code should be ...,1,28/06/2018 11:59 AM,Graph Optimization with NetworkX in Python
1320,Mark Loat,"Thanks for the article however, can you tell m...",1,24/04/2019 01:08 PM,Python Data Structures Tutorial
1223,Denis Rasulev,Still not fixed - code is barely visible. Othe...,2,24/05/2018 10:17 PM,PEP-8 Tutorial: Code Standards in Python
1235,Mallikarjun M,Very helpful with the subtopics of when to use...,1,19/03/2019 07:33 AM,Pickle in Python: Object Serialization
410,Jude Anlasun,"Awesome! You nailed it, very good read.",2,03/10/2018 12:17 AM,Demystifying Crucial Statistics in Python
882,zhuowang,Really awesome program. But I face a problem w...,3,29/05/2018 11:14 PM,Keras Tutorial: Deep Learning in Python


In [15]:
# Get the comments only
final_comments_df = pd.DataFrame()
final_comments_df['comment'] = final_df['commentMessage']
final_comments_df['articleName'] = final_df['articleName']
final_comments_df.sample(5)

Unnamed: 0,comment,articleName
204,"Hey!\n\nThanks for your tutorial, it's very us...",Beginner's Guide to Google's Vision API in Python
2191,"Error in write_delim(x, path, delim = ""\t"", n...",Web Scraping in R: rvest Tutorial
1810,Is anybody else getting an error when trying t...,Survival Analysis in R For Beginners
466,"This was very helpful, thanks!",Differences Between Machine Learning & Deep Le...
1249,great details,Poker Probability and Statistics with Python


In [16]:
# Let's serialize this DataFrame to a .csv file
final_comments_df.to_csv('GoodData/final_data_extended.csv', index=False)
print('File saved!')

File saved!


### Basic EDA

In [17]:
# Nulls?
final_comments_df.isna().sum()

comment        2
articleName    0
dtype: int64

In [18]:
# Where?
final_comments_df[final_comments_df.comment.isna()]

Unnamed: 0,comment,articleName
195,,Beginner's Guide to Feature Selection in Python
1994,,Top 5 Python IDEs For Data Science


In [19]:
# Goodbye
print('Before dropping the nulls shape of the DataFrame: {}'.format(final_comments_df.shape))
final_comments_df = final_comments_df.dropna().reset_index(drop=True)
print('Before dropping the nulls shape of the DataFrame: {}'.format(final_comments_df.shape))

Before dropping the nulls shape of the DataFrame: (2247, 2)
Before dropping the nulls shape of the DataFrame: (2245, 2)


In [20]:
# Duplicates?
final_comments_df[final_comments_df['comment'].duplicated()==True].head(20)

Unnamed: 0,comment,articleName
695,Very good article. This is a tutorial video ex...,Installing Anaconda on Windows
802,good article,JSON Data in Python
887,Thank you very much!,Keras Tutorial: Deep Learning in Python
908,Great article!,K-Means Clustering in Python with scikit-learn
1070,Excellent tutorial!,Machine Learning in R for beginners
1445,nice,Python For Finance: Algorithmic Trading
1546,,Python Object-Oriented Programming (OOP): Tuto...
1581,,Python Seaborn Tutorial For Beginners
1599,Thanks,Python Tuples Tutorial
1845,Great tutorial!,TensorFlow Tutorial For Beginners


In [21]:
final_comments_df[final_comments_df['comment'].duplicated()==True].tail(10)

Unnamed: 0,comment,articleName
1070,Excellent tutorial!,Machine Learning in R for beginners
1445,nice,Python For Finance: Algorithmic Trading
1546,,Python Object-Oriented Programming (OOP): Tuto...
1581,,Python Seaborn Tutorial For Beginners
1599,Thanks,Python Tuples Tutorial
1845,Great tutorial!,TensorFlow Tutorial For Beginners
1869,"for i in range(201):\n\n print('EPOCH',...",TensorFlow Tutorial For Beginners
1927,Great tutorial!,The Pip Python Package Manager
1952,Very nice article!,Tidy Sentiment Analysis in R
1973,Part 3 published today: https://www.datacamp.c...,Tidy Sentiment Analysis in R


In [22]:
# Total duplicate rows
len(final_comments_df[final_comments_df['comment'].duplicated()==True])

14

In [23]:
# Drop the duplicates
final_comments_df.drop_duplicates('comment', inplace=True)

In [24]:
# Final check
print(final_comments_df.isna().sum())

comment        0
articleName    0
dtype: int64


In [None]:
# Overwrite the a .csv file
final_comments_df.to_csv('GoodData/final_data_extended.csv', index=False)

#### Creating dictionary to map between comments and article names to use partially labelled CSV file

In [21]:
commentToArticleDict = dict(zip([str(comment) for comment in final_comments_df['comment'].tolist()], final_comments_df['articleName'].tolist()))

In [22]:
loaded_comments_df = pd.read_csv('./GoodData/final_data_labeledPhase1.csv', names=['comment', 'label'],index_col=None, header=None, skiprows=[0], engine='python', encoding='utf-8')

In [23]:
loaded_comments_df['articleName'] = [commentToArticleDict.get(comment, None) for comment in loaded_comments_df['comment'].tolist()]

#### Checking whether the article names were mapped perfectly with comments

In [24]:
loaded_comments_df[loaded_comments_df.articleName.isna()]

Unnamed: 0,comment,label,articleName


In [25]:
loaded_comments_df.head()

Unnamed: 0,comment,label,articleName
0,"Thank you for this tutorial , it is simple to ...",0.0,Python List Comprehension Tutorial
1,Good,0.0,Python List Comprehension Tutorial
2,In the 'filter' section the given code returns...,0.0,Python List Comprehension Tutorial
3,"Hi there! Thanks for the tutorial, much appre...",0.0,Python List Comprehension Tutorial
4,It's lovely!!!!easy to understand.........than...,0.0,Python List Comprehension Tutorial


In [26]:
loaded_comments_df.to_csv('GoodData/final_data_extended_labeledPhase1.csv', index=False)

### Taking a look at the comments grouped by artcle names

In [44]:
loaded_comments_df.groupby('articleName')['comment'].count()\
        .sort_values(ascending=False)\
        .reset_index()\
        .head(10)

Unnamed: 0,articleName,comment
0,Python Excel Tutorial: The Definitive Guide,79
1,TensorFlow Tutorial For Beginners,53
2,Choosing R or Python for Data Analysis? An Inf...,53
3,Stock Market Predictions with LSTM in Python,49
4,Python For Finance: Algorithmic Trading,48
5,AWS EC2 Tutorial For Beginners,45
6,Tidy Sentiment Analysis in R,41
7,Web Scraping in R: rvest Tutorial,37
8,Jupyter Notebook Tutorial: The Definitive Guide,36
9,Lyric Analysis with NLP & Machine Learning with R,36
