In this notebook I will try to perform data analysis and evaluate most popular ML algorithms for clap prediction task.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from nltk.corpus import stopwords
import string
import re
%matplotlib inline

First of all let's read data and analyze data types.

In [None]:
df = pd.read_csv('../input/articles.csv')
df.dtypes

It seems that claps data are in wrong format. Need to fix that.

In [None]:
df.head(5)

In [None]:
df['claps'] = df['claps'].apply(lambda x: int(float(x[:-1]) * 1000) if x[-1] == 'K' else int(x))
df.dtypes

Lets look if there are any NaN values.

In [None]:
df.isnull().any()

There are no NaN values in this data set so I am moving to the next step: feature engineering. I will add few more fields to my pandas data frame: len_title, len_text, title_clean, text_clean, len_title_clean, len_text_clean. I think those fields are self explainable. What is more before cleaning my text data I will change all text to lower case. After all I will combine author, title and text fields to single column. 

In [None]:
df['title_len'] = df['title'].str.len()
df['text_len'] = df['text'].str.len()

df['title'] = df['title'].apply(lambda x: x.lower())
df['text'] = df['text'].apply(lambda x: x.lower())
df['author'] = df['author'].apply(lambda x: x.lower())

df['title_clean'] = df['title'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))
df['text_clean'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords.words('english')]))

df['title_clean'] = df['title_clean'].apply(lambda x: re.sub('[' + string.punctuation + '—]', '', x))
df['text_clean'] = df['text_clean'].apply(lambda x: re.sub('[' + string.punctuation + '—]', '', x))

df['title_clean'] = df['title_clean'].apply(lambda x: x.translate(str.maketrans('', '', string.digits)))
df['text_clean'] = df['text_clean'].apply(lambda x: x.translate(str.maketrans('', '', string.digits)))

df['title_clean'] = df['title_clean'].apply(lambda x: re.sub(' +', ' ', x))
df['text_clean'] = df['text_clean'].apply(lambda x: re.sub(' +', ' ', x))

df['title_clean_len'] = df['title_clean'].str.len()
df['text_clean_len'] = df['text_clean'].str.len()

df['full_text'] = df['author'] + ' ' + df['title_clean'] + ' ' + df['text_clean']

df.head(10)

Now I am going to remove unnecessary data columns and data will be ready for analysis. Lets do that.

In [None]:
df = df.drop('link', axis=1)
df = df.drop('text', axis=1)
df = df.drop('title', axis=1)
df = df.drop('title_clean', axis=1)
df = df.drop('text_clean', axis=1)
df = df.drop('author', axis=1)

df = df.drop_duplicates()

df.describe(include='all')

By looking at the data it seems we have some outliers, lets plot blox plot and check them out.

In [None]:
df.boxplot(column=['claps', 
                   'text_len', 
                   'text_clean_len'])
plt.show()

In [None]:
df.boxplot(column=['reading_time', 
                   'title_len', 
                   'title_clean_len'])
plt.show()

Lets analyze some outliers and check if there are any reasonable grounds to exclude that data from data set.

In [None]:
sns.pairplot(df[['claps', 
                 'reading_time',
                 'title_len', 
                 'title_clean_len',
                 'text_len', 
                 'text_clean_len']], kind='reg')
plt.show()

Distributions of claps, reading_time, text_len and text_clean_len are positive skewed. It shows that highest frequencies of particular entities are distributed near small values.

What is more claps (dependant variable) has weak positive linear relationship with every independant variable. 

Of course we see strong relationship between reading_time and text_len and text_clean_len. Variables text_len, title_len and text_clean_len and title_clean_len are correlated.

Blox plots shows that outliers are detected when claps > 18000, text_len > 28000, text_clean_len > 18500, reading_time > 22, title_len > 95 and title_clean_len > 81. Lets look at those data points.

In [None]:
df[df['claps'] > 18000]

In [None]:
df[df['text_len'] > 28000]

In [None]:
df[df['reading_time'] > 22]

After analysing the data found no reason to remove outliers (if I can call them so). It is normal data points and because of small data set they look like outliers.

As text features I desided to use TF-IDF. 

In [None]:
vectorizer = TfidfVectorizer(max_features=None)
full_text_features = vectorizer.fit_transform(df['full_text'])
full_text_features.shape

All variables has different scale, so I am using standard scaler to make scale equal.

In [None]:
scaler = StandardScaler()
num_features = scaler.fit_transform(df[['reading_time', 
                                        'title_len',
                                        'text_len',
                                        'title_clean_len',
                                        'text_clean_len']])
num_features.shape

After all text features are concatenated with left text length features.

In [None]:
full_text_features = np.concatenate([full_text_features.toarray(), num_features], axis=1)
full_text_features.shape

Performing train/test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(full_text_features, df[['claps']].values, test_size=0.3)
X_train.shape

In [None]:
y_test.shape

Testing linear regression model.

In [None]:
reg = LinearRegression().fit(X_train, y_train)

In [None]:
y_pred = reg.predict(X_test)
y_pred.shape

In [None]:
r2_score(y_test, y_pred)

R2 metric shows that it is hard task for linear regression model to learn having so much features (R-squared=1–1=0). So I am changing number of claps to categorical values.

In [None]:
df[['claps']].hist()
plt.show()

Claps are devided into categories:
0 - 10000: rising start
10001 - 20000: star
20001 - all other: super star

In [None]:
df['claps_categorical'] = df['claps'].apply(lambda x: 'rising star' if x >= 0 and x <= 10000 else 'star' if x >= 10001 and x <= 20000 else 'super star')
df[['claps', 'claps_categorical']].head(15)

Performing new train.test split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(full_text_features, df[['claps_categorical']].values, test_size=0.3)
X_train.shape

In [None]:
y_test.shape

Using Random Forest classifier for classification task.

In [None]:
clf = RandomForestClassifier(n_estimators=1000, max_depth=2, random_state=0)
clf.fit(X_train, y_train)

Predicting claps category.

In [None]:
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

By looking at accuracy score I can tell that model is performing well, 85 % accuracy.

In [None]:
confusion_matrix(y_test, y_pred, labels=['rising star', 'star', 'super star'])

Well confusion matrix shows that only rising star category was predicted correct. All other classes were predicted incorrectly.

In order to increase recognition accuracy need to play more with feature engineering and classifier hyperparameters tunning. What is more this dataset is to small to get good and confident results. 