Some general comments:

* Several Machine Learning algorithms were tried like Decision Trees, Random Forest, Gradient Boosted Trees and Logistic   Regression.


* Out of these algorithms, logistic regression turned out to be the best in accuracy as well as time taken.


* 'authors' and 'headline' turned out to be very useful features whereas 'date' and 'short_description' did not, from 'date' variable, features like month, day of week, day of month and year were extracted but they did not prove useful.


* I think more and cleaner data will be required for some categories like 'ARTS', 'EDUCATION' etc. as content for different categories at times overlaps resulting in low accuracy. For example, 'New Yorker Cover Puts Trump 'In The Hole' After 'Racist' Comment', this headline does not look like it belongs to 'ARTS & CULTURE' category but according to data it does.


* Categories which were very infrequent have been merged into one category called 'OTHER NEWS', if rows with 'OTHER NEWS' are kept, then test set accuracy is 70.91 % and if these rows are removed, then test set accuracy is 75.34 %.

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from Generic_Functions_Huffpost import model_fitting_and_get_training_accuracy, get_test_accuracy, split_data, \
    preprocess_comments, generate_features

# Please change the path accordingly
df = pd.read_json('C:/Users/Dell/Desktop/News_Category_Dataset.json', lines=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124989 entries, 0 to 124988
Data columns (total 6 columns):
authors              124989 non-null object
category             124989 non-null object
date                 124989 non-null datetime64[ns]
headline             124989 non-null object
link                 124989 non-null object
short_description    124989 non-null object
dtypes: datetime64[ns](1), object(5)
memory usage: 5.7+ MB


In [2]:
# These variables have not been used as features so they have dropped
df = df.drop(['date', 'link', 'short_description'], axis=1)

df['authors'] = df['authors'].str.strip()
df['category'] = df['category'].str.strip()
df['headline'] = df['headline'].str.strip()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124989 entries, 0 to 124988
Data columns (total 3 columns):
authors     124989 non-null object
category    124989 non-null object
headline    124989 non-null object
dtypes: object(3)
memory usage: 2.9+ MB


In [3]:
import numpy as np

# Many of the 'authors' variable's rows were empty, when further investigated, it turned out that these articles were cited to 
# Reuters 
df['authors'].replace('', 'Reuters', inplace=True)


df['category'].replace('', np.nan, inplace=True)

# 'headline' variable is an important feature so if it is empty it is of no use so deleted the empty one's
df['headline'].replace('', np.nan, inplace=True)

df = df.dropna()

df = df.drop_duplicates()

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124612 entries, 0 to 124988
Data columns (total 3 columns):
authors     124612 non-null object
category    124612 non-null object
headline    124612 non-null object
dtypes: object(3)
memory usage: 3.8+ MB


In [4]:
df['category'].value_counts()

POLITICS          32621
ENTERTAINMENT     14241
HEALTHY LIVING     6686
QUEER VOICES       4989
BUSINESS           4246
SPORTS             4166
COMEDY             3962
PARENTS            3893
BLACK VOICES       3858
THE WORLDPOST      3662
WOMEN              3379
CRIME              2890
MEDIA              2812
WEIRD NEWS         2670
GREEN              2617
IMPACT             2602
WORLDPOST          2578
RELIGION           2548
STYLE              2246
WORLD NEWS         2174
TRAVEL             2143
TASTE              2095
ARTS               1509
FIFTY              1401
GOOD NEWS          1398
SCIENCE            1381
ARTS & CULTURE     1338
TECH               1230
COLLEGE            1144
LATINO VOICES      1129
EDUCATION          1004
Name: category, dtype: int64

In [5]:
# Categories with similar name or having different name but same meaning were clubbed together, foe example 'COLLEGE' and 
# 'EDUCATION' have been clubbed together into 'EDUCATION'
df['category'] = df['category'].replace('WORLDPOST', 'THE WORLDPOST')
df['category'] = df['category'].replace('THE WORLDPOST', 'WORLD NEWS')
df['category'] = df['category'].replace('COLLEGE', 'EDUCATION')
df['category'] = df['category'].replace('TECH', 'SCI & TECH')
df['category'] = df['category'].replace('SCIENCE', 'SCI & TECH')
df['category'] = df['category'].replace('ARTS', 'ARTS & CULTURE')
df['category'] = df['category'].replace('QUEER VOICES', 'MINORITY VOICES')
df['category'] = df['category'].replace('BLACK VOICES', 'MINORITY VOICES')
df['category'] = df['category'].replace('LATINO VOICES', 'MINORITY VOICES')


# Many categories which were very small in number and did not seem to be different from others were clubbed together 
# in 'OTHER NEWS' category
df.loc[~df['category'].isin(['POLITICS', 'ENTERTAINMENT', 'MINORITY VOICES', 'WORLD NEWS', 'HEALTHY LIVING', 'BUSINESS', 
                             'SPORTS', 'COMEDY', 'PARENTS', 'WOMEN', 'ARTS & CULTURE', 'CRIME', 'SCI & TECH', 'RELIGION', 
                             'EDUCATION']), 'category'] = 'OTHER NEWS'


# Please uncomment the below line to delete the rows with category 'OTHER NEWS'
# df = df.loc[~df['category'].isin(['OTHER NEWS']), :]

In [6]:
df['category'].value_counts()

POLITICS           32621
OTHER NEWS         19984
ENTERTAINMENT      14241
MINORITY VOICES     9976
WORLD NEWS          8414
HEALTHY LIVING      6686
BUSINESS            4246
SPORTS              4166
COMEDY              3962
PARENTS             3893
WOMEN               3379
CRIME               2890
ARTS & CULTURE      2847
SCI & TECH          2611
RELIGION            2548
EDUCATION           2148
Name: category, dtype: int64

In [7]:
le = LabelEncoder()
df_category = le.fit_transform(df['category'])

categories = list(df['category'].unique())

# combining 'authors' and 'headline' variables into a single variable so that tf-idf vectors of them can be computed together
df_author_headline = preprocess_comments(df['authors'] + ' ' + df['headline'])

In [8]:
x_train, x_test, y_train, y_test = split_data(df_author_headline, df_category, test_size=0.20)

train_tf_idf, tf_idf_vec, feature_names = generate_features(x_train)

In [9]:
lr_predictions_train, lr_accuracy_train, lr_confusion_matrix_train_category, lr_classification_report_train_category, lr_classifier = model_fitting_and_get_training_accuracy(LogisticRegression, train_tf_idf, y_train, categories, random_state=0, C=0.4)

lr_predictions_test, lr_accuracy_test, lr_confusion_matrix_test_category, lr_classification_report_test_category = get_test_accuracy(tf_idf_vec, x_test, lr_classifier, y_test, categories)

In [10]:
print('Train set accuracy in % :', lr_accuracy_train*100)
print('Test set accuracy in % :', lr_accuracy_test*100)

Train set accuracy in % : 74.34421049463833
Test set accuracy in % : 70.91441640251976


In [11]:
print(lr_classification_report_test_category)

                 precision    recall  f1-score   support

          CRIME       0.86      0.55      0.67       595
  ENTERTAINMENT       0.67      0.45      0.54       838
     WORLD NEWS       0.77      0.60      0.68       795
     OTHER NEWS       0.63      0.44      0.52       575
       POLITICS       0.68      0.49      0.57       435
MINORITY VOICES       0.79      0.78      0.79      2850
          WOMEN       0.69      0.65      0.67      1363
         COMEDY       0.81      0.64      0.72      2058
         SPORTS       0.57      0.75      0.64      3990
       BUSINESS       0.73      0.66      0.69       746
     SCI & TECH       0.74      0.89      0.81      6423
       RELIGION       0.77      0.54      0.63       504
      EDUCATION       0.83      0.36      0.50       520
        PARENTS       0.80      0.64      0.71       793
 ARTS & CULTURE       0.66      0.34      0.45       691
 HEALTHY LIVING       0.73      0.68      0.70      1747

    avg / total       0.72   

In [12]:
print(lr_classification_report_train_category)

                 precision    recall  f1-score   support

          CRIME       0.89      0.62      0.73      2252
  ENTERTAINMENT       0.75      0.49      0.59      3408
     WORLD NEWS       0.81      0.62      0.70      3167
     OTHER NEWS       0.72      0.50      0.59      2315
       POLITICS       0.71      0.51      0.59      1713
MINORITY VOICES       0.80      0.81      0.80     11391
          WOMEN       0.74      0.67      0.71      5323
         COMEDY       0.83      0.69      0.75      7918
         SPORTS       0.61      0.79      0.69     15994
       BUSINESS       0.78      0.69      0.73      3147
     SCI & TECH       0.76      0.91      0.83     26198
       RELIGION       0.80      0.56      0.66      2044
      EDUCATION       0.86      0.42      0.56      2091
        PARENTS       0.85      0.68      0.75      3373
 ARTS & CULTURE       0.72      0.40      0.51      2688
 HEALTHY LIVING       0.77      0.70      0.73      6667

    avg / total       0.75   