### Problem Statement
We have a dataset from BBC news with data of about 2000 news articles. The dataset is categorized into five different categories of tech, business, entertainment, politics and sports. The task is to build a model which can predict the category of the news article. This is a multi-class classification problem

The task is accompolished using three key things. First, we clean the textual data using different NLP guidelines, then we apply TF-IDF vectorizer to understand the importance of words present in news article relative to all the other articles. Then we provide these TF-IDF features to different classifiers to see which of these get us the highest accuracy in predicting the category of the news articles.

### Loading Packages

In [123]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm

### Read Data

In [99]:
df = pd.read_csv('bbc-text.csv')
df.shape

(2225, 2)

In [100]:
df.sample(5)

Unnamed: 0,category,text
615,tech,gadgets galore on show at fair the 2005 consum...
1904,tech,no half measures with half-life 2 could half-l...
734,sport,minister digs in over doping row the belgian s...
1434,sport,bees handed potential man utd tie brentford fa...
431,politics,blair backs pre-election budget tony blair h...


### Perform Data Cleaning
1. Keep only ascii characters
2. Remove numbers from the text
3. Remove stopwords
4. Remove punctuations

In [101]:
def words_cleaning( raw_text ):
    ## non ascii characters
    letters_only = re.sub(r'[^\x00-\x7F]+',' ', raw_text)
    letters_only = re.sub('[0-9]', '', letters_only)
    
    #convert to lower case, split into individual words
    words = [letters_only.lower().strip()]
    
    stops = set(stopwords.words("english"))
    
    # Remove stop words
    meaningful_words = [w for w in words if not w in stops]
    
    return ' '.join(meaningful_words)

In [102]:
df['text'] = df['text'].apply(words_cleaning)
df['text'] = df['text'].str.replace('[^\w\s]','')

In [103]:
df.sample(5)

Unnamed: 0,category,text
609,sport,england given tough sevens draw england will h...
957,sport,scrumhalf williams rejoins bath bath have sign...
1759,business,jarvis sells tube stake to spain shares in eng...
1949,business,huge rush for jet airways shares indian airlin...
282,tech,consumers snub portable video consumers want...


### Convert the textual data to get TF-IDF features
TF-IDF term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

<b> TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).</b>

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

<b>IDF(t) = log_e(Total number of documents / Number of documents with term t in it).</b>

<b>TF-IDF = TF * IDF</b>

We apply TF-IDF on the data which converts all words to lowercase, removes stopwords and ignores rare words which appear less than 3% in the complete list of news articles.

In [104]:
tfidf_vec = TfidfVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 1), min_df=0.03)
tfidf_features = tfidf_vec.fit_transform(df['text'])
tfidf_df = pd.DataFrame(tfidf_features.todense(), columns=tfidf_vec.get_feature_names())
tfidf_df.shape

(2225, 988)

In [105]:
tfidf_df.sample(5)

Unnamed: 0,ability,able,access,according,account,accounts,accused,act,action,actor,...,world,worldwide,worth,written,wrong,year,yearold,years,york,young
2155,0.0,0.0,0.0,0.0,0.0,0.0,0.196587,0.097903,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1342,0.0,0.112511,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.112511,0.0,0.0,0.0
1695,0.0,0.0,0.074854,0.0,0.0,0.0,0.0,0.0,0.065477,0.0,...,0.0,0.0,0.0,0.0,0.0,0.108084,0.0,0.044008,0.0,0.0
1014,0.0,0.160883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Label Encode the category

In [106]:
tfidf_df['category'] = df['category']
tfidf_df.shape

(2225, 989)

In [107]:
le = preprocessing.LabelEncoder()
tfidf_df['category'] = le.fit_transform(tfidf_df['category'])

In [108]:
tfidf_df['category'].unique()

array([4, 0, 3, 1, 2], dtype=int64)

### Running different classifiers on TF-IDF data to predict the category
1. Split the data into train and test
2. Create a helper function to train the model and calculate the accuracy
3. Train following classifiers -<br>
    1. Decision Tree
    2. Bagging based Random Forest
    3. Boosting based XGBoost
    4. Naive Bayes
    5. Logistic Regression

In [109]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_df.drop(['category'], axis=1), tfidf_df['category'],\
                                                    test_size=0.20, random_state=42)

In [112]:
def analyze_model(model):
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    print(accuracy_score(y_train, y_train_pred))
    print(accuracy_score(y_test, y_test_pred))

### Decision Tree

In [113]:
dt_model = DecisionTreeClassifier(random_state=42)
analyze_model(dt_model)

1.0
0.8202247191011236


### Random Forest

In [114]:
rf_model = RandomForestClassifier(random_state=42)
analyze_model(rf_model)

0.998876404494382
0.8966292134831461


### XGBoost

In [117]:
xgb_model = XGBClassifier(random_state=42)
analyze_model(xgb_model)

0.996629213483146
0.9325842696629213


  if diff:
  if diff:


### Multinomial Naive Bayes

In [119]:
nb_model = MultinomialNB()
analyze_model(nb_model)

0.9691011235955056
0.9303370786516854


### Logistic Regression

In [121]:
log_model = LogisticRegression(random_state=42)
analyze_model(log_model)

0.9882022471910112
0.946067415730337
