# BBC Dataset task

*The author: Sofia Shchipinskaya*

The task: 
Classify news to one of 5 categories based on text

1. Choose and argue your measure of a test's accuracy
2. Build data processing and classification pipeline
3. Tune  your model

In [1]:
# importing needed libraries
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import grid_search



## Data preporation

In [2]:
data_folder = "D:/Рабочий стол/bbc"
folders = ["business","entertainment","politics","sport","tech"]

os.chdir(data_folder)

x = []
y = []

for i in folders:
    files = os.listdir(i)
    for text_file in files:
        file_path = i + "/" +text_file
        with open(file_path) as f:
            data = f.readlines()
        data = ' '.join(data)
        x.append(data)
        y.append(i)
   
data = {'news': x, 'type': y}       
df = pd.DataFrame(data)
df.to_csv('../dataset.csv', index=False)

In [3]:
# looking at first 5 news
df.head(5)

Unnamed: 0,news,type
0,Ad sales boost Time Warner profit\n \n Quarter...,business
1,Dollar gains on Greenspan speech\n \n The doll...,business
2,Yukos unit buyer faces loan claim\n \n The own...,business
3,High fuel prices hit BA's profits\n \n British...,business
4,Pernod takeover talk lifts Domecq\n \n Shares ...,business


In [4]:
# looking through the data
df.groupby('type').describe()

Unnamed: 0_level_0,news,news,news,news
Unnamed: 0_level_1,count,unique,top,freq
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
business,510,503,SEC to rethink post-Enron rules\n \n The US st...,2
entertainment,386,369,Bookmakers back Aviator for Oscar\n \n The Avi...,2
politics,417,403,Brown outlines third term vision\n \n Gordon B...,2
sport,511,505,Ferguson rues failure to cut gap\n \n Boss Sir...,2
tech,401,347,Blind student 'hears in colour'\n \n A blind s...,2


In [5]:
# looking through the data
df['type'].value_counts(normalize=True)

sport            0.229663
business         0.229213
politics         0.187416
tech             0.180225
entertainment    0.173483
Name: type, dtype: float64

#### Comment:
As we can observe, all classes are quite balanced, so there is no sample imbalance problem (around 20% of observations for each class)

In [6]:
# creating categorical variable 
df['category'] = df['type'].factorize()[0]
df.tail(5)

Unnamed: 0,news,type,category
2220,BT program to beat dialler scams\n \n BT is in...,tech,4
2221,Spam e-mails tempt net shoppers\n \n Computer ...,tech,4
2222,Be careful how you code\n \n A new European di...,tech,4
2223,US cyber security chief resigns\n \n The man m...,tech,4
2224,Losing yourself in online gaming\n \n Online r...,tech,4


In [7]:
# vectorizing
vect = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
X = vect.fit_transform(df.news).toarray()
Y = df.category
X.shape

(2225, 14454)

In [8]:
# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

## Building models

#### Comment:
I decided to run four models that are present below.

As a measure of a test's accuracy I decided to choose f1-score. It shows how well our models can distinguish between these 5 classes. We might say that sample imbalance probem is not present here, but in case it is, f1-score can deal with this as well.

$F1 = 2\frac{PrecisionRecall}{Precision + Recall}$

### Multinomial Naive Bayes

In [9]:
# Multinomial
Multi = MultinomialNB()
Multi.fit(X_train, y_train)
Multi_pred = Multi.predict(X_test)
print(classification_report(y_test, Multi_pred))

             precision    recall  f1-score   support

          0       0.94      0.96      0.95       115
          1       0.99      0.92      0.95        72
          2       0.93      0.97      0.95        76
          3       0.99      0.99      0.99       102
          4       0.96      0.95      0.96        80

avg / total       0.96      0.96      0.96       445



### Logistic regression

In [10]:
# Logistic regression
LR = LogisticRegression(random_state=0)
LR.fit(X_train, y_train)
LR_pred = LR.predict(X_test)
print(classification_report(y_test, LR_pred))

             precision    recall  f1-score   support

          0       0.95      0.97      0.96       115
          1       0.99      0.96      0.97        72
          2       0.96      0.96      0.96        76
          3       0.99      0.99      0.99       102
          4       0.97      0.96      0.97        80

avg / total       0.97      0.97      0.97       445



### Random forest

In [11]:
# Random forest
RF = RandomForestClassifier(n_estimators=50, max_depth=70)
RF.fit(X_train, y_train)
RF_pred = RF.predict(X_test)
print(classification_report(y_test, RF_pred))

             precision    recall  f1-score   support

          0       0.97      0.97      0.97       115
          1       0.97      0.96      0.97        72
          2       0.96      0.96      0.96        76
          3       0.97      1.00      0.99       102
          4       0.97      0.95      0.96        80

avg / total       0.97      0.97      0.97       445



### Linear Support Vector Machine

In [12]:
# SVC
SVC = LinearSVC()
SVC.fit(X_train, y_train)
SVC_pred = SVC.predict(X_test)
print(classification_report(y_test, SVC_pred))

             precision    recall  f1-score   support

          0       0.95      0.97      0.96       115
          1       0.99      0.96      0.97        72
          2       0.96      0.97      0.97        76
          3       1.00      0.99      1.00       102
          4       0.97      0.96      0.97        80

avg / total       0.97      0.97      0.97       445



#### Comments
As it can be seen, all models have quite high f1-score. So, for further analysis I decided to choose two models with the highest f1-score: logistic regression (f1-score = 0.97) and SVC (f1-score = 0.97).

## Tunning models

In [13]:
# logistic regression tunning
LR_parameters = {'penalty': ['l1', 'l2'],'C': [0.0001, 0.001, 0.01, 1, 100]}
LR_tun = grid_search.GridSearchCV(LR, LR_parameters, scoring = 'f1_weighted', cv = 8)
LR_tun.fit(X_train, y_train)
LR_tun.best_estimator_
print("Best LR parameters:", LR_tun.best_params_, "Best score:", LR_tun.best_score_)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Best LR parameters: {'C': 100, 'penalty': 'l2'} Best score: 0.9842339306011278


In [14]:
# SVC tunning
SVC_parameters = {'multi_class':['ovr','crammer_singer'],'C': [0.0001, 0.001, 0.01, 1, 100]}
SVC_tun = grid_search.GridSearchCV(SVC, SVC_parameters, scoring = 'f1_weighted', cv = 10)
SVC_tun.fit(X_train, y_train)
SVC_tun.best_estimator_
print("Best SVC parameters:", SVC_tun.best_params_, "Best score:", SVC_tun.best_score_)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Best SVC parameters: {'C': 100, 'multi_class': 'ovr'} Best score: 0.9853300180480425


#### Comments
In this works the tdidf approach has been applied and then 4 different models were built. All models have quite good f1-score (higher than 0.96) but the best performing models are logistic regression and SVC.

After tuning the models, we got higher f1-scores.
So, for logistic regression model f1-score = 0.9842 with the following parameters {'C': 100, 'penalty': 'l2'}.
And for SVC model f1-score = 0.9853 with the following parameters {'C': 100, 'multi_class': 'ovr'}.