# About Dataset


Dataset consists of 7 news type labels. These labels are economy, politics, life, technology, magazine, health, sport. This dataset was created by me via Mynet, Milliyet, etc websites.
There are 600 headlines for each label in the dataset . Hence, total headlines count is 4200 for dataset.

# Loading Data

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("Turkish-HeadLines.csv")
df.head()

Unnamed: 0,HABERLER,ETIKET
0,TÜİK verilerine göre sanayi ciro endeksi Ağust...,Ekonomi
1,Piyasa güne eksi rezervde başladı,Ekonomi
2,"Citigroup, Deutsche Bank ve HSBC Libor manipül...",Ekonomi
3,Gelişen piyasa yatırımcılarını en fazla 'Fed' ...,Ekonomi
4,Bitcoin fiyatında yükseliş hız kesmiyor,Ekonomi


As we can see data we have contains only strings, and making classification with strings we need catch string similarities. 


In [2]:
data = df.values

# Splitting Data

In [3]:
x = data[:,0]
x

array(['TÜİK verilerine göre sanayi ciro endeksi Ağustos ayında bir önceki yılın aynı ayına göre %26,6 arttı.',
       'Piyasa güne eksi rezervde başladı',
       'Citigroup, Deutsche Bank ve HSBC Libor manipülasyonu davasında 132 milyon dolar ödemeyi kabul ettiler.',
       ...,
       'Konak ilçesindeki operasyonda 55 gram esrar, 55 uyuşturucu hap ile 1 ruhsatsız tabanca ele geçirildi, 1 kişi tutuklandı',
       "Siirt ve Manisa'da düzenlenen operasyonda gözaltına alınan eski 2 kaymakam ve savcı tutuklandı",
       "Denizli'de Kaçak Sigara Operasyonu: 13 Gözaltı"], dtype=object)

In [4]:
target_data = data[:,-1]
target_data

array(['Ekonomi', 'Ekonomi', 'Ekonomi', ..., 'Yaşam', 'Yaşam', 'Yaşam'],
      dtype=object)

# Changing target data string to integer value


In [5]:
target_set = {s for s in target_data}
target_set

{'Ekonomi', 'Magazin', 'Sağlık', 'Siyaset', 'Spor', 'Teknoloji', 'Yaşam'}

In [6]:
target_id = {}
i=0
for s in target_set:
    target_id[s] = i
    i+=1
target_id

{'Teknoloji': 0,
 'Ekonomi': 1,
 'Spor': 2,
 'Sağlık': 3,
 'Magazin': 4,
 'Yaşam': 5,
 'Siyaset': 6}

In [7]:
y = [target_id[s] for s in target_data]

In [8]:
# %25 test %75 train
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=37, stratify=y)


# Tokenizing text
Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors:

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
x_train_counts.shape

(3150, 13044)

# From occurrences to frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(x_train_counts)
x_train_tf = tf_transformer.transform(x_train_counts)
x_train_tf.shape

(3150, 13044)

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

In [11]:
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
x_train_tfidf.shape

(3150, 13044)

# Starting with Naive Algo
Let’s start with a naive Bayes classifier, which provides a nice baseline for this task.

In [12]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(x_train_tfidf, y_train)

Let's see how that works

In [13]:
ex_test_data = np.array(["Altın fiyatlarında Artış!", "Olimpiyatlarda Sporcumuz Gümüş Madalya Aldı"])
ex_test_data = ex_test_data.astype('object')
x_new_counts = count_vect.transform(ex_test_data)
x_new_tfidf = tfidf_transformer.transform(x_new_counts)
predicted = clf.predict(x_new_tfidf)
predicted

array([1, 2])

Getting the Headlines from id dictionary

In [14]:
def get_key(val):
    for key, value in target_id.items():
         if val == value:
             return key
for i in predicted:
    print(get_key(i))

Ekonomi
Spor


Works pretty good.

# Using Pipeline


In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [15]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', MultinomialNB()),
 ])

In [16]:
text_clf.fit(x_train, y_train)
y_pred_naive = text_clf.predict(x_test)

# Evaluation of the performance on the test set


In [17]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_naive)

0.9666666666666667

We achieved 96.7% accuracy which is great performance by naive algo. Let’s see if we can do better with a linear support vector machine (SVM), which is widely regarded as one of the best text classification algorithms (although it’s also a bit slower than naïve Bayes). We can change the learner by simply plugging a different classifier object into our pipeline:

# Continue SGDClassifier

In [18]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=37, max_iter=5, tol=None)),
 ])
text_clf.fit(x_train, y_train)
y_pred_SVM = text_clf.predict(x_test)

In [19]:
accuracy_score(y_test, y_pred_SVM)

0.9657142857142857

We achieved 96.6% accuracy. Let's get detailed info about our performance

In [20]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_SVM, target_names=target_set))

              precision    recall  f1-score   support

   Teknoloji       0.97      0.93      0.95       150
     Ekonomi       0.99      0.93      0.96       150
        Spor       0.99      0.99      0.99       150
      Sağlık       0.92      0.97      0.94       150
     Magazin       0.99      0.97      0.98       150
       Yaşam       0.98      0.98      0.98       150
     Siyaset       0.94      0.99      0.96       150

    accuracy                           0.97      1050
   macro avg       0.97      0.97      0.97      1050
weighted avg       0.97      0.97      0.97      1050



In [21]:
metrics.confusion_matrix(y_test, y_pred_SVM)

array([[139,   1,   1,   7,   0,   1,   1],
       [  2, 140,   0,   2,   1,   0,   5],
       [  0,   0, 149,   0,   0,   1,   0],
       [  1,   1,   0, 145,   0,   0,   3],
       [  1,   0,   0,   2, 146,   1,   0],
       [  0,   0,   0,   2,   1, 147,   0],
       [  1,   0,   1,   0,   0,   0, 148]], dtype=int64)

# Confusion Values
Let's see what went wrong with 2.5% wrong decisions

In [22]:
for i in range(len(y_test)):
    right_class = y_test[i]
    predicted_class = y_pred_SVM[i]
    if (right_class != predicted_class):
        print(x_test[i]+ "\nPredicted Class:" + get_key(predicted_class))
        print("Right Class:" + get_key(right_class) + "\n")


TBMM Sağlık Komisyonu Başkanı Kavuncu, "Elektronik sigaranın içindeki nikotin de en az sigara kadar zararlı. Gençlerimizin buna aldanmaması gerekir" uyarısında bulundu
Predicted Class:Siyaset
Right Class:Sağlık

Bankaların Birleşme, Devir, Bölünme ve Hisse Değişimi Hakkında Yönetmeliği yeniden düzenlendi.
Predicted Class:Sağlık
Right Class:Ekonomi

Eylem Gülçin Kanık Cinayetindeki Gelişmeler Kan Dondurdu
Predicted Class:Sağlık
Right Class:Yaşam

Bitcoinin %1'ine sahip milyarder ikizler
Predicted Class:Teknoloji
Right Class:Ekonomi

 dünyayı etkisine alan ve 150 ülkede 200 binden fazla bilgisayara bulaşan WannaCry virüsü, dikkat edilmesi gereken ve mutlaka önlem alınması gereken bir zararlı yazılım.
Predicted Class:Sağlık
Right Class:Teknoloji

Ünlü Türk markasının Bitcoin planı
Predicted Class:Magazin
Right Class:Ekonomi

General Mobile GM 6 dayanıklılık testi videomuda, elimizde olan 2 tane GM 6 modelini şut bombardımanına tuttuk ve ne kadar dayanıklı olduğunu gözlemledik.
Predicted C