# Title: DM1590 Final Project Template

## Authors: Markus Brewitz, Linus Wallin, Saga Jonasson, Vilhelm Norström, Martin Ryberg Laude

---

### Background and motivation

News articles come in a variety of subjects and categories describing their field and relevance. Some of these have self-reported tags, but too many do not, or provide inaccurate tags. Tagging articles has the potential to help information-seekers to judge the relevance of articles and filter them by interest, especially in this day and age where information is so abundant it can be exhausting. In this project we set out to use machine learning to tag articles based on their titles, to aid the effort of searching for relevant information.

### Dataset

https://archive.ics.uci.edu/ml/datasets/News+Aggregator

This dataset contains tags along with corresponding news titles, which we use to train our models.

### Methodology

Describe what you are doing and how you are doing it.

---

### Import dataset:

In [75]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

column_names = ["id", "title", "url", "publisher", "category", "story", "hostname", "timestamp"]
data = pd.read_csv('NewsAggregatorDataset/newsCorpora.csv',sep='\t',header=None, names=column_names)
corpus = data.title

In [84]:
# Splitting the data
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, data.category, train_size=0.7)

# Vectorizing the train data
vect = TfidfVectorizer(min_df=5)  # TODO: Experiment with the min_df value to find what works best for it
vect.fit(corpus_train)
X_train = vect.transform(corpus_train)

In [85]:
'''*** This cell is only there to visualize the min_df param. It has no real meaning outside of that scope ***'''

# find maximum value for each of the features over the dataset
max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()
# get feature names
feature_names = np.array(vect.get_feature_names())

print("Features with lowest tfidf:\n{}".format(
    feature_names[sorted_by_tfidf[:20]]))

print("Features with highest tfidf: \n{}".format(
    feature_names[sorted_by_tfidf[-20:]]))

Features with lowest tfidf:
['wboc' 'delmarvas' 'atv' 'tahoe' 'wxow' 'ktvn' 'synchs' 'whig' '7news'
 'quincy' '13m' 'wsvn' 'gram' 'fha' 'hanna' 'kwwl' 'chili' 'newson6'
 'charlottesville' 'oates']
Features with highest tfidf: 
['aladdin' 'asian' 'thank' 'this' 'defeat' 'hangout' 'alibaba' 'services'
 'caper' 'eclipse' 'astronaut' 'fake' 'religious' 'windows' 'sunday'
 'sports' 'lives' 'location' 'google' 'noah']


In [77]:
''' ***DO NOT RUN THIS CELLBLOCK IF YOU CAN AVOID IT*** '''

# Searches for the best value for C for logistic regression
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Best cross-validation score: 0.95
Best parameters:  {'C': 10}


In [78]:
'''
Best cross-validation score: 0.95
Best parameters:  {'C': 10}
'''

X_test = vect.transform(corpus_test) # Vectorizing the test data
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))

Test score: 0.95


---
## Discussion

Reflect on your results, and how one might continue to improve them.

## Acknowledgments

For each group member, describe what they did.

## Final meme

Include here a meme describing your experience in this module.