# Final Exam Part 2: Implementing a News Reader

### by Ilia Ozhmegov

As for some people reading news can be upsetting and devastaiting, I have decided to use taught techniques in a different area. 

My model takes as input critic's articles about a videogame and outputs the grade from 0 to 100.

## Scrapping part

Metacritic is a good article aggregator that allows easily find all critic's articles about a certain movie or a videogame, so we use this one to collect a bunch of articles.

In [1]:
import requests
import pandas as pd

from bs4       import BeautifulSoup
from newspaper import Article

In [2]:
def get_parsed_page(url, parser='html.parser', save=False):
    user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36"
    headers = {'User-Agent': user_agent}

    try:
        page = requests.get(url, headers=headers, timeout=5)

        if page.status_code != 200:
            return

        if save:
            with open("foobar.html", 'w') as f:
                f.writelines(page.text)

        parsed_page = BeautifulSoup(page.text, parser)
        return parsed_page
    except:
        print("Error: status code", page.status_code)


def collect_links(URL, save=False):
    parsed_page = get_parsed_page(URL, save=False)
    all_stats = parsed_page.find_all('div', {"class": "review_stats"})

    articles = list()
    for stat in all_stats:
        try:
            link  = stat.find("div", {"class": "review_critic"}).find('a').get('href')
            grade = stat.find("div", {"class": "review_grade"}) .find('div').text
            if 'http' in link and grade:
                articles.append((link, grade))
        except:
            continue

    df = pd.DataFrame(articles)
    df.columns = ["link", "grade"]
    if save:
        df.to_csv('articles.csv', index=True)
    return df


def download_articles(df, save=False):
    for i in range(1, len(df)):
        try:
            article = Article(str(df.iloc[i]['link']))
            article.download()
            article.parse()

            df.loc[i, 'lang']  = article.meta_lang
            df.loc[i, 'title'] = article.title
            df.loc[i, 'date']  = article.publish_date
            df.loc[i, 'text']  = article.text
        except:
            print("failed to download: ", str(df.iloc[i]['link']))

        percent = int(i/len(df)*100)
        bar = int(percent/10) 
        print('[', '='*bar*5, ' '*(10-bar)*5,'] ', percent , '% \r', sep='', end='')
        
    print('[', '='*10*5,'] ', '100%', sep='')
            
            
    if save:
        df.to_csv('articles_full_data.csv', index=True)
         
    return df


In [3]:
url_list  = ["https://www.metacritic.com/game/playstation-4/desperados-iii/critic-reviews",
             "https://www.metacritic.com/game/switch/deadly-premonition-2-a-blessing-in-disguise/critic-reviews",
             "https://www.metacritic.com/game/playstation-4/the-last-of-us-part-ii/critic-reviews",
             "https://www.metacritic.com/game/pc/alone-in-the-dark-illumination/critic-reviews",
             "https://www.metacritic.com/game/playstation-3/damnation/critic-reviews",
             "https://www.metacritic.com/game/pc/damnation/critic-reviews",
             "https://www.metacritic.com/game/playstation-4/god-of-war/critic-reviews"]


df = pd.concat(list(map(lambda x: download_articles(collect_links(x)), url_list)))

failed to download:  https://www.psu.com/reviews/desperados-3-ps4-review/
failed to download:  http://www.jeuxvideo.com/test/1235624/desperados-iii-mimimi-reaffirme-sa-maitrise-de-l-infiltration-tactique-en-temps-reel.htm
failed to download:  https://www.gamerstemple.com/game-reviews/playstation-4/11488/desperados-iii-review
failed to download:  https://www.impulsegamer.com/deadly-premonition-2-a-blessing-in-disguise-nintendo-switch-review/
failed to download:  https://areajugones.sport.es/videojuegos/analisis-deadly-premonition-2-a-blessing-in-disguise/ 
failed to download:  https://www.criticalhit.net/gaming/deadly-premonition-2-a-blessing-in-disguise-review-twin-peaks-and-valleys/
failed to download:  https://www.smh.com.au/technology/video-games/brutal-and-beautiful-the-last-of-us-sequel-is-a-harrowing-masterpiece-20200619-p5545x.html 
failed to download:  https://screenrant.com/the-last-of-us-2-game-review/
failed to download:  https://gamerant.com/last-of-us-2-review/
failed to d

In [4]:
df

Unnamed: 0,link,grade,lang,title,date,text
0,https://www.darkstation.com/reviews/desperados...,90,,,NaT,
1,https://www.psu.com/reviews/desperados-3-ps4-r...,90,,,NaT,
2,http://www.playstationcountry.com/desperados-i...,90,en,PlayStation Country,2020-06-12 16:01:55+00:00,I really enjoyed Shadow Tactics: Blades of the...
3,https://www.playstationlifestyle.net/2020/06/1...,90,,Revisiting the Wild West (PS4),2020-06-12 00:00:00,A bag sits in the middle of the grass. Noticin...
4,http://www.jeuxvideo.com/test/1235624/desperad...,90,,,NaT,
...,...,...,...,...,...,...
106,https://stevivor.com/reviews/god-of-war-review...,80,en,"God of War Review: Flawed and fun, just like K...",2018-04-12 17:01:14+10:00,"God of War on PS4 presents a kind, gentle Krat..."
107,https://www.dailydot.com/parsec/god-of-war-rev...,80,en,God of War breathes new life into an exhausted...,2018-04-12 02:01:56+00:00,"Kratos, God of War’s titular star, has been se..."
108,http://www.thesixthaxis.com/2018/04/12/god-of-...,80,en,God Of War Review – TheSixthAxis,2018-04-12 00:00:00,There’s plenty about the new God of War that w...
109,http://twinfinite.net/2018/04/god-of-war-review,80,en,God of War Review,2018-04-12 07:01:01+00:00,God of War on PS4\n\nIt’s the little details t...


## Data cleaning part

* As we do not want to deal with any langue besides english, we throw all articles in different languages.
* As some URL are broken or unrechable, we do not manage to download those, so we throw out them too.
* As sometimes newspaper takes the wrong paragraph, we want throw out these outliers by quantile less than 5%.

In [5]:
import numpy  as np

In [6]:
df

Unnamed: 0,link,grade,lang,title,date,text
0,https://www.darkstation.com/reviews/desperados...,90,,,NaT,
1,https://www.psu.com/reviews/desperados-3-ps4-r...,90,,,NaT,
2,http://www.playstationcountry.com/desperados-i...,90,en,PlayStation Country,2020-06-12 16:01:55+00:00,I really enjoyed Shadow Tactics: Blades of the...
3,https://www.playstationlifestyle.net/2020/06/1...,90,,Revisiting the Wild West (PS4),2020-06-12 00:00:00,A bag sits in the middle of the grass. Noticin...
4,http://www.jeuxvideo.com/test/1235624/desperad...,90,,,NaT,
...,...,...,...,...,...,...
106,https://stevivor.com/reviews/god-of-war-review...,80,en,"God of War Review: Flawed and fun, just like K...",2018-04-12 17:01:14+10:00,"God of War on PS4 presents a kind, gentle Krat..."
107,https://www.dailydot.com/parsec/god-of-war-rev...,80,en,God of War breathes new life into an exhausted...,2018-04-12 02:01:56+00:00,"Kratos, God of War’s titular star, has been se..."
108,http://www.thesixthaxis.com/2018/04/12/god-of-...,80,en,God Of War Review – TheSixthAxis,2018-04-12 00:00:00,There’s plenty about the new God of War that w...
109,http://twinfinite.net/2018/04/god-of-war-review,80,en,God of War Review,2018-04-12 07:01:01+00:00,God of War on PS4\n\nIt’s the little details t...


In [7]:
df = df.dropna()

In [8]:
df["grade"] = pd.to_numeric(df["grade"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [9]:
df['text_size'] = list(df['text'].apply(lambda x: len(x)))
lower_limit = np.quantile(df['text_size'], q=0.05)
df = df[df['text_size'] > lower_limit]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [10]:
df = df[df['lang'] == 'en']

In [11]:
df

Unnamed: 0,link,grade,lang,title,date,text,text_size
2,http://www.playstationcountry.com/desperados-i...,90,en,PlayStation Country,2020-06-12 16:01:55+00:00,I really enjoyed Shadow Tactics: Blades of the...,6580
6,https://www.softpedia.com/reviews/games/playst...,85,en,Desperados III Review (PS4),2020-06-30 10:38:00+00:00,There are a few videogame genres that seem to ...,7652
12,https://twinfinite.net/2020/07/desperados-iii-...,80,en,"Desperados III Review – All Good, No Bad or Ugly",2020-07-02 18:12:38+00:00,Desperados III on PlayStation 4\n\nAs someone ...,4534
13,https://www.slantmagazine.com/film/review-desp...,80,en,'Desperados III' Review: Perfect for Gunslinge...,2020-06-15 00:00:00,There are various reasons why the games on thi...,13535
16,https://metro.co.uk/2020/06/17/desperados-3-re...,70,en,Desperados 3 review – cowboy commandos,2020-06-17 00:00:00,Desperados 3 – stealth and strategy in the old...,7016
...,...,...,...,...,...,...,...
105,https://www.newgamenetwork.com/article/1886/go...,80,en,God of War Review,2018-04-12 03:01:04-04:00,"Where Kratos goes, trouble often follows\n\nAl...",17535
106,https://stevivor.com/reviews/god-of-war-review...,80,en,"God of War Review: Flawed and fun, just like K...",2018-04-12 17:01:14+10:00,"God of War on PS4 presents a kind, gentle Krat...",8288
107,https://www.dailydot.com/parsec/god-of-war-rev...,80,en,God of War breathes new life into an exhausted...,2018-04-12 02:01:56+00:00,"Kratos, God of War’s titular star, has been se...",9901
108,http://www.thesixthaxis.com/2018/04/12/god-of-...,80,en,God Of War Review – TheSixthAxis,2018-04-12 00:00:00,There’s plenty about the new God of War that w...,8074


## Machine Learning Part

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from sklearn.pipeline                import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer

In [13]:
grade_regressor         = Pipeline([('vect', HashingVectorizer(n_features=int(2**20))),
                                    ('reg',  LinearRegression())])
  
grade_lasso_regressor   = Pipeline([('vect', HashingVectorizer(n_features=int(2**20))),
                                    ('reg',  Lasso())])
  
grade_ridge_regressor   = Pipeline([('vect', HashingVectorizer(n_features=int(2**20), ngram_range=(1,2))),
                                    ('reg',  Ridge())])

grade_ridgecv_regressor = Pipeline([('vect', HashingVectorizer(n_features=int(2**20))),
                                    ('reg',  RidgeCV(alphas=np.logspace(-10, 10, 1000)))])


In [14]:
X = df.text
y = df.grade

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
grade_regressor        .fit(X_train, y_train)
grade_lasso_regressor  .fit(X_train, y_train)
grade_ridge_regressor  .fit(X_train, y_train)
grade_ridgecv_regressor.fit(X_train, y_train)

print()




In [16]:
result = pd.DataFrame(
    [
       [mean_absolute_error(y_test, grade_regressor        .predict(X_test)),
        mean_absolute_error(y_test, grade_lasso_regressor  .predict(X_test)),
        mean_absolute_error(y_test, grade_ridge_regressor  .predict(X_test)),
        mean_absolute_error(y_test, grade_ridgecv_regressor.predict(X_test))],
       [mean_squared_error (y_test, grade_regressor        .predict(X_test)),
        mean_squared_error (y_test, grade_lasso_regressor  .predict(X_test)),
        mean_squared_error (y_test, grade_ridge_regressor  .predict(X_test)),
        mean_squared_error (y_test, grade_ridgecv_regressor.predict(X_test))]
    ],
    columns=["grade_regressor", 
             "grade lasso_regressor",
             "grade ridge_regressor",
             "grade ridgecv_regressor"],
    index=["MAE", "MSE"]
                     )

result.loc["SMSE"] = list(np.sqrt(result.loc["MSE"]))
result

Unnamed: 0,grade_regressor,grade lasso_regressor,grade ridge_regressor,grade ridgecv_regressor
MAE,10.277854,12.188149,9.874437,10.277852
MSE,198.430648,202.576092,143.50709,198.430578
SMSE,14.086541,14.232923,11.979444,14.086539
