# Model Exercises
Do your work for this exercise in a file named model.

Take the work we did in the lessons further:

* What other types of models (i.e. different classifcation algorithms) could you use?
* How do the models compare when trained on term frequency data alone, instead of TF-IDF values?

In [3]:
from pprint import pprint
import pandas as pd
import nltk
import re

import acquire as a
import prepare as p

# Acquire

In [4]:
url = 'https://inshorts.com/en/read'
category_list = ['business', 'technology', 'automobile']

articles = a.get_news_articles(url, category_list)

In [5]:
news_df = pd.DataFrame(articles)
news_df

Unnamed: 0,title,content,category
0,Netflix co-CEO Reed Hastings steps down,Netflix Co-founder Reed Hastings has stepped d...,business
1,"HCLTech expects to hire 30,000 people in next ...",HCL Technologies (or HCLTech) is expected to h...,business
2,Amazon to shut down its charity donation progr...,Amazon has announced that it will shut down it...,business
3,Videocon CEO Venugopal Dhoot granted interim b...,The Bombay High Court on Friday granted interi...,business
4,Crypto lender Genesis files for bankruptcy in US,Cryptocurrency lender Genesis Global Capital f...,business
...,...,...,...
70,Former Volkswagen CEO Carl Hahn dies at 96,"Former Volkswagen CEO Carl Hahn, 96, died in h...",automobile
71,Jaguar Land Rover India MD Rohit Suri to reti...,"Rohit Suri, President and Managing Director, J...",automobile
72,Tata to set up EV battery cell-manufacturing i...,Tata Motors CFO PB Balaji said that the conglo...,automobile
73,India can become exporter of green hydrogen wi...,The government's mission for clean energy will...,automobile


In [7]:
# Assign list by passing the clean function with a join
business_words = p.clean(' '.join(news_df[news_df.category == 'business']['content']),'lem')
tech_words = p.clean(' '.join(news_df[news_df.category == 'technology']['content']),'lem')
auto_words = p.clean(' '.join(news_df[news_df.category == 'automobile']['content']),'lem')
all_words = p.clean(' '.join(news_df['content']),'lem')

Lemmatizing Performed
Lemmatizing Performed
Lemmatizing Performed
Lemmatizing Performed


In [14]:
data = pd.Series(news_df.content)

In [17]:
data = list(data)

In [18]:
data

['Netflix Co-founder Reed Hastings has stepped down as the co-CEO of the streaming giant. The company said that co-CEO Ted Sarandos will remain in his position and will share the title with COO Greg Peters. Meanwhile, Hastings will move into the role of Executive Chairman. "I believe it\'s the right time to complete my succession," Hastings said in a statement. ',
 "HCL Technologies (or HCLTech) is expected to hire 30,000 people in the next 12 months, the company's CEO C Vijayakumar told BQ Prime at the WEF in Davos. HCLTech's attrition significantly dropped in the last quarter, from 23.8% to 21.7%, and even the talent costs are moderating, Vijayakumar said. The company has doubled down on fresh talent absorption, he added.",
 'Amazon has announced that it will shut down its charity donation programme, AmazonSmile, by February. "After almost a decade, the programme has not grown to create the impact that we had originally hoped," Amazon said. Launched in 2013, Amazon said it has donate

In [19]:
### Bag of Words
from sklearn.feature_extraction.text import CountVectorizer

# same basic process as any sklearn transformation:
# make the thing
cv = CountVectorizer()
# use the thing
bag_of_words = cv.fit_transform(data)

In [20]:
bag_of_words

<75x1470 sparse matrix of type '<class 'numpy.int64'>'
	with 3486 stored elements in Compressed Sparse Row format>

In [22]:
bag_of_words.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

### TF-IDF

- term frequency - inverse document frequency
- $\text{tf} \times \text{idf} = \frac{\text{tf}}{\text{df}}$
- a measure that helps identify how important a word is in a document
- combination of how often a word appears in a document (**tf**) and how unqiue the word
  is among documents (**idf**)
- used by search engines
- naturally helps filter out stopwords
- tf is for a single document, idf is for a corpus

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
bag_of_words = tfidf.fit_transform(data)
pprint(data)
pd.DataFrame(bag_of_words.todense(), 
             columns=tfidf.get_feature_names())

['Netflix Co-founder Reed Hastings has stepped down as the co-CEO of the '
 'streaming giant. The company said that co-CEO Ted Sarandos will remain in '
 'his position and will share the title with COO Greg Peters. Meanwhile, '
 'Hastings will move into the role of Executive Chairman. "I believe it\'s the '
 'right time to complete my succession," Hastings said in a statement. ',
 'HCL Technologies (or HCLTech) is expected to hire 30,000 people in the next '
 "12 months, the company's CEO C Vijayakumar told BQ Prime at the WEF in "
 "Davos. HCLTech's attrition significantly dropped in the last quarter, from "
 '23.8% to 21.7%, and even the talent costs are moderating, Vijayakumar said. '
 'The company has doubled down on fresh talent absorption, he added.',
 'Amazon has announced that it will shut down its charity donation programme, '
 'AmazonSmile, by February. "After almost a decade, the programme has not '
 'grown to create the impact that we had originally hoped," Amazon said. '
 



Unnamed: 0,000,02,088,10,100,101,11,12,13,14,...,yatra,year,years,yesterday,york,you,your,yum,zabihullah,zeyoudi
0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.103438,0.0,0.0,0.000000,0.0,0.0,0.000000,0.119379,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.0,0.129531,0.0,0.0,0.157146,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
72,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
73,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
# To get the idf score for each word (these aren't terribly usefule themselves):
# zip: put these two things of the same length together
# dict: turn those two associated things into a k: v pair
# pd.Series: turn those keys into indeces, and the values into values
pd.Series(
    dict(
        zip(
            tfidf.get_feature_names(), tfidf.idf_
        )
    )
)



000           2.932838
02            4.637586
088           4.637586
10            3.251292
100           4.232121
                ...   
you           4.232121
your          4.637586
yum           4.637586
zabihullah    4.637586
zeyoudi       4.637586
Length: 1470, dtype: float64

# [Bag Of Ngrams](#TOC)

For either `CountVectorizer` or `TfidfVectorizer`, you can set the `ngram_range`
parameter.

In [26]:
cv = CountVectorizer(ngram_range=(2, 3))
bag_of_grams = cv.fit_transform(data)

In [27]:
pprint(data)

['Netflix Co-founder Reed Hastings has stepped down as the co-CEO of the '
 'streaming giant. The company said that co-CEO Ted Sarandos will remain in '
 'his position and will share the title with COO Greg Peters. Meanwhile, '
 'Hastings will move into the role of Executive Chairman. "I believe it\'s the '
 'right time to complete my succession," Hastings said in a statement. ',
 'HCL Technologies (or HCLTech) is expected to hire 30,000 people in the next '
 "12 months, the company's CEO C Vijayakumar told BQ Prime at the WEF in "
 "Davos. HCLTech's attrition significantly dropped in the last quarter, from "
 '23.8% to 21.7%, and even the talent costs are moderating, Vijayakumar said. '
 'The company has doubled down on fresh talent absorption, he added.',
 'Amazon has announced that it will shut down its charity donation programme, '
 'AmazonSmile, by February. "After almost a decade, the programme has not '
 'grown to create the impact that we had originally hoped," Amazon said. '
 

In [28]:
pd.DataFrame(bag_of_grams.todense(),
            columns=cv.get_feature_names())



Unnamed: 0,000 crore,000 crore electric,000 crore follow,000 crore for,000 electric,000 electric vehicles,000 evs,000 evs by,000 of,000 of its,...,you want,you want to,your lrs,your lrs eligibility,yum brands,yum brands has,zabihullah mujahid,zabihullah mujahid posted,zeyoudi said,zeyoudi said that
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
71,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
72,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
73,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
