# Assignment 5

In this assignment, you'll scrape text from [The California Aggie](https://theaggie.org/) and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a [Campus News](https://theaggie.org/campus/) list, [Arts & Culture](https://theaggie.org/arts/) list, and [Sports](https://theaggie.org/sports/) list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

__Exercise 1.1.__ Write a function that extracts all of the links to articles in an Aggie article list. The function should:

* Have a parameter `url` for the URL of the article list.

* Have a parameter `page` for the number of pages to fetch links from. The default should be `1`.

* Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

* Be polite to The Aggie and save time by setting up [requests_cache](https://pypi.python.org/pypi/requests-cache) before you write your function.

* Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

* You can use [lxml.html](http://lxml.de/lxmlhtml.html) or [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to scrape HTML. Choose one and use it throughout the entire assignment.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk import corpus
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
import requests
from bs4 import BeautifulSoup
import requests_cache
requests_cache.install_cache('demo_cache')
from nltk.corpus import stopwords
from wordcloud import WordCloud,STOPWORDS

In [5]:
def art_search(url, numpages):
    '''
    Augument: 
        Given the base URl and the page number to get all the url of article.
        
    Input:
        URL and number of pages
        
    Output:
        URL
    
    '''
    b = []
    for i in range(numpages):
        urllist = url + '/page/'+ str(i+1) 
        response = requests.get(urllist)
        html_doc = response.text
        soup = BeautifulSoup(html_doc, 'lxml')
        for j in soup.find_all('h2', class_ = "entry-title"):
            b.append(j.a['href'])
    return b

In [4]:
art_search('https://theaggie.org/campus', 4)

['https://theaggie.org/2017/02/24/2017-winter-quarter-election-results/',
 'https://theaggie.org/2017/02/23/university-of-california-davis-city-council-sever-wells-fargo-contracts/',
 'https://theaggie.org/2017/02/23/academics-unite-in-peaceful-rally-against-immigration-ban/',
 'https://theaggie.org/2017/02/23/memorial-union-to-reopen-spring-quarter/',
 'https://theaggie.org/2017/02/23/asucd-president-alex-lee-vetoes-amendment-for-creation-of-judicial-council/',
 'https://theaggie.org/2017/02/22/senate-candidate-zaki-shaheen-withdraws-from-race/',
 'https://theaggie.org/2017/02/21/uc-davis-experiences-several-recent-hate-based-crimes/',
 'https://theaggie.org/2017/02/21/uc-president-selects-gary-may-as-new-uc-davis-chancellor/',
 'https://theaggie.org/2017/02/20/katehi-controversy-prompts-decline-of-uc-administrators-seeking-profitable-subsidiary-board-positions/',
 'https://theaggie.org/2017/02/20/asucd-senate-passes-resolution-submitting-comments-on-lrdp/',
 'https://theaggie.org/201

In [None]:
def extract_inf(url):
    text = []
    a_list = []
    b_list = []
    response = requests.get(url)
    html_doc = response.text
    soup = BeautifulSoup(html_doc, 'lxml')
    result = soup.find_all('div', itemprop="articleBody")
    result_1 = result[0].find_all('p')
    if "\n" in result_1[-1]:
        b_list.append((result_1[-1].text.split("\n"))[1])
        author = b_list[0]
    else:
        author = result_1[-1].text  # author
    del result_1[-1]
    for i in result_1:
        if "\n" in result_1[-1]:
            d = i.text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })## translate the unicode
            a_list.append(d)
            a_list = a_list + (result_1[-1].text.split("\n"))[0]
        else:
            d = i.text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
            a_list.append(d)
    text = u' '.join(a_list)### text
    title_1 = soup.find_all('h1', itemprop="headline")# title 
    title = title_1[0].text
    dict_1 = {
            'author':author,
            'text': text,
            'title': title,
            'url': url,
              }
           
    return dict_1

In [None]:
extract_inf('https://theaggie.org/2017/02/20/katehi-controversy-prompts-decline-of-uc-administrators-seeking-profitable-subsidiary-board-positions/')

__Exercise 1.2.__ Write a function that extracts the title, text, and author of an Aggie article. The function should:

* Have a parameter `url` for the URL of the article.

* For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

* Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for [this article](https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/) your function should return something similar to this:
```
{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}
```

Hints:

* The author line is always the last line of the last paragraph.

*   Python 2 displays some Unicode characters as `\uXXXX`. For instance, `\u201c` is a left-facing quotation mark.
    You can convert most of these to ASCII characters with the method call (on a string)
    ```
    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    ```
    If you're curious about these characters, you can look them up on [this page](http://unicode.org/cldr/utility/character.jsp), or read 
    more about [what Unicode is](http://unicode.org/standard/WhatIsUnicode.html).

__Exercise 1.3.__ Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 [Campus News](https://theaggie.org/campus/) articles and a data frame of 60 [City News](https://theaggie.org/city/) articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [None]:
result_Campus = pd.DataFrame(extract_inf(i) for i in art_search('https://theaggie.org/campus', 4))
result_Campus['category'] = 'Campus'
result_City = pd.DataFrame(extract_inf(i) for i in art_search('https://theaggie.org/city', 4))
result_City['category'] = 'City'
result = pd.concat([result_Campus, result_City], axis = 0)
result.index = [range(120)]
result

__Exercise 1.4.__ Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

* What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

* What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

* Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

*   The [nltk book](http://www.nltk.org/book/) and [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) may be helpful here.

*   You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

*   If you want, you can use the [wordcloud](http://amueller.github.io/word_cloud/) package to plot a word cloud. To install the package, run
    ```
    conda install -c https://conda.anaconda.org/amueller wordcloud
    ```
    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

<h2> Tokenize and Denoise </h2>

In [None]:
title = result['title']
text = result['text']

In [None]:
tokenize = nltk.word_tokenize
def stem(tokens,stemmer = PorterStemmer().stem):    
    '''
    Delete the word in text that is not needed, and transform it to lower case.
    
    Input: list of string 
    
    Output: list of unicode
    
    '''
    word_1 = []
    del_words = stopwords.words('english')
    del_words.extend(["also","Also", "one", ",", "Go", "and", "in", ".", "'s", "'", "'/'/",
                      "The", "be", "to", "S", "said", "for", "is", "that",
                      "uc", "as", "thi", "it", "with","an", "at", "what", "It"
                      "from", "not", "are","had", "would", "could", "if", ".","it"
                      "to", "those",  "since", "get", "as", "of", "''", "(", ")","[", "]", '``',
                      "like", "one", "in", "!" ,"#", "%", "$", "&", "too", "go","n't","In","I",])
    for words in tokens:
        if words not in del_words:
            word_1.append(words)      
    return [stemmer(w.lower()) for w in word_1] 

def lemmatize(text):
    """
    Extract simple lemmas based on tokenization and stemming
    Input: string
    Output: list of strings (lemmata)
    """
    return stem(tokenize(text))

In [None]:
def word_fre(text):
    '''
    Count the word frequency
    
    Input: pd.Series of text
    
    Output: dictionary of unicode about the word frequency
    
    '''
    textd = {} #dictionary from lemmata to document ids containing that lemma
    for i in range(len(text)):
        s = set(lemmatize(text.iloc[i]))
        try:
            toks = toks | s
        except NameError: 
            toks = s
        for tok in s:
            try:
                textd[tok].append(title.iloc[i])
            except KeyError:
                textd[tok] = [title.iloc[i]]
    numd = {key:len(set(val)) for key,val in textd.items()}
    return numd

<h3> We define that if the word frequency is greater than 45, then it is called "high frequency word". The high frequency words denotes the different topics</h3>

In [None]:
numd = word_fre(text)
numd = pd.DataFrame([numd])
numd = numd.transpose()
numd.columns = ['Frequency']
numd = numd.sort_values(by='Frequency',ascending=False)
numd_1 = numd[numd['Frequency'] >= 45]
numd_1

In [None]:
## High Frequency level Word
numd_1.plot(y = 0, fontsize = 10, kind = 'barh', title = 'High Frequency Word')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.xlim(0,120)
plt.show()

In [None]:
text_city = []
text_campus = []

for txt in result_City["text"]:
    text_city.append(txt)
    
for txt in result_Campus["text"]:
    text_campus.append(txt)
    
text_campus_all = u' '.join(text_campus)
text_city_all = u' '.join(text_city)
vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words="english",smooth_idf=True,norm="l2")
total = [text_city_all,text_campus_all]
tfs = vectorizer.fit_transform(total)
sim_table = tfs.dot(tfs.T)
print sim_table

<h3> Strategy:             
We want to find the high frequency word in campus article and city article seperately, and find the difference in topics between two types of articles.</h3>

In [None]:
Campus_text = result_Campus['text']
City_text = result_City['text']
numd_Campus = pd.DataFrame([word_fre(Campus_text)])
numd_City = pd.DataFrame([word_fre(City_text)])
numd_Campus = numd_Campus.transpose()
numd_City = numd_City.transpose()
numd_City.columns = ['Frequency']
numd_Campus.columns = ['Frequency']
numd_Campus = numd_Campus.sort_values(by='Frequency',ascending=False)
numd_City = numd_City.sort_values(by='Frequency',ascending=False)

numd_2 = numd_City[numd_City['Frequency'] >= 20]
numd_3 = numd_Campus[numd_Campus['Frequency'] >= 25]

In [None]:
numd_2.plot(y = 0, fontsize = 10, kind = 'barh', title = 'High Frequency Word of City')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.xlim(0,120)
plt.show()
numd_3.plot(y = 0, fontsize = 10, kind = 'barh', title = 'High Frequency Word of Campus')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.xlim(0,120)
plt.show()


<strong>
<ul>
<li>From the frequency graph, the topics that cover the most are "ucdavis", "Student", "Major", "Work","president", "California".</li>
<li>We can see that both type of article have the topics, such as "community", "Davis", "work", "manipulate", and"help".
<li>From the similarity table, the similarity is 0.666. Therefore, campus article topics are quite similar to city article topics.</li>
</ul>
</strong>

<h2> 1.4.2       
Vectorize and similar matrix </h2>

In [None]:
#2
text_all = text_campus + text_city
vectorizer = TfidfVectorizer(tokenizer=lemmatize,stop_words="english",smooth_idf=True,norm="l2")
tfs_text = vectorizer.fit_transform(text_all)
norm = tfs_text.dot(tfs_text.T)
norm = norm.toarray()


def sim_matrix(matrix):
    '''
    Argument: find three pairs of most similar articles.
    
    Input: matrix
    
    Output : the pairs of index of similar matrix.
    '''
    res_sort = []
    sort_matrix = np.argsort(matrix,axis  = None)[::-1] # Transform the matrix to array and sort it 
    for i in sort_matrix:
        if (i/120)!=(i%120):    # to remove the diagnole 
            res_sort.append(i)
    result_sort_con  = [[res_sort[0]/120,res_sort[0]%120],[res_sort[2]/120,res_sort[2]%120],[res_sort[4]/120,res_sort[4]%120]]
    return result_sort_con

In [None]:
sim_matrix(norm)

In [None]:
print title.iloc[14]
print title.iloc[35]
print title.iloc[58]
print title.iloc[51]
print title.iloc[24]
print title.iloc[38]

<h3> Words in common by using Word Clouds </h3>

In [None]:
def wordcloud(text1, text2):
    text_compare = text1 + text2
    wc = WordCloud(background_color="white", max_words=80)
    wc.generate(text_compare)
    return wc

In [None]:
plt.imshow(wordcloud(text.iloc[14], text.iloc[35]))
plt.axis("off")

In [None]:
plt.imshow(wordcloud(text.iloc[58], text.iloc[51]))
plt.axis("off")

In [None]:
plt.imshow(wordcloud(text.iloc[24], text.iloc[38]))
plt.axis("off")

<strong>
Conclusion: 
<ul>
<li>Between the article "UC Davis holds first mental health conference" and "UC Davis to host first ever mental health conference", there are some common words like "mental", "student", "health".</li>
<li>Between the article "Davis College Republicans club leads protest against cancellation of Milo Yiannopoulos event" and "Protests erupt at Milo Yiannopoulos event", there are some common words like "Milo Yiannopoulos", "protest", "event".</li>
<li>Between the article "University of California Regents meet, approve first tuition raise in six years" and "UC Regents vote to raise tuition for UC campuses", there are some common words, like"tuision", "increase", "UC student".</li>
</ul>
</strong>




<h2> 1.4.3</h2>

<strong>
The corpus can not be  representative of the Aggie.
<ul>
<li> First, there are only extracted from campus and city topic, so there are many topics it has never covered, such as sports and news.</li>
<li>Second, All of the articles are from year 2017; Hence, it cannot represent the pass of Aggie.</li> 
</ul>
</strong>


In [None]:
numd = word_fre(title)
numd = pd.DataFrame([numd])
numd = numd.transpose()
numd.columns = ['Frequency']
numd = numd.sort_values(by='Frequency',ascending=False)
numd_1 = numd[numd['Frequency'] >= 5]
numd_1

<strong>
Inference:         
According to the word frequency of the word in title , we can inference that recently the UCDavis are focusing on the protest of president election and the increasment of tuision.
</strong>

