## Data Acquisition 

In [1]:
!pip install Wikipedia-API



**Before executing this notebook:**
- [x] Prior to running this notebook, use this [tool](https://petscan.wmflabs.org/) to export a csv file. Choose Wikipedia titles under the **Category: Fantasy novel series**
- [x] Add `.csv` to the end of the downloaded file, so that it can be imported as a csv by pandas
- [x] Pip install wikipedia-api https://pypi.org/project/Wikipedia-API/

**In this notebook**:
- [x] Open the csv file and use the `title` column as input for extracting the actual articles w/ wikipedia-api
- [x] Save as pandas dataframe

In [2]:
import pandas as pd
import numpy as np

import pickle
import wikipediaapi

### Extract pages from WIKIPEDIA-API

In [3]:
# open csv file
novel_list = pd.read_csv('../data/novels_by_decade.csv')
novel_list.tail()

Unnamed: 0,number,title,pageid,namespace,length,touched
2286,2287,The_Barbarian_of_World's_End,62287035,,5608,20200202073312
2287,2288,The_Pirate_of_World's_End,62287050,,5679,20200202073312
2288,2289,Wrath_of_Empire,62641545,,2879,20200202073425
2289,2290,Blood_of_Empire,62653198,,2498,20200221011008
2290,2291,The_Light_Ages,62971385,,3633,20200204103212


In [4]:
# Collect names of novels
novel_titles = novel_list['title'].values.tolist()

---

In [5]:
# wikipedia wrapper function 

def getarticles(titles):
    '''Function returns the titles of articles on wikipedia, in the form 
    of a list of dictionaries
    input:
        titles - is list of titles
    '''
    collection =[]
    for each in titles:
        wiki = wikipediaapi.Wikipedia(
                language='en',
                extract_format=wikipediaapi.ExtractFormat.WIKI
        )
        collection.append(wiki.page(each))
        
    return collection
 

In [6]:
# collect articles under the category of Fantasy
collection = getarticles(novel_titles)

In [7]:
print('total number of articles collected: ', len(collection))

total number of articles collected:  2291


In [8]:
type(collection)

list

In [9]:
type(collection[0])

wikipediaapi.WikipediaPage

In [10]:
collection[0].title

'A_Wizard_of_Earthsea'

In [11]:
collection[0].summary

'A Wizard of Earthsea is a fantasy novel written by American author Ursula K. Le Guin and first published by the small press Parnassus in 1968. It is regarded as a classic of children\'s literature, and of fantasy, within which it was widely influential. The story is set in the fictional archipelago of Earthsea and centers around a young mage named Ged, born in a village on the island of Gont. He displays great power while still a boy and joins the school of wizardry, where his prickly nature drives him into conflict with one of his fellows. During a magical duel, Ged\'s spell goes awry and releases a shadow creature that attacks him. The novel follows his journey as he seeks to be free of the creature.\nThe book has often been described as a Bildungsroman, or coming-of-age story, as it explores Ged\'s process of learning to cope with power and come to terms with death. The novel also carries Taoist themes about a fundamental balance in the universe of Earthsea, which wizards are suppo

In [12]:
print("Here's an example of novel")
# collection[0].text

Here's an example of novel


**Unpack Results**
- Collect titles and summaries from wiki articles, convert them into lists, 
- Then, create a pandas dataframe 

In [13]:
# Collect all the titles from dictionary (collection)
# This step will take a while if you have a lot of data
titles = [each.title for each in collection]
summaries = [each.summary for each in collection]

In [14]:
# Pickle these lists, just in case
with open('../data/list_summaries_novelMAS.pkl','wb') as fout:
    pickle.dump(summaries, fout)
    
# # Pickle the lists, just in case
with open('../data/list_titles_novelMAS.pkl','wb') as fout:
    pickle.dump(titles, fout)

In [15]:
# Put these lists in to a dataframe
df = pd.DataFrame(np.c_[titles, summaries], 
                  columns=['title', 'summary'])

In [16]:
df.head()

Unnamed: 0,title,summary
0,A Wizard of Earthsea,A Wizard of Earthsea is a fantasy novel writte...
1,Carmilla,Carmilla is an 1872 Gothic novella by Irish au...
2,Don_Quixote,The Ingenious Gentleman Don Quixote of La Manc...
3,Erewhon,"Erewhon: or, Over the Range () is a novel by S..."
4,Farmer_Giles_of_Ham,Farmer Giles of Ham is a comic Medieval fable ...


In [17]:
# Pickle the dataframe for latter processing
with open('../data/dfraw_novelMAS.pkl','wb') as fout:
    pickle.dump(df, fout)

**Next step**:
- Clean the dataframe and do EDA, in Step2_Cleaning.ipynb

---