# CorpusImporter Example

## import

In [5]:
%load_ext autoreload
%autoreload 2
from Import.CorpusImporter import CorpusImporter

corpus = CorpusImporter()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## CrawlNYT 
Der Crawler muss zunächst mit den gewünschten Daten gefüttert werden. Alle Eigenschaften werden per Parameter übergeben.

Zunächst muss man dem Crawler übergeben, welche Art von Artikel gesammelt werden sollen. Dazu dienen Parameter:

* whitelist - Dokumente mit einem dieser Tags werden übernommen
* blacklist - Dokumente mit einem dieser Tags werden ignoriert (Höhere Wertigkeit als whitelist !)
* mapping   - Labels werden zu einem Label zusammengefasst, Format [ Zusammenfassendes Label, [Andere Labels]  ]

Automatisch sucht der Crawler im Corpus Ordner und scannt die Jahre 2007, 2006, 2005 (in der Reihenfolge). Der Scanprocess kann weiterhin gefiltert werden.

> Mit ignore_tags = True kann das Verhalten unterbunden werden.

### per_tag
Ist ein Usecase gefordert, an der nur eine bestimmte Anzahl von Dokumenten pro Tag gesammelt werden sollen, benutze diese Funktion. Beachte hierbei, dass max_count dementsprechend hoch eingestellt werden muss. 

Ist Multilabel aktiviert, so gelten z.B.

[Art,Science] und [Art,Politics] als eigenständige Labels. 


### max_count
Dieser Parameter gibt an, wieviele Dokumente auf die Datenbank geschrieben werden sollen. Möchte man also für 5 Labels je 100 Dokumente mindestens einlesen, dann sollte dieser Wert > 500 sein.

Dieser Wert wird für Unit Tests verwendet, um den Crawlprocess zu beschleunigen.

### clearMemory
Mit 
```python 
corpus.clearMemory()
``` 
wird die interne Datenbank gelöscht. Ein guter Weg um Datenmüll zu reinigen.

## Autocomplete
Die komplette Dokumentation ist verfügbar beim Drücken von [SHIFT]+[TAB] auf die Funktion crawlNYT()

In [13]:
corpus.clearMemory()
corpus.crawlNYT(per_tag=335, is_multilabel=False)

Maximum Documents:  3015
Reading archives [/home/retkowski/nltk_data/nyt/2007]:  ['01.tgz', '02.tgz', '03.tgz', '04.tgz', '05.tgz', '06.tgz']
0  Documents successfully parsed
500  Documents successfully parsed
500  Documents successfully parsed
500  Documents successfully parsed
1000  Documents successfully parsed
1000  Documents successfully parsed
1500  Documents successfully parsed
1500  Documents successfully parsed
1500  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2000  Documents successfully parsed
2500  Documents successfully parsed
2500  Documents successfully parsed
2500  Documents successfully parsed
2500  Documents successfully parsed
2500  Documents successfully parsed


In [14]:
len(corpus._Collection)

3015

In [19]:
print(list(set([''.join(item.tags) for item in corpus._Collection])))

['New York and Region', 'World', 'Technology; Arts', 'Front Page; Education; New York and Region', 'Movies; Arts', 'Business; Washington', 'Health; Opinion', 'Front Page; New York and Region', 'U.S.; Washington', 'World; Washington', 'Arts; Education; Theater', 'Front Page; Health; U.S.', 'Arts', 'Health; Business; Washington', 'Technology; U.S.', 'Health; U.S.; Washington', 'Business', 'Arts; Education', 'Front Page; U.S.', 'Paid Death Notices', 'Sports', 'Arts; Health; Books', 'Business; Books', 'Arts; Obituaries', 'Arts; Theater', 'Front Page', 'Technology; Business', 'U.S.', 'Opinion']


In [11]:
corpus._Collection[10]

Item(organizations=None, locations=None, names=['Antonina Tumkovsky'], titles=None, tags=['Arts'], text="\n        Antonina Tumkovsky, an influential teacher at the School of American Ballet who trained future stars of the New York City Ballet and other dancers for 54 years, died on Friday at the Tolstoy Foundation Nursing Home in Valley Cottage, N.Y., where she lived. She was 101.\n        A spokesman for the school, Tom Schoff, confirmed her death.\n        From 8-year-old girls to the advanced men's class, all the students at the school faced her strict demands. The school promoted a DVD about her teaching by emphasizing that her ''particular style of teaching was crucial in preparing dancers for the Balanchine repertory.''\n        George Balanchine, a founder of City Ballet and of the school, hired her on the spot when she applied to him for a teaching job in 1949. Unlike him and his staff of earlier émigré teachers from Serge Diaghilev's Ballets Russes and the Russian Imperial Ba

In [2]:
extractElements = [f[0] for f in corpus.DATA_FIELDS]

In [2]:
corpus.clearMemory()

corpus.crawlNYT(ignore_tags=True, nytPaths=["2005"], max_count=500, extractElements=["Headline","OnlineTitles","Titles"])

Reading archives: ['01.tgz', '02.tgz', '03.tgz', '04.tgz', '05.tgz', '06.tgz', '07.tgz', '08.tgz', '09.tgz', '10.tgz', '11.tgz', '12.tgz']


In [5]:
titles = [len(i.Titles) for i in corpus._Collection if i.Titles is not None]
online_titles = [len(i.OnlineTitles) for i in corpus._Collection if i.OnlineTitles is not None]
headline = [len(i.Headline) for i in corpus._Collection if i.Headline is not None]

print("=== Titles ===")
print(len(titles))
print(sum(titles) / len(titles))

print("=== online_titles ===")
print(len(online_titles))
print(sum(online_titles) / len(online_titles))

print("=== Headline ===")
print(len(headline))
print(sum(headline) / len(headline))

=== Titles ===
4214
1.2297104888467014
=== online_titles ===
1871
1.0069481560662747
=== Headline ===
89975
47.443223117532646


In [6]:
len(corpus._Collection)

0

In [None]:
corpus.crawlNYT(max_count=10000000000, ignore_tags=True)

Maximum Documents:  10000000000
Reading archives [/home/retkowski/nltk_data/nyt/2007]:  ['01.tgz', '02.tgz', '03.tgz', '04.tgz', '05.tgz', '06.tgz']
Reading archives [/home/retkowski/nltk_data/nyt/2006]:  ['01.tgz', '02.tgz', '03.tgz', '04.tgz', '05.tgz', '06.tgz', '07.tgz', '08.tgz', '09.tgz', '10.tgz', '11.tgz', '12.tgz']
Reading archives [/home/retkowski/nltk_data/nyt/2005]:  ['01.tgz', '02.tgz', '03.tgz', '04.tgz', '05.tgz', '06.tgz', '07.tgz', '08.tgz', '09.tgz', '10.tgz', '11.tgz', '12.tgz']
Reading archives [/home/retkowski/nltk_data/nyt/2004]:  ['01.tgz', '02.tgz', '03.tgz', '04.tgz', '05.tgz', '06.tgz', '07.tgz', '08.tgz', '09.tgz', '10.tgz', '11.tgz', '12.tgz']
