# Exercises "Lecture 8: Exploratory Data Analysis and Visualisation"

In this session, we will compute statistics and visualizations on Wikipedia articles from 16 categories, namely: 

> Airports, Artists, Astronauts, Astronomical_objects, Building, City, Comics_characters, Companies, Foods, Monuments_and_memorials, Politicians, Sports_teams, Sportspeople, Transport, Universities_and_colleges, Written_communication.

Data: wkp directory containing .txt files         
Python libraries
- [os](https://docs.python.org/3.8/library/os.html), for listdir() to list files in a directory 
- [glob](https://docs.python.org/3/library/glob.html), for listing files in a directory whose names match certain patterns
- [re](https://docs.python.org/3.8/library/re.html), for regular expressions 
- pandas
- spacy (or Stanza)

In [None]:
# LOAD THE LIBRARIES

## Regexp and loading text files into a Pandas dataframe

**Exercise 1** 

* Get the list of file names in the **wkp/** directory
* Hint: You can use os.path.basename to help you
* Use a regexp together with the list of categories (given above: 'Airports', 'Artists'....) to split each file name  into 'id' and 'category'. For example: 

> File Name: 'Monteverde_Angel_Monuments_and_memorials'

is split into: 

> Id: 'Monteverde_Angel'

> Category: 'Monuments_and_memorials'

* store each processed filename in a list of lists. The list is of the form 
```[[File name, Id, Category], ...]```

In [None]:
# YOUR CODE HERE

**Exercise 2** 
* Extract the content of the file (use **read()**, cf. python_basics cheatsheet))
* Create a list of lists of the form (id, category, file_content). Save it to a variable "data4pandas" e.g., 

```
data4pandas = [['Monteverde_Angel', 'Monuments_and_memorials', 'The Monteverde Angel or Angel of the Resurrect ....], ...]
```

In [None]:
# YOUR CODE HERE

**Exercise 3** 

* Create a dataframe from this list of lists (i.e. data4pandas). Remember to add the following column headers: 'id', 'category' and'text' (cf. pandas CS). Save this dataframe to a variable called 'df'. (it is a convention to name pandas dataframe starting with 'df')
* inspect for yourself the first 10 and last 10 rows 

In [None]:
# YOUR CODE HERE

## Extract the list of categories

**Exercise 4** 
    
- store the content of the **'category'** column into a string (cf. Pandas CS)
- extract the set of unique categories from that string (cf. python basic CS)   
You should find the following 16 categories

```
['Comics_characters', 'Astronauts', 'Transport', 'Artists', 'Written_communication', 'Sports_teams', 'Foods', 'Airports', 'Monuments_and_memorials', 'Politicians', 'Sportspeople', 'Building', 'Universities_and_colleges', 'Astronomical_objects', 'Companies', 'City']
```

In [None]:
# YOUR CODE HERE

## Extract the list of headers from the 'text' column

**Exercise 5** 

Hint: In the Wikipedia articles, headers are surrounded by "==" 

_*E.g., ==  Background == *_

- Define a function called 'get_title' which extracts headers from a text (Use a regular expression)
- Apply this function to the **'text'** column in your pandas data frame (use the pandas 'apply' method)
- Store the result (the list of headers associated with each text in the frame) into a new pandas serie called 'headers'
- Concatenate this series to your pandas dataframe

In [None]:
# YOUR CODE HERE

## Extracting the vocabulary of each category

For each category, we extract the corresponding vocabulary i.e., the list of tokens occurring in the corresponding texts (removing the duplicates)


Optional: for each category
- extract the list of headers
- extract the noun and verbs 

**Exercise 6**

* write a function called "remove_underscores" that takes a python string and replace all the '_' in it with a whitespace ' '. e.g. "This_is_a_text" becomes  "This is a text"
* write a function called "lowercase_string" that takes a python string and lowercases it. e.g. "This is a text" becomes  "this is a text"
* apply both of the remove_underscores and lowercase_string functions on the **'clean_text'** column of your dataframe. Save the output into a new column in your dataframe called 'clean_text' (consider using method chaining)

In [None]:
# YOUR CODE HERE

**Exercise 7**

- Define a function 'get_tokens' which, given a category, return its vocabulary (the list of tokens occurring in the texts of that category and after removing the duplicates). One way to do this is to:
   - extract the category subframe i.e., all rows whose category column matches the input category
   - create a string out of the text column of that subframe (use str.cat(sep=" "), cf. Pandas CS)
   - run spacy or Stanza model on this string and extract the tokens from the resulting document (cf. Stanza or spacy CS)
   - use python set method to remove duplicate tokens
   - use python list method to convert the resulting set back into a list
- Create a new dataframe with headers **'CATEGORY'** and **'VOCABULARY'** in which you store for each category the corresponding vocabulary

In [None]:
# YOUR CODE HERE

## Visualising the differences in vocabulary size

**Exercise 8**

- Use pandas 'apply' method to compute the size of each category's vocabulary (the number of tokens)
- Add a **'VOCAB SIZE'** column to your the dataframe created in the previous exercise in which you input the size of the vocabulary for each category

In [None]:
# YOUR CODE HERE

**Exercise 9**

Create a barplot showing the **VOCAB SIZE** of each **Category** (use e.g., pd.barh() method)

- the y axis should show the categories
- the x axis should show the vocabulary size

In [None]:
# YOUR CODE HERE

**Exercise 10**
* create a scatter plot showing the correlation between the number of headers and each category
* reminder: you have the headers stored in the pandas dataframe saved to the 'df' variable

In [None]:
# YOUR CODE HERE