# Exercises "Lecture 8: Exploratory Data Analysis and Visualisation"

In this session, we will compute statistics and visualizations on Wikipedia articles from 16 categories, namely: 

> Airports, Artists, Astronauts, Astronomical_objects, Building, City, Comics_characters, Companies, Foods, Monuments_and_memorials, Politicians, Sports_teams, Sportspeople, Transport, Universities_and_colleges, Written_communication.

Data: wkp directory containing .txt files         
Python libraries
- [os](https://docs.python.org/3.8/library/os.html), for listdir() to list files in a directory 
- [glob](https://docs.python.org/3/library/glob.html), for listing files in a directory whose names match certain patterns
- [re](https://docs.python.org/3.8/library/re.html), for regular expressions 
- pandas
- spacy (or Stanza)

In [1]:
# LOAD THE LIBRARIES

import os 
import glob
import re
import pandas as pd 
import spacy

## Regexp and loading text files into a Pandas dataframe

**Exercise 1** 

* Get the list of file names in the **wkp/** directory
* Hint: You can use os.path.basename to help you
* Use a regexp together with the list of categories (given above: 'Airports', 'Artists'....) to split each file name  into 'id' and 'category'. For example: 

> File Name: 'Monteverde_Angel_Monuments_and_memorials'

is split into: 

> Id: 'Monteverde_Angel'

> Category: 'Monuments_and_memorials'

* store each processed filename in a list of lists. The list is of the form 
```[[File name, Id, Category], ...]```

In [25]:
# Establish categories to track and track it into a regex format
categories = [
    'Airports', 'Artists', 'Astronauts', 'Astronomical_objects', 
    'Building', 'City', 'Comics_characters', 'Companies', 'Foods', 
    'Monuments_and_memorials', 'Politicians', 'Sports_teams', 
    'Sportspeople', 'Transport', 'Universities_and_colleges', 
    'Written_communication'
]

category_re = '|'.join(categories)

# Extract the raw names of the files
directory_path = r'C:\Users\belen\Desktop\Université de Lorraine\Second semester\Data_science_P2\Lab_3_Pandas\wkp'
file_names = os.listdir(directory_path)

# Process each file name
names_list = []

for file_name in file_names:
    # Get the base name
    base_name = os.path.basename(file_name)

    # Use regex to find the category in the file name
    match = re.search(rf"({category_re})", base_name)
    if match:
        category = match.group(1)
        # Remove the category part and extension to get the ID
        file_id = re.sub(rf"_{category}_|\..*$", "", base_name)
        names_list.append([file_id, category])

print(len(names_list))

print(names_list)

160
[['Airports_of_Serbia_Airports', 'Airports'], ['Airport_authority_Airports', 'Airports'], ['Airport_bus_Airports', 'Airports'], ['Airport_check-in_Airports', 'Airports'], ['Airport_security_Airports', 'Airports'], ['Airspace_Airports', 'Airports'], ['Airspace_Transport', 'Transport'], ['Airway_beacon_Airports', 'Airports'], ['Airway_beacon_Transport', 'Transport'], ['Aisam-ul-Haq_Qureshi_career_statistics_Sportspeople', 'Sportspeople'], ['Aish_tamid_Monuments_and_memorials', 'Monuments_and_memorials'], ['Aita_Mari_Transport', 'Transport'], ['Aiwowo_Foods', 'Foods'], ['Ajilim├│jili_Foods', 'Foods'], ['Ajman_International_Airport_Airports', 'Airports'], ['Aj├¡_(sauce)_Foods', 'Foods'], ['Akabeko_Building', 'Building'], ['Akaflieg_Transport', 'Transport'], ['Akay_Artists', 'Artists'], ['Akie_Dagogo_Fubara_Transport', 'Transport'], ['Akira_Ry┼ì_Sportspeople', 'Sportspeople'], ['Akira_Toriyama_(ophthalmologist)_Universities_and_colleges', 'Universities_and_colleges'], ['Akkamma_Devi_Pol

**Exercise 2** 
* Extract the content of the file (use **read()**, cf. python_basics cheatsheet))
* Create a list of lists of the form (id, category, file_content). Save it to a variable "data4pandas" e.g., 

```
data4pandas = [['Monteverde_Angel', 'Monuments_and_memorials', 'The Monteverde Angel or Angel of the Resurrect ....], ...]
```

In [35]:
# Method: read each file of the directory, and put the file into the same list 

data4pandas = []

for item in names_list:
    file_id, category = item  # Asegurándonos de desempacar correctamente
    
# Find the first filename that contains both the category and the ID; returns None if not found.
    file_name = next((fn for fn in file_names if re.search(rf"_{category}", fn) and file_id in fn), None)
    if not file_name:
        print(f"No se encontró archivo correspondiente para: ID = {file_id}, Category = {category}")
        continue

    file_path = os.path.join(directory_path, file_name)
    
    # Leer el contenido del archivo
    with open(file_path, 'r', encoding='utf-8') as file:
        file_content = file.read()
    
    data4pandas.append([file_id, category, file_content])

No se encontró archivo correspondiente para: ID = Al_ZaquraMonuments_and_memorials, Category = Building


**Exercise 3** 

* Create a dataframe from this list of lists (i.e. data4pandas). Remember to add the following column headers: 'id', 'category' and'text' (cf. pandas CS). Save this dataframe to a variable called 'df'. (it is a convention to name pandas dataframe starting with 'df')
* inspect for yourself the first 10 and last 10 rows 

In [39]:
headers = ['id', 'category','text']

df = pd.DataFrame(data4pandas, columns=headers)

#print(df.head(10))
print(df.tail(10))

                                                   id           category  \
149  UP_Fighting_Maroons_Volleyball_Team_Sports_teams       Sports_teams   
150                            Uran_Butka_Politicians        Politicians   
151                   Victor_Mancha_Comics_characters  Comics_characters   
152                                      Vidnava_City               City   
153                   Vladimir_Dzhanibekov_Astronauts         Astronauts   
154                                       Votice_City               City   
155                              Wetted_area_Building           Building   
156                           William_Hogarth_Artists            Artists   
157                     Wolfgang_Nordwig_Sportspeople       Sportspeople   
158                                ┼¢idlochovice_City               City   

                                                  text  
149  The University of the Philippines Fighting Mar...  
150  Uran Butka (2 December 1938) is an Albanian 

## Extract the list of categories

**Exercise 4** 
    
- store the content of the **'category'** column into a string (cf. Pandas CS)
- extract the set of unique categories from that string (cf. python basic CS)   
You should find the following 16 categories

```
['Comics_characters', 'Astronauts', 'Transport', 'Artists', 'Written_communication', 'Sports_teams', 'Foods', 'Airports', 'Monuments_and_memorials', 'Politicians', 'Sportspeople', 'Building', 'Universities_and_colleges', 'Astronomical_objects', 'Companies', 'City']
```

In [41]:
category_strings = df['category'].unique()
print(category_strings)
print(len(category_strings))

['Airports' 'Transport' 'Sportspeople' 'Monuments_and_memorials' 'Foods'
 'Building' 'Artists' 'Universities_and_colleges' 'Politicians'
 'Companies' 'Written_communication' 'Astronomical_objects'
 'Comics_characters' 'Astronauts' 'City' 'Sports_teams']
16


## Extract the list of headers from the 'text' column

**Exercise 5** 

Hint: In the Wikipedia articles, headers are surrounded by "==" 

_*E.g., ==  Background == *_

- Define a function called 'get_title' which extracts headers from a text (Use a regular expression)
- Apply this function to the **'text'** column in your pandas data frame (use the pandas 'apply' method)
- Store the result (the list of headers associated with each text in the frame) into a new pandas serie called 'headers'
- Concatenate this series to your pandas dataframe

In [42]:
def get_title(df, column_name):
    # Define the regex pattern to match headers surrounded by "=="
    reg = r'==\s*.*?\s*=='
    
    # Apply the regex pattern to extract headers from each text in the specified column
    df['headers'] = df[column_name].apply(lambda text: re.findall(reg, text))
    
    return df

In [43]:
get_title(df, 'text')

Unnamed: 0,id,category,text,headers
0,Airports_of_Serbia_Airports,Airports,Airports of Serbia (Serbian Cyrillic: Аеродром...,"[== Airports ==, == References ==, == External..."
1,Airport_authority_Airports,Airports,An airport authority is an independent entity ...,[== Examples of airport authorities overseeing...
2,Airport_bus_Airports,Airports,"An airport bus, or airport shuttle bus or airp...","[== On airport transfer ==, === Airside transf..."
3,Airport_check-in_Airports,Airports,Airport check-in is the process whereby passen...,"[== Types of check-in ==, === Destination or P..."
4,Airport_security_Airports,Airports,Airport security refers to the techniques and ...,"[== Description ==, == Airport enforcement aut..."
...,...,...,...,...
154,Votice_City,City,Votice (Czech pronunciation: [ˈvocɪtsɛ]; Germa...,"[== Administrative parts ==, == History ==, ==..."
155,Wetted_area_Building,Building,The surface area that interacts with the worki...,[== References ==]
156,William_Hogarth_Artists,Artists,William Hogarth (; 10 November 1697 – 26 Octo...,"[== Early life ==, == Career ==, === Early wor..."
157,Wolfgang_Nordwig_Sportspeople,Sportspeople,Wolfgang Nordwig (born 27 August 1943) is a fo...,"[== Athletic career ==, === World rankings ==,..."


## Extracting the vocabulary of each category

For each category, we extract the corresponding vocabulary i.e., the list of tokens occurring in the corresponding texts (removing the duplicates)


Optional: for each category
- extract the list of headers
- extract the noun and verbs 

**Exercise 6**

* write a function called "remove_underscores" that takes a python string and replace all the '_' in it with a whitespace ' '. e.g. "This_is_a_text" becomes  "This is a text"
* write a function called "lowercase_string" that takes a python string and lowercases it. e.g. "This is a text" becomes  "this is a text"
* apply both of the remove_underscores and lowercase_string functions on the **'clean_text'** column of your dataframe. Save the output into a new column in your dataframe called 'clean_text' (consider using method chaining)

In [None]:
def  remove_underscores():
    

**Exercise 7**

- Define a function 'get_tokens' which, given a category, return its vocabulary (the list of tokens occurring in the texts of that category and after removing the duplicates). One way to do this is to:
   - extract the category subframe i.e., all rows whose category column matches the input category
   - create a string out of the text column of that subframe (use str.cat(sep=" "), cf. Pandas CS)
   - run spacy or Stanza model on this string and extract the tokens from the resulting document (cf. Stanza or spacy CS)
   - use python set method to remove duplicate tokens
   - use python list method to convert the resulting set back into a list
- Create a new dataframe with headers **'CATEGORY'** and **'VOCABULARY'** in which you store for each category the corresponding vocabulary

In [None]:
# YOUR CODE HERE

## Visualising the differences in vocabulary size

**Exercise 8**

- Use pandas 'apply' method to compute the size of each category's vocabulary (the number of tokens)
- Add a **'VOCAB SIZE'** column to your the dataframe created in the previous exercise in which you input the size of the vocabulary for each category

In [None]:
# YOUR CODE HERE

**Exercise 9**

Create a barplot showing the **VOCAB SIZE** of each **Category** (use e.g., pd.barh() method)

- the y axis should show the categories
- the x axis should show the vocabulary size

In [None]:
# YOUR CODE HERE

**Exercise 10**
* create a scatter plot showing the correlation between the number of headers and each category
* reminder: you have the headers stored in the pandas dataframe saved to the 'df' variable

In [None]:
# YOUR CODE HERE