# Python Web-scrapping with BeautifulSoup

## Description:

   In this project, we will learn how to scrape a static website. We wish to extract the names of all the artists beginning with the letter "Z", their nationality,birth info and link to their bio from the [National Gallery of Art Website](https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm "Artist Names Website")

   The website has four pages so we will use the for loop to iterate through all the pages. We will use the Python's BeautifulSoup library to scrape the website for info. 
   
## Objectives
- [x] [Scrape Artist Information](#Scrapping-Artist-Information)
- [x] [Write output data into a CSV file](#Writing-a-CSV-file)
- [x] [Read CSV into a Pandas DataFrame](#Reading-CSV-file-into-a-pandas-DataFrame)

### Importing libraries

In [20]:
import pandas as pd
import csv
import requests
import chardet
from bs4 import BeautifulSoup

### Scrapping Artist Information

In [21]:
pages=[]
tup = []

for i in range(1,5):
    url="https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ"+ str(i) +'.htm'
    pages.append(url)
for items in pages:
    page = requests.get(items)
    soup = BeautifulSoup(page.text, 'html.parser')
    last_links = soup.find(class_="AlphaNav")
    last_links.decompose()
    
    
    namelist=soup.find(class_="BodyText")
    namelist_items=namelist.find_all(valign = 'top')
    
    for i in namelist_items:
        name = i.find('a').text.replace(",", " ")
        links = 'https://web.archive.org'+ str(i.find('a').get('href'))
        nationality= i.find('td').find_next_sibling('td').text.split(",")[0]
        age =i.find('td').find_next_sibling('td').text.split(",")[-1]
        tup.append((name,nationality,age,links))


**Code Review 1:**
```Python

for i in range(1,5):
    url="https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ"+ str(i) +'.htm'
    pages.append(url)
```

The Artist names' website has 4 pages in total. We use the above code to perform more iteration and the results are appended in the initialized list `pages=[]`.


Now that we have a list of all the web pages, we use the `requests.get(Item)` method to collect each respective URL and assign it to the variable page.

Next, we will create a BeautifulSoup object for each page. The object takes the arguments, `page.text` and `html.parser`. **Page.text** takes the content of the server's response and parses it from the in-bulit python library **html5lib**.


`last_links = soup.find(class_="AlphaNav")`

`last_links.decompose()`
    

The above code will do away with the additional metadata tags contained on the website. Our desired info is conatined in the `<Div>` tag `class_="BodyText"`. we will use another for loop to iterate through all the `<div>` tags and get the **Name, Nationality, Age Info and Bio link** of each artist and append them to the list initialized as `tup=[]`. 


### Writing a CSV file

In [23]:
filename = 'Z_Artist_names.csv'
with open(filename, 'w', newline='') as f: 
    csvwriter=csv.writer(f)
    hw=csv.DictWriter(f, fieldnames= ['Name','Nationality','Age_info','Bio_link'])
    hw.writeheader()
    for row in tup:
        csvwriter.writerow(row)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


**Code Review 2**

Remember the List of tuples we initialized above as `tup=[]`? 
Now we wish to create a CSV file from the list of tuples and add a header. To initialize the header, we use the ***DictWriter()*** method and pass our desired column names as a list in the *fieldnames* attribute. 

We then use the ***writeheader()*** method to invoke the csvwriter object. Now that we have our header, we use the ***writerow()*** method from the ***write*** class to write data row-wise to a CSV file. 



### Reading CSV file into a pandas DataFrame

In [30]:
with open(filename, 'rb') as file:
    print(chardet.detect(file.read()))

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


In [35]:
df= pd.read_csv('Z_Artist_names.csv',encoding='ISO-8859-1')
df['Nationality'].value_counts()

Nationality
American         32
Italian          29
German           14
Dutch             5
Polish            4
Czech             4
French            3
Spanish           3
Flemish           2
Russian           2
Mexican           2
Swiss             2
British           2
Swedish           2
Chilean           1
born 1943         1
Yugoslavian       1
Austrian          1
Belgian           1
Netherlandish     1
Name: count, dtype: int64

When converting a CSV file into a DataFrame, one may encounter the **UnicodeDecodeError**.

To avoid this error, use the ***chardet*** python library to detect the type of encoding used and pass the encoding type as an attribute when reading the CSV to a pandas dataframe. For my case the chardet output was as follows:

***{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}***

The code below will convert your CSV file into a dataframe successfully. 

`df= pd.read_csv('Z_Artist_names.csv',encoding='ISO-8859-1')`