## **LINGUISTIC ANALYSIS OF THE COOKING WEBSITE.**

In this interactive essay, I will go through collecting data from a cooking website by using web scraping. 


### Data collection

For the linguistic analysis I have chosen the page devoted to cooking. Below I present the data collection process.

In [None]:
import nltk


In [None]:
import nltk
import requests
page = requests.get("https://www.bbcgoodfood.com/")
print(page.content)

At this point, we have our new variable called "page" to which our page value has been assigned. However, the output data is not readable at this stage but we can improve it using Beautiful Soup library. 
Let's move to subsection about data preparation.

### Data preparation

Using the Beautiful soup library, we split the text into paragraphs.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())


In [None]:
paras = soup.find_all('p')
print(paras)


Now we have created a new variable "paragraph" to which all the scratched paragraphs have been assigned.

In [None]:
only_text = []
for para in paras:
    processed_para = para.get_text()
    processed_para = processed_para.strip()
    if len(processed_para) > 1:
        only_text.append(processed_para)
print(only_text)


By using an empty list and appending it with a simple for loop, we made the text appear more readable.

### Data analysis 

#### 1.1 Part of speech

In [None]:
tuples = []
for sentence in only_text:
    tokenized = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized)
    for item in tagged:
        tuples.append(item)
print(tuples)


At this stage by using the word_tokenize nltk function we create an empty tuple in which we insert tokenized words. Each tuple displayed contains a token and its corresponding part of speech.

In [None]:
counter_dict = {}
for item in tuples:
    if item[1] not in counter_dict:
        counter_dict[item[1]] = 1
    else:
        counter_dict[item[1]] += 1

print(counter_dict)


With this code we can easily count the occurrences of a given part of speech. 
In addition we can visualize the data as a graph using a **matplotlib**. The result is shown below.

In [None]:
import matplotlib.pyplot as mat
mat.figure(figsize=(20,10))
mat.bar(counter_dict.keys(), counter_dict.values(), color='green')
mat.xlabel('Part of speech', size=20)
mat.ylabel('Data visualization: number of appearances', size=15)
mat.show()


As we can see common nouns (NN) most often appear in the text.

#### 1.2 Most common words

In [None]:
punctuation = '''!()-[]{};:'"\,<>.?@#$%^&*_~<=>+/(|){`\}/'''
punctuation_to_remove = soup.prettify()
no_punctuation = " "
for char in punctuation_to_remove:
    no_punctuation = no_punctuation + char
print(no_punctuation)

In [None]:
tuples = []
for sentence in only_text:
    tokenized = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized)

In [None]:
print(tuples)