# Academic Writing Project
### Text scraping and linguistic analysis

#### Importing request module
At first, it is important to import the requests module, which allows us to send HTTP requests with Python.

In [None]:
import requests

#### Getting a page data
requests.get() method sends a request to the chosen url. We can assign the content to a variable and print it.

In [None]:
page = requests.get("https://www.football365.com/news/man-city-ffp-relegation-not-sensationalised-premier-league-desire/")
print(page)
print(page.content)

#### BeautifulSoup package
Here, it is worth mentioning the Wikipedia, which describes the application of the package: 
"Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, which is useful for web scraping."

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
soup

#### Finding paragraphs in a chosen text
To distinguish paragraphs in the text, we need to use "find_all" method as below. We will assign this to the "paras" variable:

In [None]:
paras = soup.find_all('p')
print (paras)

#### Separating the text itself
The next step will be very important, as we would like to pull only the text from a chosen website. To do it, we use the for loop along with getText and strip method. The latter removes starting and ending whitespaces from a given string.

In [None]:
only_text = []
for el in paras:
  if len(el.getText().strip()) > 0:
    only_text.append(el.getText())
print(only_text)

### Lingustics tools and analysis
Having the text, we can implement several lingustics tools to analyse it in lingustics terms. Our first step in this matter needs to be a nltk (Natural Language Toolkit) package import. Later, we need to download nltk data, using the download method and chosing the data that we are interested in. Below we download the popular ones (nltk.download("popular")). We can, however, download all of them as well.

In [None]:
import nltk
nltk.download("popular")

In [None]:
nltk.download("all")

In [None]:
for sentence in only_text:
  print(sentence)

#### Sentence tokenization
To tokenize our sentences from a scraped text, we need to use the "word_tokenize" method, as below:

In [None]:
for sentence in only_text:
  print(nltk.word_tokenize(sentence))

#### Words tagging in a sentence
Another useful tool is "pos_tag" method that helps us getting parts of speech tags for our sentence. Here, I implement it with a help of a for loop and present it using tuples.

In [None]:
tuples = []
for sentence in only_text:
  tokens = (nltk.word_tokenize(sentence))
  pos_tagged = nltk.pos_tag(tokens)
  for item in pos_tagged:
    tuples.append(item)
print(tuples)

#### Parts of speech frequency (dictionary)
WHich of the parts of speech is the most frequent one? We can check it and present it creating a dictionary and adding values using a for loop.

In [None]:
counter_dict = {}

for el in tuples:
    tag = el[1]
    if tag not in counter_dict:
      counter_dict[tag] = 1
    else:
      counter_dict[tag] += 1

print(counter_dict)

#### Separating keys and values of the dictionary
We can also check and display the keys and values  (separately) of formely created dictionary:

In [None]:
keys = []
values = []

for el in counter_dict:
  keys.append(el)
  values.append(counter_dict[el])

print(keys)
print(values)

#### Removing "Stop words"
We can also remove so called "Stop words". What are they? According to Kavita Ganesan, "a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead."

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
to_be_removed = set(stopwords.words('english'))

tokenized_para=word_tokenize(str(only_text))
print(tokenized_para)
modified_token_list=[word for word in tokenized_para if not word in to_be_removed]
print(modified_token_list)

#### Data presented by a chart
Our data can be accesibly presented using different types of charts, for example. To do so, we need to import matplotlib.pyplot collection.

In [None]:
import matplotlib.pyplot as plt

#### Presenting data with charts
After getting the collection, we can choose bar chart or pie chart, for instance:

In [None]:
plt.bar(keys, values, color="green")

In [None]:
plt.pie(values,labels = keys[0:8])