In [1]:
import requests
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import urllib

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jacob\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Jacob\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# Introduction

Social media has a huge impact on the world we live into with billions of people connected everyday and interacting on different platforms. From these interactions we are able to gain a lot of insights that can be used to help various industries.

## Social Data
 Social media data can be found on social sights in terms of
 - Structured Data e.g names, cities, date of birth
 - Unstructured Data e.g text, images , vidoes

In social data analytics, we try to understand these forms of data and give some kind of understanding

## Exploring Data
<img src="https://www.oreilly.com/api/v2/epubs/9781787121485/files/assets/e963952b-e1ed-447a-8516-2123c0bf87d6.png" width="500"/>


- Problem Definition: What is the question you are trying to answer this has to be very precises
- Data Collection: What is the data needed and what is the best way to collect the data
- Data Cleaning: In many cases the data will need some cleaning which involve steps like removing duplicates
- Data analysis: What kind of analysis is required and what structure should the data be in for this analysis. The types of analysis will depend on the objecetive and the type of data.
- Data Visualisation: This will help to better understand and summarize the data
- Conclusions: We can infer conclusions better from our analysis and vizualizations


# Practice
## Getting Data
For this exercise we will be exploring text from a book
We will be using the [urllib](https://docs.python.org/3/library/urllib.request.html#module-urllib.request) and [requests](https://requests.readthedocs.io/en/latest/) library to get data from the [gutenberg](https://www.gutenberg.org/) website

In [2]:
# Defining the url of the text we would like to retrieve
url = "https://www.gutenberg.org/cache/epub/27785/pg27785.txt"

In [3]:
# Making a request with the urllib libray
response = urllib.request.urlopen(url)
raw = response.read().decode("utf8")
len(raw)

823999

In [4]:
# Using a get request to get data
response = requests.get(url)
raw = response.content.decode("utf8")
(len(raw))

823999

In [7]:
# Getting data from a file
file = open("2554-0.txt", "r")
full_text = file.read()
file.close()

FileNotFoundError: [Errno 2] No such file or directory: '2554-0.txt'

In [6]:
full_text[:200]

NameError: name 'full_text' is not defined

In [51]:
file = open("2554-0.txt", "r")
lines = file.readlines()
file.close()

In [50]:
lines[:5]

['\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\n',
 '\n',
 'This eBook is for the use of anyone anywhere in the United States and\n',
 'most other parts of the world at no cost and with almost no restrictions\n',
 'whatsoever. You may copy it, give it away or re-use it under the terms\n']

## Tokenization
Tokenization is used in natural language processing to split text, paragraphs and sentences into smaller units such as wordsthat can be more easily assigned meaning.

Below we are simply using python to get words by splitting the text. Can you see the problem?

In [37]:
tokens = raw[:4000].split()
print(tokens[:10])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'A', 'Book', 'About', 'Lawyers,', 'by']


Here we use the [nltk](https://www.nltk.org/) library. Can you spot the Difference?

In [57]:
tokens = nltk.word_tokenize(raw)
print(tokens[:10])


['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'A', 'Book', 'About', 'Lawyers', ',']


#### <span style="color:red"> Exercise 1 </span>
Try to remove the punctuation from the text.
Hint: Use this [link](https://machinelearningmastery.com/clean-text-machine-learning-python/) to figure out how

#### <span style="color:red"> Exercise 2 </span>
Explore some of the tokenizers and methods in the [nltk docs](https://www.nltk.org/api/nltk.tokenize.html?highlight=word_tokenize#nltk.tokenize.word_tokenize)

## Exploring Data

In [62]:
# Brute Force way to fine names
names_res = []
for i, token in enumerate(tokens):
    if token.istitle():
        if i < len(tokens) - 1 and tokens[i + 1].istitle():
            names_res.append('{} {}'.format(token, tokens[i + 1]))
print(names_res[:15])

['\ufeffThe Project', 'Project Gutenberg', 'A Book', 'Book About', 'About Lawyers', 'John Cordy', 'Cordy Jeaffreson', 'Jeaffreson This', 'Project Gutenberg', 'Gutenberg License', 'A Book', 'Book About', 'About Lawyers', 'Lawyers Author', 'John Cordy']


### Additional Exercise
Explore using nltk to do the above. Dont worry we will go over this in next week's class
Hint is in this [link](https://www.kaggle.com/code/lohitha17/name-extraction-using-nltk/notebook)

In [65]:
# Looking for words ending in 'ing'
ings = []
for token in tokens:
    if token.endswith('ing'):
        ings.append(token)
print(ings[:30])

['Proofreading', 'according', 'existing', 'during', 'King', 'putting', 'sitting', 'drawing', 'rustling', 'peering', 'morning', 'reading', 'neighboring', 'nothing', 'shunning', 'trembling', 'shivering', 'nothing', 'having', 'living', 'being', 'rising', 'adjoining', 'recalling', 'living', 'working', 'during', 'imposing', 'having', 'striking']


In [69]:
# right justified tokens
for word in ings:
    print(word.rjust(14, ' '))

  Proofreading
     according
      existing
        during
          King
       putting
       sitting
       drawing
      rustling
       peering
       morning
       reading
   neighboring
       nothing
      shunning
     trembling
     shivering
       nothing
        having
        living
         being
        rising
     adjoining
     recalling
        living
       working
        during
      imposing
        having
      striking
    comprising
        rising
        Having
      dwelling
        rising
         being
        taking
   criticising
    describing
       talking
       running
        Taking
     murmuring
        having
        taking
     according
     following
       walking
     surveying
      admiring
     extolling
       playing
       raising
      pleasing
   considering
     something
        laying
   overlooking
      speaking
   beautifying
       raising
     amounting
     levelling
       raising
       meeting
       evening
         l

In [72]:
# Counting a words
word_count = 0
for word in tokens:
    if word.lower() =='and':
        word_count+=1
print(word_count)

4181


why do make each word lower case? And could we have handled this differently?

In [78]:
#assign a part of speech to each word
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged[:20])

[('\ufeffThe', 'NN'), ('Project', 'NNP'), ('Gutenberg', 'NNP'), ('eBook', 'NN'), ('of', 'IN'), ('A', 'NNP'), ('Book', 'NNP'), ('About', 'IN'), ('Lawyers', 'NNP'), (',', ','), ('by', 'IN'), ('John', 'NNP'), ('Cordy', 'NNP'), ('Jeaffreson', 'NNP'), ('This', 'DT'), ('eBook', 'NN'), ('is', 'VBZ'), ('for', 'IN'), ('the', 'DT'), ('use', 'NN')]
Noun count 21633


In [79]:
# count the nouns in the text
noun_count = 0
nouns = filter(lambda x:x[1]=='NN',pos_tagged)
for word in pos_tagged:
    if word[1] =='NN':
        noun_count+=1
print('There are {} nouns'.format(noun_count))

Noun count 21633


#### <span style="color:red"> Exercise 3 </span>
Explore the python string methods in this [link](https://www.w3schools.com/python/python_ref_string.asp) to investigate, create something new from the words or find out something interesting from the words

### <span style="color:red"> Additional Homework </span>
Take it a bit further and try loading some data of your choosing and manipulating it in a way you would like.