<a href="https://colab.research.google.com/github/Jaavion/DS-Unit-1-Sprint-2-Data-Wrangling/blob/master/Copy_of_LS_DS_121_Scrape_and_process_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Scrape and process data

Objectives
- scrape and parse web pages
- use list comprehensions
- select rows and columns with pandas

Links
-  [Automate the Boring Stuff with Python, Chapter 11](https://automatetheboringstuff.com/chapter11/)
  - Requests
  - Beautiful Soup
- [Python List Comprehensions: Explained Visually](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
  - Subset Observations (Rows)
  - Subset Variables (Columns)
- Python Data Science Handbook
  - [Chapter 3.1](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html), Introducing Pandas Objects
  - [Chapter 3.2](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html), Data Indexing and Selection


## Scrape the titles of PyCon 2019 talks

In [0]:
url = 'https://us.pycon.org/2019/schedule/talks/list/'

In [0]:
import bs4 
import requests

result = requests.get(url)

In [4]:
result.text



In [0]:
soup = bs4.BeautifulSoup(result.text)

In [11]:
soup.select('presentation-')

[]

[]

In [0]:
soup.select('h2')

In [0]:
type(soup.select('h2'))

In [0]:
len(soup.select('h2'))

In [0]:
first = soup.select('h2')[0]

In [58]:
first

<h2>
<a href="/2019/schedule/presentation/235/" id="presentation-235">
        5 Steps to Build Python Native GUI Widgets for BeeWare
      </a>
</h2>

In [59]:
  first.text

'\n\n        5 Steps to Build Python Native GUI Widgets for BeeWare\n      \n'

In [60]:
#removes white space
#also can remove multiple character if type in the parameter 
first.text.strip()

'5 Steps to Build Python Native GUI Widgets for BeeWare'

In [0]:
#select last element in a list
#
last = soup.select('h2')[-1]
last.text.strip()

In [0]:
#This....
titles = []
for tag in soup.select('h2'):
  title = tag.text.strip()
  titles.append(title)
  
titles 


#... is the same as this: 

titles = [tag.text.strip()
          for tag in soup.select('h2')]

In [0]:
type(titles), len(titles)

In [0]:
titles[0], titles[-1]

## 5 ways to look at long titles

Let's define a long title as greater than 80 characters

### 1. For Loop

In [0]:
#using a for loop to figure out the characters are greater than 8- characters 
long_titles = []
for title in titles: 
  if len(titles) >80:
    #print(title)
    long_titles.append(title)
  

### 2. List Comprehension

In [0]:
l_titles = [title for title in titles if len(title) >80]
l_titles 

3. Filter with named function

In [0]:

def long(title):
  return len(title) > 80


In [0]:
# a function that applies function to iterables to a list
# the first parameter takes the function you create, 
list(filter(long, titles))

### 4. Filter with anonymous function

In [0]:
list(filter(lambda t: len(t) > 80, titles))

### 5. Pandas

pandas documentation: [Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html)

In [0]:
import pandas as pd 
pd.options.display.max_colwidth = 200

In [0]:
#create a dataframe from scratch
#dictionary key makes the columns
df = pd.DataFrame({'title': titles})


In [61]:
df.shape


(95, 4)

In [0]:
#filter the dataframe 
df[df['title'].str.len() > 80]

In [0]:
df['title']

In [0]:
df['title'].str.len() > 80

In [0]:
df['title'].str.len() > 80

In [0]:
df['title'].str.len() > 80

In [0]:
#give me the condition where the condition was true 
condition = df['title'].str.len() > 80
df[condition]

In [0]:
df['title'].str.len()

## Make new dataframe columns

pandas documentation: [apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)

### title length

In [0]:
df['title_length'] = df['title'].apply(len)

In [0]:
df.head()

In [0]:

df.head()

### long title

### first letter

In [0]:
df['first letter'] = df['title'].str[0]

In [0]:
df[df['first letter']=='P']

### word count

Using [`textstat`](https://github.com/shivam5992/textstat)

In [0]:
!pip install textstat

In [0]:
import textstat

In [0]:
df['title word count'] = df['title'].apply(textstat.lexicon_count)

In [0]:
df.shape

In [0]:
df.head()

In [0]:
df[df['title word count'] <= 3]

## Rename column

`title length` --> `title character count`

pandas documentation: [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)

In [0]:
df = df.rename(columns= {'title_length': 'title character count'})

In [0]:
df

## Analyze the dataframe

### Describe

pandas documentation: [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

In [0]:
df.describe

In [0]:
df.describe(include='all')

In [0]:
df.describe(exclude='number')

### Sort values

pandas documentation: [sort_values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)

Five shortest titles, by character count

In [0]:
df.sort_values(by='title character count').head(5)

Titles sorted reverse alphabetically

In [0]:
df.sort_values(by = 'first letter', ascending=False)

### Get value counts

pandas documentation: [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)


Frequency counts of first letters

In [0]:
df['first letter'].value_counts()

Percentage of talks with long titles

In [0]:
df['long title'].value_counts()

### Plot

pandas documentation: [Visualization](https://pandas.pydata.org/pandas-docs/stable/visualization.html)





Top 5 most frequent first letters

Histogram of title lengths, in characters

In [0]:
#Assignment 
tag = soup.select('.presentation-description')

In [13]:
description = [i.text.strip() for i in soup.select('.presentation-description')]
description


["Have you ever wanted to write a GUI application in Python that you can run on both your laptop and your phone? Have you been looking to contribute to an open source project, but you don't know where to start?\r\n\r\nBeeWare is a set of software libraries for cross-platform native app development from a single Python codebase and tools to simplify app deployment. The project aims to build, deploy, and run apps for Windows, Linux, macOS, Android, iPhone, and the web. It is native because it is actually using your platform's native GUI widgets, not a theme, icon pack, or webpage wrapper.\r\n\r\nThis talk will teach you how Toga, the BeeWare GUI toolkit, is architected and then show you how you can contribute to Toga by creating your own GUI widget in five easy steps.",
 'We rarely think about the dot “.” between our objects and their fields, but there are quite a lot of things that happen every time we use one in Python. This talk will explore the details of what happens, how the descri

In [19]:
import pandas as pd
df = pd.DataFrame({'description': description})
df

Unnamed: 0,description
0,Have you ever wanted to write a GUI applicatio...
1,We rarely think about the dot “.” between our ...
2,Account security means making sure your users ...
3,Do you feel overwhelmed by the prospect of hav...
4,Everyone’s talking about it. Everyone’s using ...
5,We will look into a day in the life of a Softw...
6,Medieval European Nobility was obsessed with L...
7,"In July of 2018, Guido van Rossum stepped down..."
8,"If you maintain a library, how can you innovat..."
9,Embroidery is an technology that dates back ce...


In [0]:
#titles = [tag.text.strip()
          for tag in soup.select('h2')]

In [0]:
df = pd.DataFrame()

In [0]:
df['Description Character Count'] = df['description'].apply(len)

In [21]:
df.head()

Unnamed: 0,description,Description Character Count
0,Have you ever wanted to write a GUI applicatio...,766
1,We rarely think about the dot “.” between our ...,296
2,Account security means making sure your users ...,426
3,Do you feel overwhelmed by the prospect of hav...,507
4,Everyone’s talking about it. Everyone’s using ...,647


In [22]:
!pip install textstat

Collecting textstat
  Downloading https://files.pythonhosted.org/packages/66/73/97bb64c89d6f2b24be6ad76007823e19b4c32ed4d484420b3ec6892ac440/textstat-0.5.6-py3-none-any.whl
Collecting pyphen (from textstat)
[?25l  Downloading https://files.pythonhosted.org/packages/15/82/08a3629dce8d1f3d91db843bb36d4d7db6b6269d5067259613a0d5c8a9db/Pyphen-0.9.5-py2.py3-none-any.whl (3.0MB)
[K    100% |████████████████████████████████| 3.0MB 7.7MB/s 
[?25hCollecting repoze.lru (from textstat)
  Downloading https://files.pythonhosted.org/packages/b0/30/6cc0c95f0b59ad4b3b9163bff7cdcf793cc96fac64cf398ff26271f5cf5e/repoze.lru-0.7-py3-none-any.whl
Installing collected packages: pyphen, repoze.lru, textstat
Successfully installed pyphen-0.9.5 repoze.lru-0.7 textstat-0.5.6


In [0]:
import textstat


In [0]:
df['description word count'] = df['description'].apply(textstat.lexicon_count)

In [31]:
df[df['description word count'] <=140]

Unnamed: 0,description,Description Character Count,description word count
0,Have you ever wanted to write a GUI applicatio...,766,135
1,We rarely think about the dot “.” between our ...,296,56
2,Account security means making sure your users ...,426,66
3,Do you feel overwhelmed by the prospect of hav...,507,84
4,Everyone’s talking about it. Everyone’s using ...,647,96
6,Medieval European Nobility was obsessed with L...,774,124
7,"In July of 2018, Guido van Rossum stepped down...",673,107
8,"If you maintain a library, how can you innovat...",337,50
9,Embroidery is an technology that dates back ce...,828,133
11,There's one part of building a Django app I ha...,634,108


In [32]:
df.describe(include='all')

Unnamed: 0,description,Description Character Count,description word count
count,95,95.0,95.0
unique,95,,
top,Python makes it incredibly easy to build progr...,,
freq,1,,
mean,,813.073684,130.821053
std,,415.988191,64.357872
min,,121.0,20.0
25%,,542.5,85.5
50%,,718.0,116.0
75%,,1016.5,165.0


In [34]:
df.describe(exclude='')

Unnamed: 0,description
count,95
unique,95
top,Python makes it incredibly easy to build progr...
freq,1


# Assignment

**Scrape** the talk descriptions. Hint: `soup.select('.presentation-description')`

**Make** new columns in the dataframe:
- description
- description character count
- description word count

**Describe** all the dataframe's columns. What's the average description word count? The minimum? The maximum?

**Answer** the question: Which descriptions could fit in a tweet?


# Stretch Challenge

**Make** another new column in the dataframe:
- description grade level (you can use [this `textstat` function](https://github.com/shivam5992/textstat#the-flesch-kincaid-grade-level) to get the Flesh-Kincaid grade level)

**Answer** the question: What's the distribution of grade levels? Plot a histogram.

**Be aware** that [Textstat has issues when sentences aren't separated by spaces](https://github.com/shivam5992/textstat/issues/77#issuecomment-453734048). (A Lambda School Data Science student helped identify this issue, and emailed with the developer.) 

Also, [BeautifulSoup doesn't separate paragraph tags with spaces](https://bugs.launchpad.net/beautifulsoup/+bug/1768330).

So, you may get some inaccurate or surprising grade level estimates here. Don't worry, that's ok — but optionally, can you do anything to try improving the grade level estimates?