# Week 2. Day 1. Exercises from Chapter 4 of FSStDS. 
## Fundamentals of Social Data Science. MT 2022

Within your study pod discuss the following questions. Please submit an individual assignment by 12:30pm Tuesday, October 18, 2022 on Canvas. 

# Exercise 1. Creating a DataFrame from multiple JSON files

There are nine pages of search results for Oxford from OMDB (as of last year; `omdb_Oxford_search_page_\*.json`). 

**Exercise 1a.** Create a single DataFrame from these 9 files.

**Exercise 1b.** Report on the oldest and most recent entry. 

- **Hint**. To read all files from a Path object with a wildcard use the 'glob' method, such as: `for path in data_dir.blog("omdb_Oxford*.json"): path.do_something()`

- **Challenge** - Note that shows that span years are written with the two years separated by `--`. So ensure that you split this and then consider these years when reporting the oldest and newest entries. 


In [14]:
# Exercise 1a answer below here 
import pandas as pd
import json
from pathlib import Path 

def read_json(json_file: Path) -> pd.DataFrame:
    with open(json_file, 'r') as f:
        data = json.load(f)
    return pd.DataFrame(data["Search"])


data_dir = Path("../data/Week2Day1 - Exercsie Data for the data folder")
assert data_dir.exists()

# Read the files from json
df = pd.concat([read_json(f) for f in data_dir.glob("*omdb_Oxford_search_page_*.json")])
# Exercise 1a answer above here

In [15]:
# Exercise 1b answer below here 
df["Year"] = df["Year"].str.extract(r"(\d{4})").astype(int)
oldest_entry = df[df["Year"] == df["Year"].min()]
newest_entry = df[df["Year"] == df["Year"].max()]
print(f"Oldest entry: {oldest_entry['Title'].values[0]} ({oldest_entry['Year'].values[0]})")
print(f"Newest entry: {newest_entry['Title'].values[0]} ({newest_entry['Year'].values[0]})")
# Exercise 1b answer above here

Oldest entry: The Oxford and Cambridge University Boat Race (1895)
Newest entry: Ein Sommer in Oxford (2018)


# Exercise 2. Navigate Reddit JSON 

Go to a page on reddit and then replace www.reddit with api.reddit. This will then give the page as JSON. Do this for a specific subreddit of interest (such as cats, cryptocurrency, mediasynthesis, ukpolitics, etc...). 

This json will likely only have 25-26 entries. Normalise by data so that each story has a single line. This will have many, many columns. One of these columns will be the title of the headline and one will be the URL. 

- **Exercise 2a**. Find these two columns and then create a smaller DataFrame that just has these columns as well as the one for upvote score (`ups`).  

- **Exercise 2b**. What are the most common words across all titles? Does it matter if you use lower case and remove punctuation as we did last week? 

- **Exercise 2c**. What domain names are the most common?

> **Hint**: If you aren't having luck with saving your own JSON, you can use the old `environment.json` that is appended with the data. 

> **Hint**: Parsing domain names can be a nuisance. Here is a a small snippet that can help: 

In [57]:
# See: https://docs.python.org/3/library/urllib.parse.html
# For example:
from urllib.parse import urlparse
result = urlparse("http://www.nytimes.com/somestory.html")
print(result) # Which item is the domain name? 

ParseResult(scheme='http', netloc='www.nytimes.com', path='/somestory.html', params='', query='', fragment='')


In [53]:
# Exercise 2a Answer below here 
from typing import List
import requests

def read_results(url: str, columns: List[str]) -> pd.DataFrame:
    r = requests.get(url)
    assert r.status_code == 200, f"Request failed with status code {r.status_code}"
    results = r.json()
    return pd.DataFrame([data["data"] for data in results["data"]["children"]])[columns]

reddit_url = "https://api.reddit.com"
wholesome_url = f"{reddit_url}/r/wholesomememes/top?t=day&limit=75"

new_df = read_results(wholesome_url, columns=["title", "url", "ups"])
print(new_df.head())

# Exercise 2a Answer above here 

                            title                                  url    ups
0            Look at his smile :)      https://i.imgur.com/H9tgQ0x.png  73112
1  Still friends 37 years later 💜  https://i.redd.it/er7dsw87b8u91.jpg  70753
2            He is the chosen one  https://i.redd.it/i7zy6j11o6u91.jpg  19215
3                Important Update  https://i.redd.it/zp8an8p15au91.png   7784
4                 this is so true  https://i.redd.it/7prhc0hpq6u91.jpg   6747


In [55]:
# Exercise 2b Answer below here 
# Find most common words in titles
new_df["title"].str.lower().str.split().explode().value_counts().head(10)

# Exercise 2b Answer above here 

is           6
you          5
to           5
just         4
this         4
i            4
in           4
like         4
my           4
wholesome    4
Name: title, dtype: int64

In [60]:
# Exercise 2c Answer below here 
def parse_url(url: str) -> str:
    return urlparse(url).netloc

new_df["domain"] = new_df["url"].apply(parse_url)

# most common domains
new_df["domain"].value_counts().head(10)

# Exercise 2c Answer above here 

i.redd.it      48
i.imgur.com     3
imgur.com       2
Name: domain, dtype: int64

# Exercise 3. The love-hate relationship with DIKW

As mentioned in the chapter, the Wikipedia entry for data had DIKW in the article, then it was removed, then it reappared! I think it is still there now. I did not do the editing of this. 

With the data export of the Wikipedia page on data (`Wikipedia - data - Special export - 2022-10-17_10_24_15.xml`): 

- **Exercise 3a**. Create a DataFrame where each revision of the Wikipedia article in the export is given its own row. 
- **Exercise 3b**. Search for the first time DIKW was mentioned and the last time it was mentioned. Try to find the gap? When did it appear? 

> **Hint**: Using `xmltodict` might be helpful for wrangling the XML data, but it might also make life complicated. Explore the data both through a text editor (or browser) and through code to get a sense of it. 

> **Hint**: It is admittedly a little easier to do this if you make use of time in your DataFrame. We do not cover that much until Chapter 10, but feel free to look ahead. You can still sort by revisionID and then just browse the data yourself. This will end up being one of those tasks that's not easy but gets easier with more skills of abstraction.  

In [None]:
# Exercise 3a Answer below here


# EXercise 3a Answer above here

In [6]:
# Exercise 3b Answer below here


# EXercise 3b Answer above here