# Week 2. Day 1. Exercises from Chapter 4 of FSStDS. 
## Fundamentals of Social Data Science. MT 2022

Within your study pod discuss the following questions. Please submit an individual assignment by 12:30pm Tuesday, October 18, 2022 on Canvas. 

# Exercise 1. Creating a DataFrame from multiple JSON files

There are nine pages of search results for Oxford from OMDB (as of last year; `omdb_Oxford_search_page_\*.json`). 

**Exercise 1a.** Create a single DataFrame from these 9 files.

**Exercise 1b.** Report on the oldest and most recent entry. 

- **Hint**. To read all files from a Path object with a wildcard use the 'glob' method, such as: `for path in data_dir.blog("omdb_Oxford*.json"): path.do_something()`

- **Challenge** - Note that shows that span years are written with the two years separated by `--`. So ensure that you split this and then consider these years when reporting the oldest and newest entries. 


In [1]:
# Exercise 1a answer below here 
import pandas as pd
import json
from pathlib import Path 

def read_json(json_file: Path) -> pd.DataFrame:
    with open(json_file, 'r') as f:
        data = json.load(f)
    return pd.DataFrame(data["Search"])


data_dir = Path("../data/Week2Day1 - Exercsie Data for the data folder")
assert data_dir.exists()

# Read the files from json
df = pd.concat([read_json(f) for f in data_dir.glob("*omdb_Oxford_search_page_*.json")])
# Exercise 1a answer above here

In [2]:
# Exercise 1b answer below here 
df["Year"] = df["Year"].str.extract(r"(\d{4})").astype(int)
oldest_entry = df[df["Year"] == df["Year"].min()]
newest_entry = df[df["Year"] == df["Year"].max()]
print(f"Oldest entry: {oldest_entry['Title'].values[0]} ({oldest_entry['Year'].values[0]})")
print(f"Newest entry: {newest_entry['Title'].values[0]} ({newest_entry['Year'].values[0]})")
# Exercise 1b answer above here

Oldest entry: The Oxford and Cambridge University Boat Race (1895)
Newest entry: Ein Sommer in Oxford (2018)


# Exercise 2. Navigate Reddit JSON 

Go to a page on reddit and then replace www.reddit with api.reddit. This will then give the page as JSON. Do this for a specific subreddit of interest (such as cats, cryptocurrency, mediasynthesis, ukpolitics, etc...). 

This json will likely only have 25-26 entries. Normalise by data so that each story has a single line. This will have many, many columns. One of these columns will be the title of the headline and one will be the URL. 

- **Exercise 2a**. Find these two columns and then create a smaller DataFrame that just has these columns as well as the one for upvote score (`ups`).  

- **Exercise 2b**. What are the most common words across all titles? Does it matter if you use lower case and remove punctuation as we did last week? 

- **Exercise 2c**. What domain names are the most common?

> **Hint**: If you aren't having luck with saving your own JSON, you can use the old `environment.json` that is appended with the data. 

> **Hint**: Parsing domain names can be a nuisance. Here is a a small snippet that can help: 

In [3]:
# See: https://docs.python.org/3/library/urllib.parse.html
# For example:
from urllib.parse import urlparse
result = urlparse("http://www.nytimes.com/somestory.html")
print(result) # Which item is the domain name? 

ParseResult(scheme='http', netloc='www.nytimes.com', path='/somestory.html', params='', query='', fragment='')


In [4]:
# Exercise 2a Answer below here 
from typing import List
import requests

def read_results(url: str, columns: List[str]) -> pd.DataFrame:
    r = requests.get(url)
    assert r.status_code == 200, f"Request failed with status code {r.status_code}"
    results = r.json()
    return pd.DataFrame([data["data"] for data in results["data"]["children"]])[columns]

reddit_url = "https://api.reddit.com"
wholesome_url = f"{reddit_url}/r/wholesomememes/top?t=year&limit=50"

#new_df = read_results(wholesome_url, columns=["title", "url", "ups"])
#print(new_df.head())

# using the json
json_file = data_dir / "environment.json"
with open(json_file, 'r') as f:
    data = json.load(f)



environment_df = pd.json_normalize(data["data"]["children"])

small_environment = environment_df[["data.title", "data.ups", "data.url"]]
print(small_environment.head())
# Exercise 2a Answer above here 

                                          data.title  data.ups  \
0  President Biden will make entire 645k federal ...      1226   
1  ‘I regret my country has been absent’: John Ke...      3532   
2  Should Biden declare a 'climate emergency'? 38...        50   
3  Schumer calls for Biden to declare climate eme...        66   
4  Trump’s Last-Minute Attack on Clean Air Faces ...        24   

                                            data.url  
0  https://electrek.co/2021/01/25/president-biden...  
1  https://www.independent.co.uk/environment/clim...  
2  https://grist.org/climate/38-countries-have-de...  
3  https://thehill.com/homenews/senate/535811-sch...  
4  https://www.commondreams.org/newswire/2021/01/...  


In [5]:
# Exercise 2b Answer below here 
# Find most common words in titles
small_environment["data.title"].str.lower().str.split().explode().value_counts().head(10)

# Exercise 2b Answer above here 

the        50
to         47
climate    29
on         26
of         25
and        23
in         22
for        17
a          15
from       14
Name: data.title, dtype: int64

In [6]:
# Exercise 2c Answer below here 
def parse_url(url: str) -> str:
    return urlparse(url).netloc

small_environment["domain"] = small_environment["data.url"].apply(parse_url)

# most common domains
small_environment["domain"].value_counts().head(10)

# Exercise 2c Answer above here 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  small_environment["domain"] = small_environment["data.url"].apply(parse_url)


www.reddit.com             7
www.theguardian.com        5
www.commondreams.org       4
biologicaldiversity.org    4
thehill.com                3
www.nytimes.com            3
electrek.co                2
www.independent.co.uk      2
news.trust.org             2
www.bbc.co.uk              2
Name: domain, dtype: int64

# Exercise 3. The love-hate relationship with DIKW

As mentioned in the chapter, the Wikipedia entry for data had DIKW in the article, then it was removed, then it reappared! I think it is still there now. I did not do the editing of this. 

With the data export of the Wikipedia page on data (`Wikipedia - data - Special export - 2022-10-17_10_24_15.xml`): 

- **Exercise 3a**. Create a DataFrame where each revision of the Wikipedia article in the export is given its own row. 
- **Exercise 3b**. Search for the first time DIKW was mentioned and the last time it was mentioned. Try to find the gap? When did it appear? 

> **Hint**: Using `xmltodict` might be helpful for wrangling the XML data, but it might also make life complicated. Explore the data both through a text editor (or browser) and through code to get a sense of it. 

> **Hint**: It is admittedly a little easier to do this if you make use of time in your DataFrame. We do not cover that much until Chapter 10, but feel free to look ahead. You can still sort by revisionID and then just browse the data yourself. This will end up being one of those tasks that's not easy but gets easier with more skills of abstraction.  

In [7]:
# Exercise 3a Answer below here
wiki_path = next(data_dir.glob("*Wiki*.xml"))

# parse the xml file to dictionary
import xmltodict
with open(wiki_path, 'r', encoding="utf8") as f:
    wiki_dict = xmltodict.parse(f.read())

wikidf = pd.json_normalize(wiki_dict["mediawiki"]["page"]["revision"])
wikidf["timestamp"] = pd.to_datetime(wikidf["timestamp"])
print(wikidf.head())
# EXercise 3a Answer above here

       id                 timestamp  minor               comment     model  \
0  246479 2001-03-17 06:29:52+00:00    NaN                     *  wikitext   
1  246480 2001-03-26 22:35:32+00:00    NaN                     *  wikitext   
2   18191 2002-02-25 14:51:43+00:00    NaN  Automated conversion  wikitext   
3   18192 2002-02-25 14:52:12+00:00    NaN          term in bold  wikitext   
4   18817 2002-02-25 15:43:11+00:00    NaN     linking mass noun  wikitext   

        format                             sha1 contributor.username  \
0  text/x-wiki  4bi0lqmoh3d6z1tv2dg3zfjaqt20fj3      208.245.214.xxx   
1  text/x-wiki  lv1z2dzhms5ys77nfrgq58mem9ptdik       192.75.241.xxx   
2  text/x-wiki  rtvjc8fo6z391pjlte9kw8e7wi02enn    Conversion script   
3  text/x-wiki  1x390dickut12wma4fuy2eezu138o37                  NaN   
4  text/x-wiki  gprvn7s0a4k8jrjcou6lq139xxv1hde                  NaN   

  contributor.id text.@bytes text.@xml:space  \
0              0         542        preserve   
1 

In [8]:
wikidf[["timestamp", "text.#text"]].head()

Unnamed: 0,timestamp,text.#text
0,2001-03-17 06:29:52+00:00,The word <i>data</i> is the plural of <i>datum...
1,2001-03-26 22:35:32+00:00,The word <i>data</i> is the plural of <i>datum...
2,2002-02-25 14:51:43+00:00,The word <i>data</i> is the plural of <i>datum...
3,2002-02-25 14:52:12+00:00,The word '''data''' is the plural of <i>datum<...
4,2002-02-25 15:43:11+00:00,The word '''data''' is the plural of <i>datum<...


In [9]:
wikidf["timestamp"].max()

Timestamp('2010-05-14 22:59:49+0000', tz='UTC')

In [10]:
# Exercise 3b Answer below here
dikwdf = wikidf[["timestamp", "text.#text"]].dropna().rename(columns={"text.#text": "text"})
dikwdf.dropna(inplace=True)

smalldikwdf = dikwdf[dikwdf["text"].str.contains("dikw", na=False, case=False)]


# earliest timestamp
earliest_stamp = smalldikwdf["timestamp"].min()

# latest timestamp
latest_stamp = smalldikwdf["timestamp"].max()

# time difference in days
time_diff = (latest_stamp - earliest_stamp).days

# Find the dates where the word was removed
has_dikw = dikwdf["text"].str.contains("dikw", na=False, case=False)
dikw_removed = ~has_dikw & has_dikw.shift(1, fill_value=False)
dikw_removed_dates = dikwdf.loc[dikw_removed, "timestamp"].dt.strftime("%Y-%m-%d").tolist()

print(f"Earliest timestamp: {earliest_stamp}\nLatest timestamp: {latest_stamp}\nTotal duration: {time_diff} days\nDIKW was removed on the following dates: {dikw_removed_dates}")

# EXercise 3b Answer above here

Earliest timestamp: 2005-08-25 06:17:15+00:00
Latest timestamp: 2005-11-23 21:37:48+00:00
Total duration: 90 days
DIKW was removed on the following dates: ['2005-11-04', '2005-11-24']
