<span style="font-size: 20pt;"><span style="font-weight: bold;">Chapter 10.</span>Cleaning data for socially interesting features</span>

Last update: 12 January 2024

Thank you for checking out the code for: 

> Hogan, Bernie (2023) _From Social Science to Data Science_. London, UK. Sage Publications. 

This notebook contains the code from the book, along with the headers and additional author notes that are not in the book as a way to help navigate the code. You can run this notebook in a browser by clicking the buttons below. 
    
The version that is uploaded to GitHub should have all the results pasted, but the best way to follow along is to clear all outputs and then start afresh. To do this in Jupyter go the menu and select "Kernel -> Restart Kernel and Clear all Outputs...". To do this on Google Colab go to the menu and select "Edit -> Clear all outputs".
    
The most up-to-date version of this code can be found at https://www.github.com/berniehogan/fsstds 

Additional resources and teaching materials can be found on Sage's forthcoming website for this book. 

All code for the book and derivative code on the book's repository is released open source under the  MIT license. 
    

[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/berniehogan/fsstds/main?filepath=chapters%2FCh.10.Cleaning.ipynb)[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/berniehogan/fsstds/blob/main/chapters/Ch.10.Cleaning.ipynb)

<span style="font-size: 20pt;">📺 YouTube Video lecture for this chapter 📺</span>

(Please note: This lecture is still pending)

In [41]:
from IPython.display import YouTubeVideo

# YouTubeVideo('')

## Important note on this chapter before getting started 

> This chapter is a sustained cleaning example of Stack Exchange data. It uses the Movie data, which can be downloaded from the Internet Archive. The subsequent chapters use this cleaned data for some of the examples. However, while you can continue to use this chapter to create such a file, I would highly recommend the use of the [Stack Exchange Downloader](https://github.com/berniehogan/fsstds/blob/main/supplemental_notebooks/Ch.00.Stack_downloader.ipynb), which I have provided in the supplemental notebooks. You simply need to run that one long cell and it will render a series of buttons and options for a very simple download experience. The downloader also exports to `feather` as well as `pickle` and `parquet`. I recommend using feather. 

> The updated versions of the following chaters preferentially load the feather data but will also still load the .pkl data as you can see in those chapters. Much of the insights of those chapters do not require this data, but it is really great data to get started with: 

> * It is live, 
> * It is messy, (which is good - there's lots to practice cleaning up such data)
> * There are many stacks to consider if you have topic expertise,
> * it is open access and does not require authentication. 

**This notebook uses the most recent Movies.StackExchange.com download. Which as of the latest edit is from 4 December 2023. Thus, the results will be different from the book, but the structure and logic will be completely the same.** 

## Introduction: Data as a form of social context

In the book I give an series of examples, such as a train journey and an email message, where we can think of many forms of socially interesting data that can be collected. There is no code in the first section. It focuses on data such as dates, locations, relationships, and the semantics of text. These can all be found in many forms of social data. 

To see them all in action and to practice with live data we can look to Stack Exchange. 

# A sustained example for cleaning: Stack Exchange

The book unfortunately omits one key step in this process of wrangling data from Stack Exchange. The files from https://archive.org/download/stackexchange/ come down as `.7z` files. There are programs available for both Mac and Windows which will unzip 7zip files. However, you can also do this directly in Python. 

The first code cell below is NOT in the book. It will automatically take a 7zip file from the data folder and extract it as a subfolder. You will first have to download the `.7z` file to your data folder. If you want to look at an automated way to download the data and extract, you are welcome to review the [Stack Exchange Downloader](https://github.com/berniehogan/fsstds/blob/main/supplemental_notebooks/Ch.00.Stack_downloader.ipynb) in the supplemental files. 

Notice the first part of the code installs `py7zr` which is the Python archive for unzipping 7z files. Then it has a method called `extract_7zip` with an optional parameter `remove_7z` which if true will try to delete the original 7z file once it has been extracted. 

The stack exchange folder then is placed in the same folder as the original 7z. It assumes the naming convention `movies.stackexchage.com.7z` but removes the 7z at the end so there should be a folder called (in this case) `movies.stackexchange.com` created once the file is run. 

In [1]:
# 7zip extractor might not be installed. It shouldn't cause trouble to live install
try: 
    import py7zr
except ModuleNotFoundError:
    import sys
    !{sys.executable} -m pip install py7zr
    import py7zr

from pathlib import Path
    
def extract_7zip(archive_path, remove_7z=False):
    
    if isinstance(archive_path, str):
        archive_path = Path(archive_path)
            
    if not archive_path.exists():
        raise FileNotFoundError(f"The there is no file found at {str(path)}. ")
        
    file_name = archive_path.name
    folder_name = ".".join(archive_path.name.split(".")[:-1])
    archive_folder = archive_path.parent / folder_name
    archive_folder.mkdir(exist_ok=True)

    with py7zr.SevenZipFile(archive_path, 'r') as archive:
        archive.extractall(archive_folder)

    if remove_7z: 
        try:
            os.remove(archive_path)
        except:
            print("The original 7z could not be deleted")

    return True

extract_7zip("../data/movies.stackexchange.com.7z", remove_7z=False)

True

In [2]:
from pathlib import Path

# I just unzipped the file within the data folder.
data_dir = Path().cwd().parent / "data" / "movies.stackexchange.com"

print(open(data_dir / "Posts.xml").read(1000))

﻿<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="2776" CreationDate="2011-11-30T19:15:54.070" Score="31" ViewCount="8503" Body="&lt;p&gt;Some comedians / actors are given creative freedom to improvise at times when producing a new film. Is there any evidence that Vince Vaughn or Owen Wilson improvised in any scenes, diverging from the script in the film &amp;quot;Wedding Crashers&amp;quot;?&lt;/p&gt;&#xA;" OwnerUserId="11" LastEditorUserId="94442" LastEditDate="2022-02-12T21:59:39.633" LastActivityDate="2022-02-12T21:59:39.633" Title="To what extent were the actors in Wedding Crashers improvising?" Tags="&lt;wedding-crashers&gt;" AnswerCount="2" CommentCount="0" ContentLicense="CC BY-SA 4.0" />
  <row Id="2" PostTypeId="2" ParentId="1" CreationDate="2011-11-30T19:37:10.510" Score="15" Body="&lt;p&gt;According to the &lt;a href=&quot;http://www.imdb.com/title/tt0396269/trivia&quot;&gt;trivia on IMDb&lt;/a&gt;, Owen Wilson and Vince Vaughn im

In [3]:
import xmltodict 

xml_data = open(data_dir / "Posts.xml",'r').read()

stack_dict = xmltodict.parse(xml_data)
print(type(stack_dict))

<class 'dict'>


In [4]:
print(stack_dict["posts"].keys())
print(type(stack_dict["posts"]["row"]))

dict_keys(['row'])
<class 'list'>


In [5]:
display(stack_dict["posts"]["row"][0])

{'@Id': '1',
 '@PostTypeId': '1',
 '@AcceptedAnswerId': '2776',
 '@CreationDate': '2011-11-30T19:15:54.070',
 '@Score': '31',
 '@ViewCount': '8503',
 '@Body': '<p>Some comedians / actors are given creative freedom to improvise at times when producing a new film. Is there any evidence that Vince Vaughn or Owen Wilson improvised in any scenes, diverging from the script in the film &quot;Wedding Crashers&quot;?</p>\n',
 '@OwnerUserId': '11',
 '@LastEditorUserId': '94442',
 '@LastEditDate': '2022-02-12T21:59:39.633',
 '@LastActivityDate': '2022-02-12T21:59:39.633',
 '@Title': 'To what extent were the actors in Wedding Crashers improvising?',
 '@Tags': '<wedding-crashers>',
 '@AnswerCount': '2',
 '@CommentCount': '0',
 '@ContentLicense': 'CC BY-SA 4.0'}

In [6]:
print(len(stack_dict['posts']['row']))

64054


In [7]:
import pandas as pd 

stack_df = pd.json_normalize(stack_dict["posts"]["row"])

## Quick summaries of the dataset 

In [8]:
stack_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64054 entries, 0 to 64053
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   @Id                     64054 non-null  object
 1   @PostTypeId             64054 non-null  object
 2   @AcceptedAnswerId       13634 non-null  object
 3   @CreationDate           64054 non-null  object
 4   @Score                  64054 non-null  object
 5   @ViewCount              22570 non-null  object
 6   @Body                   64054 non-null  object
 7   @OwnerUserId            62148 non-null  object
 8   @LastEditorUserId       40154 non-null  object
 9   @LastEditDate           40848 non-null  object
 10  @LastActivityDate       64054 non-null  object
 11  @Title                  22570 non-null  object
 12  @Tags                   22570 non-null  object
 13  @AnswerCount            22570 non-null  object
 14  @CommentCount           64054 non-null  object
 15  @C

In [9]:
stack_df.columns = [i.replace("@","") for i in stack_df.columns]

In [10]:
stack_df.loc[0]

Id                                                                       1
PostTypeId                                                               1
AcceptedAnswerId                                                      2776
CreationDate                                       2011-11-30T19:15:54.070
Score                                                                   31
ViewCount                                                             8503
Body                     <p>Some comedians / actors are given creative ...
OwnerUserId                                                             11
LastEditorUserId                                                     94442
LastEditDate                                       2022-02-12T21:59:39.633
LastActivityDate                                   2022-02-12T21:59:39.633
Title                    To what extent were the actors in Wedding Cras...
Tags                                                    <wedding-crashers>
AnswerCount              

# Setting an index

In [11]:
stack_df.columns

Index(['Id', 'PostTypeId', 'AcceptedAnswerId', 'CreationDate', 'Score',
       'ViewCount', 'Body', 'OwnerUserId', 'LastEditorUserId', 'LastEditDate',
       'LastActivityDate', 'Title', 'Tags', 'AnswerCount', 'CommentCount',
       'ContentLicense', 'ParentId', 'FavoriteCount', 'LastEditorDisplayName',
       'OwnerDisplayName', 'ClosedDate', 'CommunityOwnedDate'],
      dtype='object')

In [12]:
stack_df.set_index('Id', inplace=True)

# Handling missing data

In [13]:
len(stack_df["OwnerDisplayName"].unique())

463

In [14]:
stack_df["OwnerDisplayName"].unique()[0:5]

array([nan, 'user96', 'user35', 'user223', 'user315'], dtype=object)

In [15]:
stack_df["OwnerDisplayName"].fillna("",inplace=True)

In [16]:
type(stack_df["OwnerDisplayName"][0])

str

# Cleaning numeric data

In [17]:
# I use [:5] for brevity. You can remove it to see all of the columns.
stack_df[stack_df.columns[:5]].head()

Unnamed: 0_level_0,PostTypeId,AcceptedAnswerId,CreationDate,Score,ViewCount
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,2776.0,2011-11-30T19:15:54.070,31,8503.0
2,2,,2011-11-30T19:37:10.510,15,
3,1,814.0,2011-11-30T19:41:14.960,30,1946.0
4,1,120.0,2011-11-30T19:42:45.470,59,3929.0
6,1,21.0,2011-11-30T19:44:55.593,16,8337.0


In [18]:
for col in ["Score", "ViewCount", "AnswerCount",
            "CommentCount", "FavoriteCount"]:
    stack_df[col] = pd.to_numeric(stack_df[col],errors="coerce")

print(stack_df['Score'].mean())

7.2629656227558


In [19]:
stack_df.describe().style.format("{:0.2f}")

Unnamed: 0,Score,ViewCount,AnswerCount,CommentCount,FavoriteCount
count,64054.0,22570.0,22570.0,64054.0,4886.0
mean,7.26,7704.07,1.64,1.58,0.0
std,11.7,29366.4,1.45,2.44,0.02
min,-24.0,6.0,0.0,0.0,0.0
25%,1.0,422.0,1.0,0.0,0.0
50%,4.0,1428.0,1.0,1.0,0.0
75%,9.0,5138.0,2.0,2.0,0.0
max,326.0,1528888.0,19.0,31.0,1.0


In [20]:
tot = len(stack_df)

for col in ["Score", "ViewCount", "AnswerCount", 
            "CommentCount","FavoriteCount"]:
    print(f"Missing rows for {col}:", tot - stack_df[col].count()) 

Missing rows for Score: 0
Missing rows for ViewCount: 41484
Missing rows for AnswerCount: 41484
Missing rows for CommentCount: 0
Missing rows for FavoriteCount: 59168


# Cleaning up Web data

In [21]:
# Remember to set `Id` to the index (and remove the @symbols) 
# if you get an error here.
stack_df.loc["2","Body"]

'<p>According to the <a href="http://www.imdb.com/title/tt0396269/trivia">trivia on IMDb</a>, Owen Wilson and Vince Vaughn improvised the "Lock it up!" banter. As I understand it, that also means the other scenes did not - or only slightly - diverge from the script.</p>\n'

## Encoding

## Stripping HTML from text 

In [22]:
# Note - this might take a few seconds to a minute to complete. 
import bs4 
import warnings

# The parser now commonly warns that the comments might be filenames instead. 
# For discussion, example implementation, see: https://bugs.launchpad.net/beautifulsoup/+bug/1955450
warnings.filterwarnings("ignore", category=bs4.MarkupResemblesLocatorWarning)

def robustParse(text):
    try: 
        return bs4.BeautifulSoup(text, "lxml").text.replace("\n"," ")
    except: 
        return None 

# Note: Variable column header different from book.
# Updated to be consistent with SemanticName conventions
stack_df["BodyText"] = stack_df["Body"].map(robustParse)

display(stack_df[["Body","BodyText"]].head())

Unnamed: 0_level_0,Body,BodyText
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,<p>Some comedians / actors are given creative ...,Some comedians / actors are given creative fre...
2,"<p>According to the <a href=""http://www.imdb.c...","According to the trivia on IMDb, Owen Wilson a..."
3,"<p>In his Star Wars Episode 1 <a href=""https:/...","In his Star Wars Episode 1 review/analysis, Mi..."
4,<p>I'm a big fan of the Pink Panther movies (t...,I'm a big fan of the Pink Panther movies (the ...
6,"<p>At the end of the movie, adult Jack (Sean P...","At the end of the movie, adult Jack (Sean Penn..."


## Extracting links from HTML

In [23]:

# Notice that this will, like above, take a moment to run. 
def returnLinks(text):
    try: 
        soup = bs4.BeautifulSoup(text, 'html.parser')
        return [x['href'] for x in soup.find_all('a')
                if 'href' in x.attrs and "://" in x.get('href')]
    except:
        return None

# Let's make a new column with a list of all URLs found
# Errata: Naming convention aligned with Stack Downloader and other cols
#         It was ListUrl and now it is BodyURLs 
stack_df["BodyURLs"] = stack_df["Body"].map(returnLinks)

stack_df["BodyURLs"].head()

Id
1                                                   []
2         [http://www.imdb.com/title/tt0396269/trivia]
3    [https://redlettermedia.com/mr-plinketts-star-...
4    [http://www.imdb.com/title/tt0352520/, http://...
6                                                   []
Name: BodyURLs, dtype: object

# Cleaning up lists of data 

In [24]:
def splitTags(text):
    if type(text) != str:
        return []
    elif len(text) == 0:
        return []
    else:
        return text[1:-1].split("><")

print(stack_df["Tags"][4],end="\n\n")

stack_df["TagsList"] = stack_df["Tags"].map(splitTags)
print(stack_df["TagsList"][4])

<plot-explanation><analysis><ending><the-tree-of-life>

['plot-explanation', 'analysis', 'ending', 'the-tree-of-life']


In [25]:
stack_df["TagsList"].map(len).value_counts().sort_index(ascending=True)

TagsList
0    41484
1     5092
2    11805
3     4656
4      897
5      120
Name: count, dtype: int64

In [26]:
pd.crosstab(stack_df['PostTypeId'],stack_df['TagsList'].map(len))

TagsList,0,1,2,3,4,5
PostTypeId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,5092,11805,4656,897,120
2,37036,0,0,0,0,0
4,2211,0,0,0,0,0
5,2211,0,0,0,0,0
6,21,0,0,0,0,0
7,5,0,0,0,0,0


In [27]:
print(len(stack_df[stack_df["Tags"].notna()]))

22570


In [28]:
longtag_stack_df = stack_df[stack_df["Tags"].notna()].explode("TagsList")
display(longtag_stack_df[["TagsList",
                          "Body",
                          "Score",
                          "OwnerUserId"]].head(10))

Unnamed: 0_level_0,TagsList,Body,Score,OwnerUserId
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,wedding-crashers,<p>Some comedians / actors are given creative ...,31,11
3,analysis,"<p>In his Star Wars Episode 1 <a href=""https:/...",30,41
3,star-wars,"<p>In his Star Wars Episode 1 <a href=""https:/...",30,41
4,comedy,<p>I'm a big fan of the Pink Panther movies (t...,59,22
4,the-pink-panther,<p>I'm a big fan of the Pink Panther movies (t...,59,22
6,plot-explanation,"<p>At the end of the movie, adult Jack (Sean P...",16,34
6,analysis,"<p>At the end of the movie, adult Jack (Sean P...",16,34
6,ending,"<p>At the end of the movie, adult Jack (Sean P...",16,34
6,the-tree-of-life,"<p>At the end of the movie, adult Jack (Sean P...",16,34
10,plot-explanation,"<p>Frank Costello the mob boss, one of the mai...",21,11


In [29]:
longtag_stack_df['QuestionId'] = longtag_stack_df.index
longtag_stack_df.index = pd.RangeIndex(len(longtag_stack_df))
display(longtag_stack_df[["TagsList",
                          "QuestionId",
                          "Score",
                          "OwnerUserId"]].head(6))

Unnamed: 0,TagsList,QuestionId,Score,OwnerUserId
0,wedding-crashers,1,31,11
1,analysis,3,30,41
2,star-wars,3,30,41
3,comedy,4,59,22
4,the-pink-panther,4,59,22
5,plot-explanation,6,16,34


# Parsing time in the Stack Exchange 

In [30]:
for col in ["CreationDate", "LastEditDate", "LastActivityDate", 
            "ClosedDate", "CommunityOwnedDate"]:
    stack_df[col] = pd.to_datetime(stack_df[col])
    print(f"Number of missing for {col}: ",
          len(stack_df)-stack_df[col].count())

Number of missing for CreationDate:  0
Number of missing for LastEditDate:  23206
Number of missing for LastActivityDate:  0
Number of missing for ClosedDate:  61974
Number of missing for CommunityOwnedDate:  63986


In [31]:
# Slice by time 1: By Year
year = 2016
cyear = len(stack_df[stack_df["CreationDate"].dt.year == year])
print(f"There were {cyear} posts created in {year}")

There were 9545 posts created in 2016


In [32]:
# Time slicing: For one specific day
t1 = '2015-03-14'; t2 = '2015-03-15'
mask = (stack_df["CreationDate"]>= t1) & \
       (stack_df["CreationDate"]< t2)

print(f"There were {len(stack_df[mask])} posts made between",
      f"{t1} and {t2}")

There were 22 posts made between 2015-03-14 and 2015-03-15


In [33]:
type(stack_df["CreationDate"][0])

pandas._libs.tslibs.timestamps.Timestamp

# Regular expressions

In [34]:
list_Comments = ["I wanted a new guitar for Christmas, not a new sweater", 
                 "I always knew trombones were not for me", 
                 "Woohoo! New drums for my kit.", 
                 "What to do with my new bass?"]

import re

pattern = re.compile(r"new \w")
for comment in list_Comments: 
    print(pattern.findall(comment))

['new g', 'new s']
['new t']
[]
['new b']


In [35]:
pattern = re.compile(r"new \w*")
for comment in list_Comments: 
    print(pattern.findall(comment))

['new guitar', 'new sweater']
['new trombones']
[]
['new bass']


In [36]:
pattern = re.compile(r"\bnew \w*",re.IGNORECASE)
for comment in list_Comments: 
    print(pattern.findall(comment))

['new guitar', 'new sweater']
[]
['New drums']
['new bass']


## Further learning for regular expressions

In [37]:
# Try the various codes yourself for the text in the example
pattern = "\w"
text = "Happy Birthday: It's 21 time!" 

if re_match := re.compile(pattern).search(text):
    print(re_match[0])

H


## Regular expressions and _ground truth_

In [38]:
email_pattern = "\A[a-z0-9!#$%&'*+/=?^_‘{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_‘{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\z"

# Storing our work

In [39]:
# Previous export to pickle. Now recommending feather instead (for long-term maintainability)
#
# import pickle 
#
# with open(Path.cwd().parent / "data" / "movies_stack_df.pkl",'wb') as fileout: 
#     fileout.write(pickle.dumps(stack_df))

stack_df.reset_index().to_feather(Path.cwd().parent / "data" / "movies_stack_df.feather")

# Summary

# Further Reading 

# Extensions and reflections 