# INFOMCDMMC Critical Data Mining of Media Culture

## Utrecht University, MSc Applied Data Science


### Team members:
* Meagan Loerakker, m.b.loerakker@students.uu.nl
* Celesta Terwisscha van Scheltinga, c.c.m.terwisschavanscheltinga@students.uu.nl
* Nina Alblas, n.m.alblas@students.uu.nl
* Berber van Drunen, b.p.vandrunen@students.uu.nl
* Debarupa Roy Choudhury, d.roychoudhury@students.uu.nl

# Preprocessing of the data

In [None]:
#Stats
import pandas as pd

#Support
import re
import csv
import string

#NLP
import nltk
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = nltk.stem.WordNetLemmatizer() # from nltk.stem import WordNetLemmatizer

import spacy
nlp = spacy.load("en_core_web_sm")

In [3]:
news_ai_df = pd.read_csv("data/cleaned_data.csv").iloc[:, 1:]
news_ai_df

Unnamed: 0,filename,outlet,title,description,datetime,body,year,month
0,2010-06-gears-of-war-3-beast.html,Wired,Gears of War 3 Co-op Makes Beasts of Gamers,"LOS ANGELES — Back in 2008, Gears of War 2 int...",2010-06-17 16:22:00.000,"LOS ANGELES – Back in 2008, Gears of War 2 in...",2010,6
1,sponsored-story-innovating-for-the-individual....,Wired,WIRED Brand Lab | Innovating for the Individual,What every leader can learn from the technolog...,2021-08-27 12:14:31.296,Innovative technology is making healthcare mor...,2021,8
2,ttps:--www.wired.com-story-for-all-mankind-bes...,Wired,‘For All Mankind’ Is the Best Sci-Fi of Its Era,The Apple TV+ alternate history series is simp...,2022-06-09 07:00:00.000,"New Star Wars, new Star Trek, Russian Doll, Se...",2022,6
3,story-ghostery-open-source-new-business-model....,Wired,Ad-Blocker Ghostery Just Went Open Source—And ...,"Ghostery, Edward Snowden’s preferred ad-blocke...",2018-03-08 09:45:00.000,"In privacy-focused, anti-establishment corners...",2018,3
4,story-best-game-subscriptions.html,Wired,Too Many Game Subscription Services? Here’s Ho...,PlayStation Plus Extra or Plus Premium? Xbox L...,2022-04-11 10:00:00.000,Gaming is starting to look more and more like...,2022,4
...,...,...,...,...,...,...,...,...
17427,bits.blogs.nytimes.com-2014-01-01-big-data-shr...,NYT,Big Data Shrinks to Grow,It was a good year for Big Data — the term at ...,2014-01-01 16:00:55.000,"In fact, it may be underway. Google Trends sh...",2014,1
17428,2013-06-09-us-revelations-give-look-at-spy-age...,NYT,How the U.S. Uses Technology to Mine More Data...,A revolution in software technology has transf...,2013-06-09 01:43:16.000,WASHINGTON — When American analysts hunting te...,2013,6
17429,krugman.blogs.nytimes.com-2013-08-18-the-dynam...,NYT,The Dynamo and Big Data,These things take time.,2013-08-18 15:43:43.000,James Glanz relays skepticism about the econom...,2013,8
17430,dealbook.nytimes.com-2012-03-26-morning-take-o...,NYT,Morning Take-Out,Highlights from the DealBook newsletter.,2012-03-26 14:23:58.000,E-Mail to Corzine Said Transfer Was Not Custom...,2012,3


In [4]:
def include_features(x):
    """
    Include text "features" (https://spacy.io/usage/linguistic-features)
    """
    
    include_features = ['VERB', 'PROPN', 'NOUN', 'ADJ']
    text = ' '.join([ent.text for ent in x if ent.pos_ in include_features])
    
    return text

In [8]:
def clean_text(text, stopword_list):
    """
    Clean up the texts (lowercase, remove punctuation, etc.)
    """
    
    # lowercase
    text = text.lower()
    
    # remove URLs
    re.sub(r'http\S+', '', text)
    
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # tokenize
    text = nltk.word_tokenize(text) 
    
    # remove stop words
    text = [token for token in text if token not in stopword_list]
    
    # lemmatization and pass string back
    text = ' '.join([lemmatizer.lemmatize(w) for w in text])
    
    return text

In [15]:
def preprocess_text(df, text_column):
    """
    df: pd.DataFrame
    text_column: "body" or "description"
    """
    
    # separate features
    nlp_text = df[text_column].apply(lambda text : nlp(text))
    
    # include features (https://spacy.io/usage/linguistic-features)
    nlp_features = nlp_text.apply(lambda text : include_features(text))
    
    # define stop words
    stop_words = stopwords.words('english')
    extra_stop_words = ['ai', 'artificial intelligence', 'artificial', 'intelligence', 'technology', 'said', 'year', 'people', 'time', 'tech', 'mr', 'dr', 'think', 'know', 'want', 'create', 'say', 'going', 'human', 'woman', 'man', 'washington post', 'nyt', 'washington', 'post', 'guardian', 'gizmodo', 'wired']
    stop_words.extend(extra_stop_words)
    
    # clean text
    cleaned_text = nlp_features.apply(lambda text : clean_text(text, stop_words))
    
    if text_column == "body":
        df["preprocessed_body"] = cleaned_text
    else:
        df["preprocessed_description"] = cleaned_text
        
    return df

In [10]:
# preprocess the description column
news_ai_df = preprocess_text(news_ai_df, "description")
news_ai_df

Unnamed: 0,filename,outlet,title,description,datetime,body,year,month,cleaned_description
0,2010-06-gears-of-war-3-beast.html,Wired,Gears of War 3 Co-op Makes Beasts of Gamers,"LOS ANGELES — Back in 2008, Gears of War 2 int...",2010-06-17 16:22:00.000,"LOS ANGELES – Back in 2008, Gears of War 2 in...",2010,6,los angeles gear war introduced horde mode co ...
1,sponsored-story-innovating-for-the-individual....,Wired,WIRED Brand Lab | Innovating for the Individual,What every leader can learn from the technolog...,2021-08-27 12:14:31.296,Innovative technology is making healthcare mor...,2021,8,leader learn transforming healthcare
2,ttps:--www.wired.com-story-for-all-mankind-bes...,Wired,‘For All Mankind’ Is the Best Sci-Fi of Its Era,The Apple TV+ alternate history series is simp...,2022-06-09 07:00:00.000,"New Star Wars, new Star Trek, Russian Doll, Se...",2022,6,apple tv alternate history series ambitious th...
3,story-ghostery-open-source-new-business-model....,Wired,Ad-Blocker Ghostery Just Went Open Source—And ...,"Ghostery, Edward Snowden’s preferred ad-blocke...",2018-03-08 09:45:00.000,"In privacy-focused, anti-establishment corners...",2018,3,ghostery edward snowden preferred ad blocker d...
4,story-best-game-subscriptions.html,Wired,Too Many Game Subscription Services? Here’s Ho...,PlayStation Plus Extra or Plus Premium? Xbox L...,2022-04-11 10:00:00.000,Gaming is starting to look more and more like...,2022,4,playstation extra plus premium xbox live gold ...
...,...,...,...,...,...,...,...,...,...
17427,bits.blogs.nytimes.com-2014-01-01-big-data-shr...,NYT,Big Data Shrinks to Grow,It was a good year for Big Data — the term at ...,2014-01-01 16:00:55.000,"In fact, it may be underway. Google Trends sh...",2014,1,good big data term least industry mass data su...
17428,2013-06-09-us-revelations-give-look-at-spy-age...,NYT,How the U.S. Uses Technology to Mine More Data...,A revolution in software technology has transf...,2013-06-09 01:43:16.000,WASHINGTON — When American analysts hunting te...,2013,6,revolution software transformed national secur...
17429,krugman.blogs.nytimes.com-2013-08-18-the-dynam...,NYT,The Dynamo and Big Data,These things take time.,2013-08-18 15:43:43.000,James Glanz relays skepticism about the econom...,2013,8,thing take
17430,dealbook.nytimes.com-2012-03-26-morning-take-o...,NYT,Morning Take-Out,Highlights from the DealBook newsletter.,2012-03-26 14:23:58.000,E-Mail to Corzine Said Transfer Was Not Custom...,2012,3,highlight dealbook newsletter


In [12]:
# preprocess the body column
news_ai_df = preprocess_text(news_ai_df, "body")
news_ai_df

step 1 done
step 2 done


Unnamed: 0,filename,outlet,title,description,datetime,body,year,month,cleaned_description,cleaned_body
0,2010-06-gears-of-war-3-beast.html,Wired,Gears of War 3 Co-op Makes Beasts of Gamers,"LOS ANGELES — Back in 2008, Gears of War 2 int...",2010-06-17 16:22:00.000,"LOS ANGELES – Back in 2008, Gears of War 2 in...",2010,6,los angeles gear war introduced horde mode co ...,los angeles gear war introduced horde mode co ...
1,sponsored-story-innovating-for-the-individual....,Wired,WIRED Brand Lab | Innovating for the Individual,What every leader can learn from the technolog...,2021-08-27 12:14:31.296,Innovative technology is making healthcare mor...,2021,8,leader learn transforming healthcare,innovative making healthcare personal bit inge...
2,ttps:--www.wired.com-story-for-all-mankind-bes...,Wired,‘For All Mankind’ Is the Best Sci-Fi of Its Era,The Apple TV+ alternate history series is simp...,2022-06-09 07:00:00.000,"New Star Wars, new Star Trek, Russian Doll, Se...",2022,6,apple tv alternate history series ambitious th...,new star war new star trek russian doll severa...
3,story-ghostery-open-source-new-business-model....,Wired,Ad-Blocker Ghostery Just Went Open Source—And ...,"Ghostery, Edward Snowden’s preferred ad-blocke...",2018-03-08 09:45:00.000,"In privacy-focused, anti-establishment corners...",2018,3,ghostery edward snowden preferred ad blocker d...,privacy focused anti establishment corner inte...
4,story-best-game-subscriptions.html,Wired,Too Many Game Subscription Services? Here’s Ho...,PlayStation Plus Extra or Plus Premium? Xbox L...,2022-04-11 10:00:00.000,Gaming is starting to look more and more like...,2022,4,playstation extra plus premium xbox live gold ...,gaming starting look netflix buying disc store...
...,...,...,...,...,...,...,...,...,...,...
17427,bits.blogs.nytimes.com-2014-01-01-big-data-shr...,NYT,Big Data Shrinks to Grow,It was a good year for Big Data — the term at ...,2014-01-01 16:00:55.000,"In fact, it may be underway. Google Trends sh...",2014,1,good big data term least industry mass data su...,fact underway google trend show search term bi...
17428,2013-06-09-us-revelations-give-look-at-spy-age...,NYT,How the U.S. Uses Technology to Mine More Data...,A revolution in software technology has transf...,2013-06-09 01:43:16.000,WASHINGTON — When American analysts hunting te...,2013,6,revolution software transformed national secur...,american analyst hunting terrorist sought new ...
17429,krugman.blogs.nytimes.com-2013-08-18-the-dynam...,NYT,The Dynamo and Big Data,These things take time.,2013-08-18 15:43:43.000,James Glanz relays skepticism about the econom...,2013,8,thing take,james glanz relay skepticism economic impact b...
17430,dealbook.nytimes.com-2012-03-26-morning-take-o...,NYT,Morning Take-Out,Highlights from the DealBook newsletter.,2012-03-26 14:23:58.000,E-Mail to Corzine Said Transfer Was Not Custom...,2012,3,highlight dealbook newsletter,e mail corzine transfer customer money jon cor...


In [14]:
news_ai_df.to_csv("data/preprocessed_data.csv")