# Data Cleaning

So, now I have a bunch of text relating to transgender events in the News. How can I turn this text into data I can easily analyze? The first step is cleaning it up into a cleaned corpus by removing junk words, and turning the relevant words into easier-to-work-with forms. Then I can transform the cleaned corpi into word-document matrices for further analysis.

# Cleaning the API Corpi

I went through a couple of iterations of a pipeline to turn the NewsAPI and WorldNewsAPI documents into cleaned corpi. 
Initially, I made four pipelines, one for each API, and one using PorterStemmer with one using WordNetLemmatizer. This was a massive waste of space, as I quickly realized I could place both the stemmed and lemmed versions of the text into separate columns of a single .csv file.

I also realized that I was using the same basic algorithm to clean every text column in each of these APIs, with the only distinction being the column name. Thus, I combined all of the API data into a single pipeline which cleaned every text column and returned a stemmed/lemmed version of said columns. Then I was able to combine these columns into complete cleaned corpi for each API dataset.

Here is that pipeline:

In [2]:
import pandas as pd
import numpy as np
#Regular Expressions will come in handy here.
import re
#Importing the WordNetLemmatizer and the PorterStemmer from NLTK for stemming and lemming
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

news = pd.read_csv('../data/newsapi_corpus_raw.csv',index_col=0)
world_news = pd.read_csv('../data/worldnewsapi_corpus_raw.csv',index_col=0)

#Here I create some object for later. A porterStemmer, WordNetLemmatizer, and lastly a 'junk' regex which will help me remove garbage characters from the text.
wnl = WordNetLemmatizer()
ps = PorterStemmer()
#(This is a regex for anything that isn't alphanumeric)
junk = re.compile('[^a-zA-Z\\d]')

#Split the text along white-space and remove junk, then lemmatize
def lemmatize(text):
    #I tried splitting along a few differend regexes here, but ultimately the whitespace literal proved the best.
    word_list = re.split(' ',text)
    #This line removes all junk characters (anything non-alphanumeric) from each word in the list.
    word_list = [re.sub(junk,'',word) for word in word_list]
    word_list = [wnl.lemmatize(word) for word in word_list]
    return ' '.join(word_list)
v_lemm = np.vectorize(lemmatize)

#Same process but with the PorterStemmer rather than the WordNetLemmatizer
def stem(text):
    word_list = re.split(' ',text)
    word_list = word_list = [re.sub(junk,'',word) for word in word_list]
    word_list = [ps.stem(word) for word in word_list]
    return ' '.join(word_list)
v_stem = np.vectorize(stem)
#Fill nan values
news = news.fillna('')
world_news = world_news.fillna('')

# I did this already, now commenting it out for ease of use for stemming.

#Add stemmed and lemmed versions of News as columns
news.loc[:,'title_s']=v_stem(news.loc[:,'title'])
news.loc[:,'desc_s']=v_stem(news.loc[:,'description'])
news.loc[:,'content_s']=v_stem(news.loc[:,'content'])
news.loc[:,'title_l']=v_lemm(news.loc[:,'title'])
news.loc[:,'desc_l']=v_lemm(news.loc[:,'description'])
news.loc[:,'content_l']=v_lemm(news.loc[:,'content'])

#Do the same for world_news
world_news.loc[:,'text_s']=v_stem(world_news.loc[:,'text'])
world_news.loc[:,'text_l']=v_lemm(world_news.loc[:,'text'])
world_news.loc[:,'title_s']=v_stem(world_news.loc[:,'title'])
world_news.loc[:,'title_l']=v_lemm(world_news.loc[:,'title'])

#news.to_csv('../../data/newsapi_corpus_cleaned.csv',index=0)
#world_news.to_csv('../../data/worldnewsapi_corpus_cleaned.csv',index=0)
news.head()
world_news.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,title,text,authors,country,sentiment,url,text_s,text_l,title_s,title_l
0.0,Russia’s Supreme Court effectively outlaws LGB...,Menu Menu World U.S. Politics Sports Entertain...,DASHA LITVINOVA,us,0.311,https://apnews.com/article/russia-lgbtq-crackd...,menu menu world us polit sport entertain busi ...,Menu Menu World US Politics Sports Entertainme...,russia suprem court effect outlaw lgbtq activ ...,Russias Supreme Court effectively outlaw LGBTQ...
1.0,Where Do Trans Rights Stand in Taiwan After Sa...,"An estimated 5,000 people gathered in Ximendin...",Daniel Yo-Ling,us,0.134,https://thediplomat.com/2023/11/where-do-trans...,an estim 5000 peopl gather in ximend on the ev...,An estimated 5000 people gathered in Ximending...,where do tran right stand in taiwan after same...,Where Do Trans Rights Stand in Taiwan After Sa...
2.0,Transgender People&#039;s Neurological Needs A...,"As a transgender neurologist, I advocate for t...",Deneen Broadnax,us,0.12,https://worldnewsera.com/news/science/transgen...,as a transgend neurologist i advoc for the imp...,As a transgender neurologist I advocate for th...,transgend people039 neurolog need are be overl...,Transgender People039s Neurological Needs Are ...
3.0,"Class 10th, 12th Board Exams Forms: Transgende...",After the Supreme Court recognised transgender...,Deeksha Teri,in,0.06,https://indianexpress.com/article/education/cl...,after the suprem court recognis transgend peop...,After the Supreme Court recognised transgender...,class 10th 12th board exam form transgend stud...,Class 10th 12th Board Exams Forms Transgender ...
4.0,Photos: Protesters squared off in downtown Ottawa,Article content PHOTO GALLERY Thousands of peo...,Lois Kirkup,ca,0.481,https://ottawacitizen.com/news/local-news/phot...,articl content photo galleri thousand of peopl...,Article content PHOTO GALLERY Thousands of peo...,photo protest squar off in downtown ottawa,Photos Protesters squared off in downtown Ottawa


# Ground News

For this dataset, I wanted to do more than just stem and lem each word. I knew the source_text entries contained a lot of junk words which came from advertisements, site descriptions, subscription requests, etc. I wanted to remove as much of this junk text as I could while keeping the text relevant to the news story. 

To do this, I took advantage of the fact that the source text retained its original order, and that I had a **title** entry corresponding to each **source_text** entry. The plan was to tokenize and lemmatize the title and corresponding source text entry, then only keep the words in the source text which came in between the first and last word which matched a word in the title.

### Example:
Using the title

**'Greece Legalizes same-sex marriage'**

and this monster lemmatized source text:

   ```
   "BBC HomepageSkip to contentAccessibility HelpYour accountHomeNewsSportEarthReelWorklifeTravelMore menuMore menuSearch BBCHomeNewsSportEarthReelWorklifeTravelCultureFutureMusicTVWeatherSoundsClose menuBBC NewsMenuHomeIsrael-Gaza warWar in UkraineClimateVideoWorldUS & CanadaUKBusinessTechMoreScienceEntertainment & ArtsHealthIn PicturesBBC VerifyWorld News TVNewsbeatWorldAfricaAsiaAustraliaEuropeLatin AmericaMiddle EastGreece legalises same-sex marriagePublished1 day agoShareclose panelShare pageCopy linkAbout sharingThis video can not be playedTo play this video you need to enable JavaScript in your browser.Media caption, Watch: Cheers in Athens as same-sex marriage becomes lawBy James GregoryBBC NewsGreece has become the first Christian Orthodox-majority country to legalise same-sex marriage.Same-sex couples will now also be legally allowed to adopt children after Thursday's 176-76 vote in parliament.Prime Minister Kyriakos Mitsotakis said the new law would ""boldly abolish a serious inequality"".But it has divided the country, with fierce resistance led by the powerful Orthodox Church. Its supporters held a protest rally in Athens.Many displayed banners, held crosses, read prayers and sang passages from the Bible in the capital's Syntagma Square.The head of the Orthodox Church, Archbishop Ieronymos, said the measure would ""corrupt the homeland's social cohesion"".The bill needed a simple majority to pass through the 300-member parliament.Mr Mitsotakis had championed the bill but required the support of opposition parties to get it over the line, with dozens of MPs from his centre-right governing party opposed. ""People who have been invisible will finally be made visible around us, and with them, many children will finally find their rightful place,"" the prime minister told parliament during a debate ahead of the vote. ""The reform makes the lives of several of our fellow citizens better, without taking away anything from the lives of the many.""Image source, Getty ImagesImage caption, Opponents of the legislature held a protest rally in front of the parliament building in AthensThe vote has been welcomed by LGBTQ organisations in Greece.""This is a historic moment,"" Stella Belia, the head of same-sex parents' group Rainbow Families, told Reuters news agency. ""This is a day of joy."" Fifteen of the European Union's 27 members have already legalised same-sex marriage. It is permitted in 35 countries worldwide.Greece has until now lagged behind some of its European neighbours, largely because of opposition from the Church. It is the first country in south-eastern Europe to have marriage equality.Related TopicsGreeceMarriageLGBTMore on this storyCheers in Athens as same-sex marriage becomes law. Video, 00:00:28Cheers in Athens as same-sex marriage becomes lawPublished1 day ago0:28Top StoriesLive. ‘Putin is responsible’ - Biden speaks out after report of Navalny’s deathTrump ordered to pay 354m in New York fraud casePublished10 hours agoTrump must pay 354m. How could he do it?Published2 hours agoFeaturesAlexei Navalny: What we know about reports of his deathNavalny’s life in 'Polar Wolf' remote penal colonyArrested and poisoned: See Navalny's moments of defiance. VideoArrested and poisoned: See Navalny's moments of defianceSatellite images show construction on Egypt-Gaza borderIs Russia about to win another victory in Ukraine?The Argentines backing a 'crazy' president's shock therapyMillions of donkeys killed each year to make medicineWeekly quiz: Who could join Sinéad in the Rock & Roll Hall of Fame?The KGB spy who rubbed shoulders with French elite for decadesElsewhere on the BBCWhy Gen Z are dressing like Mob WivesThe world map created by a man who never left homeThe seedy underbelly of life coachingMost Read1Amy Schumer hits back at comments about her face2Two teenagers charged over Super Bowl parade shooting3Alexei Navalny: What we know about reports of his death4King's cancer may bring family closer, says Harry5Brian Wilson's family seeks conservatorship6Trump must pay 354m. How could he do it?7Satellite images show construction on Egypt-Gaza border8Fani Willis' dad testifies in Trump Georgia case9Democrats relieved as Manchin rules out White House bid10Biden condemns House recess without new Ukraine aidBBC News ServicesOn your mobileOn smart speakersGet news alertsContact BBC NewsHomeNewsSportEarthReelWorklifeTravelCultureFutureMusicTVWeatherSoundsTerms of UseAbout the BBCPrivacy PolicyCookiesAccessibility HelpParental GuidanceContact the BBCGet Personalised NewslettersWhy you can trust the BBCAdvertise with us© 2024 BBC. The BBC is not responsible for the content of external sites. Read about our approach to external linking."
```

This algorithm will throw out a bunch of junk at the beginning and end, and only keep the bulk of the article, which falls in between two instances of words which match their lemmatized cousins in the corresponding **title** entry.

  ~~BBC HomepageSkip to contentAccessibility HelpYour accountHomeNewsSportEarthReelWorklifeTravelMore menuMore menuSearch BBCHomeNewsSportEarthReelWorklifeTravelCultureFutureMusicTVWeatherSoundsClose menuBBC NewsMenuHomeIsrael-Gaza warWar in UkraineClimateVideoWorldUS & CanadaUKBusinessTechMoreScienceEntertainment & ArtsHealthIn PicturesBBC VerifyWorld News TVNewsbeatWorldAfricaAsiaAustraliaEuropeLatin AmericaMiddle EastGreece~~ 
  ```
  legalises same-sex marriagePublished1 day agoShareclose panelShare pageCopy linkAbout sharingThis video can not be playedTo play this video you need to enable JavaScript in your browser.Media caption, Watch: Cheers in Athens as same-sex marriage becomes lawBy James GregoryBBC NewsGreece has become the first Christian Orthodox-majority country to legalise same-sex marriage.Same-sex couples will now also be legally allowed to adopt children after Thursday's 176-76 vote in parliament.Prime Minister Kyriakos Mitsotakis said the new law would ""boldly abolish a serious inequality"".But it has divided the country, with fierce resistance led by the powerful Orthodox Church. Its supporters held a protest rally in Athens.Many displayed banners, held crosses, read prayers and sang passages from the Bible in the capital's Syntagma Square.The head of the Orthodox Church, Archbishop Ieronymos, said the measure would ""corrupt the homeland's social cohesion"".The bill needed a simple majority to pass through the 300-member parliament.Mr Mitsotakis had championed the bill but required the support of opposition parties to get it over the line, with dozens of MPs from his centre-right governing party opposed. ""People who have been invisible will finally be made visible around us, and with them, many children will finally find their rightful place,"" the prime minister told parliament during a debate ahead of the vote. ""The reform makes the lives of several of our fellow citizens better, without taking away anything from the lives of the many.""Image source, Getty ImagesImage caption, Opponents of the legislature held a protest rally in front of the parliament building in AthensThe vote has been welcomed by LGBTQ organisations in Greece.""This is a historic moment,"" Stella Belia, the head of same-sex parents' group Rainbow Families, told Reuters news agency. ""This is a day of joy."" Fifteen of the European Union's 27 members have already legalised same-sex marriage. It is permitted in 35 countries worldwide.Greece has until now lagged behind some of its European neighbours, largely because of opposition from the Church. It is the first country in south-eastern Europe to have marriage
  ```
  ~~equality.Related TopicsGreeceMarriageLGBTMore on this storyCheers in Athens as same-sex marriage becomes law. Video, 00:00:28Cheers in Athens as same-sex marriage becomes lawPublished1 day ago0:28Top StoriesLive. ‘Putin is responsible’ - Biden speaks out after report of Navalny’s deathTrump ordered to pay 354m in New York fraud casePublished10 hours agoTrump must pay 354m. How could he do it?Published2 hours agoFeaturesAlexei Navalny: What we know about reports of his deathNavalny’s life in 'Polar Wolf' remote penal colonyArrested and poisoned: See Navalny's moments of defiance. VideoArrested and poisoned: See Navalny's moments of defianceSatellite images show construction on Egypt-Gaza borderIs Russia about to win another victory in Ukraine?The Argentines backing a 'crazy' president's shock therapyMillions of donkeys killed each year to make medicineWeekly quiz: Who could join Sinéad in the Rock & Roll Hall of Fame?The KGB spy who rubbed shoulders with French elite for decadesElsewhere on the BBCWhy Gen Z are dressing like Mob WivesThe world map created by a man who never left homeThe seedy underbelly of life coachingMost Read1Amy Schumer hits back at comments about her face2Two teenagers charged over Super Bowl parade shooting3Alexei Navalny: What we know about reports of his death4King's cancer may bring family closer, says Harry5Brian Wilson's family seeks conservatorship6Trump must pay 354m. How could he do it?7Satellite images show construction on Egypt-Gaza border8Fani Willis' dad testifies in Trump Georgia case9Democrats relieved as Manchin rules out White House bid10Biden condemns House recess without new Ukraine aidBBC News ServicesOn your mobileOn smart speakersGet news alertsContact BBC NewsHomeNewsSportEarthReelWorklifeTravelCultureFutureMusicTVWeatherSoundsTerms of UseAbout the BBCPrivacy PolicyCookiesAccessibility HelpParental GuidanceContact the BBCGet Personalised NewslettersWhy you can trust the BBCAdvertise with us© 2024 BBC. The BBC is not responsible for the content of external sites. Read about our approach to external linking."~~

Here was my final code:

In [4]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#Here I'm cleaning the Ground News articles to form a cleaned corpus.
#My main task here is using the Ground News summary for each article to remove 
#superfluous words from the article's source text.
#My plan is:
#1. Split the source text and summary into word lists and lemmatize them (while maintaining duplicates, stop words, and word order)
#2. Remove all stop words from the summary word list
#3. Find the first and last indices of words in the source text which match a word in the summary word list
#4. Throw out all source text words which fall outside of those indices.
#This should hopefully remove all of the ads which were displayed before and after the main story, while keeping the bulk of the main
#story for each source text article.
df = pd.read_csv('../data/groundnews_corpus_raw.csv',index_col=0).fillna('')
wnl = WordNetLemmatizer()
junk = re.compile('[^a-zA-Z\\d]')
stop_words = set(stopwords.words('english'))

#I tried vectorizing separate functions for each step of this process, but pandas
#and numpy don't like the intermediate state of having three columns of the dataframe
#be columns with a list in each cell. So instead I'm combining all steps into one function which
#will return the cleaned title, summary, and source text. Then I'll vectorize it and run it across the
#data frame.

#Helper function for lemmatizing a text and returning the lemmatized word list
def lemmatize(text):
    word_list = re.split(junk,text)
    word_list = [re.sub(junk,'',word) for word in word_list]
    word_list = [wnl.lemmatize(word) for word in word_list]
    return word_list

#I tried filtering on the title as well as the summary, and filtering on the title gave me much better results.
def clean_source_text(t,s,s_t):
    #Step 1
    t=lemmatize(t)
    s=lemmatize(s)
    s_t=lemmatize(s_t)

    #Step 2
    t = [w for w in t if not w.lower() in stop_words]
    
    #Step 3
    t_in_s_t = [w in t for w in s_t]
    # s_in_s_t = [w in s for w in s_t]
    
    title_matching_indices = [i for i,x in enumerate(t_in_s_t) if x]
    # summary_matching_indices = [i for i,x in enumerate(s_in_s_t) if x]
    
    title = ' '.join(t)
    summary = ' '.join(s)
    #Step 4
    if len(title_matching_indices)>1:
        source_text = s_t[title_matching_indices[0]:title_matching_indices[-1]]
    # elif len(summary_matching_indices)>1:
    #     source_text = s_t[summary_matching_indices[0]:summary_matching_indices[-1]]
    else:
        source_text = ['']
    #Printing out how many junk words were removed by this process for each entry in the corpus.
    print(len(s_t)-len(source_text), end=", ")
    
    source_text = ' '.join(source_text)
    
    return title, summary, source_text

for i in range(len(df)):
    df.loc[i,'title'],df.loc[i,'summary'],df.loc[i,'source_text'] = clean_source_text(
        df.loc[i,'title'],df.loc[i,'summary'],df.loc[i,'source_text'])

#df.to_csv('../../data/groundnews_corpus_cleaned.csv')
df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Owner\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


335, 221, 623, 1142, 1, 7, 492, 31, 449, 4, 322, 489, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 31, 1, 1, 2, 418, 161, 441, 1, 186, 1, 1, 1, 1, 13, 1, 0, 1, 1, 2, 1, 1, 4, 1, 1, 1, 1, 1, 1, 322, 0, 1, 31, 174, 1, 0, 0, 136, 97, 152, 52, 627, 1, 139, 321, 618, 39, 332, 239, 1, 1, 1, 14, 1, 1, 605, 1, 644, 3, 1, 3, 781, 1, 1, 1, 1, 1, 1, 12, 1, 1, 9, 151, 10, 1, 342, 1, 1, 1, 1, 1, 1, 1, 31, 15, 643, 1, 1, 1, 1, 1, 654, 7, 738, 1254, 1, 30, 1, 11, 4, 326, 13, 10, 161, 12, 1, 0, 10, 580, 469, 157, 291, 442, 450, 640, 1, 1, 52, 932, 808, 782, 1, 1, 1, 168, 51, 72, 763, 1, 1, 624, 1374, 327, 65, 1, 1048, 657, 1, 1, 1, 1, 67, 0, 321, 168, 691, 1, 1, 1, 1, 1, 1, 171, 1, 1, 1, 1, 1, 1, 1, 489, 1, 31, 1, 388, 1, 1, 143, 352, 0, 1, 462, 1, 170, 712, 493, 181, 400, 795, 1, 211, 2, 2, 2026, 1, 10, 1, 235, 257, 1280, 1, 1, 227, 576, 1, 1, 1, 72, 235, 223, 1, 1, 1, 171, 1, 135, 1, 1, 550, 126, 464, 256, 31, 387, 1, 1, 1, 1, 9, 10, 1, 1, 1, 1, 1, 1, 1, 1, 16, 176, 422, 190, 1, 487, 357, 100, 317, 349, 271,

Unnamed: 0,title,summary,bias,factuality,owner,source,owner_type,source_text
0,Greece legalises sex marriage,Greece ha become the first Christian Orthodox ...,Center,High Factuality,Government of the United Kingdom,https://www.bbc.co.uk/news/world-europe-683101...,Government,legalises same sex marriagePublished1 day agoS...
1,Greece becomes first Orthodox Christian countr...,Lawmakers in the 300 seat parliament voted for...,Lean Left,Mixed Factuality,Scott Trust Limited,https://www.theguardian.com/world/2024/feb/15/...,Independent,becomes first Christian Orthodox country to le...
2,Greece legalises sex marriage landmark change,The law give same sex couple the right to wed ...,Lean Left,High Factuality,The Hindu Group,https://www.thehindu.com/news/international/gr...,Independent,Greece legalises same sex marriage in landmark...
3,Greece becomes first Orthodox Christian countr...,Greece ha become the first Orthodox Christian ...,Center,High Factuality,Bell Media,https://www.ctvnews.ca/world/greece-becomes-fi...,Media Conglomerate,Greece becomes first Orthodox Christian countr...
4,Greece legalises sex marriage another Orthod...,Greece ha become the first majority Orthodox C...,Lean Left,Mixed Factuality,Evgeny Lebedev,https://www.independent.co.uk/news/world/europ...,Individual,Jump to contentUS EditionChangeUK EditionAsia...


As you can see, this method removed a fair number of 'junk' from many of these source_text entries. For many entries, it barely removed anything, but that's okay. The hope is that as a result of this cleaning, the quality of the remaining source text data will go up significantly, and we can use it for further analysis.

After cleaning the source_text with the lemmatized version of the titles, I decide to create a stemmed version in case that might prove better for creating word-document matrices. Here was my code:

In [6]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import PorterStemmer

ps = PorterStemmer()
ground_news = pd.read_csv('../data/groundnews_corpus_cleaned.csv',index_col=0).fillna('')
junk = re.compile('[^a-zA-Z\\d]')

#slice the text into words and stem each word
def stem(text):
    word_list = re.split(' ',text)
    word_list = word_list = [re.sub(junk,'',word) for word in word_list]
    word_list = [ps.stem(word) for word in word_list]
    return ' '.join(word_list)

v_stem = np.vectorize(stem)

#Add columns for stemmed version of title, summary, and source text.
ground_news.loc[:,'title_s'] = v_stem(ground_news.loc[:,'title'])
ground_news.loc[:,'summary_s'] = v_stem(ground_news.loc[:,'summary'])
ground_news.loc[:,'source_text_s'] = v_stem(ground_news.loc[:,'source_text'])

#ground_news.to_csv('../../data/groundnews_corpus_cleaned.csv')
ground_news.head()

Unnamed: 0,title,summary,bias,factuality,owner,source,owner_type,source_text,title_s,summary_s,source_text_s
0,Greece legalises sex marriage,Greece ha become the first Christian Orthodox ...,Center,High Factuality,Government of the United Kingdom,https://www.bbc.co.uk/news/world-europe-683101...,Government,legalises same sex marriagePublished1 day agoS...,greec legalis sex marriag,greec ha becom the first christian orthodox ma...,legalis same sex marriagepublished1 day agosha...
1,Greece becomes first Orthodox Christian countr...,Lawmakers in the 300 seat parliament voted for...,Lean Left,Mixed Factuality,Scott Trust Limited,https://www.theguardian.com/world/2024/feb/15/...,Independent,becomes first Christian Orthodox country to le...,greec becom first orthodox christian countri l...,lawmak in the 300 seat parliament vote for the...,becom first christian orthodox countri to lega...
2,Greece legalises sex marriage landmark change,The law give same sex couple the right to wed ...,Lean Left,High Factuality,The Hindu Group,https://www.thehindu.com/news/international/gr...,Independent,Greece legalises same sex marriage in landmark...,greec legalis sex marriag landmark chang,the law give same sex coupl the right to wed a...,greec legalis same sex marriag in landmark cha...
3,Greece becomes first Orthodox Christian countr...,Greece ha become the first Orthodox Christian ...,Center,High Factuality,Bell Media,https://www.ctvnews.ca/world/greece-becomes-fi...,Media Conglomerate,Greece becomes first Orthodox Christian countr...,greec becom first orthodox christian countri l...,greec ha becom the first orthodox christian co...,greec becom first orthodox christian countri t...
4,Greece legalises sex marriage another Orthod...,Greece ha become the first majority Orthodox C...,Lean Left,Mixed Factuality,Evgeny Lebedev,https://www.independent.co.uk/news/world/europ...,Individual,Jump to contentUS EditionChangeUK EditionAsia...,greec legalis sex marriag anoth orthodox chr...,greec ha becom the first major orthodox christ...,jump to contentu editionchangeuk editionasia ...


### And with that, I had cleaned corpuses! Now to create some word-document matrices...

# WDMS

For this stage of cleaning, I might have gotten a bit overzealous. There were so many options for customization when creating a word-document matrix that I really wasn't sure which I should use. The choices to consider included:

1. Which text column from each corpus should I use (title, summary, source text)?
2. Stemmed or lemmatized versions of the text?
3. Should I remove stop-words?
4. Should I use CountVectorizer or TfidfVectorizer?
5. How many words should I count as 1 token? (i.e. which value of *n* for n-grams should I be using?)

Considering all of these options, I realized I wouldn't know the correct answer to any of them unless I came to that answer through trial an error. So I decided to use pretty much all options. This resulted in me creating not 1, but ***20*** word-document matrices, each stored in their own special location within the now vast ```/data/wdms``` folder tree. I eventually settled on creating one WDM for each text/title column of each API, with stemming/lemming, and count/tfidf.

Here was my code:

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

news = pd.read_csv('../data/newsapi_corpus_cleaned.csv').fillna('')
world_news = pd.read_csv('../data/worldnewsapi_corpus_cleaned.csv').fillna('')
ground_news = pd.read_csv('../data/groundnews_corpus_cleaned.csv',index_col=0).fillna('')


#Extract the metadata, we'll add it back in later
news_metadata = news.loc[:,('author','publishedAt','source','url','urlToImage')]
world_news_metadata = world_news.loc[:,('authors','country','sentiment','url')]
ground_news_metadata = ground_news.loc[:,('bias','factuality','owner','source','owner_type')]


cv = CountVectorizer(stop_words='english')
cv_2 = CountVectorizer(stop_words='english',ngram_range=(2,2))
tv = TfidfVectorizer(stop_words='english')
tv_2 = TfidfVectorizer(stop_words='english',ngram_range=(2,2))

#This function transforms the text into a word-document matrix containing all words
#which occur at least twice (this is to cut down on size, under the assumption that the words which occur
#once are more trouble than they're worth), and sorts the matrix based on frequency.
def create_wdm(series,metadata,cv,n):
    wdm = cv.fit_transform(series)
    wdm_df = pd.DataFrame(data=wdm.toarray(),columns=cv.get_feature_names_out())
    wdm_df.loc['sum'] = wdm_df.sum()
    #Filter for words with a minimum frequency. Cuts down on junk and size
    wdm_df = wdm_df.loc[:,wdm_df.loc['sum']>n]
    #filter for strings greater than a certain length. Cuts out some silliness.
    wdm_df = wdm_df.filter(regex=r'^[a-zA-Z]{3,}$')
    wdm_df = wdm_df.sort_values(by='sum',axis=1,ascending=False)
    wdm_df = pd.concat([metadata,wdm_df],axis=1,join='inner')
    return wdm_df

#Create the WDMs for News. Altering min number of occurances to keep the size of larger WDMs in check.
# news_title_l_cv = create_wdm(news['title_l'], news_metadata, cv, 3)
# news_desc_l_cv = create_wdm(news['desc_l'], news_metadata, cv, 3)
# news_content_l_cv = create_wdm(news['content_l'], news_metadata, cv, 3)
# news_title_l_tv = create_wdm(news['title_l'], news_metadata, tv, 3)
# news_desc_l_tv = create_wdm(news['desc_l'], news_metadata, tv, 3)
# news_content_l_tv = create_wdm(news['content_l'], news_metadata, tv, 3)

# news_title_s_cv = create_wdm(news['title_s'], news_metadata, cv, 3)
# news_desc_s_cv = create_wdm(news['desc_s'], news_metadata, cv, 3)
# news_content_s_cv = create_wdm(news['content_s'], news_metadata, cv, 3)
# news_title_s_tv = create_wdm(news['title_s'], news_metadata, tv, 3)
# news_desc_s_tv = create_wdm(news['desc_s'], news_metadata, tv, 3)
# news_content_s_tv = create_wdm(news['content_s'], news_metadata, tv, 3)


#Create the WDMs for World News
# world_news_title_s_cv = create_wdm(world_news['title_s'], world_news_metadata, cv, 6)
# world_news_text_s_cv = create_wdm(world_news['text_s'], world_news_metadata, cv, 15)
# world_news_title_s_tv = create_wdm(world_news['title_s'], world_news_metadata, tv, 6)
# world_news_text_s_tv = create_wdm(world_news['text_s'], world_news_metadata, tv, 15)

# world_news_title_l_cv = create_wdm(world_news['title_l'], world_news_metadata, cv, 6)
# world_news_text_l_cv = create_wdm(world_news['text_l'], world_news_metadata, cv, 15)
# world_news_title_l_tv = create_wdm(world_news['title_l'], world_news_metadata, tv, 6)
# world_news_text_l_tv = create_wdm(world_news['text_l'], world_news_metadata, tv, 15)

#Create the WDMs for Ground News

ground_news_title_l_cv = create_wdm(ground_news['title'], ground_news_metadata, cv, 3)
# ground_news_title_l_tv = create_wdm(ground_news['title'], ground_news_metadata, tv, 3)
# ground_news_title_s_cv = create_wdm(ground_news['title_s'], ground_news_metadata, cv, 3)
# ground_news_title_s_tv = create_wdm(ground_news['title_s'], ground_news_metadata, tv, 3)

# ground_news_summary_l_cv = create_wdm(ground_news['summary'], ground_news_metadata, cv, 3)
# ground_news_summary_l_tv = create_wdm(ground_news['summary'], ground_news_metadata, tv, 3)
# ground_news_summary_s_cv = create_wdm(ground_news['summary_s'], ground_news_metadata, cv, 3)
# ground_news_summary_s_tv = create_wdm(ground_news['summary_s'], ground_news_metadata, tv, 3)

# ground_news_source_text_l_cv = create_wdm(ground_news['source_text'], ground_news_metadata, cv, 3)
# ground_news_source_text_l_tv = create_wdm(ground_news['source_text'], ground_news_metadata, tv, 3)
# ground_news_source_text_s_cv = create_wdm(ground_news['source_text_s'], ground_news_metadata, cv, 3)
# ground_news_source_text_s_tv = create_wdm(ground_news['source_text_s'], ground_news_metadata, tv, 3)

#Saving WDMs to csv.
# news_title_s_cv.to_csv('../../data/wdms/count/newsapi/stemmed/title.csv')
# news_desc_s_cv.to_csv('../../data/wdms/count/newsapi/stemmed/desc.csv')
# news_content_s_cv.to_csv('../../data/wdms/count/newsapi/stemmed/content.csv')
# news_title_s_tv.to_csv('../../data/wdms/tfidf/newsapi/stemmed/title.csv')
# news_desc_s_tv.to_csv('../../data/wdms/tfidf/newsapi/stemmed/desc.csv')
# news_content_s_tv.to_csv('../../data/wdms/tfidf/newsapi/stemmed/content.csv')

# news_title_l_cv.to_csv('../../data/wdms/count/newsapi/lemmed/title.csv')
# news_desc_l_cv.to_csv('../../data/wdms/count/newsapi/lemmed/desc.csv')
# news_content_l_cv.to_csv('../../data/wdms/count/newsapi/lemmed/content.csv')
# news_title_l_tv.to_csv('../../data/wdms/tfidf/newsapi/lemmed/title.csv')
# news_desc_l_tv.to_csv('../../data/wdms/tfidf/newsapi/lemmed/desc.csv')
# news_content_l_tv.to_csv('../../data/wdms/tfidf/newsapi/lemmed/content.csv')


# world_news_title_s_cv.to_csv('../../data/wdms/count/worldnewsapi/stemmed/title.csv')
# world_news_text_s_cv.to_csv('../../data/wdms/count/worldnewsapi/stemmed/text.csv')
# world_news_title_l_cv.to_csv('../../data/wdms/count/worldnewsapi/lemmed/title.csv')
# world_news_text_l_cv.to_csv('../../data/wdms/count/worldnewsapi/lemmed/text.csv')

# world_news_title_s_cv.to_csv('../../data/wdms/tfidf/worldnewsapi/stemmed/title.csv')
# world_news_text_s_cv.to_csv('../../data/wdms/tfidf/worldnewsapi/stemmed/text.csv')
# world_news_title_l_cv.to_csv('../../data/wdms/tfidf/worldnewsapi/lemmed/title.csv')
# world_news_text_l_cv.to_csv('../../data/wdms/tfidf/worldnewsapi/lemmed/title.csv')


# ground_news_title_s_cv.to_csv('../../data/wdms/count/groundnews/stemmed/title.csv')
# ground_news_summary_s_cv.to_csv('../../data/wdms/count/groundnews/stemmed/summ.csv')
# ground_news_source_text_s_cv.to_csv('../../data/wdms/count/groundnews/stemmed/source.csv')
# ground_news_title_s_tv.to_csv('../../data/wdms/tfidf/groundnews/stemmed/title.csv')
# ground_news_summary_s_tv.to_csv('../../data/wdms/tfidf/groundnews/stemmed/summ.csv')
# ground_news_source_text_s_tv.to_csv('../../data/wdms/tfidf/groundnews/stemmed/source.csv')

# ground_news_title_l_cv.to_csv('../../data/wdms/count/groundnews/lemmed/title.csv')
# ground_news_summary_l_cv.to_csv('../../data/wdms/count/groundnews/lemmed/summ.csv')
# ground_news_source_text_l_cv.to_csv('../../data/wdms/count/groundnews/lemmed/source.csv')
# ground_news_title_l_tv.to_csv('../../data/wdms/tfidf/groundnews/lemmed/title.csv')
# ground_news_summary_l_tv.to_csv('../../data/wdms/tfidf/groundnews/lemmed/summ.csv')
# ground_news_source_text_l_tv.to_csv('../../data/wdms/tfidf/groundnews/lemmed/source.csv')

ground_news_title_l_cv.head()

Unnamed: 0,bias,factuality,owner,source,owner_type,transgender,trans,gender,lgbtq,sex,...,program,horrendous,headline,hamilton,sue,criticize,surfer,create,target,abusing
0,Center,High Factuality,Government of the United Kingdom,https://www.bbc.co.uk/news/world-europe-683101...,Government,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Lean Left,Mixed Factuality,Scott Trust Limited,https://www.theguardian.com/world/2024/feb/15/...,Independent,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,Lean Left,High Factuality,The Hindu Group,https://www.thehindu.com/news/international/gr...,Independent,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,Center,High Factuality,Bell Media,https://www.ctvnews.ca/world/greece-becomes-fi...,Media Conglomerate,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Lean Left,Mixed Factuality,Evgeny Lebedev,https://www.independent.co.uk/news/world/europ...,Individual,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


I'm breaking with tradition here and only showing the output for the creation of one of the many word-document matrices. Of all the code I've showed off so far, this program had the highest preponderance to cause memory crashes, so for demonstration purposes I'm going to keep it light.

Essentially, I used the various vectorizers to create word-document matrices, then organized those matrices based on word frequency, and used a minimum frequency cut-off to remove junk words that only appeared once or twice. I always removed stopwords, after finding that not removing stop-words resulted in a lot of unnecessary junk at the top of my WDMs. I'm also sticking with *n=1* for the purposes of n-grams for now, because with *n=2* this function often resulted in a memory crash.

The resulting WDMs are going to form the basis for further analysis, which you can check out in the exploratory analysis tab.